CNN หลีกเลี่ยงปัญหาการไล่ระดับสีที่หายไปได้อย่างไร

ฉันอ่านมากเกี่ยวกับเครือข่ายประสาทการสนทนาและสงสัยว่าพวกเขาจะหลีกเลี่ยงปัญหาการไล่ระดับสีที่หายไปได้อย่างไร ฉันรู้ว่าเครือข่ายที่มีความเชื่อลึกนั้นมีกองเข้ารหัสอัตโนมัติระดับเดียวหรือเครือข่ายตื้น ๆ ที่ผ่านการฝึกอบรมล่วงหน้าและสามารถหลีกเลี่ยงปัญหานี้ได้ แต่ฉันไม่รู้ว่ามันจะหลีกเลี่ยงได้อย่างไรใน CNNs

ตามที่Wikipedia :

แม้จะมีปัญหาการไล่ระดับสีที่หายไป แต่พลังการประมวลผลที่เหนือกว่าของ GPU ทำให้การเผยแพร่กลับเป็นไปได้อย่างง่ายดายสำหรับเครือข่ายประสาทที่มีการป้อนลึกแบบหลายชั้น

ฉันไม่เข้าใจว่าทำไมการประมวลผล GPU จะลบปัญหานี้หรือไม่

— อาลี
แหล่งที่มา

บทความวิกิพีเดียไม่ได้พิสูจน์ว่าทำไม GPU ช่วยแก้ไขปัญหาการไล่ระดับสีที่หายไปหรือไม่? เป็นเพราะถึงแม้ว่าการไล่ระดับสีจะมีขนาดเล็กเนื่องจาก GPU มีความรวดเร็วมากเรายังคงสามารถปรับปรุงพารามิเตอร์โดยทำตามขั้นตอนมากมายด้วย GPU

— Charlie Parker

เผง ปัญหาการไล่ระดับสีที่หายไปนั้นเป็นสาเหตุที่ทำให้ตุ้มน้ำหนักชั้นล่างได้รับการอัปเดตในอัตราที่น้อยมากและใช้เวลานานในการฝึกอบรมเครือข่าย แต่เช่นเดียวกับ GPU ที่คุณสามารถทำการคำนวณได้มากขึ้น (เช่นการอัปเดตน้ำหนักมากขึ้น) ในเวลาที่น้อยลงด้วยการประมวลผล GPU ที่มากขึ้นปัญหาการไล่ระดับสีที่หายไปค่อนข้างจะหายไปบ้าง

— Sangram

@CharlieParker, could you elaborate on GPU's are fast correlated with vanishing gradients, I can understand the fast logic with large memory bandwidth to process multiple matrix multiplications! but could you please explain what it has to do with the derivatives? The vanishing gradient issue seems to do more with weight initialization, isn't it!

— Anu

The vanishing gradient problem requires us to use small learning rates with gradient descent which then needs many small steps to converge. This is a problem if you have a slow computer which takes a long time for each step. If you have a fast GPU which can perform many more steps in a day, this is less of a problem.

There are several ways to tackle the vanishing gradient problem. I would guess that the largest effect for CNNs came from switching from sigmoid nonlinear units to rectified linear units. If you consider a simple neural network whose error $E$ depends on weight $w_{ij}$ only through $y_j$ , where

y_{j} = f (\sum_{i} w_{i j} x_{i}),

$y_j = f\left( \sum_iw_{ij}x_i \right),$

its gradient is

\begin{aligned} \frac{\partial}{\partial w_{i j}} E & = \frac{\partial E}{\partial y_{j}} \cdot \frac{\partial y_{j}}{\partial w_{i j}} \\ = \frac{\partial E}{\partial y_{j}} \cdot f^{'} (\sum_{i} w_{i j} x_{i}) x_{i} . \end{aligned}

$\begin{align} \frac{\partial}{\partial w_{ij}} E &= \frac{\partial E}{\partial y_j} \cdot \frac{\partial y_j}{\partial w_{ij}} \\ &= \frac{\partial E}{\partial y_j} \cdot f'\left(\sum_i w_{ij} x_i\right) x_i. \end{align}$

If $f$ is the logistic sigmoid function, $f'$ will be close to zero for large inputs as well as small inputs. If $f$ is a rectified linear unit,

\begin{aligned} f (u) = max (0, u), \end{aligned}

$\begin{align} f(u) = \max\left(0, u\right), \end{align}$ the derivative is zero only for negative inputs and 1 for positive inputs. Another important contribution comes from properly initializing the weights. This paper looks like a good source for understanding the challenges in more details (although I haven't read it yet):

http://jmlr.org/proceedings/papers/v9/glorot10a/glorot10a.pdf

— Lucas
แหล่งที่มา

I'm a little puzzled about the rectified linear units. Yes, for sigmoids etc. the gradient is often very small - but for rectified linear units it is often exactly zero. Isn't that worse? Thus, if the weights of an unit are unfortunate, they will never ever change.

— Hans-Peter Störr

Thinking about this, leaky and/or noisy ReLUs might be in use for that reason.

— sunside

Why is your first sentence true? I.e. "The vanishing gradient problem requires us to use small learning rates with gradient descent which then needs many small steps to converge." Why do we need small learning rates to deal with the vanishing gradient problem? If the gradients are already small with due to vanishing gradients I would have expected that making them small only made things worse.

— Charlie Parker

Good question, I should have explained that statement better. The vanishing gradient problem is not that all gradients are small (which we could easily fix by using large learning rates), but that the gradients vanish as you backpropagate through the network. I.e., the gradients are small in some layers but large in other layers. If you use large learning rates, the whole thing explodes (because some gradients are large), so you have to use a small learning rate. Using multiple learning rates is another approach to addressing the problem, at the cost of introducing more hyperparameters.

— Lucas

I would argue that the learning rate is mostly tied to the exploding gradient problem. Scaling the gradient down with an exaggeratingly low learning rate does not at all prevent vanishing gradients, it just delays the effect as learning slows down considerably. The effect itself is caused by the repeated application of nonlinearities and multiplication of small values. Of course there is a trend to go to smaller learning rates (due to computing power) but that has nothing to do with vanishing gradients as it only controls how well the state space is explored (given stable conditions).

— runDOSrun