Deep Neural Network - Backpropogation ด้วย ReLU

ฉันมีปัญหาในการได้รับการเผยแพร่กลับด้วย ReLU และฉันทำงานบางอย่าง แต่ฉันไม่แน่ใจว่าฉันกำลังถูกทาง

ฟังก์ชันต้นทุน:โดยที่คือค่าจริงและเป็นค่าที่คาดการณ์ไว้ นอกจากนี้สมมติว่า > 0 เสมอ $\frac{1}{2}(y-\hat y)^2$ $y$ $\hat y$ $x$

1 Layer ReLU โดยที่น้ำหนักที่ชั้นที่ 1 คือ $w_1$

$\frac{dC}{dw_1}=\frac{dC}{dR}\frac{dR}{dw_1}$

$\frac{dC}{w_1}=(y-ReLU(w_1x))(x)$

2 Layer ReLU โดยที่ตุ้มน้ำหนักที่ชั้นที่ 1 คือและชั้นที่สองคือและฉันต้องการอัปเดตชั้นที่ 1 $w_2$ $w_1$ $w_2$

$\frac{dC}{dw_2}=\frac{dC}{dR}\frac{dR}{dw_2}$

$\frac{dC}{w_2}=(y-ReLU(w_1*ReLU(w_2x))(w_1x)$

ตั้งแต่ $ReLU(w_1*ReLU(w_2x))=w_1w_2x$

3 Layer ReLU โดยที่ตุ้มน้ำหนักที่ชั้นที่ 1 คือ , 2nd layer และ 3rd layer $w_3$ $w_2$ $w_1$

$\frac{dC}{dw_3}=\frac{dC}{dR}\frac{dR}{dw_3}$

$\frac{dC}{w_3}=(y-ReLU(w_1*ReLU(w_2(*ReLU(w_3)))(w_1w_2x)$

ตั้งแต่ $ReLU(w_1*ReLU(w_2(*ReLU(w_3))=w_1w_2w_3x$

เนื่องจากกฎลูกโซ่ใช้เวลาเพียง 2 อนุพันธ์เมื่อเทียบกับ sigmoid ซึ่งอาจเป็นจำนวนชั้น $n$

ว่าฉันต้องการอัปเดตน้ำหนักของทั้ง 3 เลเยอร์โดยที่คือเลเยอร์ 3, คือเลเยอร์ที่ 2, คือเลเยอร์ที่ 3 $w_1$ $w_2$ $w_1$

$\frac{dC}{w_1}=(y-ReLU(w_1x))(x)$

$\frac{dC}{w_2}=(y-ReLU(w_1*ReLU(w_2x))(w_1x)$

$\frac{dC}{w_3}=(y-ReLU(w_1*ReLU(w_2(*ReLU(w_3)))(w_1w_2x)$

หากการสืบทอดนี้ถูกต้องสิ่งนี้จะป้องกันไม่ให้หายไปได้อย่างไร เมื่อเทียบกับ sigmoid ที่เรามีจำนวนมากคูณ 0.25 ในสมการในขณะที่ ReLU ไม่มีการคูณค่าคงที่ใด ๆ หากมีหลายพันชั้นจะมีการคูณจำนวนมากเนื่องจากน้ำหนักแล้วจะไม่หายไปหรือระเบิดไล่ระดับสี?

neural-network backpropagation

— user1157751
แหล่งที่มา

@ NeilSlater ขอบคุณสำหรับคำตอบของคุณ! คุณสามารถทำอย่างละเอียดฉันไม่แน่ใจว่าคุณหมายถึงอะไร?

— user1157751

อาฉันคิดว่าฉันรู้ว่าคุณหมายถึงอะไร เหตุผลที่ฉันถามคำถามนี้คือฉันแน่ใจว่าการได้มานั้นถูกต้องหรือไม่ ฉันค้นหาไปรอบ ๆ และไม่พบตัวอย่างของ ReLU ที่ได้มาอย่างสมบูรณ์ตั้งแต่ต้น?

— user1157751

นิยามการทำงานของฟังก์ชัน ReLU และอนุพันธ์:

$ReLU(x) = \begin{cases} 0, & \text{if } x < 0, \\ x, & \text{otherwise}. \end{cases}$

$\frac{d}{dx} ReLU(x) = \begin{cases} 0, & \text{if } x < 0, \\ 1, & \text{otherwise}. \end{cases}$

อนุพันธ์เป็นหน่วยฟังก์ชั่นขั้นตอน สิ่งนี้ไม่สนใจปัญหาที่ $x=0$ โดยที่การไล่ระดับสีไม่ได้ถูกกำหนดอย่างเคร่งครัด แต่นั่นไม่ได้เป็นปัญหาสำหรับเครือข่ายประสาทเทียม ด้วยสูตรข้างต้นอนุพันธ์ที่ 0 คือ 1 แต่คุณสามารถปฏิบัติต่อมันอย่างเท่าเทียมกันเป็น 0 หรือ 0.5 โดยไม่มีผลกระทบต่อประสิทธิภาพเครือข่ายประสาทเทียมอย่างแท้จริง

เครือข่ายง่าย

ด้วยคำจำกัดความเหล่านั้นมาดูเครือข่ายตัวอย่างของคุณ

คุณกำลังดำเนินการถดถอยด้วยฟังก์ชันต้นทุน $C = \frac{1}{2}(y-\hat{y})^2$ 2คุณได้กำหนด $R$ เป็นเอาท์พุทของเซลล์ประสาทเทียม แต่คุณยังไม่ได้กำหนดค่าอินพุต ฉันจะเพิ่มว่าเพื่อความสมบูรณ์ - เรียกว่า $z$ เพิ่มการจัดทำดัชนีบางส่วนโดยชั้นและฉันชอบกรณีที่ต่ำกว่าสำหรับเวกเตอร์และกรณีบนสำหรับการฝึกอบรมเพื่อให้ $r^{(1)}$ การส่งออกของชั้นแรก, $z^{(1)}$ สำหรับ อินพุตและ $W^{(0)}$ สำหรับน้ำหนักที่เชื่อมต่อเซลล์ประสาทกับอินพุต $x$ (ในเครือข่ายขนาดใหญ่ที่อาจเชื่อมต่อกับลึกกว่า $r$ ค่าแทน) ฉันได้ปรับหมายเลขดัชนีสำหรับเมทริกซ์น้ำหนักด้วย - เหตุใดจึงมีความชัดเจนมากขึ้นสำหรับเครือข่ายที่ใหญ่ขึ้น NB ฉันไม่สนใจมีมากกว่าเซลล์ประสาทในแต่ละชั้นตอนนี้

ดูที่ 1 เลเยอร์เครือข่ายเซลล์ประสาท 1 แบบเรียบง่ายสมการป้อนไปข้างหน้าคือ:

$z^{(1)} = W^{(0)}x$

$\hat{y} = r^{(1)} = ReLU(z^{(1)})$

อนุพันธ์ของฟังก์ชันต้นทุน wrt การประมาณตัวอย่างคือ:

$\frac{\partial C}{\partial \hat{y}} = \frac{\partial C}{\partial r^{(1)}} = \frac{\partial}{\partial r^{(1)}}\frac{1}{2}(y-r^{(1)})^2 = \frac{1}{2}\frac{\partial}{\partial r^{(1)}}(y^2 - 2yr^{(1)} + (r^{(1)})^2) = r^{(1)} - y$

การใช้กฎลูกโซ่สำหรับการกระจายกลับเป็นค่า pre-transform ( $z$ ):

$\frac{\partial C}{\partial z^{(1)}} = \frac{\partial C}{\partial r^{(1)}} \frac{\partial r^{(1)}}{\partial z^{(1)}} = (r^{(1)} - y)Step(z^{(1)}) = (ReLU(z^{(1)}) - y)Step(z^{(1)})$

$\frac{\partial C}{\partial z^{(1)}}$

$W^{(0)}$

$\frac{\partial C}{\partial W^{(0)}} = \frac{\partial C}{\partial z^{(1)}} \frac{\partial z^{(1)}}{\partial W^{(0)}} = (ReLU(z^{(1)}) - y)Step(z^{(1)})x = (ReLU(W^{(0)}x) - y)Step(W^{(0)}x)x$

. . . because $z^{(1)} = W^{(0)}x$ therefore $\frac{\partial z^{(1)}}{\partial W^{(0)}} = x$

That is the full solution for your simplest network.

However, in a layered network, you also need to carry the same logic down to the next layer. Also, you typically have more than one neuron in a layer.

More general ReLU network

If we add in more generic terms, then we can work with two arbitrary layers. Call them Layer $(k)$ indexed by $i$ , and Layer $(k+1)$ indexed by $j$ . The weights are now a matrix. So our feed-forward equations look like this:

$z^{(k+1)}_j = \sum_{\forall i} W^{(k)}_{ij}r^{(k)}_i$

$r^{(k+1)}_j = ReLU(z^{(k+1)}_j)$

In the output layer, then the initial gradient w.r.t. $r^{output}_j$ is still $r^{output}_j - y_j$ . However, ignore that for now, and look at the generic way to back propagate, assuming we have already found $\frac{\partial C}{\partial r^{(k+1)}_j}$ - just note that this is ultimately where we get the output cost function gradients from. Then there are 3 equations we can write out following the chain rule:

First we need to get to the neuron input before applying ReLU:

$\frac{\partial C}{\partial z^{(k+1)}_j} = \frac{\partial C}{\partial r^{(k+1)}_j} \frac{\partial r^{(k+1)}_j}{\partial z^{(k+1)}_j} = \frac{\partial C}{\partial r^{(k+1)}_j}Step(z^{(k+1)}_j)$

We also need to propagate the gradient to previous layers, which involves summing up all connected influences to each neuron:

$\frac{\partial C}{\partial r^{(k)}_i} = \sum_{\forall j} \frac{\partial C}{\partial z^{(k+1)}_j} \frac{\partial z^{(k+1)}_j}{\partial r^{(k)}_i} = \sum_{\forall j} \frac{\partial C}{\partial z^{(k+1)}_j} W^{(k)}_{ij}$

And we need to connect this to the weights matrix in order to make adjustments later:

$\frac{\partial C}{\partial W^{(k)}_{ij}} = \frac{\partial C}{\partial z^{(k+1)}_j} \frac{\partial z^{(k+1)}_j}{\partial W^{(k)}_{ij}} = \frac{\partial C}{\partial z^{(k+1)}_j} r^{(k)}_{i}$

You can resolve these further (by substituting in previous values), or combine them (often steps 1 and 2 are combined to relate pre-transform gradients layer by layer). However the above is the most general form. You can also substitute the $Step(z^{(k+1)}_j)$ in equation 1 for whatever the derivative function is of your current activation function - this is the only place where it affects the calculations.

Back to your questions:

If this derivation is correct, how does this prevent vanishing?

Your derivation was not correct. However, that does not completely address your concerns.

The difference between using sigmoid versus ReLU is just in the step function compared to e.g. sigmoid's $y(1-y)$ , applied once per layer. As you can see from the generic layer-by-layer equations above, the gradient of the transfer function appears in one place only. The sigmoid's best case derivative adds a factor of 0.25 (when $x = 0, y = 0.5$ ), and it gets worse than that and saturates quickly to near zero derivative away from $x=0$ . The ReLU's gradient is either 0 or 1, and in a healthy network will be 1 often enough to have less gradient loss during backpropagation. This is not guaranteed, but experiments show that ReLU has good performance in deep networks.

If there's thousands of layers, there would be a lot of multiplication due to weights, then wouldn't this cause vanishing or exploding gradient?

Yes this can have an impact too. This can be a problem regardless of transfer function choice. In some combinations, ReLU may help keep exploding gradients under control too, because it does not saturate (so large weight norms will tend to be poor direct solutions and an optimiser is unlikely to move towards them). However, this is not guaranteed.

— Neil Slater
แหล่งที่มา

Was a chain rule performed on

\frac{d C}{d \hat{y}}

$\frac{dC}{d \hat y}$ ?

— user1157751

@ user1157751: ไม่

\frac{\partial C}{\partial \hat{y}} = \frac{\partial C}{\partial r^{(1)}}

$\frac{\partial C}{\partial \hat{y}} = \frac{\partial C}{\partial r^{(1)}}$ เพราะ

\hat{y} = r^{(1)}

$\hat{y} = r^{(1)}$ . ฟังก์ชันต้นทุน C นั้นง่ายพอที่คุณจะสามารถหาอนุพันธ์ได้ทันที สิ่งเดียวที่ฉันไม่ได้แสดงนั่นคือการขยายตัวของสแควร์ - คุณต้องการให้ฉันเพิ่มเข้าไปหรือไม่

— Neil Slater

แต่

C

$C$ คือ

\frac{1}{2} (y - \hat{y})^{2}

$\frac{1}{2}(y- \hat y)^2$ เราไม่จำเป็นต้องทำกฎลูกโซ่เพื่อให้เราสามารถทำอนุพันธ์ได้

\hat{y}

$\hat y$ ?

\frac{d C}{d \hat{y}} = \frac{d C}{d U} \frac{d U}{d \hat{y}}

$\frac{dC}{d \hat y}=\frac{dC}{dU}\frac{dU}{d \hat y}$ ที่ไหน

U = y - \hat{y}

$U = y - \hat y$ . ขออภัยที่ถามคำถามง่าย ๆ ความสามารถทางคณิตศาสตร์ของฉันอาจเป็นปัญหาสำหรับคุณ: (

— user1157751

หากคุณสามารถทำให้สิ่งต่าง ๆ เรียบง่ายขึ้นด้วยการขยาย จากนั้นโปรดขยายพื้นที่สี่เหลี่ยม

— user1157751

@user1157751: Yes you could use the chain rule in that way, and it would give the same answer as I show. I just expanded the square - I'll show it.

— Neil Slater