ตัวอย่างทีละขั้นตอนของการสร้างความแตกต่างโดยอัตโนมัติในโหมดย้อนกลับ

ไม่แน่ใจว่าคำถามนี้อยู่ที่นี่หรือไม่ แต่เป็นเรื่องที่เกี่ยวข้องกับวิธีการไล่ระดับสีในการปรับให้เหมาะสมซึ่งดูเหมือนจะอยู่ในหัวข้อที่นี่ อย่างไรก็ตามคุณสามารถโยกย้ายได้ถ้าคุณคิดว่าชุมชนอื่นมีความเชี่ยวชาญในหัวข้อนี้มากกว่า

ในระยะสั้นฉันกำลังมองหาตัวอย่างขั้นตอนโดยขั้นตอนของโหมดกลับแตกต่างอัตโนมัติ มีวรรณกรรมไม่มากในหัวข้อที่มีและการใช้งานที่มีอยู่ (เช่นใน TensorFlow ) ยากที่จะเข้าใจโดยไม่ทราบทฤษฎีที่อยู่เบื้องหลัง ดังนั้นฉันจะขอบคุณมากถ้ามีคนสามารถแสดงรายละเอียดสิ่งที่เราส่งผ่านวิธีที่เราดำเนินการและสิ่งที่เรานำออกจากกราฟการคำนวณ

สองคำถามที่ฉันมีปัญหากับ:

เมล็ด - ทำไมเราต้องการพวกเขาทั้งหมด
ย้อนกลับกฎความแตกต่าง - ฉันรู้วิธีสร้างความแตกต่างไปข้างหน้า แต่เราจะย้อนกลับได้อย่างไร เช่นในตัวอย่างจากส่วนนี้อย่างไรเรารู้ว่า $\bar{w_2}=\bar{w_3}w_1$ ?
เราจะทำงานกับสัญลักษณ์เท่านั้นหรือส่งผ่านค่าจริงหรือไม่ เช่นในตัวอย่างเดียวกันเป็น $w_i$ และ $\bar{w_i}$ สัญลักษณ์หรือค่า?

— ffriend
แหล่งที่มา

"การเรียนรู้ของเครื่องบนมือกับ Scikit-Learn & TensorFlow" ภาคผนวก D ให้คำอธิบายที่ดีมากในความคิดของฉัน ฉันแนะนำมัน

— Agustin Barrachina

สมมุติว่าเรามีนิพจน์ $z = x_1x_2 + \sin(x_1)$ และต้องการหาอนุพันธ์ $\frac{dz}{dx_1}$ และ $\frac{dz}{dx_2}$ 2โฆษณาโหมดย้อนกลับแบ่งงานนี้ออกเป็น 2 ส่วนคือส่งต่อและย้อนกลับ

ส่งต่อ

อันดับแรกเราแยกการแสดงออกที่ซับซ้อนของเราออกเป็นชุดของการเขียนแบบดั้งเดิมเช่นการแสดงออกที่ประกอบด้วยการเรียกใช้ฟังก์ชันเดียว โปรดทราบว่าฉันยังเปลี่ยนชื่อตัวแปรอินพุตและเอาต์พุตเพื่อความสอดคล้องแม้ว่ามันไม่จำเป็น:

w_{1} = x_{1}

$w_1 = x_1$

w_{2} = x_{2}

$w_2 = x_2$

w_{3} = w_{1} w_{2}

$w_3 = w_1w_2$

w_{4} = \sin (w_{1})

$w_4 = \sin(w_1)$

w_{5} = w_{3} + w_{4}

$w_5 = w_3 + w_4$

z = w_{5}

$z = w_5$

ข้อได้เปรียบของการเป็นตัวแทนนี้ก็คือการรู้จักกฎการสร้างความแตกต่างสำหรับแต่ละนิพจน์แยกกันอยู่แล้ว ตัวอย่างเช่นเรารู้ว่าที่มาของ $\sin$ คือ $\cos$ และอื่น ๆ $\frac{dw_4}{dw_1} = \cos(w_1)$ )เราจะใช้ข้อเท็จจริงนี้ในการส่งผ่านย้อนกลับด้านล่าง

โดยพื้นฐานแล้วฟอร์เวิร์ดพาสประกอบด้วยการประเมินแต่ละนิพจน์และบันทึกผลลัพธ์ กล่าวว่าปัจจัยการผลิตของเรา: $x_1 = 2$ และ $x_2 = 3$ 3จากนั้นเรามี:

w_{1} = x_{1} = 2

$w_1 = x_1 = 2$

w_{2} = x_{2} = 3

$w_2 = x_2 = 3$

w_{3} = w_{1} w_{2} = 6

$w_3 = w_1w_2 = 6$

w_{4} = \sin (w_{1}) = 0.9

$w_4 = \sin(w_1) ~= 0.9$

w_{5} = w_{3} + w_{4} = 6.9

$w_5 = w_3 + w_4 = 6.9$

z = w_{5} = 6.9

$z = w_5 = 6.9$

ย้อนกลับ

นี่คือการเริ่มต้นเป็นความมหัศจรรย์และมันเริ่มต้นด้วยกฎลูกโซ่ ในรูปแบบพื้นฐานกฎลูกโซ่ระบุว่าหากคุณมีตัวแปร $t(u(v))$ ซึ่งขึ้นอยู่กับ $u$ ซึ่งในทางกลับกันขึ้นอยู่กับ $v$ ดังนั้น:

\frac{d t}{d v} = \frac{d t}{d u} \frac{d u}{d v}

$\frac{dt}{dv} = \frac{dt}{du}\frac{du}{dv}$

หรือถ้า $t$ ขึ้นอยู่กับ $v$ ผ่านหลายเส้นทาง / ตัวแปร $u_i$ เช่น:

u_{1} = f (v)

$u_1 = f(v)$

u_{2} = g (v)

$u_2 = g(v)$

t = h (u_{1}, u_{2})

$t = h(u_1, u_2)$

จากนั้น (ดูข้อพิสูจน์ที่นี่ ):

\frac{d t}{d v} = \sum_{i} \frac{d t}{d u_{i}} \frac{d u_{i}}{d v}

$\frac{dt}{dv} = \sum_i \frac{dt}{du_i}\frac{du_i}{dv}$

ในแง่ของกราฟนิพจน์หากเรามีโหนดสุดท้าย $z$ และโหนดอินพุต $w_i$ และพา ธ จาก $z$ ถึง $w_i$ ผ่านโหนดกลาง $w_p$ (เช่น $z = g(w_p)$ โดยที่ $w_p = f(w_i)$ ) เราสามารถหาอนุพันธ์ $\frac{dz}{dw_i}$ เป็น

\frac{d z}{d w_{i}} = \sum_{p \in p a r e n t s (i)} \frac{d z}{d w_{p}} \frac{d w_{p}}{d w_{i}}

$\frac{dz}{dw_i} = \sum_{p \in parents(i)} \frac{dz}{dw_p} \frac{dw_p}{dw_i}$

ในคำอื่น ๆ ในการคำนวณอนุพันธ์ของตัวแปรเอาท์พุท $z$ WRT ตัวแปรกลางหรือป้อนข้อมูลใด ๆ $w_i$ เราจำเป็นต้องรู้อนุพันธ์ของพ่อแม่และสูตรการคำนวณที่มาของการแสดงออกดั้งเดิม $w_p = f(w_i)$ )

Reverse pass เริ่มที่จุดสิ้นสุด (เช่น $\frac{dz}{dz}$ ) and propagates backward to all dependencies. Here we have (expression for "seed"):

\frac{d z}{d z} = 1

$\frac{dz}{dz} = 1$

That may be read as "change in $z$ results in exactly the same change in $z$ ", which is quite obvious.

Then we know that $z = w_5$ and so:

\frac{d z}{d w_{5}} = 1

$\frac{dz}{dw_5} = 1$

$w_5$ linearly depends on $w_3$ and $w_4$ , so $\frac{dw_5}{dw_3} = 1$ and $\frac{dw_5}{dw_4} = 1$ . Using the chain rule we find:

\frac{d z}{d w_{3}} = \frac{d z}{d w_{5}} \frac{d w_{5}}{d w_{3}} = 1 \times 1 = 1

$\frac{dz}{dw_3} = \frac{dz}{dw_5} \frac{dw_5}{dw_3} = 1 \times 1 = 1$

\frac{d z}{d w_{4}} = \frac{d z}{d w_{5}} \frac{d w_{5}}{d w_{4}} = 1 \times 1 = 1

$\frac{dz}{dw_4} = \frac{dz}{dw_5} \frac{dw_5}{dw_4} = 1 \times 1 = 1$

From definition $w_3 = w_1w_2$ and rules of partial derivatives, we find that $\frac{dw_3}{dw_2} = w_1$ . Thus:

\frac{d z}{d w_{2}} = \frac{d z}{d w_{3}} \frac{d w_{3}}{d w_{2}} = 1 \times w_{1} = w_{1}

$\frac{dz}{dw_2} = \frac{dz}{dw_3} \frac{dw_3}{dw_2} = 1 \times w_1 = w_1$

Which, as we already know from forward pass, is:

\frac{d z}{d w_{2}} = w_{1} = 2

$\frac{dz}{dw_2} = w_1 = 2$

Finally, $w_1$ contributes to $z$ via $w_3$ and $w_4$ . Once again, from the rules of partial derivatives we know that $\frac{dw_3}{dw_1} = w_2$ and $\frac{dw_4}{dw_1} = \cos(w_1)$ . Thus:

\frac{d z}{d w_{1}} = \frac{d z}{d w_{3}} \frac{d w_{3}}{d w_{1}} + \frac{d z}{d w_{4}} \frac{d w_{4}}{d w_{1}} = w_{2} + \cos (w_{1})

$\frac{dz}{dw_1} = \frac{dz}{dw_3} \frac{dw_3}{dw_1} + \frac{dz}{dw_4} \frac{dw_4}{dw_1} = w_2 + \cos(w_1)$

And again, given known inputs, we can calculate it:

\frac{d z}{d w_{1}} = w_{2} + \cos (w_{1}) = 3 + \cos (2) = 2.58

$\frac{dz}{dw_1} = w_2 + \cos(w_1) = 3 + \cos(2) ~= 2.58$

Since $w_1$ and $w_2$ are just aliases for $x_1$ and $x_2$ , we get our answer:

\frac{d z}{d x_{1}} = 2.58

$\frac{dz}{dx_1} = 2.58$

\frac{d z}{d x_{2}} = 2

$\frac{dz}{dx_2} = 2$

And that's it!

This description concerns only scalar inputs, i.e. numbers, but in fact it can also be applied to multidimensional arrays such as vectors and matrices. Two things that one should keep in mind when differentiating expressions with such objects:

Derivatives may have much higher dimensionality than inputs or output, e.g. derivative of vector w.r.t. vector is a matrix and derivative of matrix w.r.t. matrix is a 4-dimensional array (sometimes referred to as a tensor). In many cases such derivatives are very sparse.
Each component in output array is an independent function of 1 or more components of input array(s). E.g. if $y = f(x)$ and both $x$ and $y$ are vectors, $y_i$ never depends on $y_j$ , but only on subset of $x_k$ . In particular, this means that finding derivative $\frac{dy_i}{dx_j}$ boils down to tracking how $y_i$ depends on $x_j$ .

The power of automatic differentiation is that it can deal with complicated structures from programming languages like conditions and loops. However, if all you need is algebraic expressions and you have good enough framework to work with symbolic representations, it's possible to construct fully symbolic expressions. In fact, in this example we could produce expression $\frac{dz}{dw_1} = w_2 + \cos(w_1) = x_2 + \cos(x_1)$ and calculate this derivative for whatever inputs we want.

— ffriend
แหล่งที่มา

Very useful question/answer. Thanks. Just a litte criticism: you seem to move on a tree structure without explaining (that's when you start talking about parents, etc..)

— MadHatter

Also it won't hurt clarifying why we need seeds.

— MadHatter

@MadHatter thanks for the comment. I tried to rephrase a couple of paragraphs (these that refer to parents) to emphasize a graph structure. I also added "seed" to the text, although this name itself may be misleading in my opinion: in AD seed is always a fixed expression -

\frac{d z}{d z} = 1

$\frac{dz}{dz} = 1$ , not something you can choose or generate.

— ffriend

Thanks! I noticed when you have to set more than one "seed", generally one chooses 1 and 0. I'd like to know why. I mean, one takes the "quotient" of a differential w.r.t. itself, so "1" is at least intuitively justified.. But what about 0? And what if one has to pick more than 2 seeds?

— MadHatter

As far as I understand, more than one seed is used only in forward-mode AD. In this case you set the seed to 1 for an input variable you want to differentiate with respect to and set the seed to 0 for all the other input variables so that they don't contribute to the output value. In reverse-mode you set the seed to an output variable, and you normally have only one output variable. I guess, you can construct reverse-mode AD pipeline with several output variables and set all of them but one to 0 to get the same effect as in forward mode, but I have never investigated this option.

— ffriend