ทำไมอนุพันธ์อันดับสองถึงมีประโยชน์ในการเพิ่มประสิทธิภาพของนูน?

18

ฉันเดาว่านี่เป็นคำถามพื้นฐานและเกี่ยวข้องกับทิศทางของการไล่ระดับสี แต่ฉันกำลังมองหาตัวอย่างที่วิธีการลำดับที่ 2 (เช่นBFGS ) มีประสิทธิภาพมากกว่าการไล่ระดับสีแบบง่าย

optimization

— บาร์
แหล่งที่มา

3

มันง่ายเกินไปหรือไม่ที่จะเพียงแค่สังเกตว่า "การหาจุดสุดยอดของพาราโบลา" เป็นการประมาณที่ดีกว่าสำหรับปัญหา "การหาค่าต่ำสุด" มากกว่า "การหาค่าต่ำสุดของฟังก์ชันเชิงเส้นนี้" (ซึ่งแน่นอนไม่มีขั้นต่ำเพราะ เชิงเส้น)?

20

นี่คือกรอบการทำงานร่วมกันสำหรับการตีความทั้งการไล่ระดับสีและวิธีการของนิวตันซึ่งอาจเป็นวิธีที่มีประโยชน์ในการคิดความแตกต่างเป็นส่วนเสริมของคำตอบของ @ Sycorax (BFGS ใกล้เคียงกับวิธีของนิวตัน; ฉันจะไม่พูดถึงมันโดยเฉพาะที่นี่)

เรากำลังลดฟังก์ชั่นแต่เราไม่รู้ว่าจะทำอย่างไรโดยตรง ดังนั้นแทนที่จะเราใช้เวลาประมาณท้องถิ่นที่จุดของเราในปัจจุบันและลดว่า $f$ $x$

วิธีการของนิวตันใกล้เคียงกับฟังก์ชันโดยใช้การขยายเทย์เลอร์ลำดับที่สอง: ที่หมายถึงการไล่ระดับสีของที่จุดและ Hessian ที่

f (y) \approx N_{x} (y) := f (x) + \nabla f (x)^{T} (y - x) + \frac{1}{2} (y - x)^{T} \nabla^{2} f (x) (y - x),

$f(y) \approx N_x(y) := f(x) + \nabla f(x)^T (y - x) + \tfrac12 (y - x)^T \, \nabla^2 f(x) \, (y - x) ,$

\nabla f (x)

$\nabla f(x)$

f

$f$

x

$x$

\nabla^{2} f (x)

$\nabla^2 f(x)$

x

$x$ xจากนั้นขั้นตอนที่จะ

และซ้ำ

\arg min_{y} N_{x} (y)

$\arg\min_y N_x(y)$

การไล่ระดับสีไล่ระดับมีเพียงการไล่ระดับสีไม่ใช่ Hessian ไม่สามารถทำการประมาณอันดับแรกและลดให้น้อยลงเนื่องจาก @Hurkyl สังเกตว่าไม่มีขั้นต่ำ แต่เรากำหนดขนาดขั้นตอนและขั้นตอนในการ )แต่ทราบว่า $t$ $x - t \nabla f(x)$ ดังนั้นการลดความลาดชันจะลดฟังก์ชั่น

\begin{aligned} x - t \nabla f (x) & = \arg max_{y} [f (x) + \nabla f (x)^{T} (y - x) + \frac{1}{2 t} ‖ y - x ‖^{2}] \\ = \arg max_{y} [f (x) + \nabla f (x)^{T} (y - x) + \frac{1}{2} (y - x)^{T} \frac{1}{t} I (y - x)] . \end{aligned}

$\begin{align} x - t \,\nabla f(x) &= \arg\max_y \left[f(x) + \nabla f(x)^T (y - x) + \tfrac{1}{2 t} \lVert y - x \rVert^2\right] \\&= \arg\max_y \left[f(x) + \nabla f(x)^T (y - x) + \tfrac12 (y-x)^T \tfrac{1}{t} I (y - x)\right] .\end{align}$

G_{x} (y) := f (x) + \nabla f (x)^{T} (y - x) + \frac{1}{2} (y - x)^{T} \frac{1}{t} I (y - x) .

$G_x(y) := f(x) + \nabla f(x)^T (y - x) + \tfrac12 (y-x)^T \tfrac{1}{t} I (y - x).$

ดังนั้นการไล่ระดับสีแบบนี้ก็เหมือนกับการใช้วิธีของนิวตัน แต่แทนที่จะใช้การขยายตัวแบบเทย์เลอร์อันดับสองเราแสร้งว่า Hessian คือ $\tfrac1t I$ ฉันนี้มักจะเป็นประมาณเลวร้ายยิ่งไปกว่าและเชื้อสายลาดจึงมักจะใช้ขั้นตอนที่มากยิ่งกว่าวิธีของนิวตัน แน่นอนว่านี่คือการยกของโดยแต่ละขั้นตอนของการไล่ระดับสีที่ถูกกว่ามากในการคำนวณกว่าแต่ละขั้นตอนของวิธีการของนิวตัน สิ่งที่ดีกว่านั้นขึ้นอยู่กับลักษณะของปัญหาทรัพยากรการคำนวณและข้อกำหนดด้านความแม่นยำของคุณ $G$ $f$ $N$

ดูตัวอย่างของ@ Sycoraxเพื่อลดกำลังสอง

f (x) = \frac{1}{2} x^{T} A x + d^{T} x + c

$f(x) = \tfrac12 x^T A x + d^T x + c$

$N = f$

ในทางกลับกันการไล่ระดับสีใช้

G_{x} (y) = f (x) + (A x + d)^{T} y + \frac{1}{2} (x - y)^{T} \frac{1}{t} I (x - y)

$G_x(y) = f(x) + (A x + d)^T y + \tfrac12 (x - y)^T \tfrac1t I (x-y)$

x

$x$

A

$A$

— Dougal
แหล่งที่มา

1

นี่คล้ายกับคำตอบของ @ Aksakalแต่ในเชิงลึกยิ่งขึ้น

— Dougal

1

(+1) นี่เป็นส่วนเสริมที่ยอดเยี่ยม!

— Sycorax พูดว่า Reinstate Monica

17

โดยพื้นฐานแล้วข้อดีของวิธีอนุพันธ์อันดับสองเช่นวิธีของนิวตันคือมันมีคุณภาพของการเลิกกำลังสอง ซึ่งหมายความว่ามันสามารถลดฟังก์ชั่นสมการกำลังสองในจำนวนขั้นตอนที่ จำกัด วิธีการเช่นการไล่ระดับสีขึ้นอยู่กับอัตราการเรียนรู้อย่างมากซึ่งอาจทำให้เกิดการปรับให้เข้าหากันอย่างช้า ๆ เพราะมันกระดอนไปรอบ ๆ ที่เหมาะสมหรือแยกออกจากกันโดยสิ้นเชิง อัตราการเรียนรู้ที่เสถียรสามารถพบได้ ... แต่เกี่ยวข้องกับการคำนวณแบบ hessian แม้เมื่อใช้อัตราการเรียนรู้ที่มั่นคงคุณสามารถมีปัญหาเช่นการสั่นรอบที่เหมาะสมเช่นคุณจะไม่ใช้เส้นทาง "โดยตรง" หรือ "ประสิทธิภาพ" ไปทางต่ำสุดเสมอไป ดังนั้นอาจใช้การวนซ้ำหลายครั้งเพื่อยุติแม้ว่าคุณค่อนข้างใกล้เคียงกับมัน วิธีการของ BFGS และนิวตันสามารถรวมกันได้เร็วขึ้นแม้ว่าความพยายามในการคำนวณของแต่ละขั้นตอนจะมีราคาแพงกว่า

ตามคำขอของคุณสำหรับตัวอย่าง: สมมติว่าคุณมีฟังก์ชั่นวัตถุประสงค์

F (x) = \frac{1}{2} x^{T} A x + d^{T} x + ค

$F(x)=\frac{1}{2}x^TAx+d^Tx+c$ การไล่ระดับสีคือ

\nabla F (x) = A x + d

$\nabla F(x)=Ax+d$ and putting it into the steepest descent form with constant learning rate

x_{k + 1} = x_{k} - α (A x_{k} + d) = (I - α A) x_{k} - α d .

$x_{k+1}= x_k-\alpha(Ax_k+d) = (I-\alpha A)x_k-\alpha d.$

This will be stable if the magnitudes of the eigenvectors of $I-\alpha A$ are less than 1. We can use this property to show that a stable learning rate satisfies

α < \frac{2}{λ_{m a x}},

$\alpha<\frac{2}{\lambda_{max}},$ where

λ_{m a x}

$\lambda_{max}$ is the largest eigenvalue of

A

$A$ . The steepest descent algorithm's convergence rate is limited by the largest eigenvalue and the routine will converge most quickly in the direction of its corresponding eigenvector. Likewise, it will converge most slowly in directions of the eigenvector of the smallest eigenvalue. When there is a large disparity between large and small eigenvalues for

A

$A$ , gradient descent will be slow. Any

A

$A$ with this property will converge slowly using gradient descent.

In the specific context of neural networks, the book Neural Network Design has quite a bit of information on numerical optimization methods. The above discussion is a condensation of section 9-7.

— Sycorax says Reinstate Monica
แหล่งที่มา

Great answer! I'm accepting @Dougal 's answer as I think it provides a simpler explanation.

— บาร์

6

In convex optimization you are approximating the function as the second degree polynomial in one dimensional case:

f (x) = c + β x + α x^{2}

$f(x)=c+\beta x + \alpha x^2$

In this case the the second derivative

\partial^{2} f (x) / \partial x^{2} = 2 α

$\partial^2 f(x)/\partial x^2=2\alpha$

If you know the derivatives, then it's easy to get the next guess for the optimum:

guess = - \frac{β}{2 α}

$\text{guess}=-\frac{\beta}{2\alpha}$

The multivariate case is very similar, just use gradients for derivatives.

— Aksakal
แหล่งที่มา

2

@Dougal already gave a great technical answer.

The no-maths explanation is that while the linear (order 1) approximation provides a “plane” that is tangential to a point on an error surface, the quadratic approximation (order 2) provides a surface that hugs the curvature of the error surface.

The videos on this link do a great job of visualizing this concept. They display order 0, order 1 and order 2 approximations to the function surface, which just intuitively verifies what the other answers present mathematically.

Also, a good blogpost on the topic (applied to neural networks) is here.

— Zhubarb
แหล่งที่มา