18

ฉันพยายามเข้าใจบทบาทของอนุพันธ์ของฟังก์ชัน sigmoid ในโครงข่ายประสาท

ครั้งแรกที่ฉันพล็อตฟังก์ชั่น sigmoid และอนุพันธ์ของคะแนนทั้งหมดจากการกำหนดโดยใช้หลาม บทบาทของอนุพันธ์นี้คืออะไร?

import numpy as np
import matplotlib.pyplot as plt

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def derivative(x, step):
    return (sigmoid(x+step) - sigmoid(x)) / step

x = np.linspace(-10, 10, 1000)

y1 = sigmoid(x)
y2 = derivative(x, 0.0000000000001)

plt.plot(x, y1, label='sigmoid')
plt.plot(x, y2, label='derivative')
plt.legend(loc='upper left')
plt.show()

machine-learning neural-network

— lukassz
แหล่งที่มา

2

หากคุณมีคำถามเพิ่มเติมอย่าลังเลที่จะถาม

— JahKnows

23

การใช้งานของสัญญาซื้อขายล่วงหน้าในโครงข่ายประสาทเทียมสำหรับขั้นตอนการฝึกอบรมที่เรียกว่าแพร่กระจายย้อนกลับ เทคนิคนี้ใช้การไล่ระดับสีแบบไล่ลำดับเพื่อค้นหาชุดพารามิเตอร์ที่ดีที่สุดเพื่อลดฟังก์ชั่นการสูญเสีย ในตัวอย่างของคุณคุณต้องใช้อนุพันธ์ของ sigmoidเพราะนั่นคือการเปิดใช้งานที่เซลล์ประสาทส่วนบุคคลของคุณใช้

ฟังก์ชั่นการสูญเสีย

สาระสำคัญของการเรียนรู้ของเครื่องคือการเพิ่มประสิทธิภาพฟังก์ชั่นค่าใช้จ่ายเพื่อให้เราสามารถลดหรือเพิ่มฟังก์ชั่นเป้าหมายบางอย่าง โดยทั่วไปเรียกว่าการสูญเสียหรือ funtion ค่าใช้จ่าย โดยทั่วไปแล้วเราต้องการลดฟังก์ชั่นนี้ ฟังก์ชั่นค่าใช้จ่าย, , เชื่อมโยงการลงโทษบางอย่างขึ้นอยู่กับข้อผิดพลาดที่เกิดขึ้นเมื่อส่งผ่านข้อมูลผ่านแบบจำลองของคุณเป็นฟังก์ชั่นของพารามิเตอร์แบบจำลอง $C$

ลองดูตัวอย่างที่เราพยายามติดป้ายว่ารูปภาพมีแมวหรือสุนัขหรือไม่ ถ้าเรามีรูปแบบที่สมบูรณ์แบบเราสามารถให้รูปแบบและมันจะบอกเราว่ามันเป็นแมวหรือสุนัข อย่างไรก็ตามไม่มีโมเดลที่สมบูรณ์และจะทำผิดพลาด

เมื่อเราฝึกแบบจำลองของเราเพื่อให้สามารถอนุมานความหมายจากข้อมูลอินพุตเราต้องการลดจำนวนข้อผิดพลาดที่เกิดขึ้น ดังนั้นเราจึงใช้ชุดฝึกอบรมข้อมูลนี้มีรูปภาพสุนัขและแมวจำนวนมากและเรามีป้ายกำกับความจริงที่เกี่ยวข้องกับภาพนั้น ทุกครั้งที่เราทำการฝึกซ้ำของแบบจำลองเราคำนวณต้นทุน (จำนวนข้อผิดพลาด) ของแบบจำลอง เราจะต้องการลดค่าใช้จ่ายนี้

ฟังก์ชั่นค่าใช้จ่ายจำนวนมากมีอยู่เพื่อวัตถุประสงค์ของตนเอง ฟังก์ชั่นค่าใช้จ่ายทั่วไปที่ใช้เป็นค่าใช้จ่ายกำลังสองซึ่งถูกกำหนดเป็น

2 $C = \frac{1}{N} \sum_{i=0}^{N}(\hat{y} - y)^2$

นี่คือความแตกต่างระหว่างป้ายกำกับที่คาดการณ์กับป้ายกำกับความจริงภาคพื้นดินสำหรับภาพที่เราฝึกมา เราจะต้องการย่อขนาดนี้ให้เล็กลง $N$

ลดฟังก์ชั่นการสูญเสียให้น้อยที่สุด

อันที่จริงแล้วการเรียนรู้ของเครื่องจักรส่วนใหญ่นั้นเป็นเพียงกรอบการทำงานที่มีความสามารถในการพิจารณาการกระจายโดยการลดฟังก์ชั่นค่าใช้จ่าย คำถามที่เราสามารถถามได้คือ "เราจะลดฟังก์ชั่นได้อย่างไร"

ลองย่อฟังก์ชั่นต่อไปนี้

6 $y = x^2-4x+6$

ถ้าเราพล็อตนี้เราจะเห็นว่ามีขั้นต่ำที่ 2ในการทำสิ่งนี้เราสามารถหาอนุพันธ์ของฟังก์ชันนี้ได้ $x = 2$

$\frac{dy}{dx} = 2x - 4 = 0$

2 $x = 2$

อย่างไรก็ตามบ่อยครั้งที่การค้นหาการวิเคราะห์ขั้นต่ำระดับโลกไม่สามารถทำได้ ดังนั้นเราจึงใช้เทคนิคการเพิ่มประสิทธิภาพแทน นี่คือวิธีที่แตกต่างกันเป็นจำนวนมากที่มีอยู่เช่น: Newton-Raphson ค้นหาตาราง ฯลฯ กลุ่มคนเหล่านี้เป็นเชื้อสายลาด นี่คือเทคนิคที่ใช้โดยเครือข่ายประสาท

โคตรลาด

ลองใช้การเปรียบเทียบที่ใช้ชื่อเสียงเพื่อทำความเข้าใจสิ่งนี้ ลองนึกภาพปัญหาการย่อขนาด 2D นี่เทียบเท่ากับการปีนเขาที่เป็นภูเขาในถิ่นทุรกันดาร คุณต้องการกลับไปที่หมู่บ้านที่คุณรู้ว่าอยู่ที่จุดต่ำสุด แม้ว่าคุณจะไม่ทราบทิศทางสำคัญของหมู่บ้าน สิ่งที่คุณต้องทำคือเดินลงไปเรื่อย ๆ และในที่สุดคุณก็จะไปถึงหมู่บ้าน ดังนั้นเราจะลงมาตามพื้นผิวตามความชันของความลาดชัน

มาฟังก์ชั่นของเรากัน

$y = x^2-4x+6$

เราจะหาค่าที่ถูกย่อให้เล็กสุด ขั้นตอนวิธีการไล่โทนสีเชื้อสายแรกที่บอกว่าเราจะรับค่าสุ่มสำหรับxให้เราเริ่มต้นที่ $x$ $y$ $x$ $x=8$ 8จากนั้นอัลกอริทึมจะทำสิ่งต่อไปนี้ซ้ำ ๆ จนกว่าเราจะมาบรรจบกัน

$x^{new} = x^{old} - \nu \frac{dy}{dx}$

เมื่อเป็นอัตราการเรียนรู้เราสามารถตั้งค่านี้เป็นค่าใดก็ได้ที่เราต้องการ อย่างไรก็ตามมีวิธีที่ชาญฉลาดในการเลือกสิ่งนี้ ใหญ่เกินไปและเราจะไม่ถึงค่าต่ำสุดของเราและใหญ่เกินไปเราจะเสียเวลามากเกินไปก่อนที่เราจะไปถึงที่นั่น มันคล้ายกับขนาดของขั้นตอนที่คุณต้องการลดความชันลง ก้าวเล็ก ๆ และคุณจะตายบนภูเขาคุณจะไม่มีวันลง ขั้นตอนใหญ่เกินไปและคุณเสี่ยงที่จะยิงหมู่บ้านและสิ้นสุดที่อีกด้านหนึ่งของภูเขา อนุพันธ์คือวิธีการที่เราเดินทางไปตามความชันนี้ไปสู่ค่าต่ำสุดของเรา $\nu$

$\frac{dy}{dx} = 2x - 4$

$\nu = 0.1$

การวนซ้ำ 1:

$x^{new} = 8 - 0.1(2 * 8 - 4) = 6.8$
$x^{new} = 6.8 - 0.1(2 * 6.8 - 4) = 5.84$
$x^{new} = 5.84 - 0.1(2 * 5.84 - 4) = 5.07$
$x^{new} = 5.07 - 0.1(2 * 5.07 - 4) = 4.45$
$x^{new} = 4.45 - 0.1(2 * 4.45 - 4) = 3.96$
$x^{new} = 3.96 - 0.1(2 * 3.96 - 4) = 3.57$
$x^{new} = 3.57 - 0.1(2 * 3.57 - 4) = 3.25$
$x^{new} = 3.25 - 0.1(2 * 3.25 - 4) = 3.00$
$x^{new} = 3.00 - 0.1(2 * 3.00 - 4) = 2.80$
$x^{new} = 2.80 - 0.1(2 * 2.80 - 4) = 2.64$
$x^{new} = 2.64 - 0.1(2 * 2.64 - 4) = 2.51$
$x^{new} = 2.51 - 0.1(2 * 2.51 - 4) = 2.41$
$x^{new} = 2.41 - 0.1(2 * 2.41 - 4) = 2.32$
$x^{new} = 2.32 - 0.1(2 * 2.32 - 4) = 2.26$
$x^{new} = 2.26 - 0.1(2 * 2.26 - 4) = 2.21$
$x^{new} = 2.21 - 0.1(2 * 2.21 - 4) = 2.16$
$x^{new} = 2.16 - 0.1(2 * 2.16 - 4) = 2.13$
$x^{new} = 2.13 - 0.1(2 * 2.13 - 4) = 2.10$
$x^{new} = 2.10 - 0.1(2 * 2.10 - 4) = 2.08$
$x^{new} = 2.08 - 0.1(2 * 2.08 - 4) = 2.06$
$x^{new} = 2.06 - 0.1(2 * 2.06 - 4) = 2.05$
$x^{new} = 2.05 - 0.1(2 * 2.05 - 4) = 2.04$
$x^{new} = 2.04 - 0.1(2 * 2.04 - 4) = 2.03$
$x^{new} = 2.03 - 0.1(2 * 2.03 - 4) = 2.02$
$x^{new} = 2.02 - 0.1(2 * 2.02 - 4) = 2.02$
$x^{new} = 2.02 - 0.1(2 * 2.02 - 4) = 2.01$
$x^{new} = 2.01 - 0.1(2 * 2.01 - 4) = 2.01$
$x^{new} = 2.01 - 0.1(2 * 2.01 - 4) = 2.01$
$x^{new} = 2.01 - 0.1(2 * 2.01 - 4) = 2.00$
$x^{new} = 2.00 - 0.1(2 * 2.00 - 4) = 2.00$
$x^{new} = 2.00 - 0.1(2 * 2.00 - 4) = 2.00$
$x^{new} = 2.00 - 0.1(2 * 2.00 - 4) = 2.00$
$x^{new} = 2.00 - 0.1(2 * 2.00 - 4) = 2.00$

And we see that the algorithm converges at $x = 2$ ! We have found the minimum.

Applied to neural networks

The first neural networks only had a single neuron which took in some inputs $x$ and then provide an output $\hat{y}$ . A common function used is the sigmoid function

$\sigma(z) = \frac{1}{1+exp(z)}$

$\hat{y}(w^Tx) = \frac{1}{1+exp(w^Tx + b)}$

where $w$ is the associated weight for each input $x$ and we have a bias $b$ . We then want to minimize our cost function

$C = \frac{1}{2N} \sum_{i=0}^{N}(\hat{y} - y)^2$ .

How to train the neural network?

We will use gradient descent to train the weights based on the output of the sigmoid function and we will use some cost function $C$ and train on batches of data of size $N$ .

$C = \frac{1}{2N} \sum_i^N (\hat{y} - y)^2$

$\hat{y}$ is the predicted class obtained from the sigmoid function and $y$ is the ground truth label. We will use gradient descent to minimize the cost function with respect to the weights $w$ . To make life easier we will split the derivative as follows

$\frac{\partial C}{\partial w} = \frac{\partial C}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial w}$ .

$\frac{\partial C}{\partial \hat{y}} = \hat{y} - y$

and we have that $\hat{y} = \sigma(w^Tx)$ and the derivative of the sigmoid function is $\frac{\partial \sigma(z)}{\partial z} = \sigma(z)(1-\sigma(z))$ thus we have,

$\frac{\partial \hat{y}}{\partial w} = \frac{1}{1+exp(w^Tx + b)} (1 - \frac{1}{1+exp(w^Tx + b)})$ .

So we can then update the weights through gradient descent as

$w^{new} = w^{old} - \eta \frac{\partial C}{\partial w}$

where $\eta$ is the learning rate.

— JahKnows
แหล่งที่มา

2

please tell me why is this process not so nicely described in books? Do you have a blog? What materials for learning neural networks do you recommend? I have test data and I want to train it. Can I draw a function that I will minimize? I would like to visualize this process to better understand it.

— lukassz

Can you explain backpropagation in this simple way?

— lukassz

1

Amazing Answer...(+1)

— Aditya

1

Backprop is also similar to what JahKnows has Explained above... Its just the gradient is carried all the way to the inputs right from the outputs.. A quick google search will make this clear.. Also the same goes every other activation functions also..

— Aditya

1

@lukassz, notice that his equation is the same as the one I have for the weight update in the before last equation.

\frac{\partial C}{\partial w} = (\hat{y} - y) * derivative of sigmoid

$\frac{\partial C}{\partial w} = (\hat{y} - y) * \text{derivative of sigmoid}$ . He uses the same cost function as me, dont forget that you need to take the derivative of the loss function too, that becomes

\hat{y} - y

$\hat{y} - y$ , where

\hat{y}

$\hat{y}$ are the predicted labels and

y

$y$ are the ground truth labels.

— JahKnows

2

During the phase where the neural network generates its prediction, it feeds the input forward through the network. For each layer, the layer's input $X$ goes first through an affine transformation $W \cdot X + b$ and then is passed through the sigmoid function $σ(W \cdot X + b)$ .

In order to train the network, the output $\hat y$ is then compared to the expected output (or label) $y$ through a cost function $L(y, \hat y)=L\left(y, σ(W \cdot X + b)\right)$ . The goal of the whole training procedure is to minimize that cost function. In order to do that, a technique called gradient descent is performed which calculates how we should change $W$ and $b$ so that the cost reduces.

Gradient Descent requires calculating the derivative of the cost function w.r.t $W$ and $b$ . In order to do that we must apply the chain rule, because the derivative we need to calculate is a composition of two functions. As dictated by the chain rule we must calculate the derivative of the sigmoid function.

One of the reasons that the sigmoid function is popular with neural networks, is because its derivative is easy to compute.

— M Sef
แหล่งที่มา

1

In simple words:

Derivative shows neuron's ability to learn on particular input.

For example if input is 0 or 1 or -2, the derivative (the "learning ability") is high and back-propagation will improve neuron's weights for this sample dramatically.

On other hand, if input is 20, the the derivative will be very close to 0. It means that back-propagation on this sample will not "teach" this neuron to produce a better result.

The things above are valid for a single sample.

Let's look at the bigger picture, for all samples in the training set. Here we have several situations:

If derivative is 0 for all samples in your training set AND neuron always produces wrong results - it means the neuron is saturated (dumb) and will not improve.
If derivative is 0 for all samples in your training set AND neuron always produces correct results - it means the neuron have been studying really well and already as smart as it could (side note: this case is good but it may indicate potential overfitting, which is not good)
If derivative is 0 on some samples, non-0 on other samples AND neuron produces mixed results - it indicates that this neuron doing some good work and potentially may improve from further training (though not necessarily as it depends on other neurons and training data you have)

So, when you are looking at the derivative plot, you can see how much the neuron prepared to learn and absorb the new knowledge, given a particular input.

— VeganHunter
แหล่งที่มา

0

The derivative you see here is important in neural networks. It's the reason why people generally prefer something else such as rectified linear unit.

Do you see the derivative drop for the two ends? What if your network is on the very left side, but it needs to move to the right side? Imagine you're on -10.0 but you want 10.0. The gradient will be too small for your network to converge quickly. We don't want to wait, we want quicker convergence. RLU doesn't have this problem.

We call this problem "Neural Network Saturation".

Please see https://www.quora.com/What-is-special-about-rectifier-neural-units-used-in-NN-learning

— SmallChess
แหล่งที่มา

อนุพันธ์บทบาทของฟังก์ชัน sigmoid ในโครงข่ายประสาทเทียม

ฟังก์ชั่นการสูญเสีย

ลดฟังก์ชั่นการสูญเสียให้น้อยที่สุด

โคตรลาด

Applied to neural networks

How to train the neural network?