ความซับซ้อนของเวลาสำหรับการฝึกอบรมเครือข่ายประสาทเทียมโดยใช้การเผยแพร่กลับเป็นอย่างไร

17

สมมติว่ามี NN $n$ ชั้นซ่อน $m$ ตัวอย่างการฝึกอบรม $x$ คุณสมบัติและ $n_i$ โหนดในแต่ละชั้น ความซับซ้อนของเวลาในการฝึกอบรม NN นี้โดยใช้การเผยแพร่กลับเป็นอย่างไร

ฉันมีความคิดพื้นฐานเกี่ยวกับวิธีที่พวกเขาค้นหาความซับซ้อนของเวลาของอัลกอริทึม แต่ที่นี่มีปัจจัยต่าง ๆ 4 ข้อที่ต้องพิจารณาที่นี่เช่นการทำซ้ำเลเยอร์โหนดในแต่ละชั้นตัวอย่างการฝึกอบรมและปัจจัยอื่น ๆ ฉันพบคำตอบที่นี่แต่มันไม่ชัดเจนพอ

มีปัจจัยอื่น ๆ นอกเหนือจากที่ฉันได้กล่าวถึงข้างต้นหรือไม่ที่มีอิทธิพลต่อความซับซ้อนของเวลาในการฝึกอัลกอริทึมของ NN?

— DuttaA
แหล่งที่มา

ดูเพิ่มเติมhttps://qr.ae/TWttzq

— nbro

11

ฉันไม่เห็นคำตอบจากแหล่งที่เชื่อถือได้ แต่ฉันจะพยายามตอบคำถามนี้ด้วยตัวเองตัวอย่างง่ายๆ (ด้วยความรู้ปัจจุบันของฉัน)

โดยทั่วไปแล้วโปรดทราบว่าการฝึกอบรม MLP โดยใช้การเผยแพร่กลับมักใช้กับเมทริกซ์

ความซับซ้อนของเวลาในการคูณเมทริกซ์

ความซับซ้อนเวลาของการคูณเมทริกซ์สำหรับ $M_{ij} * M_{jk}$ เป็นเพียง $\mathcal{O}(i*j*k)$ )

โปรดสังเกตว่าเรากำลังสมมติอัลกอริทึมการคูณที่ง่ายที่สุดที่นี่: มีอัลกอริทึมอื่น ๆ ที่มีความซับซ้อนของเวลาค่อนข้างดีกว่า

อัลกอริทึมการส่งผ่านของ Feedforward

อัลกอริทึมการแพร่กระจายของ Feedforward มีดังนี้

ก่อนอื่นให้ไปจากเลเยอร์ $i$ ถึง $j$ ก่อน

S_{j} = W_{j i} * Z_{i}

$S_j = W_{ji}*Z_i$

จากนั้นคุณใช้ฟังก์ชั่นการเปิดใช้งาน

Z_{j} = f (S_{j})

$Z_j = f(S_j)$

ถ้าเรามีเลเยอร์ $N$ (รวมถึงเลเยอร์อินพุตและเอาต์พุต) สิ่งนี้จะรัน $N-1$ ครั้ง

ตัวอย่าง

ตัวอย่างเช่นสมมติคำนวณซับซ้อนเวลาสำหรับขั้นตอนวิธีการผ่านไปข้างหน้าสำหรับ MLP กับ $4$ ชั้นที่ $i$ หมายถึงจำนวนโหนดของชั้นอินพุต, $j$ จำนวนโหนดในชั้นที่สอง $k$ จำนวนโหนดในที่ ชั้นที่สามและ $l$ จำนวนโหนดใน layer output

เนื่องจากมี $4$ เลเยอร์คุณต้องมีเมทริกซ์ $3$ เพื่อแสดงน้ำหนักระหว่างเลเยอร์เหล่านี้ ลองแทนพวกมันด้วย $W_{ji}$ , $W_{kj}$ และ $W_{lk}$ โดยที่ $W_{ji}$ เป็นเมทริกซ์ที่มีแถว $j$ และคอลัมน์ $i$ ( $W_{ji}$ จึงมีน้ำหนักที่เพิ่มจากเลเยอร์ $i$ ถึงเลเยอร์ $j$ )

สมมติคุณมี $t$ ตัวอย่างการฝึกอบรม สำหรับการแพร่กระจายจากเลเยอร์ $i$ ถึง $j$ เรามีก่อน

S_{j t} = W_{j i} * Z_{i t}

$S_{jt} = W_{ji} * Z_{it}$

และการดำเนินการนี้ (เช่นการคูณเมทริกซ์) มีความซับซ้อนของเวลา $\mathcal{O}(j*i*t)$ จากนั้นเราใช้ฟังก์ชั่นการเปิดใช้งาน

Z_{j t} = f (S_{j t})

$Z_{jt} = f(S_{jt})$

และสิ่งนี้มีความซับซ้อนของเวลา $\mathcal{O}(j*t)$ เนื่องจากเป็นการดำเนินการที่มีองค์ประกอบที่ชาญฉลาด

ดังนั้นโดยรวมเรามี

O (j * i * t + j * t) = O (j * t * (t + 1)) = O (j * i * t)

$\mathcal{O}(j*i*t + j*t) = \mathcal{O}(j*t*(t + 1)) = \mathcal{O}(j*i*t)$

ใช้ตรรกะเดียวกันสำหรับการไป $j \to k$ เรามี $\mathcal{O}(k*j*t)$ และสำหรับ $k \to l$ เรามี $\mathcal{O}(l*k*t)$ )

In total, the time complexity for feedforward propagation will be

O (j * i * t + k * j * t + l * k * t) = O (t * (i j + j k + k l))

$\mathcal{O}(j*i*t + k*j*t + l*k*t) = \mathcal{O}(t*(ij + jk + kl))$

I'm not sure if this can be simplified further or not. Maybe it's just $\mathcal{O}(t*i*j*k*l)$ , but I'm not sure.

Back-propagation algorithm

The back-propagation algorithm proceeds as follows. Starting from the output layer $l \to k$ , we compute the error signal, $E_{lt}$ , a matrix containing the error signals for nodes at layer $l$

E_{l t} = f^{'} (S_{l t}) ⊙ (Z_{l t} - O_{l t})

$E_{lt} = f'(S_{lt}) \odot {(Z_{lt} - O_{lt})}$

where $\odot$ means element-wise multiplication. Note that $E_{lt}$ has $l$ rows and $t$ columns: it simply means each column is the error signal for training example $t$ .

$D_{lk} \in \mathbb{R}^{l \times k}$ $l$ $k$

D_{l k} = E_{l t} * Z_{t k}

$D_{lk} = E_{lt} * Z_{tk}$

where $Z_{tk}$ is the transpose of $Z_{kt}$ .

We then adjust the weights

W_{l k} = W_{l k} - D_{l k}

$W_{lk} = W_{lk} - D_{lk}$

For $l \to k$ , we thus have the time complexity $\mathcal{O}(lt + lt + ltk + lk) = \mathcal{O}(l*t*k)$ .

Now, going back from $k \to j$ . We first have

E_{k t} = f^{'} (S_{k t}) ⊙ (W_{k l} * E_{l t})

$E_{kt} = f'(S_{kt}) \odot (W_{kl} * E_{lt})$

Then

D_{k j} = E_{k t} * Z_{t j}

$D_{kj} = E_{kt} * Z_{tj}$

And then

W_{k j} = W_{k j} - D_{k j}

$W_{kj} = W_{kj} - D_{kj}$

where $W_{kl}$ is the transpose of $W_{lk}$ . For $k \to j$ , we have the time complexity $\mathcal{O}(kt + klt + ktj + kj) = \mathcal{O}(k*t(l+j))$ .

And finally, for $j \to i$ , we have $\mathcal{O}(j*t(k+i))$ . In total, we have

O (l t k + t k (l + j) + t j (k + i)) = O (t * (l k + k j + j i))

$\mathcal{O}(ltk + tk(l + j) + tj (k + i)) = \mathcal{O}(t*(lk + kj + ji))$

which is same as feedforward pass algorithm. Since they are same, the total time complexity for one epoch will be

O (t * (i j + j k + k l)) .

$O(t*(ij + jk + kl)).$

This time complexity is then multiplied by number of iterations (epochs). So, we have

O (n * t * (i j + j k + k l)),

$O(n*t*(ij + jk + kl)),$ where

n

$n$ is number of iterations.

Notes

Note that these matrix operations can greatly be paralelized by GPUs.

Conclusion

We tried to find the time complexity for training a neural network that has 4 layers with respectively $i$ , $j$ , $k$ and $l$ nodes, with $t$ training examples and $n$ epochs. The result was $\mathcal{O}(nt*(ij + jk + kl))$ .

We assumed the simplest form of matrix multiplication that has cubic time complexity. We used batch gradient descent algorithm. The results for stochastic and mini-batch gradient descent should be same. (Let me know if you think the otherwise: note that batch gradient descent is the general form, with little modification, it becomes stochastic or mini-batch)

Also, if you use momentum optimization, you will have same time complexity, because the extra matrix operations required are all element-wise operations, hence they will not affect the time complexity of the algorithm.

I'm not sure what the results would be using other optimizers such as RMSprop.

Sources

The following article http://briandolhansky.com/blog/2014/10/30/artificial-neural-networks-matrix-form-part-5 describes an implementation using matrices. Although this implementation is using "row major", the time complexity is not affected by this.

If you're not familiar with back-propagation, check this article:

http://briandolhansky.com/blog/2013/9/27/artificial-neural-networks-backpropagation-part-4

— M.kazem Akhgary
แหล่งที่มา

Your answer is great..I could not find any ambiguity till now, but you forgot the no. of iterations part, just add it...and if no one answers in 5 days i'll surely accept your answer

— DuttaA

@DuttaA I tried to put every thing I knew. it may not be 100% correct so feel free to leave this unaccepted :) I'm also waiting for other answers to see what other points I missed.

— M.kazem Akhgary

4

For the evaluation of a single pattern, you need to process all weights and all neurons. Given that every neuron has at least one weight, we can ignore them, and have $\mathcal{O}(w)$ where $w$ is the number of weights, i.e., $n * n_i$ , assuming full connectivity between your layers.

The back-propagation has the same complexity as the forward evaluation (just look at the formula).

So, the complexity for learning $m$ examples, where each gets repeated $e$ times, is $\mathcal{O}(w*m*e)$ .

The bad news is that there's no formula telling you what number of epochs $e$ you need.

— maaartinus
แหล่งที่มา

From the above answer don't you think itdepends on more factors?

— DuttaA

1

@DuttaA No. There's a constant amount of work per weight, which gets repeated e times for each of m examples. I didn't bother to compute the number of weights, I guess, that's the difference.

— maaartinus

1

I think the answers are same. in my answer I can assume number of weights w = ij + jk + kl. basically sum of n * n_i between layers as you noted.

— M.kazem Akhgary

1

A potential disadvantage of gradient-based methods is that they head for the nearest minimum, which is usually not the global minimum.

This means that the only difference between these search methods is the speed with which solutions are obtained, and not the nature of those solutions.

An important consideration is time complexity, which is the rate at which the time required to find a solution increases with the number of parameters (weights). In short, the time complexities of a range of different gradient-based methods (including second-order methods) seem to be similar.

Six different error functions exhibit a median run-time order of approximately O(N to the power 4) on the N-2-N encoder in this paper:

Lister, R and Stone J "An Empirical Study of the Time Complexity of Various Error Functions with Conjugate Gradient Back Propagation" , IEEE International Conference on Artificial Neural Networks (ICNN95), Perth, Australia, Nov 27-Dec 1, 1995.

Summarised from my book: Artificial Intelligence Engines: A Tutorial Introduction to the Mathematics of Deep Learning.

— James V Stone
แหล่งที่มา

Hi J. Stone. Thanks for trying to contribute to the site. However, please, note that this is not a place for advertising yourself. Anyway, you can surely provide a link to your own books if they are useful for answering the questions and provided you're not just trying to advertise yourself.

— nbro

@nbro If James Stone can provide an insightful answer - and it seems so - then i'm fine with him also mentioning some of his work. Having experts on this network is a solid contribution to the quality and level.

— javadba

Dear nbro, That is a fair comment. I dislike adverts too. But it is possible for a book and/or paper to be relevant to a question, as I believe it is in this case. regards, Jim Stone

— James V Stone