ความซับซ้อนในการคำนวณ k-NN

ความซับซ้อนของเวลาของอัลกอริทึมk -NN ด้วยวิธีการค้นหาแบบไร้เดียงสา (ไม่มี kd tree หรือ similars) คืออะไร?

ผมสนใจในความซับซ้อนเวลาพิจารณายัง hyperparameter k ฉันได้พบคำตอบที่ขัดแย้ง:

O (ND + kn) โดยที่nคือ cardinality ของชุดการฝึกอบรมและวันที่มิติของแต่ละตัวอย่าง [1]
O (ndk) อีกครั้งที่nเป็น cardinality ของชุดการฝึกอบรมและวันที่มิติของแต่ละตัวอย่าง [2]

[1] http://www.csd.uwo.ca/courses/CS9840a/Lecture2_knn.pdf (Pag. 18/20)

[2] http://www.cs.haifa.ac.il/~rita/ml_course/lectures/KNN.pdf (หน้า 18/31)

k-nearest-neighbour time-complexity

สมมติว่าได้รับการแก้ไข (เป็นทั้งการบรรยายที่เชื่อมโยงกัน) จากนั้นตัวเลือกขั้นตอนวิธีของคุณจะกำหนดว่าการคำนวณของคุณใช้เวลารันไทม์หรือ $k$ $O(nd+kn)$ รันไทม์ $O(ndk)$

อันดับแรกให้พิจารณา $O(nd+kn)$ อัลกอริทึมรันไทม์ :

เตรียมสำหรับการสังเกตทั้งหมดอยู่ในชุดฝึกอบรม $selected_i = 0$ $i$
สำหรับการฝึกอบรมในแต่ละชุดการสังเกตคำนวณ , ระยะทางจากการสังเกตใหม่เพื่อการฝึกอบรมชุดสังเกต $i$ $dist_i$ $i$
สำหรับจะ : ห่วงผ่านทุกการสังเกตการฝึกอบรมชุดเลือกดัชนีมีขนาดเล็กที่สุดคุ้มค่าและการที่ 0เลือกข้อสังเกตนี้โดยการตั้งค่า 1 $j=1$ $k$ $i$ $dist_i$ $selected_i=0$ $selected_i=1$
ส่งคืนดัชนีที่เลือก $k$

การคำนวณระยะทางแต่ละครั้งต้องใช้รันไทม์ดังนั้นขั้นตอนที่สองต้องใช้รันไทม์สำหรับแต่ละย้ำในขั้นตอนที่สามเราดำเนินการการทำงานโดยการวนลูปผ่านการสังเกตการฝึกอบรมชุดดังนั้นขั้นตอนโดยรวมต้องการทำงาน ขั้นตอนแรกและขั้นตอนที่สี่จำเป็นต้องใช้งานเท่านั้นดังนั้นเราจึงได้รันไทม์ $O(d)$ $O(nd)$ $O(n)$ $O(nk)$ $O(n)$ $O(nd+kn)$

Now, let's consider a $O(ndk)$ runtime algorithm:

Initialize $selected_i = 0$ for all observations $i$ in the training set
For $j=1$ to $k$ : Loop through all training set observations and compute the distance $d$ between the selected training set observation and the new observation. Select the index $i$ with the smallest $d$ value for which $selected_i=0$ . Select this observation by setting $selected_i=1$ .
Return the $k$ selected indices

$O(nd)$ $O(ndk)$

$O(n)$ $O(nd)$ $selected$ vector, requiring $O(n)$ storage, the storage of the two algorithms is asymptotically the same. As a result, the better asymptotic runtime for $k > 1$ makes the first algorithm more attractive.

It's worth noting that it is possible to obtain an $O(nd)$ runtime using an algorithmic improvement:

For each training set observation $i$ , compute $dist_i$ , the distance from the new observation to training set observation $i$
Run the quickselect algorithm to compute the $k^{th}$ smallest distance in $O(n)$ runtime
Return all indices no larger than the computed $k^{th}$ smallest distance

This approach takes advantage of the fact that efficient approaches exist to find the $k^{th}$ smallest value in an unsorted array.

— josliber
แหล่งที่มา

Great answer and I especially like the advice towards the use of quickselect.

— usεr11852 says Reinstate Monic

One more question: for the third option I believe that the time complexity should be O(nd+k), as you still have to compute the most common label among the k-nearest neighbors to emit a prediction, right?

— Daniel López

@Daniel Since

k \leq n

$k \leq n$ ,

O (n d + k)

$O(nd+k)$ is the same as

O (n d)

$O(nd)$ .

— josliber

Last time I bother you: trying to determine the computational complexity of a modified version of k-NN I am working on, I get the following: O(nd+nd/p) Where by definition n, d and p are integers greater than zero. Can I simplify that to O(nd)?

— Daniel López

@Daniel Yes, in that case

O (n d)

$O(nd)$ works.

— josliber