พื้นที่ใต้เส้นโค้ง ROC หรือพื้นที่ใต้เส้นโค้ง PR สำหรับข้อมูลที่ไม่สมดุล?

ฉันมีข้อสงสัยเกี่ยวกับการวัดประสิทธิภาพที่จะใช้พื้นที่ภายใต้เส้นโค้ง ROC (TPR เป็นฟังก์ชันของ FPR) หรือพื้นที่ใต้เส้นโค้งความแม่นยำ - การเรียกคืน (ความแม่นยำเป็นฟังก์ชันการเรียกคืน)

ข้อมูลของฉันไม่สมดุลนั่นคือจำนวนอินสแตนซ์เชิงลบมีขนาดใหญ่กว่าอินสแตนซ์บวกมาก

ฉันกำลังใช้การทำนายผลลัพธ์ของ weka ตัวอย่างคือ:

inst#,actual,predicted,prediction
1,2:0,2:0,0.873
2,2:0,2:0,0.972
3,2:0,2:0,0.97
4,2:0,2:0,0.97
5,2:0,2:0,0.97
6,2:0,2:0,0.896
7,2:0,2:0,0.973

และฉันใช้ห้องสมุด pROC และ ROCR

— เอ็มเอ็ม
แหล่งที่มา

คุณลืมที่จะพูดถึงสิ่งที่คุณต้องการเพื่อให้บรรลุด้วยเส้นโค้งใด ๆ เหล่านี้

— Marc Claesen

หมายเหตุ: ดูเหมือนว่าคุณต้องการเลือกระหว่าง ROC curves (TPR เป็นฟังก์ชันของ FPR ตลอดช่วงการทำงานทั้งหมด) และ PR curves (ความแม่นยำและการเรียกคืนในช่วงการใช้งานทั้งหมด) คำศัพท์เช่น " AUC-ROC ของความแม่นยำและการเรียกคืน " นั้นทำให้เข้าใจผิดมากดังนั้นฉันจึงแก้ไขสิ่งนี้ โปรดย้อนกลับถ้าฉันเข้าใจผิด

— Marc Claesen

คำตอบ:

คำถามค่อนข้างคลุมเครือดังนั้นฉันจะสมมติว่าคุณต้องการเลือกการวัดประสิทธิภาพที่เหมาะสมเพื่อเปรียบเทียบรุ่นต่างๆ สำหรับภาพรวมที่ดีของความแตกต่างที่สำคัญระหว่างร็อคและประชาสัมพันธ์โค้งคุณสามารถดูกระดาษต่อไปนี้: ความสัมพันธ์ระหว่างความแม่นยำจำและเส้นโค้ง ROCโดยเดวิสและ Goadrich

หากต้องการอ้างอิง Davis และ Goadrich:

อย่างไรก็ตามเมื่อต้องรับมือกับชุดข้อมูลที่บิดเบือนอย่างมากเส้นโค้ง Precision-Recall (PR) จะให้ภาพที่มีข้อมูลประสิทธิภาพของอัลกอริทึมมากขึ้น

พล็อตกราฟ ROC FPR กับ TPR เพื่อให้ชัดเจนยิ่งขึ้น:

F P R = \frac{F P}{F P + T N}, T P R = \frac{T P}{T P + F N} .

$FPR = \frac{FP}{FP+TN}, \quad TPR=\frac{TP}{TP+FN}.$

r e c a l l = \frac{T P}{T P + F N} = T P R, p r e c i s i o n = \frac{T P}{T P + F P}

$recall = \frac{TP}{TP+FN} = TPR,\quad precision = \frac{TP}{TP+FP}$

Precision is directly influenced by class (im)balance since $FP$ is affected, whereas TPR only depends on positives. This is why ROC curves do not capture such effects.

Precision-recall curves are better to highlight differences between models for highly imbalanced data sets. If you want to compare different models in imbalanced settings, area under the PR curve will likely exhibit larger differences than area under the ROC curve.

That said, ROC curves are much more common (even if they are less suited). Depending on your audience, ROC curves may be the lingua franca so using those is probably the safer choice. If one model completely dominates another in PR space (e.g. always have higher precision over the entire recall range), it will also dominate in ROC space. If the curves cross in either space they will also cross in the other. In other words, the main conclusions will be similar no matter which curve you use.

Shameless advertisement. As an additional example, you could have a look at one of my papers in which I report both ROC and PR curves in an imbalanced setting. Figure 3 contains ROC and PR curves for identical models, clearly showing the difference between the two. To compare area under the PR versus area under ROC you can compare tables 1-2 (AUPR) and tables 3-4 (AUROC) where you can see that AUPR shows much larger differences between individual models than AUROC. This emphasizes the suitability of PR curves once more.

— Marc Claesen
แหล่งที่มา

Thanks for the explanation. The question now, why PR curves are more informative for imbalanced data? For me, ROC should be more informative because it considers both TPR and FPR.

— M.M

In addition, these two articles make me more confused! onlinelibrary.wiley.com/doi/10.1111/j.1466-8238.2007.00358.x/… riceanalytics.com/db3/00232/riceanalytics.com/_download/…

— M.M

@M.A edited my answer to clarify.

— Marc Claesen

I think there is a mixup in the equation for recall between TPR and FPR, no?

— Simon Thordal

You're right, it should be: recall = ... = TPR, not FPR. @Marc Claesen, I think only you can change that, because when I try to do it, I'm informed that: "Edits should have at least 6 characters", so it's impossible to correct small typos, such as this one.

— ponadto

ROC curves plot TPR on the y-axis and FPR on the x-axis, but it depends on what you want to portray. Unless there is some reason to plot it differently in your area of study, TPR/FPR ROC curves are the standard for showing operating tradeoffs and I believe they would be most well received.

Precision and Recall alone can be misleading because it does not account for true negatives.

— Underminer
แหล่งที่มา

I consider the largest difference in ROC and PR AUC the fact the ROC is determining how well your model can "calculate" the positive class AND the negative class where as the PR AUC is really only looking at your positive class. So in a balanced class situation and where you care about both negative and positive classes, the ROC AUC metric works great. When you have an imbalanced situation, it is preferred to use the PR AUC, but keep in mind it is only determining how well your model can "calculate" the positive class!

— David
แหล่งที่มา