วารสารวิทยาศาสตร์ให้การรับรองเส้นทางการ์เด้นออฟฟอร์คกิ้งหรือไม่?


29

แนวคิดของการวิเคราะห์ข้อมูลแบบปรับตัวคือคุณปรับเปลี่ยนแผนสำหรับการวิเคราะห์ข้อมูลในขณะที่คุณเรียนรู้เพิ่มเติมเกี่ยวกับมัน ในกรณีของการวิเคราะห์ข้อมูลเชิงสำรวจ (EDA) โดยทั่วไปเป็นความคิดที่ดี (คุณมักจะมองหารูปแบบที่ไม่คาดฝันในข้อมูล) แต่สำหรับการศึกษาเชิงยืนยันสิ่งนี้ได้รับการยอมรับอย่างกว้างขวางว่าเป็นวิธีการวิเคราะห์ที่มีข้อบกพร่องมาก ขั้นตอนมีการกำหนดไว้อย่างชัดเจนและวางแผนอย่างเหมาะสมในขั้นสูง)

ดังที่ได้กล่าวไปแล้วการวิเคราะห์ข้อมูลที่ปรับตัวได้นั้นโดยทั่วไปแล้วมีนักวิจัยจำนวนเท่าใดที่ทำการวิเคราะห์จริง ๆ เช่นนี้หากใครสามารถทำได้ในลักษณะที่ถูกต้องทางสถิติมันจะปฏิวัติการปฏิบัติทางสถิติ

บทความวิทยาศาสตร์ต่อไปนี้อ้างว่าได้พบวิธีในการทำเช่นนั้น (ฉันขอโทษสำหรับ paywall แต่ถ้าคุณอยู่ในมหาวิทยาลัยคุณน่าจะเข้าถึงได้): Dwork et al, 2015, holdout ที่นำมาใช้ใหม่ได้: รักษาความถูกต้องในการวิเคราะห์ข้อมูลแบบปรับตัว .

โดยส่วนตัวฉันมักสงสัยเกี่ยวกับบทความสถิติที่ตีพิมพ์ในวิทยาศาสตร์และบทความนี้ก็ไม่ต่างกัน ในความเป็นจริงหลังจากอ่านบทความสองครั้งรวมถึงเนื้อหาเพิ่มเติมฉันไม่เข้าใจ (เลย) ทำไมผู้เขียนอ้างว่าวิธีการของพวกเขาป้องกันไม่ให้เกินความเหมาะสม

ความเข้าใจของฉันคือพวกเขามีชุดข้อมูลแบบโฮลด์ซึ่งพวกเขาจะใช้ซ้ำ พวกเขาดูเหมือนจะเรียกร้องโดย "fuzzing" ผลลัพธ์ของการวิเคราะห์ยืนยันในชุดข้อมูลที่ไม่ยอมอ่อนข้อกว่ากระชับจะได้รับการป้องกัน (มันเป็นที่น่าสังเกตว่า fuzzing น่าจะเป็นเพียงการเพิ่มเสียงถ้าสถิติการคำนวณเกี่ยวกับข้อมูลการฝึกอบรมคือพอไกล จากสถิติที่คำนวณได้ในข้อมูลโฮลด์ ) เท่าที่ฉันสามารถบอกได้ว่าไม่มีเหตุผลจริงที่จะป้องกันไม่ให้มีความเหมาะสมมากเกินไป

ฉันเข้าใจผิดในสิ่งที่ผู้เขียนเสนอหรือไม่? มีลักษณะพิเศษบางอย่างที่ฉันมองเห็นหรือไม่? หรือวิทยาศาสตร์ ได้รับรองการฝึกฝนทางสถิติที่เลวร้ายที่สุดจนถึงปัจจุบันหรือไม่?


2
Those without Science access might want to consult this recent Science news article about how one can access paywalled papers.
amoeba says Reinstate Monica

1
Is this possibly a preprint: arxiv.org/pdf/1411.2664.pdf ?
Tim

1
@Tim: the Science article cites the preprint you posted. Also, the Laplacian Noise Addition section seems very similar, but not identical, to the methods in the published article.
Cliff AB

1
@CliffAB so they possibly used differential privacy to make them different ;)
Tim

4
This topic is actually a tutorial at ICML last month. "Rigorous Data Dredging: Theory and Tools for Adaptive Data Analysis" by some fellow at google. icml.cc/2016/?page_id=97
horaceT

คำตอบ:


7

There is a blog posting by the authors that describes this at a high level.

To quote from early in that posting:

In order to reduce the number of variables and simplify our task, we first select some promising looking variables, for example, those that have a positive correlation with the response variable (systolic blood pressure). We then fit a linear regression model on the selected variables. To measure the goodness of our model fit, we crank out a standard F-test from our favorite statistics textbook and report the resulting p-value.

Freedman showed that the reported p-value is highly misleading - even if the data were completely random with no correlation whatsoever between the response variable and the data points, we’d likely observe a significant p-value! The bias stems from the fact that we selected a subset of the variables adaptively based on the data, but we never account for this fact. There is a huge number of possible subsets of variables that we selected from. The mere fact that we chose one test over the other by peeking at the data creates a selection bias that invalidates the assumptions underlying the F-test.

Freedman’s paradox bears an important lesson. Significance levels of standard procedures do not capture the vast number of analyses one can choose to carry out or to omit. For this reason, adaptivity is one of the primary explanations of why research findings are frequently false as was argued by Gelman and Loken who aptly refer to adaptivity as “garden of the forking paths”.

I can't see how their technique addresses this issue at all. So in answer to your question I believe that they don't address the Garden of Forking Paths, and in that sense their technique will lull people into a false sense of security. Not much different from saying "I used cross-validation" lulls many -- who used non-nested CV -- into a false sense of security.

It seems to me that the bulk of the blog posting points to their technique as a better answer to how to keep participants in a Kaggle-style competition from climbing the test set gradient. Which is useful, but doesn't directly address the Forking Paths. It feels like it has the flavor of the Wolfram and Google's New Science where massive amounts of data will take over. That narrative has a mixed record, and I'm always skeptical of automated magic.


3

I'm sure I'm over-simplifying this differential privacy technique here, but the idea makes sense in a high level.

When you get an algorithm to spit out good result (wow, the accuracy on my test set has really improved), you don't want to jump to conclusion right away. You want to accept it only when the improvement is significantly larger than the previous algorithm. That's the reason for adding noise.

EDIT : This blog has good explanation and R codes to demo the effectiveness of the noise adder, http://www.win-vector.com/blog/2015/10/a-simpler-explanation-of-differential-privacy/


But that's not an improvement over saying "I will only accept estimated effects >τ"...which will not prevent over fitting (although it will slightly dampen it). Interestingly, in their own plots, you can see evidence of over fitting (systematically lower reported error on holdout data than on fresh data).
Cliff AB

1
@CliffAB I have the same nagging feeling why this works better than just a simple threshold. But they have proofs!
horaceT

...except that their own example is inconsistent with their claim of preventing over fitting, and is consistent with what I would expect the results to be from "I will only accept estimated effects >τ".
Cliff AB

@CliffAB Can you elaborate? where? That'a an intriguing possibility....
horaceT

Using the slides from your earlier link (icml.cc/2016/?page_id=97), on slides 72 and 73, even when using the "Thresholdout" method, the holdout accuracy is greater than the fresh data at every single simulation, although it does do better than "standard holdout" (which is really "standard abuse of the validation dataset", not an actual valid statistical procedure). FYI, the plot appears on the slides to be the same one in the Science paper (just in case you don't have access).
Cliff AB

3

The claim that adding noise helps prevent overfitting really does hold water here, since what they are really doing is limiting how the holdout is reused. Their method actually does two things: it limits the number of questions that can be asked of the holdout, and how much of each of the answers reveals about the holdout data.

It might help to understand what the benchmarks are: one on hand, you can just insist that the holdout be used only once. That has clear drawbacks. On the other hand, if you want to be able to use the holdout k times, you could chop it into k disjoint pieces, and use each piece once. The problem with that method is that it loses a lot of power (if you had n data points in your holdout sample to begin with, you are now getting the statistical power of only n/k samples).

The Dwork et al paper gives a method which, even with adversarially posed questions, gives you an effective sample size of about n/k for each of the k questions you ask. Furthermore, they can do better if the questions are "not too nasty" (in a sense that is a bit hard to pin down, so let's ignore that for now).

The heart of their method is a relationship between algorithmic stability and overfitting, which dates back to the late 1970's (Devroye and Wagner 1978). Roughly, it says

"Let A be an algorithm that takes a data set X as input and outputs the description of a predicate q=A(X). If A is "stable" and X is drawn i.i.d from a population P, then the empirical frequency of q in x is about the same as the frequency of q in the population P."

Dwork et al. suggest using a notion of stability that controls how the distribution of answers changes as the data set changes (called differential privacy). It has the useful property that if A() is differentially private, then so is f(A()), for any function f. In other words, for the stability analysis to go through, the predicate q doesn't have to be the output of A --- any predicate that is derived from A's output will also enjoy the same type of guarantee.

There are now quite a few papers analyzing how different noise addition procedures control overfitting. A relatively readable one is that of Russo and Zou (https://arxiv.org/abs/1511.05219). Some more recent follow-up papers on the initial work of Dwork et al. might also be helpful to look at. (Disclaimer: I have two papers on the topic, the more recent one explaining a connection to adaptive hypothesis testing: https://arxiv.org/abs/1604.03924.)

Hope that all helps.


0

I object to your second sentence. The idea that one's complete plan of data analysis should be determined in advance is unjustified, even in a setting where you are trying to confirm a preexisting scientific hypothesis. On the contrary, any decent data analysis will require some attention to the actual data that has been acquired. The researchers who believe otherwise are generally researchers who believe that significance testing is the beginning and the end of data analysis, with little to no role for descriptive statistics, plots, estimation, prediction, model selection, etc. In that setting, the requirement to fix one's analytic plans in advance makes more sense because the conventional ways in which p-values are calculated require that the sample size and the tests to be conducted are decided in advance of seeing any data. This requirement hamstrings the analyst, and hence is one of many good reasons not to use significance tests.

You might object that letting the analyst choose what to do after seeing the data allows overfitting. It does, but a good analyst will show all the analyses they conducted, say explicitly what information in the data was used to make analytic decisions, and use methods such as cross-validation appropriately. For example, it is generally fine to recode variables based on the obtained distribution of values, but choosing for some analysis the 3 predictors out of 100 that have the closest observed association to the dependent variable means the the estimates of association are going to be positively biased, by the principle of regression to the mean. If you want to do variable selection in a predictive context, you need to select variables inside your cross-validation folds, or using only the training data.


2
I believe a lot of what you are suggesting fits into the realm of exploratory data analysis (EDA), for which I did endorse adaptive data analysis methods. I also think EDA is underrated and should be given more credit. But all this is orthogonal to the question at hand, which is "Have these authors really allowed us to repeatedly reuse the validation data for model selection in a statistical valid method?" Your last sentence suggests that you, like myself, are somewhat skeptical of such findings.
Cliff AB

I don't think e.g. estimation is inherently exploratory, no. If you have a scientific hypothesis that says that the maximum length of a crocodile must be 12 feet and you try to estimate the maximum length of a crocodile to check this, you're doing a confirmatory analysis.
Kodiologist

2
+1, despite three existing downvotes. I do agree with the main point of this answer (your 2nd sentence), even though I am fully aware that it is quite controversial. In general I think the difference between the exploratory and the confirmatory analysis is overrated; real-life analysis is often somewhere in between. That said, I don't think you answered (or even attempted to answer) OP's question which was about Dwork et al. paper.
amoeba says Reinstate Monica

@amoeba "I don't think you answered (or even attempted to answer) OP's question which was about Dwork et al. paper" — True, although this still seemed worth posting as an answer because it casts doubt on what seems to be a premise of the question.
Kodiologist

2
+1 to @amoeba's comment. This would have been a great comment to the question, but it's not an answer.
S. Kolassa - Reinstate Monica
โดยการใช้ไซต์ของเรา หมายความว่าคุณได้อ่านและทำความเข้าใจนโยบายคุกกี้และนโยบายความเป็นส่วนตัวของเราแล้ว
Licensed under cc by-sa 3.0 with attribution required.