LASSO และสันเขาจากมุมมองแบบเบย์: แล้วพารามิเตอร์การจูนล่ะ?

การประมาณค่าถดถอยแบบปรับโทษเช่น LASSO และสันถูกกล่าวว่าสอดคล้องกับตัวประมาณแบบเบย์กับนักบวชบางคน ฉันเดา (เนื่องจากฉันไม่ทราบเกี่ยวกับสถิติของ Bayesian มากพอ) ว่าสำหรับพารามิเตอร์การปรับค่าคงที่มีรูปธรรมที่สอดคล้องกันมาก่อน

ตอนนี้ผู้ใช้งานประจำจะปรับพารามิเตอร์การปรับให้เหมาะสมโดยการตรวจสอบข้าม มีสิ่งที่เทียบเท่ากับการทำแบบเบย์หรือไม่และมีการนำมาใช้ทั้งหมดหรือไม่? หรือวิธีการแบบเบย์แก้ไขพารามิเตอร์การจูนอย่างมีประสิทธิภาพก่อนที่จะเห็นข้อมูลหรือไม่? (ฉันเดาว่าหลังจะเป็นอันตรายต่อประสิทธิภาพการทำนาย)

bayesian lasso ridge-regression

— Richard Hardy
แหล่งที่มา

ฉันจินตนาการว่าวิธีการแบบเบย์อย่างเต็มรูปแบบจะเริ่มต้นด้วยวิธีที่ได้รับก่อนหน้านี้และไม่ปรับเปลี่ยนใช่ แต่ยังมีการทดลอง-Bayesวิธีการเพิ่มประสิทธิภาพที่มากกว่าค่า hyperparameter: เช่นดูstats.stackexchange.com/questions/24799

— อะมีบาพูดว่า Reinstate Monica

คำถามเพิ่มเติม (อาจเป็นส่วนหนึ่งของ Q หลัก): มีบางอย่างก่อนหน้านี้ในพารามิเตอร์การทำให้เป็นมาตรฐานซึ่งจะแทนที่กระบวนการข้ามการตรวจสอบความถูกต้องหรือไม่?

— kjetil b halvorsen

Bayesians สามารถใส่พารามิเตอร์การปรับค่าไว้ก่อนหน้าได้ซึ่งโดยปกติจะสอดคล้องกับพารามิเตอร์ความแปรปรวน นี่คือสิ่งที่ควรทำเพื่อหลีกเลี่ยง CV เพื่อให้อยู่อย่างเต็มที่ใน Bayes หรือคุณสามารถใช้ REML เพื่อปรับพารามิเตอร์การทำให้เป็นมาตรฐาน

— ผู้ชาย

ป.ล. : สำหรับผู้ที่มุ่งหวังเงินรางวัลหมายเหตุความคิดเห็นของฉัน: ฉันต้องการเห็นคำตอบที่ชัดเจนซึ่งแสดงก่อนหน้านี้ที่ทำให้แผนที่ประมาณเท่ากับการตรวจสอบข้ามบ่อย

— statslearner2

@ statslearner2 ฉันคิดว่ามันตอบคำถามของ Richard ได้ดีมาก ดูเหมือนว่าเงินรางวัลของคุณจะเน้นไปที่แง่มุมที่แคบกว่า (ประมาณ hyperprior) กว่า Richard ของคิว

— อะมีบากล่าว Reinstate Monica

การประมาณค่าถดถอยแบบปรับโทษเช่น LASSO และสันถูกกล่าวว่าสอดคล้องกับตัวประมาณแบบเบย์กับนักบวชบางคน

ใช่ถูกต้องแล้ว เมื่อใดก็ตามที่เรามีปัญหาการปรับให้เหมาะสมที่เกี่ยวข้องกับการเพิ่มฟังก์ชั่นการบันทึกความน่าจะเป็นสูงสุดรวมถึงฟังก์ชั่นการลงโทษกับพารามิเตอร์ เพื่อดูนี้สมมติว่าเรามีฟังก์ชั่นการลงโทษใช้ปรับพารามิเตอร์λฟังก์ชันวัตถุประสงค์ในกรณีเหล่านี้สามารถเขียนเป็น: $^\dagger$ $w$ $\lambda$

\begin{aligned} H_{x} (θ | λ) & = ℓ_{x} (θ) - w (θ | λ) \\ = \ln (L_{x} (θ) \cdot \exp (- w (θ | λ))) \\ = \ln (\frac{L_{x} (θ) π (θ | λ)}{\int L_{x} (θ) π (θ | λ) d θ}) + const \\ = \ln π (θ | x, λ) + const, \end{aligned}

$\begin{equation} \begin{aligned} H_\mathbf{x}(\theta|\lambda) &= \ell_\mathbf{x}(\theta) - w(\theta|\lambda) \\[6pt] &= \ln \Big( L_\mathbf{x}(\theta) \cdot \exp ( -w(\theta|\lambda)) \Big) \\[6pt] &= \ln \Bigg( \frac{L_\mathbf{x}(\theta) \pi (\theta|\lambda)}{\int L_\mathbf{x}(\theta) \pi (\theta|\lambda) d\theta} \Bigg) + \text{const} \\[6pt] &= \ln \pi(\theta|\mathbf{x}, \lambda) + \text{const}, \\[6pt] \end{aligned} \end{equation}$

ที่เราใช้ก่อน $\pi(\theta|\lambda) \propto \exp ( -w(\theta|\lambda))$ )สังเกตที่นี่ว่าพารามิเตอร์การปรับแต่งในการปรับให้เหมาะสมนั้นถือเป็นไฮเปอร์พารามิเตอร์คงที่ในการแจกแจงก่อนหน้า หากคุณกำลังดำเนินการปรับแต่งแบบคลาสสิกด้วยพารามิเตอร์การปรับค่าคงที่สิ่งนี้จะเทียบเท่ากับการเพิ่มประสิทธิภาพแบบเบย์ด้วยพารามิเตอร์ไฮเปอร์คงที่ สำหรับ LASSO และ Ridge regression ฟังก์ชันการลงโทษและการเทียบเท่าก่อนหน้านั้นคือ

\begin{aligned} LASSO Regression & π (θ | λ) & = \prod_{k = 1}^{m} Laplace (0, \frac{1}{λ}) = \prod_{k = 1}^{m} \frac{λ}{2} \cdot \exp (- λ | θ_{k} |), \\ Ridge Regression & π (θ | λ) & = \prod_{k = 1}^{m} Normal (0, \frac{1}{2 λ}) = \prod_{k = 1}^{m} \sqrt{λ / π} \cdot \exp (- λ θ_{k}^{2}) . \end{aligned}

$\begin{equation} \begin{aligned} \text{LASSO Regression} & & \pi(\theta|\lambda) &= \prod_{k=1}^m \text{Laplace} \Big( 0, \frac{1}{\lambda} \Big) = \prod_{k=1}^m \frac{\lambda}{2} \cdot \exp ( -\lambda |\theta_k| ), \\[6pt] \text{Ridge Regression} & & \pi(\theta|\lambda) &= \prod_{k=1}^m \text{Normal} \Big( 0, \frac{1}{2\lambda} \Big) = \prod_{k=1}^m \sqrt{\lambda/\pi} \cdot \exp ( -\lambda \theta_k^2 ). \\[6pt] \end{aligned} \end{equation}$

วิธีการก่อนหน้านี้ลงโทษสัมประสิทธิ์การถดถอยตามขนาดสัมบูรณ์ซึ่งเทียบเท่ากับการวาง Laplace ไว้ก่อนหน้านี้ที่ศูนย์ วิธีหลังลงโทษค่าสัมประสิทธิ์การถดถอยตามขนาดกำลังสองซึ่งเท่ากับการจัดเก็บแบบปกติก่อนตั้งอยู่ที่ศูนย์

ตอนนี้ผู้ใช้งานประจำจะปรับพารามิเตอร์การปรับให้เหมาะสมโดยการตรวจสอบข้าม มีสิ่งที่เทียบเท่ากับการทำแบบเบย์หรือไม่และมีการนำมาใช้ทั้งหมดหรือไม่?

ตราบใดที่วิธีการบ่อยสามารถวางเป็นปัญหาการเพิ่มประสิทธิภาพ (แทนที่จะพูดรวมถึงการทดสอบสมมติฐานหรืออะไรทำนองนี้) จะมีการเปรียบเทียบแบบเบย์โดยใช้การเทียบเท่าก่อนหน้านี้ เช่นเดียวกับ frequentists อาจรักษาปรับพารามิเตอร์ $\lambda$ เป็นที่รู้จักและประเมินจากข้อมูลที่คชกรรมในทำนองเดียวกันอาจรักษา hyperparameter $\lambda$ เป็นที่ไม่รู้จัก ในการวิเคราะห์แบบเบย์แบบเต็มรูปแบบสิ่งนี้จะเกี่ยวข้องกับการให้ไฮเปอร์พารามิเตอร์ของตัวเองมาก่อนและหาค่าสูงสุดด้านหลังภายใต้ก่อนหน้านี้ซึ่งจะคล้ายกับการเพิ่มฟังก์ชั่นวัตถุประสงค์ต่อไปนี้:

\begin{aligned} H_{x} (θ, λ) & = ℓ_{x} (θ) - w (θ | λ) - h (λ) \\ = \ln (L_{x} (θ) \cdot \exp (- w (θ | λ)) \cdot \exp (- h (λ))) \\ = \ln (\frac{L_{x} (θ) π (θ | λ) π (λ)}{\int L_{x} (θ) π (θ | λ) π (λ) d θ}) + const \\ = \ln π (θ, λ | x) + const . \end{aligned}

$\begin{equation} \begin{aligned} H_\mathbf{x}(\theta, \lambda) &= \ell_\mathbf{x}(\theta) - w(\theta|\lambda) - h(\lambda) \\[6pt] &= \ln \Big( L_\mathbf{x}(\theta) \cdot \exp ( -w(\theta|\lambda)) \cdot \exp ( -h(\lambda)) \Big) \\[6pt] &= \ln \Bigg( \frac{L_\mathbf{x}(\theta) \pi (\theta|\lambda) \pi (\lambda)}{\int L_\mathbf{x}(\theta) \pi (\theta|\lambda) \pi (\lambda) d\theta} \Bigg) + \text{const} \\[6pt] &= \ln \pi(\theta, \lambda|\mathbf{x}) + \text{const}. \\[6pt] \end{aligned} \end{equation}$

วิธีนี้ใช้ในการวิเคราะห์แบบเบย์ในกรณีที่นักวิเคราะห์ไม่สะดวกที่จะเลือกพารามิเตอร์ที่เฉพาะเจาะจงสำหรับพวกเขาก่อนหน้านี้และพยายามที่จะทำให้การกระจายก่อนหน้านี้มากขึ้นโดยถือว่าเป็นสิ่งที่ไม่รู้จักและทำให้เกิดการกระจาย (โปรดทราบว่านี่เป็นเพียงวิธีโดยนัยในการให้การกระจายมากขึ้นก่อนพารามิเตอร์ที่น่าสนใจ $\theta$ )

(ความคิดเห็นจากstatslearner2ด้านล่าง) ฉันกำลังมองหาการประมาณการ MAP ที่เทียบเท่ากับตัวเลข ตัวอย่างเช่นสำหรับสันเขาโทษปรับมี Gaussian ก่อนที่จะให้ฉันประมาณการ MAP เท่ากับการประมาณการสัน ตอนนี้สำหรับสัน k-fold CV ไฮเปอร์ก่อนหน้านั้นจะให้ค่าประมาณ MAP ซึ่งคล้ายกับค่าประมาณ CV-ridge ของฉันเป็นเท่าไหร่?

Before proceeding to look at $K$ -fold cross-validation, it is first worth noting that, mathematically, the maximum a posteriori (MAP) method is simply an optimisation of a function of the parameter $\theta$ and the data $\mathbf{x}$ . If you are willing to allow improper priors then the scope encapsulates any optimisation problem involving a function of these variables. Thus, any frequentist method that can be framed as a single optimisation problem of this kind has a MAP analogy, and any frequentist method that cannot be framed as a single optimisation of this kind does not have a MAP analogy.

In the above form of model, involving a penalty function with a tuning parameter, $K$ -fold cross-validation is commonly used to estimate the tuning parameter $\lambda$ . For this method you partition the data vector $\mathbb{x}$ into $K$ sub-vectors $\mathbf{x}_1,...,\mathbf{x}_K$ . For each of sub-vector $k=1,...,K$ you fit the model with the "training" data $\mathbf{x}_{-k}$ and then measure the fit of the model with the "testing" data $\mathbf{x}_k$ . ในแต่ละแบบคุณจะได้รับการประมาณค่าพารามิเตอร์รุ่นซึ่งจะให้ข้อมูลการทดสอบซึ่งคุณสามารถนำมาเปรียบเทียบกับข้อมูลการทดสอบจริงเพื่อวัด "การสูญเสีย":

\begin{matrix} Estimator & \hat{θ} (x_{- k}, λ), \\ Predictions & {\hat{x}}_{k} (x_{- k}, λ), \\ Testing loss & L_{k} ({\hat{x}}_{k}, x_{k} | x_{- k}, λ) . \end{matrix}

$\begin{matrix} \text{Estimator} & & \hat{\theta}(\mathbf{x}_{-k}, \lambda), \\[6pt] \text{Predictions} & & \hat{\mathbf{x}}_k(\mathbf{x}_{-k}, \lambda), \\[6pt] \text{Testing loss} & & \mathscr{L}_k(\hat{\mathbf{x}}_k, \mathbf{x}_k| \mathbf{x}_{-k}, \lambda). \\[6pt] \end{matrix}$

The loss measures for each of the $K$ "folds" can then be aggregated to get an overall loss measure for the cross-validation:

L (x, λ) = \sum_{k} L_{k} ({\hat{x}}_{k}, x_{k} | x_{- k}, λ)

$\mathscr{L}(\mathbf{x}, \lambda) = \sum_k \mathscr{L}_k(\hat{\mathbf{x}}_k, \mathbf{x}_k| \mathbf{x}_{-k}, \lambda)$

One then estimates the tuning parameter by minimising the overall loss measure:

\hat{λ} \equiv \hat{λ} (x) \equiv \underset{λ}{arg min} L (x, λ) .

$\hat{\lambda} \equiv \hat{\lambda}(\mathbf{x}) \equiv \underset{\lambda}{\text{arg min }} \mathscr{L}(\mathbf{x}, \lambda).$

We can see that this is an optimisation problem, and so we now have two seperate optimisation problems (i.e., the one described in the sections above for $\theta$ , and the one described here for $\lambda$ ). Since the latter optimisation does not involve $\theta$ , we can combine these optimisations into a single problem, with some technicalities that I discuss below. To do this, consider the optimisation problem with objective function:

\begin{aligned} H_{x} (θ, λ) & = ℓ_{x} (θ) - w (θ | λ) - δ L (x, λ), \end{aligned}

where $\delta > 0$ is a weighting value on the tuning-loss. As $\delta \rightarrow \infty$ the weight on optimisation of the tuning-loss becomes infinite and so the optimisation problem yields the estimated tuning parameter from $K$ -fold cross-validation (in the limit). The remaining part of the objective function is the standard objective function conditional on this estimated value of the tuning parameter. Now, unfortunately, taking $\delta = \infty$ screws up the optimisation problem, but if we take $\delta$ to be a very large (but still finite) value, we can approximate the combination of the two optimisation problems up to arbitrary accuracy.

From the above analysis we can see that it is possible to form a MAP analogy to the model-fitting and $K$ -fold cross-validation process. This is not an exact analogy, but it is a close analogy, up to arbitrarily accuracy. It is also important to note that the MAP analogy no longer shares the same likelihood function as the original problem, since the loss function depends on the data and is thus absorbed as part of the likelihood rather than the prior. In fact, the full analogy is as follows:

\begin{aligned} H_{x} (θ, λ) & = ℓ_{x} (θ) - w (θ | λ) - δ L (x, λ) \\ = \ln (\frac{L_{x}^{*} (θ, λ) π (θ, λ)}{\int L_{x}^{*} (θ, λ) π (θ, λ) d θ}) + const, \end{aligned}

$\begin{equation} \begin{aligned} \mathcal{H}_\mathbf{x}(\theta, \lambda) &= \ell_\mathbf{x}(\theta) - w(\theta|\lambda) - \delta \mathscr{L}(\mathbf{x}, \lambda) \\[6pt] &= \ln \Bigg( \frac{L_\mathbf{x}^*(\theta, \lambda) \pi (\theta, \lambda)}{\int L_\mathbf{x}^*(\theta, \lambda) \pi (\theta, \lambda) d\theta} \Bigg) + \text{const}, \\[6pt] \end{aligned} \end{equation}$

where $L_\mathbf{x}^*(\theta, \lambda) \propto \exp( \ell_\mathbf{x}(\theta) - \delta \mathscr{L}(\mathbf{x}, \lambda))$ and $\pi (\theta, \lambda) \propto \exp( -w(\theta|\lambda))$ , with a fixed (and very large) hyper-parameter $\delta$ .

$^\dagger$ This gives an improper prior in cases where the penalty does not correspond to the logarithm of a sigma-finite density.

— Reinstate Monica
แหล่งที่มา

Ok +1 already, but for the bounty I'm looking for these more precise answers.

— statslearner2

1. I do not get how (since frequentists generally use classical hypothesis tests, etc., which have no Bayesian equivalent) connects to the rest of what I or you are saying; parameter tuning has nothing to do with hypothesis tests, or does it? 2. Do I understand you correctly that there is no Bayesian equivalent to frequentist regularized estimation when the tuning parameter is selected by cross validation? What about empirical Bayes that amoeba mentions in the comments to the OP?

— Richard Hardy

3. Since regularization with cross validation seems to be quite effective for, say, prediction, doesn't point 2. suggest that the Bayesian approach is somehow inferior?

— Richard Hardy

@Ben, thanks for your explicit answer and the subsequent clarifications. You have once again done a wonderful job! Regarding 3., yes, it was quite a jump; it certainly is not a strict logical conclusion. But looking at your points w.r.t. 2. (that a Bayesian method can approximate the frequentist penalized optimization with cross validation), I no longer think that Bayesian must be "inferior". The last quibble on my side is, could you perhaps explain how the last, complicated formula could arise in practice in the Bayesian paradigm? Is it something people would normally use or not?

— Richard Hardy

@Ben (ctd) My problem is that I know little about Bayes. Once it gets technical, I may easily lose the perspective. So I wonder whether this complicated analogy (the last formula) is something that is just a technical possibility or rather something that people routinely use. In other words, I am interested in whether the idea behind cross validation (here in the context of penalized estimation) is resounding in the Bayesian world, whether its advantages are utilized there. Perhaps this could be a separate question, but a short description will suffice for this particular case.

— Richard Hardy

Indeed most penalized regression methods correspond to placing a particular type of prior to the regression coefficients. For example, you get the LASSO using a Laplace prior, and the ridge using a normal prior. The tuning parameters are the “hyperparameters” under the Bayesian formulation for which you can place an additional prior to estimate them; for example, for in the case of the ridge it is often assumed that the inverse variance of the normal distribution has a $\chi^2$ prior. However, as one would expect, resulting inferences can be sensitive to the choice of the prior distributions for these hyperparameters. For example, for the horseshoe prior there are some theoretical results that you should place such a prior for the hyperparameters that it would reflect the number of non-zero coefficients you expect to have.

A nice overview of the links between penalized regression and Bayesian priors is given, for example, by Mallick and Yi.

— Dimitris Rizopoulos
แหล่งที่มา

Thank you for your answer! The linked paper is quite readable, which is nice.

— Richard Hardy

This does not answer the question, can you elaborate to explain how does the hyper-prior relate to k-fold CV?

— statslearner2