วิธีการถดถอยแบบฉาก (รวมสี่เหลี่ยมจัตุรัสน้อยที่สุด) ผ่านทาง PCA ได้อย่างไร


29

ฉันมักจะใช้lm()ในการวิจัยเพื่อดำเนินการถดถอยเชิงเส้นของyบนxxฟังก์ชั่นที่ส่งกลับค่าสัมประสิทธิ์βดังกล่าวว่า

y=βx.

วันนี้ฉันได้เรียนรู้เกี่ยวกับกำลังสองรวมน้อยที่สุดและสามารถprincomp()ใช้ฟังก์ชัน (การวิเคราะห์องค์ประกอบหลัก, PCA) เพื่อดำเนินการได้ มันควรจะดีสำหรับฉัน (แม่นยำยิ่งขึ้น) ฉันได้ทำการทดสอบโดยใช้princomp()เช่น:

r <- princomp( ~ x + y)

ปัญหาของฉันคือวิธีการตีความผลลัพธ์ ฉันจะรับสัมประสิทธิ์การถดถอยได้อย่างไร โดย "ค่าสัมประสิทธิ์" ผมหมายถึงจำนวนβว่าผมจะต้องใช้ในการคูณxคุ้มค่าที่จะให้ตัวเลขที่ใกล้เคียงกับปีy


One moment guys, i'm a bit confused. look at: zoonek2.free.fr/UNIX/48_R/09.html This is called PCA (Principal Component Analysis, aka "orthogonal regression" or "perpendicular sums of squares" or "total least squares") so i think we are talking about TLS with princomp() No?
Dail

No; those are two different things, see wikipedia article about PCA. The fact it is used here is a hack (I don't know how exact, but I'm going to check it); that's why the complex extraction of coefficients.

1
A related question: stats.stackexchange.com/questions/2691/… and a blog post is referenced by one of the answers: cerebralmastication.com/2010/09/…
Jonathan

คำตอบ:


48

Ordinary least squares vs. total least squares

xxy

OLS vs TLS

y=βxyy^. TLS fits the same equation by minimizing squared distances between (x,y) points and their projection on the line. In this simplest case TLS line is simply the first principal component of the 2D data. To find β, do PCA on (x,y) points, i.e. construct the 2×2 covariance matrix Σ and find its first eigenvector v=(vx,vy); then β=vy/vx.

In Matlab:

 v = pca([x y]);    //# x and y are centered column vectors
 beta = v(2,1)/v(1,1);

In R:

 v <- prcomp(cbind(x,y))$rotation
 beta <- v[2,1]/v[1,1]

By the way, this will yield correct slope even if x and y were not centered (because built-in PCA functions automatically perform centering). To recover the intercept, compute β0=y¯βx¯.

OLS vs. TLS, multiple regression

Given a dependent variable y and many independent variables xi (again, all centered for simplicity), regression fits an equation

y=β1x1++βpxp.
OLS does the fit by minimizing the squared errors between observed values of y and predicted values y^. TLS does the fit by minimizing the squared distances between observed (x,y)Rp+1 points and the closest points on the regression plane/hyperplane.

Note that there is no "regression line" anymore! The equation above specifies a hyperplane: it's a 2D plane if there are two predictors, 3D hyperplane if there are three predictors, etc. So the solution above does not work: we cannot get the TLS solution by taking the first PC only (which is a line). Still, the solution can be easily obtained via PCA.

As before, PCA is performed on (x,y) points. This yields p+1 eigenvectors in columns of V. The first p eigenvectors define a p-dimensional hyperplane H that we need; the last (number p+1) eigenvector vp+1 is orthogonal to it. The question is how to transform the basis of H given by the first p eigenvectors into the β coefficients.

Observe that if we set xi=0 for all ik and only xk=1, then y^=βk, i.e. the vector

(0,,1,,βk)H
lies in the hyperplane H. On the other hand, we know that
vp+1=(v1,,vp+1)H
is orthogonal to it. I.e. their dot product must be zero:
vk+βkvp+1=0βk=vk/vp+1.

In Matlab:

 v = pca([X y]);    //# X is a centered n-times-p matrix, y is n-times-1 column vector
 beta = -v(1:end-1,end)/v(end,end);

In R:

 v <- prcomp(cbind(X,y))$rotation
 beta <- -v[-ncol(v),ncol(v)] / v[ncol(v),ncol(v)]

Again, this will yield correct slopes even if x and y were not centered (because built-in PCA functions automatically perform centering). To recover the intercept, compute β0=y¯x¯β.

As a sanity check, notice that this solution coincides with the previous one in case of only a single predictor x. Indeed, then the (x,y) space is 2D, and so, given that the first PCA eigenvector is orthogonal to the second (last) one, vy(1)/vx(1)=vx(2)/vy(2).

Closed form solution for TLS

Surprisingly, it turns out that there is a closed form equation for β. The argument below is taken from Sabine van Huffel's book "The total least squares" (section 2.3.2).

Let X and y be the centered data matrices. The last PCA eigenvector vp+1 is an eigenvector of the covariance matrix of [Xy] with an eigenvalue σp+12. If it is an eigenvector, then so is vp+1/vp+1=(β1). Writing down the eigenvector equation:

(XXXyyXyy)(β1)=σp+12(β1),
and computing the product on the left, we immediately get that
βTLS=(XXσp+12I)1Xy,
which strongly reminds the familiar OLS expression
βOLS=(XX)1Xy.

Multivariate multiple regression

The same formula can be generalized to the multivariate case, but even to define what multivariate TLS does, would require some algebra. See Wikipedia on TLS. Multivariate OLS regression is equivalent to a bunch of univariate OLS regressions for each dependent variable, but in the TLS case it is not so.


1
I do not know R, but still wanted to provide R snippets for future reference. There are many people here proficient in R. Please feel free to edit my snippets if needed! Thank you.
amoeba says Reinstate Monica

Nice post, but if I may ask what guarantees the fact that the vector (0,,1,,βk) lies in the hyperplane?
JohnK

@JohnK, I am not sure what exactly is unclear. As I wrote, let all xi be equal to zero apart from xk=1. Then if you plug this into y=βjxj, you will get y=βk1=βk. So the point (0,,1,βk) lies on the hyperplane defined by the equation y=βjxj.
amoeba says Reinstate Monica

I seem to have misread that part but now it is clear. Thanks for the clarification too.
JohnK

2
In R, you might prefer "eigen(cov(cbind(x, y)))$vectors" over "prcomp(cbind(x, y))$rotation" because the former is much faster for larger vectors.
Thomas Browne

9

Based on the naive GNU Octave implementation found here, something like this might (grain of salt, it's late) work.

tls <- function(A, b){

  n <- ncol(A)
  C <- cbind(A, b)

  V <- svd(C)$v
  VAB <- V[1:n, (n+1):ncol(V)]
  VBB <- V[(n+1):nrow(V), (n+1):ncol(V)]
  return(-VAB/VBB)
}

4

princomp is running principal component analysis instead of total least squares regression. As far as I know there is no R function nor package that does TLS; at most there is Deming regression in MethComp.
Yet, please treat this as a suggestion that it is most likely not worth it.


I thought Deming in the MethComp package was TLS - what's the difference?
mark999

You must give it the ratio of errors on x and y; pure TLS optimises this.
โดยการใช้ไซต์ของเรา หมายความว่าคุณได้อ่านและทำความเข้าใจนโยบายคุกกี้และนโยบายความเป็นส่วนตัวของเราแล้ว
Licensed under cc by-sa 3.0 with attribution required.