คำถามเกี่ยวกับวิธีการทำให้ค่าสัมประสิทธิ์การถดถอยเป็นปกติ


16

ไม่แน่ใจว่าคำว่า normalize เป็นคำที่ถูกต้องที่จะใช้ที่นี่หรือไม่ แต่ฉันจะพยายามอย่างดีที่สุดเพื่ออธิบายสิ่งที่ฉันพยายามถาม ตัวประมาณที่ใช้ในที่นี้คือกำลังสองน้อยสุด

สมมติว่าคุณมีy = β 0 + β 1 x 1y=β0+β1x1คุณสามารถจัดให้อยู่กึ่งกลางค่าเฉลี่ยโดยy = β 0 + β 1 x 1y=β0+β1x1โดยที่β 0 = β 0 + β 1 ˉ x 1β0=β0+β1x¯1และx 1 = x - ˉ xx1=xx¯ , ดังนั้นβ 0β0ไม่มีอิทธิพลต่อการประมาณβ 1β1อีกต่อไป

โดยที่ผมหมายถึงนี้β 1ในY = β 1 x ' 1เทียบเท่ากับเบต้า 1ในY = β 0 + β 1 x 1 เราได้ลดสมการเพื่อการคำนวณกำลังสองน้อยที่สุดβ^1y=β1x1β^1y=β0+β1x1

คุณใช้วิธีนี้โดยทั่วไปอย่างไร ตอนนี้ผมมีรูปแบบY = β 1 E x 1 T + β 2 E x 2ทีy=β1ex1t+β2ex2tผมกำลังพยายามที่จะลดไปY = β 1 xy=β1x '


คุณกำลังวิเคราะห์ข้อมูลชนิดใดและทำไมคุณต้องการลบ covariate e x 1 tex1tออกจากแบบจำลองของคุณ นอกจากนี้ยังมีเหตุผลที่คุณจะลบการสกัดกั้นหรือไม่? หากคุณหมายถึงกึ่งกลางข้อมูลความชันจะเหมือนกันในโมเดลที่มี / ไม่มีการสกัดกั้น แต่โมเดลที่มีการสกัดกั้นจะพอดีกับข้อมูลของคุณดีขึ้น
caburke

@caburke ฉันไม่ได้กังวลเกี่ยวกับความพอดีของแบบจำลองเพราะหลังจากฉันคำนวณβ 1β1และβ 2β2แล้วฉันสามารถนำกลับไปเป็นแบบจำลองได้ จุดสำหรับการออกกำลังกายนี้คือการประเมินββ1 1 ด้วยการลดสมการดั้งเดิมให้เหลือเพียงy = β 1 x y=β1xการคำนวณกำลังสองน้อยที่สุดจะง่ายขึ้น (x' เป็นส่วนหนึ่งของสิ่งที่ฉันพยายามค้นหาซึ่งอาจรวมถึงe x 1 tex1t ) ฉันพยายามเรียนรู้กลไกนี่เป็นคำถามจากหนังสือของ Tukey
เซเบอร์ CN

@ca การสังเกตในตอนท้ายของความคิดเห็นของคุณนั้นน่าสงสัย ไม่สามารถนำไปใช้กับนิพจน์ที่ไม่ใช่เชิงเส้นได้ - พวกเขาไม่มีสิ่งใดที่สามารถพิจารณาได้ว่าเป็น "ความชัน" - แต่มันไม่ถูกต้องในการตั้งค่า OLS: ความเหมาะสมสำหรับข้อมูลที่มีค่าเฉลี่ยปานกลางดีเท่ากับ พอดีกับการสกัดกั้น กระบี่แบบจำลองของคุณไม่ชัดเจน: ซึ่งβ 1 , β 2 , x 1 , x 2 , tคือตัวแปรและพารามิเตอร์ใด โครงสร้างข้อผิดพลาดที่ตั้งใจไว้คืออะไร? (และหนังสือเล่มใดของ Tukey เป็นคำถามจาก?)β1,β2,x1,x2,t
whuber

1
@whuber This is from Tukey's book "Data analysis and regression: a second course in statistics" chapter 14A. β1,β2β1,β2 are the parameters we are trying to estimate, x1,x2x1,x2 are the variables each with n observations, tt I assume is the time variable associated with the observations, however it did not specify. The error should be normal and can be ignored for this question.
Saber CN

1
@whuber I was mostly referring to the first part of the post, but this was not clear in my comment. What I meant was that if you only mean-center xx, and not yy, as it seemed was being suggested in the OP, and then remove the intercept then the fit will be worse, since its not necessarily the case that ˉy=0y¯=0. Slope is obviously not a good term for the coefficient in the model mentioned in the last line of the OP.
caburke

คำตอบ:


38

Although I cannot do justice to the question here--that would require a small monograph--it may be helpful to recapitulate some key ideas.

The question

Let's begin by restating the question and using unambiguous terminology. The data consist of a list of ordered pairs (ti,yi)(ti,yi) . Known constants α1α1 and α2α2 determine values x1,i=exp(α1ti)x1,i=exp(α1ti) and x2,i=exp(α2ti)x2,i=exp(α2ti). We posit a model in which

yi=β1x1,i+β2x2,i+εi

yi=β1x1,i+β2x2,i+εi

for constants β1β1 and β2β2 to be estimated, εiεi are random, and--to a good approximation anyway--independent and having a common variance (whose estimation is also of interest).

Background: linear "matching"

Mosteller and Tukey refer to the variables x1x1 = (x1,1,x1,2,)(x1,1,x1,2,) and x2x2 as "matchers." They will be used to "match" the values of y=(y1,y2,)y=(y1,y2,) in a specific way, which I will illustrate. More generally, let yy and xx be any two vectors in the same Euclidean vector space, with yy playing the role of "target" and xx that of "matcher". We contemplate systematically varying a coefficient λλ in order to approximate yy by the multiple λxλx. The best approximation is obtained when λxλx is as close to yy as possible. Equivalently, the squared length of yλxyλx is minimized.

One way to visualize this matching process is to make a scatterplot of xx and yy on which is drawn the graph of xλxxλx. The vertical distances between the scatterplot points and this graph are the components of the residual vector yλxyλx; the sum of their squares is to be made as small as possible. Up to a constant of proportionality, these squares are the areas of circles centered at the points (xi,yi)(xi,yi) with radii equal to the residuals: we wish to minimize the sum of areas of all these circles.

Here is an example showing the optimal value of λλ in the middle panel:

Panel

The points in the scatterplot are blue; the graph of xλxxλx is a red line. This illustration emphasizes that the red line is constrained to pass through the origin (0,0)(0,0): it is a very special case of line fitting.

Multiple regression can be obtained by sequential matching

Returning to the setting of the question, we have one target yy and two matchers x1x1 and x2x2. We seek numbers b1b1 and b2b2 for which yy is approximated as closely as possible by b1x1+b2x2b1x1+b2x2, again in the least-distance sense. Arbitrarily beginning with x1x1, Mosteller & Tukey match the remaining variables x2x2 and yy to x1x1. Write the residuals for these matches as x21x21 and y1y1, respectively: the 11 indicates that x1x1 has been "taken out of" the variable.

We can write

y=λ1x1+y1 and x2=λ2x1+x21.

y=λ1x1+y1 and x2=λ2x1+x21.

Having taken x1x1 out of x2x2 and yy, we proceed to match the target residuals y1y1 to the matcher residuals x21x21. The final residuals are y12y12. Algebraically, we have written

y1=λ3x21+y12; whencey=λ1x1+y1=λ1x1+λ3x21+y12=λ1x1+λ3(x2λ2x1)+y12=(λ1λ3λ2)x1+λ3x2+y12.

y1y=λ3x21+y12; whence=λ1x1+y1=λ1x1+λ3x21+y12=λ1x1+λ3(x2λ2x1)+y12=(λ1λ3λ2)x1+λ3x2+y12.

This shows that the λ3λ3 in the last step is the coefficient of x2x2 in a matching of x1x1 and x2x2 to yy.

We could just as well have proceeded by first taking x2x2 out of x1x1 and yy, producing x12x12 and y2y2, and then taking x12x12 out of y2y2, yielding a different set of residuals y21y21. This time, the coefficient of x1x1 found in the last step--let's call it μ3μ3--is the coefficient of x1x1 in a matching of x1x1 and x2x2 to yy.

Finally, for comparison, we might run a multiple (ordinary least squares regression) of yy against x1x1 and x2x2. Let those residuals be ylmylm. It turns out that the coefficients in this multiple regression are precisely the coefficients μ3μ3 and λ3λ3 found previously and that all three sets of residuals, y12y12, y21y21, and ylmylm, are identical.

Depicting the process

None of this is new: it's all in the text. I would like to offer a pictorial analysis, using a scatterplot matrix of everything we have obtained so far.

Scatterplot

Because these data are simulated, we have the luxury of showing the underlying "true" values of yy on the last row and column: these are the values β1x1+β2x2β1x1+β2x2 without the error added in.

The scatterplots below the diagonal have been decorated with the graphs of the matchers, exactly as in the first figure. Graphs with zero slopes are drawn in red: these indicate situations where the matcher gives us nothing new; the residuals are the same as the target. Also, for reference, the origin (wherever it appears within a plot) is shown as an open red circle: recall that all possible matching lines have to pass through this point.

Much can be learned about regression through studying this plot. Some of the highlights are:

  • The matching of x2x2 to x1x1 (row 2, column 1) is poor. This is a good thing: it indicates that x1x1 and x2x2 are providing very different information; using both together will likely be a much better fit to yy than using either one alone.

  • Once a variable has been taken out of a target, it does no good to try to take that variable out again: the best matching line will be zero. See the scatterplots for x21x21 versus x1x1 or y1y1 versus x1x1, for instance.

  • The values x1x1, x2x2, x12x12, and x21x21 have all been taken out of ylmylm.

  • Multiple regression of yy against x1x1 and x2x2 can be achieved first by computing y1y1 and x21x21. These scatterplots appear at (row, column) = (8,1)(8,1) and (2,1)(2,1), respectively. With these residuals in hand, we look at their scatterplot at (4,3)(4,3). These three one-variable regressions do the trick. As Mosteller & Tukey explain, the standard errors of the coefficients can be obtained almost as easily from these regressions, too--but that's not the topic of this question, so I will stop here.

Code

These data were (reproducibly) created in R with a simulation. The analyses, checks, and plots were also produced with R. This is the code.

#
# Simulate the data.
#
set.seed(17)
t.var <- 1:50                                    # The "times" t[i]
x <- exp(t.var %o% c(x1=-0.1, x2=0.025) )        # The two "matchers" x[1,] and x[2,]
beta <- c(5, -1)                                 # The (unknown) coefficients
sigma <- 1/2                                     # Standard deviation of the errors
error <- sigma * rnorm(length(t.var))            # Simulated errors
y <- (y.true <- as.vector(x %*% beta)) + error   # True and simulated y values
data <- data.frame(t.var, x, y, y.true)

par(col="Black", bty="o", lty=0, pch=1)
pairs(data)                                      # Get a close look at the data
#
# Take out the various matchers.
#
take.out <- function(y, x) {fit <- lm(y ~ x - 1); resid(fit)}
data <- transform(transform(data, 
  x2.1 = take.out(x2, x1),
  y.1 = take.out(y, x1),
  x1.2 = take.out(x1, x2),
  y.2 = take.out(y, x2)
), 
  y.21 = take.out(y.2, x1.2),
  y.12 = take.out(y.1, x2.1)
)
data$y.lm <- resid(lm(y ~ x - 1))               # Multiple regression for comparison
#
# Analysis.
#
# Reorder the dataframe (for presentation):
data <- data[c(1:3, 5:12, 4)]

# Confirm that the three ways to obtain the fit are the same:
pairs(subset(data, select=c(y.12, y.21, y.lm)))

# Explore what happened:
panel.lm <- function (x, y, col=par("col"), bg=NA, pch=par("pch"),
   cex=1, col.smooth="red",  ...) {
  box(col="Gray", bty="o")
  ok <- is.finite(x) & is.finite(y)
  if (any(ok))  {
    b <- coef(lm(y[ok] ~ x[ok] - 1))
    col0 <- ifelse(abs(b) < 10^-8, "Red", "Blue")
    lwd0 <- ifelse(abs(b) < 10^-8, 3, 2)
    abline(c(0, b), col=col0, lwd=lwd0)
  }
  points(x, y, pch = pch, col="Black", bg = bg, cex = cex)    
  points(matrix(c(0,0), nrow=1), col="Red", pch=1)
}
panel.hist <- function(x, ...) {
  usr <- par("usr"); on.exit(par(usr))
  par(usr = c(usr[1:2], 0, 1.5) )
  h <- hist(x, plot = FALSE)
  breaks <- h$breaks; nB <- length(breaks)
  y <- h$counts; y <- y/max(y)
  rect(breaks[-nB], 0, breaks[-1], y,  ...)
}
par(lty=1, pch=19, col="Gray")
pairs(subset(data, select=c(-t.var, -y.12, -y.21)), col="Gray", cex=0.8, 
   lower.panel=panel.lm, diag.panel=panel.hist)

# Additional interesting plots:
par(col="Black", pch=1)
#pairs(subset(data, select=c(-t.var, -x1.2, -y.2, -y.21)))
#pairs(subset(data, select=c(-t.var, -x1, -x2)))
#pairs(subset(data, select=c(x2.1, y.1, y.12)))

# Details of the variances, showing how to obtain multiple regression
# standard errors from the OLS matches.
norm <- function(x) sqrt(sum(x * x))
lapply(data, norm)
s <- summary(lm(y ~ x1 + x2 - 1, data=data))
c(s$sigma, s$coefficients["x1", "Std. Error"] * norm(data$x1.2)) # Equal
c(s$sigma, s$coefficients["x2", "Std. Error"] * norm(data$x2.1)) # Equal
c(s$sigma, norm(data$y.12) / sqrt(length(data$y.12) - 2))        # Equal

1
Could multiple regression of y against x1 and x2 still be achieved by first computing y.1 and x2.1 if x1 and x2 were correlated? Wouldn't it then make a big difference whether we sequentially regressed y on x1 and x2.1 or on x2 and x1.2 ? How does this relate to one regression equation with multiple explanatory variables?
miura

1
@miura, One of the leitmotifs of that chapter in Mosteller & Tukey is that when the xi are correlated, the partials xij have low variances; because their variances appear in the denominator of a formula for the estimation variance of their coefficients, this implies the corresponding coefficients will have relatively uncertain estimates. That's a fact of the data, M&T say, and you need to recognize that. It makes no difference whether you start the regression with x1 or x2: compare y.21 to y.12 in my code.
whuber

1
I came across this today, here is what I think on the question by @miura, Think of a 2 dimensional space where Y is to be projected as a combination of two vectors. y = ax1 + bx2 + res (=0). Now think of y as a combination of 3 variables, y = ax1 + bx2 + cx3. and x3 = mx1 + nx2. so certainly, the order in which you choose your variables is going to effect the coefficients. The reason for this is: the minimum error here can be obtained by various combinations. However, in few examples, the minimum error can be obtained by only one combination and that is where the order will not matter.
Gaurav Singhal

@whuber Can you elaborate on how this equation might be used for a multivariate regression that also has a constant term ? ie y = B1 * x1 + B2 * x2 + c ? It is not clear to me how the constant term can be derived. Also I understand in general what was done for the 2 variables, enough at least to replicate it in Excel. How can that be expanded to 3 variables ? x1, x2, x3. It seems clear that we would need to remove x3 first from y, x1, and x2. then remove x2 from x1 and y. But it is not clear to me how to then get the B3 term.
Fairly Nerdy

I have answered some of my questions I have in the comment above. For a 3 variable regression, we would have 6 steps. Remove x1 from x2, from x3, and from y. Then remove x2,1 from x3,1 and from y1. Then remove x3,21 from y21. That results in 6 equations, each of which is of the form variable = lamda * different variable + residual. One of those equations has a y as the first variable, and if you just keep substituting the other variables in, you get the equation you need
Fairly Nerdy
โดยการใช้ไซต์ของเรา หมายความว่าคุณได้อ่านและทำความเข้าใจนโยบายคุกกี้และนโยบายความเป็นส่วนตัวของเราแล้ว
Licensed under cc by-sa 3.0 with attribution required.