การวิเคราะห์องค์ประกอบหลักและการถดถอยในไพ ธ อน

11

ฉันกำลังพยายามหาวิธีที่จะสร้างผลงานใน Python ที่ฉันทำใน SAS การใช้ชุดข้อมูลนี้ซึ่งมีปัญหาความหลากหลายทางหลายระดับฉันต้องการทำการวิเคราะห์องค์ประกอบหลักใน Python ฉันได้ดู scikit-learn และ statsmodels แต่ฉันไม่แน่ใจว่าจะเอาท์พุทของพวกเขาและแปลงเป็นโครงสร้างผลลัพธ์เดียวกันกับ SAS สำหรับสิ่งหนึ่ง SAS ดูเหมือนว่าจะดำเนินการ PCA บนเมทริกซ์สหสัมพันธ์เมื่อคุณใช้PROC PRINCOMPแต่ส่วนใหญ่ (ทั้งหมด?) ของไลบรารี Python ดูเหมือนจะใช้ SVD

ในชุดข้อมูลคอลัมน์แรกคือตัวแปรตอบกลับและอีก 5 รายการถัดไปคือตัวแปรทำนายที่เรียกว่า pred1-pred5

ใน SAS เวิร์กโฟลว์ทั่วไปคือ:

/* Get the PCs */
proc princomp data=indata out=pcdata;
    var pred1 pred2 pred3 pred4 pred5;
run;

/* Standardize the response variable */
proc standard data=pcdata mean=0 std=1 out=pcdata2;
    var response;
run;

/* Compare some models */
proc reg data=pcdata2;
    Reg:     model response = pred1 pred2 pred3 pred4 pred5 / vif;
    PCa:     model response = prin1-prin5 / vif;
    PCfinal: model response = prin1 prin2 / vif;
run;
quit;

/* Use Proc PLS to to PCR Replacement - dropping pred5 */
/* This gets me my parameter estimates for the original data */
proc pls data=indata method=pcr nfac=2;
    model response = pred1 pred2 pred3 pred4 / solution;
run;
quit;

ฉันรู้ว่าขั้นตอนสุดท้ายใช้งานได้เพราะฉันเลือก PC1 และ PC2 ตามลำดับเท่านั้น

ดังนั้นใน Python สิ่งนี้เกี่ยวกับเท่าที่ฉันได้รับ:

import pandas as pd
import numpy  as np
from sklearn.decomposition.pca import PCA

source = pd.read_csv('C:/sourcedata.csv')

# Create a pandas DataFrame object
frame = pd.DataFrame(source)

# Make sure we are working with the proper data -- drop the response variable
cols = [col for col in frame.columns if col not in ['response']]
frame2 = frame[cols]

pca = PCA(n_components=5)
pca.fit(frame2)

จำนวนความแปรปรวนที่แต่ละพีซีอธิบายหรือไม่

print pca.explained_variance_ratio_

Out[190]:
array([  9.99997603e-01,   2.01265023e-06,   2.70712663e-07,
         1.11512302e-07,   2.40310191e-09])

พวกนี้คืออะไร? eigenvectors?

print pca.components_

Out[179]:
array([[ -4.32840645e-04,  -7.18123771e-04,  -9.99989955e-01,
         -4.40303223e-03,  -2.46115129e-05],
       [  1.00991662e-01,   8.75383248e-02,  -4.46418880e-03,
          9.89353169e-01,   5.74291257e-02],
       [ -1.04223303e-02,   9.96159390e-01,  -3.28435046e-04,
         -8.68305757e-02,  -4.26467920e-03],
       [ -7.04377522e-03,   7.60168675e-04,  -2.30933755e-04,
          5.85966587e-02,  -9.98256573e-01],
       [ -9.94807648e-01,  -1.55477793e-03,  -1.30274879e-05,
          1.00934650e-01,   1.29430210e-02]])

ค่าลักษณะเฉพาะเหล่านี้หรือไม่

print pca.explained_variance_

Out[180]:
array([  8.07640319e+09,   1.62550137e+04,   2.18638986e+03,
         9.00620474e+02,   1.94084664e+01])

ฉันรู้สึกว่าวิธีการได้รับจากผลลัพธ์ของ Python เป็นการสูญเสียเงินต้นที่สำคัญ (ใน Python) ไลบรารี Python ใด ๆ เติมช่องว่างให้เหมือนกับ SAS หรือไม่

เคล็ดลับใด ๆ ที่ชื่นชม ฉันเสียเปรียบนิดหน่อยจากการใช้ฉลากในการส่งออกของ SAS และฉันไม่ค่อยคุ้นเคยกับหมีแพนด้า, อ้วนท้วน, scipy หรือ scikit เรียนรู้

แก้ไข:

ดังนั้นดูเหมือนว่า sklearn จะไม่ทำงานโดยตรงกับดาต้าดาต้าแพนด้า สมมติว่าฉันแปลงเป็นอาร์เรย์ numpy:

npa = frame2.values
npa

นี่คือสิ่งที่ฉันได้รับ:

Out[52]:
array([[  8.45300000e+01,   4.20730000e+02,   1.99443000e+05,
          7.94000000e+02,   1.21100000e+02],
       [  2.12500000e+01,   2.73810000e+02,   4.31180000e+04,
          1.69000000e+02,   6.28500000e+01],
       [  3.38200000e+01,   3.73870000e+02,   7.07290000e+04,
          2.79000000e+02,   3.53600000e+01],
       ..., 
       [  4.71400000e+01,   3.55890000e+02,   1.02597000e+05,
          4.07000000e+02,   3.25200000e+01],
       [  1.40100000e+01,   3.04970000e+02,   2.56270000e+04,
          9.90000000e+01,   7.32200000e+01],
       [  3.85300000e+01,   3.73230000e+02,   8.02200000e+04,
          3.17000000e+02,   4.32300000e+01]])

ถ้าฉันเปลี่ยนcopyพารามิเตอร์ของ PCA ของ sklearn เป็นFalse,มันทำงานโดยตรงในอาร์เรย์ตามความคิดเห็นด้านล่าง

pca = PCA(n_components=5,copy=False)
pca.fit(npa)

npa

ตามผลลัพธ์มันดูเหมือนว่ามันจะแทนที่ค่าทั้งหมดnpaแทนการผนวกอะไรกับอาร์เรย์ npaตอนนี้มีคุณค่าอะไรบ้าง? คะแนนองค์ประกอบหลักสำหรับอาร์เรย์เดิมหรือไม่

Out[64]:
array([[  3.91846649e+01,   5.32456568e+01,   1.03614689e+05,
          4.06726542e+02,   6.59830027e+01],
       [ -2.40953351e+01,  -9.36743432e+01,  -5.27103110e+04,
         -2.18273458e+02,   7.73300268e+00],
       [ -1.15253351e+01,   6.38565684e+00,  -2.50993110e+04,
         -1.08273458e+02,  -1.97569973e+01],
       ..., 
       [  1.79466488e+00,  -1.15943432e+01,   6.76868901e+03,
          1.97265416e+01,  -2.25969973e+01],
       [ -3.13353351e+01,  -6.25143432e+01,  -7.02013110e+04,
         -2.88273458e+02,   1.81030027e+01],
       [ -6.81533512e+00,   5.74565684e+00,  -1.56083110e+04,
         -7.02734584e+01,  -1.18869973e+01]])

pca python scikit-learn

— ดินเหนียว
แหล่งที่มา

1

ใน Scikit-Learn แต่ละตัวอย่างจะถูกจัดเก็บเป็นแถวในเมทริกซ์ข้อมูลของคุณ PCA class ทำงานบนเมทริกซ์ข้อมูลโดยตรงเช่นจะดูแลการคำนวณเมทริกซ์ความแปรปรวนร่วมและจากนั้นก็ eigenvector สำหรับคำถาม 3 ข้อสุดท้ายใช่แล้วส่วนประกอบคือ eigenvectors ของเมทริกซ์ความแปรปรวนร่วมอธิบายอธิบายความแปรปรวนเป็นความแปรปรวนที่พีซีแต่ละเครื่องอธิบายและความแปรปรวนที่อธิบายควรสอดคล้องกับค่าลักษณะเฉพาะ

— เล่นแร่แปรธาตุ light

@lightalchemist ขอบคุณสำหรับการชี้แจง ด้วย sklearn คุณควรสร้างดาต้าเฟรมใหม่ก่อนดำเนินการ PCA หรือเป็นไปได้หรือไม่ที่จะส่งดาต้าโฟนแพนด้าที่ 'สมบูรณ์' และไม่มีการทำงานที่คอลัมน์ด้านซ้ายสุด (ตอบกลับ)?

— Clay

ฉันเพิ่มข้อมูลอีกเล็กน้อย ถ้าฉันแปลงเป็นอาร์เรย์ numpy ก่อนแล้วเรียกใช้ PCA ด้วยcopy=Falseฉันจะได้รับค่าใหม่ คะแนนเหล่านี้เป็นองค์ประกอบหลักหรือไม่

— Clay

ฉันไม่คุ้นเคยกับนุ่นดังนั้นฉันไม่ได้ตอบกลับส่วนหนึ่งของคำถามของคุณ เกี่ยวกับส่วนที่สองฉันไม่คิดว่าพวกเขาเป็นองค์ประกอบหลัก ฉันเชื่อว่าพวกเขาเป็นตัวอย่างข้อมูลดั้งเดิม แต่มีค่าเฉลี่ยลบออก อย่างไรก็ตามฉันไม่แน่ใจจริงๆเกี่ยวกับเรื่องนี้

— เล่นแร่แปรธาตุ light

16

Scikit-learn ไม่ได้มีการนำ PCA มาใช้ร่วมกับการถดถอยเช่นแพ็คเกจplsใน R แต่ฉันคิดว่าเราสามารถทำได้เหมือนด้านล่างหรือเลือก PLS regression

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.preprocessing import scale
from sklearn.decomposition import PCA
from sklearn import cross_validation
from sklearn.linear_model import LinearRegression

%matplotlib inline

import seaborn as sns
sns.set_style('darkgrid')

df = pd.read_csv('multicollinearity.csv')
X = df.iloc[:,1:6]
y = df.response

Scikit เรียนรู้ PCA

pca = PCA()

ปรับขนาดและแปลงข้อมูลเพื่อรับส่วนประกอบหลัก

X_reduced = pca.fit_transform(scale(X))

ความแปรปรวน (% สะสม) อธิบายโดยองค์ประกอบหลัก

np.cumsum(np.round(pca.explained_variance_ratio_, decimals=4)*100)

array([  73.39,   93.1 ,   98.63,   99.89,  100.  ])

ดูเหมือนว่าสององค์ประกอบแรกจะอธิบายความแปรปรวนส่วนใหญ่ในข้อมูลได้

CV 10 เท่าพร้อมสับเปลี่ยน

n = len(X_reduced)
kf_10 = cross_validation.KFold(n, n_folds=10, shuffle=True, random_state=2)

regr = LinearRegression()
mse = []

ทำ CV หนึ่งรายการเพื่อรับ MSE เพียงจุดตัด (ไม่มีองค์ประกอบหลักในการถดถอย)

score = -1*cross_validation.cross_val_score(regr, np.ones((n,1)), y.ravel(), cv=kf_10, scoring='mean_squared_error').mean()    
mse.append(score)

ทำ CV สำหรับ 5 องค์ประกอบหลักโดยเพิ่มหนึ่งส่วนประกอบในการถดถอยในเวลานั้น

for i in np.arange(1,6):
    score = -1*cross_validation.cross_val_score(regr, X_reduced[:,:i], y.ravel(), cv=kf_10, scoring='mean_squared_error').mean()
    mse.append(score)

fig, (ax1, ax2) = plt.subplots(1,2, figsize=(12,5))
ax1.plot(mse, '-v')
ax2.plot([1,2,3,4,5], mse[1:6], '-v')
ax2.set_title('Intercept excluded from plot')

for ax in fig.axes:
    ax.set_xlabel('Number of principal components in regression')
    ax.set_ylabel('MSE')
    ax.set_xlim((-0.2,5.2))

Scikit เรียนรู้การถดถอย PLS

mse = []

kf_10 = cross_validation.KFold(n, n_folds=10, shuffle=True, random_state=2)

for i in np.arange(1, 6):
    pls = PLSRegression(n_components=i, scale=False)
    pls.fit(scale(X_reduced),y)
    score = cross_validation.cross_val_score(pls, X_reduced, y, cv=kf_10, scoring='mean_squared_error').mean()
    mse.append(-score)

plt.plot(np.arange(1, 6), np.array(mse), '-v')
plt.xlabel('Number of principal components in PLS regression')
plt.ylabel('MSE')
plt.xlim((-0.2, 5.2))

— จอร์ดี้
แหล่งที่มา

7

นี่คือ SVD ใน Python และ NumPy เท่านั้น (หลายปีต่อมา)
(สิ่งนี้ไม่ได้ตอบคำถามของคุณเกี่ยวกับ SSA / sklearn / pandas เลย แต่อาจช่วยนักหลามได้ในสักวันหนึ่ง)

#!/usr/bin/env python2
""" SVD straight up """
# geometry: see http://www.ams.org/samplings/feature-column/fcarc-svd

from __future__ import division
import sys
import numpy as np

__version__ = "2015-06-15 jun  denis-bz-py t-online de"

# from bz.etc import numpyutil as nu
def ints( x ):
    return np.round(x).astype(int)  # NaN Inf -> - maxint

def quantiles( x ):
    return "quantiles %s" % ints( np.percentile( x, [0, 25, 50, 75, 100] ))


#...........................................................................
csvin = "ccheaton-multicollinearity.csv"  # https://gist.github.com/ccheaton/8393329
plot = 0

    # to change these vars in sh or ipython, run this.py  csvin=\"...\"  plot=1  ...
for arg in sys.argv[1:]:
    exec( arg )

np.set_printoptions( threshold=10, edgeitems=10, linewidth=120,
    formatter = dict( float = lambda x: "%.2g" % x ))  # float arrays %.2g

#...........................................................................
yX = np.loadtxt( csvin, delimiter="," )
y = yX[:,0]
X = yX[:,1:]
print "read %s" % csvin
print "y %d  %s" % (len(y), quantiles(y))
print "X %s  %s" % (X.shape, quantiles(X))
print ""

#...........................................................................
U, sing, Vt = np.linalg.svd( X, full_matrices=False )
#...........................................................................

print "SVD: %s -> U %s . sing diagonal . Vt %s" % (
        X.shape, U.shape, Vt.shape )
print "singular values:", ints( sing )
    # % variance (sigma^2) explained != % sigma explained, e.g. 10 1 1 1 1

var = sing**2
var *= 100 / var.sum()
print "% variance ~ sing^2:", var

print "Vt, the right singular vectors  * 100:\n", ints( Vt * 100 )
    # multicollinear: near +- 100 in each row / col

yU = y.dot( U )
yU *= 100 / yU.sum()
print "y ~ these percentages of U, the left singular vectors:", yU

-> บันทึก

# from: test-pca.py
# run: 15 Jun 2015 16:45  in ~bz/py/etc/data/etc  Denis-iMac 10.8.3
# versions: numpy 1.9.2  scipy 0.15.1   python 2.7.6   mac 10.8.3

read ccheaton-multicollinearity.csv
y 373  quantiles [  2823  60336  96392 147324 928560]
X (373, 5)  quantiles [     7     47    247    573 512055]

SVD: (373, 5) -> U (373, 5) . sing diagonal . Vt (5, 5)
singular values: [2537297    4132    2462     592      87]
% variance ~ sing^2: [1e+02 0.00027 9.4e-05 5.4e-06 1.2e-07]
Vt, the right singular vectors  * 100:
[[  0   0 100   0   0]
 [  1  98   0 -12  17]
 [-10 -11   0 -99  -6]
 [  1 -17   0  -4  98]
 [-99   2   0  10   2]]
y ~ these percentages of U, the left singular vectors: [1e+02 15 -18 0.88 -0.57]

— เดนิส
แหล่งที่มา

ฉันไปสายไปงานเลี้ยง แต่คำตอบที่ดี

— plumbus_bouquet

3

ลองใช้ไปป์ไลน์เพื่อรวมการวิเคราะห์องค์ประกอบหลักและการถดถอยเชิงเส้น:

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline

# Principle components regression
steps = [
    ('scale', StandardScaler()),
    ('pca', PCA()),
    ('estimator', LinearRegression())
]
pipe = Pipeline(steps)
pca = pipe.set_params(pca__n_components=3)
pca.fit(X, y)

— โจ
แหล่งที่มา

3

คำตอบของฉันมาเกือบห้าปีแล้วและมีโอกาสที่คุณไม่ต้องการความช่วยเหลือเกี่ยวกับการทำ PCR ใน Python อีกต่อไป เราได้พัฒนาแพ็กเกจ Python ชื่อhoggormซึ่งสามารถทำสิ่งที่คุณต้องการได้อย่างแท้จริง กรุณาดูได้ที่ตัวอย่าง PCR ที่นี่ นอกจากนี้ยังมีแพ็คเกจการพล็อตเสริมชื่อhoggormplotสำหรับการแสดงภาพของผลลัพธ์ที่คำนวณด้วย hoggorm

— Oli
แหล่งที่มา