มีวิธีใช้การตรวจสอบไขว้เพื่อทำการเลือกตัวแปร / คุณสมบัติใน R หรือไม่?

10

ฉันมีชุดข้อมูลที่มีตัวแปรประมาณ 70 ตัวที่ฉันต้องการลด สิ่งที่ฉันต้องการทำคือใช้ CV เพื่อค้นหาตัวแปรที่มีประโยชน์มากที่สุดในรูปแบบต่อไปนี้

1) สุ่มเลือกพูด 20 ตัวแปร

2) ใช้stepwise/ LASSO/ lars/ ฯลฯ เพื่อเลือกตัวแปรที่สำคัญที่สุด

3) ทำซ้ำ ~ 50x และดูว่าตัวแปรใดถูกเลือก (ไม่ตัดออก) บ่อยที่สุด

นี่เป็นไปตามสายของสิ่งที่randomForestจะทำ แต่rfVarSelดูเหมือนว่าแพคเกจจะทำงานเฉพาะกับปัจจัย / การจัดหมวดหมู่และฉันจำเป็นต้องทำนายตัวแปรตามอย่างต่อเนื่อง

ฉันกำลังใช้ R ดังนั้นคำแนะนำใด ๆ ก็จะถูกนำไปใช้อย่างเหมาะสม

— screechOwl
แหล่งที่มา

คุณสมบัติทั้งหมดมีความสำคัญหรือไม่? คุณมีตัวอย่างกี่ตัวอย่าง หากฉันเข้าใจปัญหาได้อย่างถูกต้องคุณสามารถลองใช้ตัวกระตุ้นบางตัว - เลือกชุดย่อยของกลุ่มตัวอย่างซ้ำ ๆ และปรับตัวแปรทั้งหมดให้เข้ากับตัวแปรเหล่านั้นและดูว่ากลุ่มใดที่ป๊อปอัปบ่อยขึ้น

— Ofelia

1

ฉันคิดว่ากระบวนการของคุณไม่น่าจะดีขึ้นใน LASSO ซึ่งการใช้งานใน R (เช่น glmnet และถูกลงโทษ) ทำโดยค่าเริ่มต้นใช้การตรวจสอบข้ามเพื่อหาพารามิเตอร์การทำให้เป็นมาตรฐานที่ดีที่สุด สิ่งหนึ่งที่คุณควรพิจารณาคือการทำซ้ำการค้นหา LASSO สำหรับพารามิเตอร์นี้หลายครั้งเพื่อรับมือกับความแปรปรวนของการตรวจสอบข้าม (CV ซ้ำ) ที่อาจเกิดขึ้นอย่างมาก แน่นอนว่าไม่มีอัลกอริธึมใดที่สามารถเอาชนะความรู้เดิมของแต่ละวิชาได้

— miura

9

ฉันเชื่อว่าสิ่งที่คุณอธิบายถูกนำไปใช้ในcaretแพ็คเกจแล้ว ดูrfeฟังก์ชั่นหรือบทความสั้น ๆ ที่นี่: http://cran.r-project.org/web/packages/caret/vignettes/caretSelection.pdf

ตอนนี้ต้องบอกว่าทำไมคุณต้องลดจำนวนฟีเจอร์? จาก 70 ถึง 20 ไม่ได้ลดขนาดลงจริงๆ ฉันคิดว่าคุณต้องการคุณสมบัติมากกว่า 70 รายการก่อนที่คุณจะมี บริษัท ที่เชื่อก่อนหน้านี้ว่าคุณสมบัติบางอย่างจริง ๆ แล้วไม่สำคัญ แต่แล้วอีกครั้งนั่นคือสิ่งที่ฉันคิดก่อน

— Shea Parkes
แหล่งที่มา

5

ไม่มีเหตุผลที่ความถี่ในการเลือกตัวแปรให้ข้อมูลใด ๆ ที่คุณยังไม่ได้รับจากความสำคัญที่ชัดเจนของตัวแปรในรูปแบบเริ่มต้น นี่คือการเล่นซ้ำของนัยสำคัญทางสถิติเบื้องต้น คุณกำลังเพิ่มระดับความเด็ดขาดในระดับใหม่เมื่อพยายามตัดสินใจเลือกตัดความถี่สำหรับการเลือก การเลือกตัวแปรการสุ่มใหม่นั้นได้รับความเสียหายอย่างรุนแรงจาก collinearity นอกเหนือจากปัญหาอื่น ๆ

— Frank Harrell
แหล่งที่มา

2

วันนี้ฉันได้แก้ไขคำตอบของฉันแล้ว ตอนนี้ฉันได้สร้างข้อมูลตัวอย่างที่จะใช้รหัส คนอื่น ๆ แนะนำอย่างถูกต้องว่าคุณพิจารณาใช้ caret package ที่ฉันเห็นด้วย อย่างไรก็ตามในบางกรณีคุณอาจจำเป็นต้องเขียนรหัสของคุณเอง ด้านล่างฉันพยายามสาธิตวิธีใช้ตัวอย่าง () ฟังก์ชันใน R เพื่อกำหนดการสังเกตแบบสุ่มให้กับการตรวจสอบความถูกต้องข้าม ฉันยังใช้สำหรับการวนซ้ำเพื่อทำการเลือกตัวแปรล่วงหน้า (โดยใช้การถดถอยเชิงเส้นแบบ univariate พร้อมการตัดค่าแบบผ่อนปรน p ที่ 0.1) และการสร้างแบบจำลอง (โดยใช้การถดถอยแบบขั้นตอน) ในชุดฝึกสิบชุด จากนั้นคุณสามารถเขียนรหัสของคุณเองเพื่อนำโมเดลผลลัพธ์ไปใช้กับการตรวจสอบความถูกต้อง หวังว่านี่จะช่วยได้!

################################################################################
## Load the MASS library, which contains the "stepAIC" function for performing
## stepwise regression, to be used later in this script
library(MASS)
################################################################################


################################################################################
## Generate example data, with 100 observations (rows), 70 variables (columns 1
## to 70), and a continuous dependent variable (column 71)
Data <- NULL
Data <- as.data.frame(Data)

for (i in 1:71) {
for (j in 1:100) {
Data[j,i]  <- rnorm(1) }}

names(Data)[71] <- "Dependent"
################################################################################


################################################################################
## Create ten folds for cross-validation. Each observation in your data will
## randomly be assigned to one of ten folds.
Data$Fold <- sample(c(rep(1:10,10)))

## Each fold will have the same number of observations assigned to it. You can
## double check this by typing the following:
table(Data$Fold)

## Note: If you were to have 105 observations instead of 100, you could instead
## write: Data$Fold <- sample(c(rep(1:10,10),rep(1:5,1)))
################################################################################


################################################################################
## I like to use a "for loop" for cross-validation. Here, prior to beginning my
## "for loop", I will define the variables I plan to use in it. You have to do
## this first or R will give you an error code.
fit <- NULL
stepw <- NULL
training <- NULL
testing <- NULL
Preselection <- NULL
Selected <- NULL
variables <- NULL
################################################################################


################################################################################
## Now we can begin the ten-fold cross validation. First, we open the "for loop"
for (CV in 1:10) {

## Now we define your training and testing folds. I like to store these data in
## a list, so at the end of the script, if I want to, I can go back and look at
## the observations in each individual fold
training[[CV]] <- Data[which(Data$Fold != CV),]
testing[[CV]]  <- Data[which(Data$Fold == CV),]

## We can preselect variables by analyzing each variable separately using
## univariate linear regression and then ranking them by p value. First we will
## define the container object to which we plan to output these data.
Preselection[[CV]] <- as.data.frame(Preselection[CV])

## Now we will run a separate linear regression for each of our 70 variables.
## We will store the variable name and the coefficient p value in our object
## called "Preselection".
for (i in 1:70) {
Preselection[[CV]][i,1]  <- i
Preselection[[CV]][i,2]  <- summary(lm(Dependent ~ training[[CV]][,i] , data = training[[CV]]))$coefficients[2,4]
}

## Now we will remove "i" and also we will name the columns of our new object.
rm(i)
names(Preselection[[CV]]) <- c("Variable", "pValue")

## Now we will make note of those variables whose p values were less than 0.1.
Selected[[CV]] <- Preselection[[CV]][which(Preselection[[CV]]$pValue <= 0.1),] ; row.names(Selected[[CV]]) <- NULL

## Fit a model using the pre-selected variables to the training fold
## First we must save the variable names as a character string
temp <- NULL
for (k in 1:(as.numeric(length(Selected[[CV]]$Variable)))) {
temp[k] <- paste("training[[CV]]$V",Selected[[CV]]$Variable[k]," + ",sep="")}
variables[[CV]] <- paste(temp, collapse = "")
variables[[CV]] <- substr(variables[[CV]],1,(nchar(variables[[CV]])-3))

## Now we can use this string as the independent variables list in our model
y <- training[[CV]][,"Dependent"]
form <- as.formula(paste("y ~", variables[[CV]]))

## We can build a model using all of the pre-selected variables
fit[[CV]] <- lm(form, training[[CV]])

## Then we can build new models using stepwise removal of these variables using
## the MASS package
stepw[[CV]] <- stepAIC(fit[[CV]], direction="both")

## End for loop
}

## Now you have your ten training and validation sets saved as training[[CV]]
## and testing[[CV]]. You also have results from your univariate pre-selection
## analyses saved as Preselection[[CV]]. Those variables that had p values less
## than 0.1 are saved in Selected[[CV]]. Models built using these variables are
## saved in fit[[CV]]. Reduced versions of these models (by stepwise selection)
## are saved in stepw[[CV]].

## Now you might consider using the predict.lm function from the stats package
## to apply your ten models to their corresponding validation folds. You then
## could look at the performance of the ten models and average their performance
## statistics together to get an overall idea of how well your data predict the
## outcome.
################################################################################

ก่อนที่จะทำการตรวจสอบข้ามมันเป็นสิ่งสำคัญที่คุณอ่านเกี่ยวกับการใช้งานที่เหมาะสม การอ้างอิงทั้งสองนี้นำเสนอการอภิปรายที่ยอดเยี่ยมของการตรวจสอบข้าม:

Simon RM, Subramanian J, Li MC, Menezes S. ใช้การตรวจสอบข้ามเพื่อประเมินความแม่นยำในการทำนายความเสี่ยงต่อการรอดชีวิตของตัวจำแนกประเภทความเสี่ยงต่อการรอดชีวิตจากข้อมูลมิติสูง Bioinform โดยย่อ 2554 พฤษภาคม 12 (3): 203-14 Epub 2011 15 กุมภาพันธ์http://bib.oxfordjournals.org/content/12/3/203.long
Richard Simon, Michael D. Radmacher, Kevin Dobbin และ Lisa M. McShane ข้อผิดพลาดในการใช้ข้อมูล DNA Microarray เพื่อการจำแนกและวินิจฉัยโรค JNCI J Natl Cancer Inst (2003) 95 (1): 14-18 http://jnci.oxfordjournals.org/content/95/1/14.long

เอกสารเหล่านี้มุ่งเน้นไปที่นักชีวสถิติ แต่จะเป็นประโยชน์สำหรับทุกคน

นอกจากนี้โปรดจำไว้เสมอว่าการใช้การถดถอยแบบขั้นตอนเป็นสิ่งที่อันตราย (แม้ว่าการใช้การตรวจสอบความถูกต้องแบบข้ามจะช่วยบรรเทาการ overfitting) การสนทนาที่ดีของการถดถอยแบบขั้นตอนสามารถใช้ได้ที่นี่: http://www.stata.com/support/faqs/stat/stepwise.html

แจ้งให้เราทราบหากคุณมีคำถามเพิ่มเติม!

— อเล็กซานเด
แหล่งที่มา

0

ฉันเพิ่งพบสิ่งที่ดีที่นี่: http://cran.r-project.org/web/packages/Causata/vignettes/Causata-vignette.pdf

ลองใช้วิธีนี้เมื่อใช้แพ็คเกจ glmnet

# extract nonzero coefficients
coefs.all <- as.matrix(coef(cv.glmnet.obj, s="lambda.min"))
idx <- as.vector(abs(coefs.all) > 0)
coefs.nonzero <- as.matrix(coefs.all[idx])
rownames(coefs.nonzero) <- rownames(coefs.all)[idx]

— Simon Nehls
แหล่งที่มา