machine learning - R -- Can I apply the train function in caret to a list of data frames? -
i using excellent r package, caret, , i'd run train function on list of multiple training data sets. now, realize documentation train function says data argument must data frame, i'm trying may not possible , might better suggested enhancement caret, wanted see if had tried this.
using sonar data illustrative purposes, i've created list (named both) consisting of 2 data frames, each of separate training dataset. i'm using mapply apply train function each element in list. unfortunately, i'm getting scary-looking results. specifically, hoping metrics in pls1.3..a[[2]] identical metrics in pls1.3..b2. can see, not. oddly, pls1.3..a[[1]] match pls1.3..b1. there obvious i'm doing wrong, or might not possible (now)? (i'm running r 3.1.1 on 1.4 ghz intel core i5 mac.)
reproducible code (and output commented out) follows:
require(domc) registerdomc(cores = 2) library(caret) library(mlbench) data(sonar) set.seed(1234) intrain <- createdatapartition(y = sonar$class, p = .75, list = false) training <- sonar[ intrain,] training2 <- sonar[-intrain,] both <- list(training, training2) #both_test <- list(training[c(1:100),], training2[c(1:35),]) #silly test data functionality testing set.seed(1234) labels <- list() for(i in 1:length(both)) { labels[i] <- list(both[[i]]$class) } #new code -- added based on @josh w's comment -- removing label (class) variable feature matrix both <- lapply(both, function(x) { subset(x[,c(1:60)]) }) #new code -- changed using formula implementation of caret x (feature matrix), y (label/outcome vector) pls1.3..a <- mapply(function(x,y) train(x, y, method = "pls", preproc = c("center", "scale")), x = both, y = labels, simplify = false) pls1.3..a #[[1]] #partial least squares #157 samples # 60 predictor # 2 classes: 'm', 'r' #pre-processing: centered, scaled #resampling: bootstrapped (25 reps) #summary of sample sizes: 157, 157, 157, 157, 157, 157, ... #resampling results across tuning parameters: # ncomp accuracy kappa accuracy sd kappa sd # 1 0.6889679 0.3756821 0.06015197 0.11605511 # 2 0.7393776 0.4742204 0.04962609 0.09775688 # 3 0.7410997 0.4793703 0.04856698 0.09412599 #accuracy used select optimal model using largest value. #the final value used model ncomp = 3. #[[2]] #partial least squares #51 samples #60 predictors # 2 classes: 'm', 'r' #pre-processing: centered, scaled #resampling: bootstrapped (25 reps) #summary of sample sizes: 51, 51, 51, 51, 51, 51, ... #resampling results across tuning parameters: # ncomp accuracy kappa accuracy sd kappa sd # 1 0.6452693 0.2929118 0.08076455 0.1525176 # 2 0.6468405 0.2902136 0.09686340 0.1790924 # 3 0.6559113 0.3087227 0.08025215 0.1547317 #accuracy used select optimal model using largest value. #the final value used model ncomp = 3. set.seed(1234) pls1.3..b1 <- train(both[[1]], labels[[1]], method = "pls", preproc = c("center", "scale")) pls1.3..b1 #partial least squares #157 samples # 60 predictor # 2 classes: 'm', 'r' #pre-processing: centered, scaled #resampling: bootstrapped (25 reps) #summary of sample sizes: 157, 157, 157, 157, 157, 157, ... #resampling results across tuning parameters: # ncomp accuracy kappa accuracy sd kappa sd # 1 0.6889679 0.3756821 0.06015197 0.11605511 # 2 0.7393776 0.4742204 0.04962609 0.09775688 # 3 0.7410997 0.4793703 0.04856698 0.09412599 #accuracy used select optimal model using largest value. #the final value used model ncomp = 3. set.seed(1234) pls1.3..b2 <- train(both[[2]], labels[[2]], method = "pls", preproc = c("center", "scale")) pls1.3..b2 #partial least squares #51 samples #60 predictors # 2 classes: 'm', 'r' #pre-processing: centered, scaled #resampling: bootstrapped (25 reps) #summary of sample sizes: 51, 51, 51, 51, 51, 51, ... #resampling results across tuning parameters: # ncomp accuracy kappa accuracy sd kappa sd # 1 0.6127279 0.2518488 0.11925682 0.1959400 # 2 0.6792163 0.3618657 0.09386771 0.1776549 # 3 0.6673662 0.3343716 0.07524373 0.1476405 #accuracy used select optimal model using largest value. #the final value used model ncomp = 2.
if use following, you'll result (close what) expect:
set.seed(1234) pls1.3..b <- train(labels[[2]]~ ., data = both[[2]], method = "pls", preproc = c("center", "scale")) pls1.3..b i believe it's because of way have formula specified. object ~ . has formula use in data that's not column object. specified in mapply call, it's basically external object ~ entire data.frame, including class labels. believe it's training response variable in dataset.
Comments
Post a Comment