machine learning - R -- Can I apply the train function in caret to a list of data frames? -


i using excellent r package, caret, , i'd run train function on list of multiple training data sets. now, realize documentation train function says data argument must data frame, i'm trying may not possible , might better suggested enhancement caret, wanted see if had tried this.

using sonar data illustrative purposes, i've created list (named both) consisting of 2 data frames, each of separate training dataset. i'm using mapply apply train function each element in list. unfortunately, i'm getting scary-looking results. specifically, hoping metrics in pls1.3..a[[2]] identical metrics in pls1.3..b2. can see, not. oddly, pls1.3..a[[1]] match pls1.3..b1. there obvious i'm doing wrong, or might not possible (now)? (i'm running r 3.1.1 on 1.4 ghz intel core i5 mac.)

reproducible code (and output commented out) follows:

    require(domc)     registerdomc(cores = 2)       library(caret)      library(mlbench)      data(sonar)      set.seed(1234)      intrain <- createdatapartition(y = sonar$class,                                     p = .75,                                      list = false)       training <- sonar[ intrain,]      training2  <- sonar[-intrain,]       both <- list(training, training2)      #both_test <- list(training[c(1:100),], training2[c(1:35),]) #silly test data functionality testing       set.seed(1234)       labels <- list()      for(i in 1:length(both)) {          labels[i] <- list(both[[i]]$class)          }       #new code -- added based on @josh w's comment -- removing label (class) variable feature matrix     both <- lapply(both, function(x) {         subset(x[,c(1:60)])         })      #new code -- changed using formula implementation of caret x (feature matrix), y (label/outcome vector)      pls1.3..a <- mapply(function(x,y) train(x, y, method = "pls", preproc = c("center", "scale")), x = both, y = labels, simplify = false)      pls1.3..a       #[[1]]     #partial least squares       #157 samples     # 60 predictor     #  2 classes: 'm', 'r'       #pre-processing: centered, scaled      #resampling: bootstrapped (25 reps)       #summary of sample sizes: 157, 157, 157, 157, 157, 157, ...       #resampling results across tuning parameters:      #  ncomp  accuracy   kappa      accuracy sd  kappa sd       #  1      0.6889679  0.3756821  0.06015197   0.11605511     #  2      0.7393776  0.4742204  0.04962609   0.09775688     #  3      0.7410997  0.4793703  0.04856698   0.09412599      #accuracy used select optimal model using  largest value.     #the final value used model ncomp = 3.       #[[2]]     #partial least squares       #51 samples     #60 predictors     # 2 classes: 'm', 'r'       #pre-processing: centered, scaled      #resampling: bootstrapped (25 reps)       #summary of sample sizes: 51, 51, 51, 51, 51, 51, ...       #resampling results across tuning parameters:      #  ncomp  accuracy   kappa      accuracy sd  kappa sd      #  1      0.6452693  0.2929118  0.08076455   0.1525176     #  2      0.6468405  0.2902136  0.09686340   0.1790924     #  3      0.6559113  0.3087227  0.08025215   0.1547317      #accuracy used select optimal model using  largest value.     #the final value used model ncomp = 3.                set.seed(1234)     pls1.3..b1 <- train(both[[1]],                     labels[[1]],                     method = "pls",                     preproc = c("center", "scale"))     pls1.3..b1     #partial least squares       #157 samples     # 60 predictor     #  2 classes: 'm', 'r'       #pre-processing: centered, scaled      #resampling: bootstrapped (25 reps)       #summary of sample sizes: 157, 157, 157, 157, 157, 157, ...       #resampling results across tuning parameters:      #  ncomp  accuracy   kappa      accuracy sd  kappa sd       #  1      0.6889679  0.3756821  0.06015197   0.11605511     #  2      0.7393776  0.4742204  0.04962609   0.09775688     #  3      0.7410997  0.4793703  0.04856698   0.09412599      #accuracy used select optimal model using  largest value.     #the final value used model ncomp = 3.       set.seed(1234)     pls1.3..b2 <- train(both[[2]],                     labels[[2]],                     method = "pls",                     preproc = c("center", "scale"))     pls1.3..b2      #partial least squares       #51 samples     #60 predictors     # 2 classes: 'm', 'r'       #pre-processing: centered, scaled      #resampling: bootstrapped (25 reps)       #summary of sample sizes: 51, 51, 51, 51, 51, 51, ...       #resampling results across tuning parameters:      #  ncomp  accuracy   kappa      accuracy sd  kappa sd      #  1      0.6127279  0.2518488  0.11925682   0.1959400     #  2      0.6792163  0.3618657  0.09386771   0.1776549     #  3      0.6673662  0.3343716  0.07524373   0.1476405      #accuracy used select optimal model using  largest value.     #the final value used model ncomp = 2.   

if use following, you'll result (close what) expect:

set.seed(1234)  pls1.3..b <- train(labels[[2]]~ .,                     data = both[[2]],                     method = "pls",                     preproc = c("center", "scale"))  pls1.3..b  

i believe it's because of way have formula specified. object ~ . has formula use in data that's not column object. specified in mapply call, it's basically external object ~ entire data.frame, including class labels. believe it's training response variable in dataset.


Comments

Popular posts from this blog

node.js - Using Node without global install -

How to access a php class file from PHPFox framework into javascript code written in simple HTML file? -

java - Null response to php query in android, even though php works properly -