python - scikit-learn pipeline -



each sample in (iid) dataset looks this:
x = [a_1,a_2...a_n,b_1,b_2...b_m]

i have label of each sample (this supervised learning)

the a features sparse (namely bag-of-words representation), while b features dense (integers,there ~45 of those)

i using scikit-learn, , want use gridsearchcv pipeline.

the question: possible use 1 countvectorizer on features type a , countvectorizer on features type b?

what want can thought of as:

pipeline = pipeline([     ('vect1', countvectorizer()), #will work on features [0,(n-1)]     ('vect2', countvectorizer()), #will work on features [n,(n+m-1)]     ('clf', sgdclassifier()), #will use features classify ])  parameters = {     'vect1__max_df': (0.5, 0.75, 1.0),       # type features     'vect1__ngram_range': ((1, 1), (1, 2)),  # type features     'vect2__max_df': (0.5, 0.75, 1.0),       # type b features     'vect2__ngram_range': ((1, 1), (1, 2)),  # type b features     'clf__alpha': (0.00001, 0.000001),     'clf__penalty': ('l2', 'elasticnet'),     'clf__n_iter': (10, 50, 80), }  grid_search = gridsearchcv(pipeline, parameters, n_jobs=-1, verbose=1) grid_search.fit(x, y) 

is possible?

a nice idea presented @andreas mueller. however, want keep original non-chosen features well... therefore, cannot tell column index each phase @ pipeline upfront (before pipeline begins).

for example, if set countvectorizer(max_df=0.75), may reduce terms, , original column index change.

thanks

unfortunately, not nice be. need use featureunion concatenate kinds of features, , transformer in each needs select features , transform them. 1 way make pipeline of transformer selects columns (you need write yourself) , countvectorizer. there example similar here. example separates features different values in dictionary, don't need that. have @ related issue selecting columns contains code transformer need.

it looks current code:

make_pipeline(     make_union(         make_pipeline(featureselector(some_columns), countvectorizer()),         make_pipeline(featureselector(other_columns), countvectorizer())),     sgdclassifier()) 

Comments

Popular posts from this blog

angularjs - ADAL JS Angular- WebAPI add a new role claim to the token -

php - CakePHP HttpSockets send array of paramms -

node.js - Using Node without global install -