python - scikit-learn pipeline -
each sample in (iid) dataset looks this:
x = [a_1,a_2...a_n,b_1,b_2...b_m]
i have label of each sample (this supervised learning)
the a features sparse (namely bag-of-words representation), while b features dense (integers,there ~45 of those)
i using scikit-learn, , want use gridsearchcv pipeline.
the question: possible use 1 countvectorizer on features type a , countvectorizer on features type b?
what want can thought of as:
pipeline = pipeline([ ('vect1', countvectorizer()), #will work on features [0,(n-1)] ('vect2', countvectorizer()), #will work on features [n,(n+m-1)] ('clf', sgdclassifier()), #will use features classify ]) parameters = { 'vect1__max_df': (0.5, 0.75, 1.0), # type features 'vect1__ngram_range': ((1, 1), (1, 2)), # type features 'vect2__max_df': (0.5, 0.75, 1.0), # type b features 'vect2__ngram_range': ((1, 1), (1, 2)), # type b features 'clf__alpha': (0.00001, 0.000001), 'clf__penalty': ('l2', 'elasticnet'), 'clf__n_iter': (10, 50, 80), } grid_search = gridsearchcv(pipeline, parameters, n_jobs=-1, verbose=1) grid_search.fit(x, y)
is possible?
a nice idea presented @andreas mueller. however, want keep original non-chosen features well... therefore, cannot tell column index each phase @ pipeline upfront (before pipeline begins).
for example, if set countvectorizer(max_df=0.75)
, may reduce terms, , original column index change.
thanks
unfortunately, not nice be. need use featureunion concatenate kinds of features, , transformer in each needs select features , transform them. 1 way make pipeline of transformer selects columns (you need write yourself) , countvectorizer. there example similar here. example separates features different values in dictionary, don't need that. have @ related issue selecting columns contains code transformer need.
it looks current code:
make_pipeline( make_union( make_pipeline(featureselector(some_columns), countvectorizer()), make_pipeline(featureselector(other_columns), countvectorizer())), sgdclassifier())
Comments
Post a Comment