machine learning - Python: Creating Term Document Matrix from list -

July 15, 2010

so wanted train naive bayes algorithm on documents , below code run fine if had documents in form of strings. issues strings have goes through series of pre-processing step more stopword remove, lemmatization etc rather there custom conversion returns list of ngrams, n can [1,2,3] depending on context of text. since have list of ngram instead of string representing document confused how can represent same input countvectorizer. suggestions?

code work fine docs document array of type string.

count_vectorizer = countvectorizer(binary='true') data = count_vectorizer.fit_transform(docs)  tfidf_data = tfidftransformer(use_idf=false).fit_transform(data) classifier = bernoullinb().fit(tfidf_data,op)

you should combine pre-processing steps preprocessor , maybe tokenizer functions, see section 4.2.3.10 , countvectorizer description scikit-learn docs. example of such tokenizers/transformers see related question of src code of scikit-learn itself.

Search This Blog

Call

machine learning - Python: Creating Term Document Matrix from list -

Comments

Post a Comment

Popular posts from this blog

node.js - Using Node without global install -

php - CakePHP HttpSockets send array of paramms -

angularjs - ADAL JS Angular- WebAPI add a new role claim to the token -