machine learning - Python: Creating Term Document Matrix from list -


so wanted train naive bayes algorithm on documents , below code run fine if had documents in form of strings. issues strings have goes through series of pre-processing step more stopword remove, lemmatization etc rather there custom conversion returns list of ngrams, n can [1,2,3] depending on context of text. since have list of ngram instead of string representing document confused how can represent same input countvectorizer. suggestions?

code work fine docs document array of type string.

count_vectorizer = countvectorizer(binary='true') data = count_vectorizer.fit_transform(docs)  tfidf_data = tfidftransformer(use_idf=false).fit_transform(data) classifier = bernoullinb().fit(tfidf_data,op) 

you should combine pre-processing steps preprocessor , maybe tokenizer functions, see section 4.2.3.10 , countvectorizer description scikit-learn docs. example of such tokenizers/transformers see related question of src code of scikit-learn itself.


Comments

Popular posts from this blog

angularjs - ADAL JS Angular- WebAPI add a new role claim to the token -

node.js - Using Node without global install -

php - CakePHP HttpSockets send array of paramms -