Training Data Set in NLTK Python -


i working on python nltk tagging, , input text non hindi. in order tokenize input text must first trained.

my question how train data?

i having line of code suggested me here on stackoverflow.

train_data = indian.tagged_sents('hindi.pos')  

*how non-hindi data input.

the short answer is: training tagger requires tagged corpus.

assigning part of speech tags must done according existing model. unfortunately, unlike problems finding sentence boundaries, there no way choose them out of thin air. there experimental approaches try assign parts of speech using parallel texts , machine-translation alignment algorithms, real pos taggers must trained on text has been tagged already.

evidently don't have tagged corpus unnamed language, you'll need find or create 1 if want build tagger. creating tagged corpus major undertaking, since you'll need lot of training materials sort of decent performance. there may ways "bootstrap" tagged corpus (put poor-quality tagger make easier retag results hand), depends on situation.


Comments

Popular posts from this blog

angularjs - ADAL JS Angular- WebAPI add a new role claim to the token -

node.js - Using Node without global install -

php - CakePHP HttpSockets send array of paramms -