Unicode Tagging in Python NLTK -


i working on python nltk tagging program. input file hindi text containing several lines. on tokenizing text , using pos_tag output nn tag only. english sentence input proper tagging. kindly help. version - python 3.4.1, nltk 3.0 documentation


kindly help! here tried.

word_to_be_tagged = u"ताजो स्वास आनी चकचकीत दांत तुमचें व्यक्तीमत्व परजळायतात."

from nltk.corpus import indian  train_data = indian.tagged_sents('hindi.pos')[:300]  test_data = indian.tagged_sents('hindi.pos')[301:]   print(word_to_be_tagged) print (train_data) 

and output different.

ताजो स्वास आनी चकचकीत दांत तुमचें व्यक्तीमत्व परजळायतात. [[('पूर्ण', 'jj'), ('प्रतिबंध', 'nn'), ('हटाओ', 'vfm'), (':', 'sym'), ('इराक', 'nnp')], [('संयुक्त', 'nnc'), ('राष्ट्र', 'nn'), ('।', 'sym')], ...] 

the problem should use hindi pos tagger:

from nltk.corpus import indian nltk.tag import tnt  train_data = indian.tagged_sents('hindi.pos') tnt_pos_tagger = tnt.tnt() tnt_pos_tagger.train(train_data) #training tnt part of speech tagger hindi data  print tnt_pos_tagger.tag(nltk.word_tokenize(word_to_be_tagged)) 

the problem part of speech tagger accurate in specific domain (mostly combination of language , topic). in english, of words tagger haven't seen yet nouns (nn), tags data nn only.

if train same domain want tag after (hindi), should ok.

see this more explanations.


Comments

Popular posts from this blog

angularjs - ADAL JS Angular- WebAPI add a new role claim to the token -

php - CakePHP HttpSockets send array of paramms -

node.js - Using Node without global install -