Unicode Tagging in Python NLTK -
i working on python nltk tagging program. input file hindi text containing several lines. on tokenizing text , using pos_tag output nn tag only. english sentence input proper tagging. kindly help. version - python 3.4.1, nltk 3.0 documentation
kindly help! here tried.
word_to_be_tagged = u"ताजो स्वास आनी चकचकीत दांत तुमचें व्यक्तीमत्व परजळायतात."
from nltk.corpus import indian train_data = indian.tagged_sents('hindi.pos')[:300] test_data = indian.tagged_sents('hindi.pos')[301:] print(word_to_be_tagged) print (train_data)
and output different.
ताजो स्वास आनी चकचकीत दांत तुमचें व्यक्तीमत्व परजळायतात. [[('पूर्ण', 'jj'), ('प्रतिबंध', 'nn'), ('हटाओ', 'vfm'), (':', 'sym'), ('इराक', 'nnp')], [('संयुक्त', 'nnc'), ('राष्ट्र', 'nn'), ('।', 'sym')], ...]
the problem should use hindi pos tagger:
from nltk.corpus import indian nltk.tag import tnt train_data = indian.tagged_sents('hindi.pos') tnt_pos_tagger = tnt.tnt() tnt_pos_tagger.train(train_data) #training tnt part of speech tagger hindi data print tnt_pos_tagger.tag(nltk.word_tokenize(word_to_be_tagged))
the problem part of speech tagger accurate in specific domain (mostly combination of language , topic). in english, of words tagger haven't seen yet nouns (nn), tags data nn only.
if train same domain want tag after (hindi), should ok.
see this more explanations.
Comments
Post a Comment