Unicode Tagging of an input file in Python NLTK -
i reposting asked question code tried working on python nltk tagging program.
my input file konkani(indian language) text containing several lines. guess need encode input file. kindly help.
my code - input file of several sentences
inputfile - ताजो स्वास आनी चकचकीत दांत तुमचें व्यक्तीमत्व परजळायतात. दांत आशिल्ल्यान तुमचो आत्मविश्वासय वाडटा. आमच्या हड्ड्यां आनी दांतां मदीं बॅक्टेरिया आसतात.   code-
import nltk  file=open('kkn.txt') t=file.read(); s=nltk.pos_tag(nltk.word_tokenize(t))  print(s)   gives error in output -
>>>  traceback (most recent call last):   file "g:/nltk/inputkonkanisentence.py", line 4, in <module>     t=file.read();   file "c:\python34\lib\encodings\cp1252.py", line 23, in decode     return codecs.charmap_decode(input,self.errors,decoding_table)[0] unicodedecodeerror: 'charmap' codec can't decode byte 0x8d in position 21: character maps <undefined> >>>       
this happening because file you're trying use not using cp1252 encoding. encoding you're using, you'll have figure out. have specify encoding when open file. example:
file = open(filename, encoding="utf8")
Comments
Post a Comment