Unicode Tagging of an input file in Python NLTK -
i reposting asked question code tried working on python nltk tagging program.
my input file konkani(indian language) text containing several lines. guess need encode input file. kindly help.
my code - input file of several sentences
inputfile - ताजो स्वास आनी चकचकीत दांत तुमचें व्यक्तीमत्व परजळायतात. दांत आशिल्ल्यान तुमचो आत्मविश्वासय वाडटा. आमच्या हड्ड्यां आनी दांतां मदीं बॅक्टेरिया आसतात.
code-
import nltk file=open('kkn.txt') t=file.read(); s=nltk.pos_tag(nltk.word_tokenize(t)) print(s)
gives error in output -
>>> traceback (most recent call last): file "g:/nltk/inputkonkanisentence.py", line 4, in <module> t=file.read(); file "c:\python34\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] unicodedecodeerror: 'charmap' codec can't decode byte 0x8d in position 21: character maps <undefined> >>>
this happening because file you're trying use not using cp1252 encoding. encoding you're using, you'll have figure out. have specify encoding when open file. example:
file = open(filename, encoding="utf8")
Comments
Post a Comment