Unicode Tagging of an input file in Python NLTK -


i reposting asked question code tried working on python nltk tagging program.

my input file konkani(indian language) text containing several lines. guess need encode input file. kindly help.

my code - input file of several sentences

inputfile - ताजो स्वास आनी चकचकीत दांत तुमचें व्यक्तीमत्व परजळायतात. दांत आशिल्ल्यान तुमचो आत्मविश्वासय वाडटा. आमच्या हड्ड्यां आनी दांतां मदीं बॅक्टेरिया आसतात. 

code-

import nltk  file=open('kkn.txt') t=file.read(); s=nltk.pos_tag(nltk.word_tokenize(t))  print(s) 

gives error in output -

>>>  traceback (most recent call last):   file "g:/nltk/inputkonkanisentence.py", line 4, in <module>     t=file.read();   file "c:\python34\lib\encodings\cp1252.py", line 23, in decode     return codecs.charmap_decode(input,self.errors,decoding_table)[0] unicodedecodeerror: 'charmap' codec can't decode byte 0x8d in position 21: character maps <undefined> >>>  

this happening because file you're trying use not using cp1252 encoding. encoding you're using, you'll have figure out. have specify encoding when open file. example:

file = open(filename, encoding="utf8")


Comments

Popular posts from this blog

angularjs - ADAL JS Angular- WebAPI add a new role claim to the token -

php - CakePHP HttpSockets send array of paramms -

node.js - Using Node without global install -