python - Preserving contractions with textblob ngrams -


is there way tell #textblob not split contractions let's let & 's when creating ngrams? know technically 2 separate words, i'd maintain them one.

it looks you've got 2 options here:

the latter easier, slower.

changing pattern

textblob accepts nltk tokenizers, , i'm more familiar those, we're going use that. nltk's wordpuncttokenizer repexptokenizer pattern "\\w+|[^\\w\\s]+":

>>> nltk.tokenize.regexptokenizer("\\w+|[^\\w\\s]+").tokenize("let's check out.") ['let', "'", 's', 'check', 'this', 'out', '.'] 

before disjunction \w+, indicates word characters. after disjunction [^\w\s], matches that's not character or whitespace--that is, punctuation.

if want include ' in words, "let's", can add character word character portion of disjunction:

>>> nltk.tokenize.regexptokenizer("[\\w']+|[^\\w\\s]+").tokenize("let's check out.") ["let's", 'check', 'this', 'out', '.'] 

post-processing

the regex approach isn't perfect, though. suspect textblob's built-in tokenizer might bit better hack regex. if strictly want take contractions 1 token, recommend post-processing textblob's output.

>>> tokens = ["let", "'s", "check", "this", "out", "."] >>> def postproc(toks): ...     toks_out = [] ...     while len(toks) > 1: ...             bigram = toks[:2] ...             if bigram[1][0] == "'": ...                     toks_out.append("".join(bigram)) ...                     toks = toks[2:] ...             else: ...                     toks_out.append(bigram[0]) ...                     toks = toks[1:] ...     toks_out.extend(toks) ...     return toks_out ...  >>> postproc(tokens) ["let's", 'check', 'this', 'out', '.'] 

so gets fixed want fixed, whole post-processing add run time code.


Comments

Popular posts from this blog

node.js - Using Node without global install -

How to access a php class file from PHPFox framework into javascript code written in simple HTML file? -

java - Null response to php query in android, even though php works properly -