python - Preserving contractions with textblob ngrams -

is there way tell #textblob not split contractions let's let & 's when creating ngrams? know technically 2 separate words, i'd maintain them one.

it looks you've got 2 options here:

change tokenizer used in textblob.
post-process tokens.

the latter easier, slower.

changing pattern

textblob accepts nltk tokenizers, , i'm more familiar those, we're going use that. nltk's wordpuncttokenizer repexptokenizer pattern "\\w+|[^\\w\\s]+":

>>> nltk.tokenize.regexptokenizer("\\w+|[^\\w\\s]+").tokenize("let's check out.") ['let', "'", 's', 'check', 'this', 'out', '.']

before disjunction \w+, indicates word characters. after disjunction [^\w\s], matches that's not character or whitespace--that is, punctuation.

if want include ' in words, "let's", can add character word character portion of disjunction:

>>> nltk.tokenize.regexptokenizer("[\\w']+|[^\\w\\s]+").tokenize("let's check out.") ["let's", 'check', 'this', 'out', '.']

post-processing

the regex approach isn't perfect, though. suspect textblob's built-in tokenizer might bit better hack regex. if strictly want take contractions 1 token, recommend post-processing textblob's output.

>>> tokens = ["let", "'s", "check", "this", "out", "."] >>> def postproc(toks): ...     toks_out = [] ...     while len(toks) > 1: ...             bigram = toks[:2] ...             if bigram[1][0] == "'": ...                     toks_out.append("".join(bigram)) ...                     toks = toks[2:] ...             else: ...                     toks_out.append(bigram[0]) ...                     toks = toks[1:] ...     toks_out.extend(toks) ...     return toks_out ...  >>> postproc(tokens) ["let's", 'check', 'this', 'out', '.']

so gets fixed want fixed, whole post-processing add run time code.

Search This Blog

Call

python - Preserving contractions with textblob ngrams -

Comments

Post a Comment