python - Preserving contractions with textblob ngrams -
is there way tell #textblob not split contractions let's let & 's when creating ngrams? know technically 2 separate words, i'd maintain them one.
it looks you've got 2 options here:
- change tokenizer used in textblob.
- post-process tokens.
the latter easier, slower.
changing pattern
textblob accepts nltk tokenizers, , i'm more familiar those, we're going use that. nltk's wordpuncttokenizer repexptokenizer pattern "\\w+|[^\\w\\s]+":
>>> nltk.tokenize.regexptokenizer("\\w+|[^\\w\\s]+").tokenize("let's check out.") ['let', "'", 's', 'check', 'this', 'out', '.'] before disjunction \w+, indicates word characters. after disjunction [^\w\s], matches that's not character or whitespace--that is, punctuation.
if want include ' in words, "let's", can add character word character portion of disjunction:
>>> nltk.tokenize.regexptokenizer("[\\w']+|[^\\w\\s]+").tokenize("let's check out.") ["let's", 'check', 'this', 'out', '.'] post-processing
the regex approach isn't perfect, though. suspect textblob's built-in tokenizer might bit better hack regex. if strictly want take contractions 1 token, recommend post-processing textblob's output.
>>> tokens = ["let", "'s", "check", "this", "out", "."] >>> def postproc(toks): ... toks_out = [] ... while len(toks) > 1: ... bigram = toks[:2] ... if bigram[1][0] == "'": ... toks_out.append("".join(bigram)) ... toks = toks[2:] ... else: ... toks_out.append(bigram[0]) ... toks = toks[1:] ... toks_out.extend(toks) ... return toks_out ... >>> postproc(tokens) ["let's", 'check', 'this', 'out', '.'] so gets fixed want fixed, whole post-processing add run time code.
Comments
Post a Comment