Java StreamTokenizer splits Email adress at @ sign -
i trying parse document containing email adresses, streamtokenizer splits e-mail adress 2 seperate parts.
i set @ sign ordinarychar , space whitespace:
streamtokenizer tokeziner = new streamtokenizer(freader); tokeziner.ordinarychar('@'); tokeziner.whitespacechars(' ', ' '); still, e-mail adresses split up.
a line parse looks following:
"student 6 name6 lastname6 del6@uni.at competition speech university of innsbruck". the tokenizer splits del6@uni.at "del6" , "uni.at".
is there way tell tokenizer not split @ @ signs?
so here why worked did:
streamtokenizer regards input programming language tokenizer. is, breaks tokens "words", "numbers", "quoted strings", "comments", , on, based on syntax programmer sets it. programmer tells characters word characters, plain characters, comment characters etc.
so in fact rather sophisticated tokenizing - recognizing comments, quoted strings, numbers. note in programing language, can have string a = a+b;. simple tokenizer merely breaks text whitespace break a, = , a+b;. streamtokenizer break a, =, a, +, b, , ;, , give "type" each of these tokens, "language" parser can distinguish identifiers operators. streamtokenizer's types rather basic, behavior key understanding happened in case.
it wasn't recognizing @ whitespace. in fact, parsing , returning token. value in ttype field, , looking @ sval.
a streamtokenizer recognize line as:
the word student number 6.0 word name6 word lastname6 word del6 character @ word uni.at word competition word speech word university word of word innsbruck
(this actual output of little demo wrote tokenizing example line , printing type).
in fact, telling @ "ordinary character", telling take @ own token (which anyway default). ordinarychar() documentation tells method:
specifies character argument "ordinary" in tokenizer. removes special significance character has comment character, word component, string delimiter, white space, or number character. when such character encountered parser, parser treats single-character token , sets ttype field character value.
(my emphasis).
in fact, if had instead passed wordchars(), in tokenizer.wordchars('@','@') have kept whole e-mail together. little demo added gives:
the word student number 6.0 word name6 word lastname6 word del6@uni.at word competition word speech word university word of word innsbruck
if need programming-language-like tokenizer, streamtokenizer may work you. otherwise options depend on whether data line-based (each line separate record, there may different number of tokens on each line), typically read lines one-by-one reader, split them using string.split(), or if whitespace-delimited chain of tokens, scanner might suit better.
Comments
Post a Comment