Java StreamTokenizer splits Email adress at @ sign -
i trying parse document containing email adresses, streamtokenizer splits e-mail adress 2 seperate parts.
i set @
sign ordinarychar
, space whitespace
:
streamtokenizer tokeziner = new streamtokenizer(freader); tokeziner.ordinarychar('@'); tokeziner.whitespacechars(' ', ' ');
still, e-mail adresses split up.
a line parse looks following:
"student 6 name6 lastname6 del6@uni.at competition speech university of innsbruck".
the tokenizer splits del6@uni.at
"del6"
, "uni.at"
.
is there way tell tokenizer not split @ @
signs?
so here why worked did:
streamtokenizer
regards input programming language tokenizer. is, breaks tokens "words", "numbers", "quoted strings", "comments", , on, based on syntax programmer sets it. programmer tells characters word characters, plain characters, comment characters etc.
so in fact rather sophisticated tokenizing - recognizing comments, quoted strings, numbers. note in programing language, can have string a = a+b;
. simple tokenizer merely breaks text whitespace break a
, =
, a+b;
. streamtokenizer
break a
, =
, a
, +
, b
, , ;
, , give "type" each of these tokens, "language" parser can distinguish identifiers operators. streamtokenizer
's types rather basic, behavior key understanding happened in case.
it wasn't recognizing @
whitespace. in fact, parsing , returning token. value in ttype
field, , looking @ sval
.
a streamtokenizer
recognize line as:
the word student number 6.0 word name6 word lastname6 word del6 character @ word uni.at word competition word speech word university word of word innsbruck
(this actual output of little demo wrote tokenizing example line , printing type).
in fact, telling @
"ordinary character", telling take @
own token (which anyway default). ordinarychar()
documentation tells method:
specifies character argument "ordinary" in tokenizer. removes special significance character has comment character, word component, string delimiter, white space, or number character. when such character encountered parser, parser treats single-character token , sets ttype field character value.
(my emphasis).
in fact, if had instead passed wordchars()
, in tokenizer.wordchars('@','@')
have kept whole e-mail together. little demo added gives:
the word student number 6.0 word name6 word lastname6 word del6@uni.at word competition word speech word university word of word innsbruck
if need programming-language-like tokenizer, streamtokenizer
may work you. otherwise options depend on whether data line-based (each line separate record, there may different number of tokens on each line), typically read lines one-by-one reader, split them using string.split()
, or if whitespace-delimited chain of tokens, scanner
might suit better.
Comments
Post a Comment