Java StreamTokenizer splits Email adress at @ sign -


i trying parse document containing email adresses, streamtokenizer splits e-mail adress 2 seperate parts.

i set @ sign ordinarychar , space whitespace:

streamtokenizer tokeziner = new streamtokenizer(freader); tokeziner.ordinarychar('@'); tokeziner.whitespacechars(' ', ' '); 

still, e-mail adresses split up.

a line parse looks following:

"student 6 name6 lastname6 del6@uni.at  competition speech university of innsbruck". 

the tokenizer splits del6@uni.at "del6" , "uni.at".

is there way tell tokenizer not split @ @ signs?

so here why worked did:

streamtokenizer regards input programming language tokenizer. is, breaks tokens "words", "numbers", "quoted strings", "comments", , on, based on syntax programmer sets it. programmer tells characters word characters, plain characters, comment characters etc.

so in fact rather sophisticated tokenizing - recognizing comments, quoted strings, numbers. note in programing language, can have string a = a+b;. simple tokenizer merely breaks text whitespace break a, = , a+b;. streamtokenizer break a, =, a, +, b, , ;, , give "type" each of these tokens, "language" parser can distinguish identifiers operators. streamtokenizer's types rather basic, behavior key understanding happened in case.

it wasn't recognizing @ whitespace. in fact, parsing , returning token. value in ttype field, , looking @ sval.

a streamtokenizer recognize line as:

the word student number 6.0 word name6 word lastname6 word del6 character @ word uni.at word competition word speech word university word of word innsbruck 

(this actual output of little demo wrote tokenizing example line , printing type).

in fact, telling @ "ordinary character", telling take @ own token (which anyway default). ordinarychar() documentation tells method:

specifies character argument "ordinary" in tokenizer. removes special significance character has comment character, word component, string delimiter, white space, or number character. when such character encountered parser, parser treats single-character token , sets ttype field character value.

(my emphasis).

in fact, if had instead passed wordchars(), in tokenizer.wordchars('@','@') have kept whole e-mail together. little demo added gives:

the word student number 6.0 word name6 word lastname6 word del6@uni.at word competition word speech word university word of word innsbruck 

if need programming-language-like tokenizer, streamtokenizer may work you. otherwise options depend on whether data line-based (each line separate record, there may different number of tokens on each line), typically read lines one-by-one reader, split them using string.split(), or if whitespace-delimited chain of tokens, scanner might suit better.


Comments

Popular posts from this blog

angularjs - ADAL JS Angular- WebAPI add a new role claim to the token -

node.js - Using Node without global install -

php - CakePHP HttpSockets send array of paramms -