parsing - Can I force ANTL4 to read expected tokens instead of letting it guessing what kind of token it may be? -

February 15, 2014

i try write simple antlr4 grammar parsing srt subtitles files. thought easy, introductory task, guess must miss point. first things first --- grammar:

grammar srt;  file    :   subtitle (nl nl subtitle)* eof;  subtitle:   subno nl             tstamp ' --> ' tstamp nl             line (nl line)*;  tstamp  :   i99 ':' i59 ':' i59 ',' i999; subno   :   d09+; nl      :   '\r'? '\n'; line    :   ~('\r'|'\n')+;  fragment i999   :   d09 d09 d09; fragment i99    :   d09 d09; fragment i59    :   d05 d09; fragment d09    :   [0-9]; fragment d05    :   [0-5];

and here's beginning of srt file problem stars:

1 00:00:20,000 --> 00:00:26,000

the error is:

line 2:0 mismatched input '00:00:20,000 --> 00:00:26,000' expecting tstamp

so looks second line applied lexer rule line (as longest token have been matched), expect match rule tstamp (and that's why it's defined before line rule in grammar). antlr4 knowledge weak @ point tweak grammar in way, lexer try match subset on tokens depending on current position in parser rule. intend achieve match tstamp , not line, tstamp in fact expected input. maybe trick lexer modes, can hardly believe couldn't written in simpler way. can it?

as corona suggested trick defer decision line rule parser , clue. modified grammar bit more , parser subtitles smoothly:

grammar srt;  file    :   subtitle (nl nl subtitle)* eof;  subtitle:   subno nl             tstamp ' --> ' tstamp nl             lines;  lines   :   line (nl line)*; line    :   (linechar | subno | tstamp)*;  tstamp  :   i99 ':' i59 ':' i59 ',' i999; subno   :   d09+; nl      :   '\r'? '\n'; linechar:   ~[\r\n];  fragment i999   :   d09 d09 d09?; fragment i99    :   d09 d09; fragment i59    :   d05 d09; fragment d09    :   [0-9]; fragment d05    :   [0-5];

your definition of token line subsumes everything:

line    :   ~('\r'|'\n')+;

each tstamp line line can match longer lexems. , can see. antlr prefers longest matches.

to make grammar work, transfer decision line lexer parser:

subtitle:   subno nl             tstamp ' --> ' tstamp nl             line*;  line:   (linechar | tstamp | subno)* nl?;  ...  linechar    :   ~('\r'|'\n' ) ; //remove '+'

you can see line may contain line_char tstamps , subnos.

Search This Blog

Call

parsing - Can I force ANTL4 to read expected tokens instead of letting it guessing what kind of token it may be? -

Comments

Post a Comment

Popular posts from this blog

node.js - Using Node without global install -

php - CakePHP HttpSockets send array of paramms -

angularjs - ADAL JS Angular- WebAPI add a new role claim to the token -