parsing - Can I force ANTL4 to read expected tokens instead of letting it guessing what kind of token it may be? -
i try write simple antlr4 grammar parsing srt subtitles files. thought easy, introductory task, guess must miss point. first things first --- grammar:
grammar srt; file : subtitle (nl nl subtitle)* eof; subtitle: subno nl tstamp ' --> ' tstamp nl line (nl line)*; tstamp : i99 ':' i59 ':' i59 ',' i999; subno : d09+; nl : '\r'? '\n'; line : ~('\r'|'\n')+; fragment i999 : d09 d09 d09; fragment i99 : d09 d09; fragment i59 : d05 d09; fragment d09 : [0-9]; fragment d05 : [0-5];
and here's beginning of srt file problem stars:
1 00:00:20,000 --> 00:00:26,000
the error is:
line 2:0 mismatched input '00:00:20,000 --> 00:00:26,000' expecting tstamp
so looks second line applied lexer rule line
(as longest token have been matched), expect match rule tstamp
(and that's why it's defined before line
rule in grammar). antlr4 knowledge weak @ point tweak grammar in way, lexer try match subset on tokens depending on current position in parser rule. intend achieve match tstamp
, not line
, tstamp
in fact expected input. maybe trick lexer modes, can hardly believe couldn't written in simpler way. can it?
as corona suggested trick defer decision line
rule parser , clue. modified grammar bit more , parser subtitles smoothly:
grammar srt; file : subtitle (nl nl subtitle)* eof; subtitle: subno nl tstamp ' --> ' tstamp nl lines; lines : line (nl line)*; line : (linechar | subno | tstamp)*; tstamp : i99 ':' i59 ':' i59 ',' i999; subno : d09+; nl : '\r'? '\n'; linechar: ~[\r\n]; fragment i999 : d09 d09 d09?; fragment i99 : d09 d09; fragment i59 : d05 d09; fragment d09 : [0-9]; fragment d05 : [0-5];
your definition of token line
subsumes everything:
line : ~('\r'|'\n')+;
each tstamp
line
line can match longer lexems. , can see. antlr prefers longest matches.
to make grammar work, transfer decision line lexer parser:
subtitle: subno nl tstamp ' --> ' tstamp nl line*; line: (linechar | tstamp | subno)* nl?; ... linechar : ~('\r'|'\n' ) ; //remove '+'
you can see line may contain line_char
tstamp
s , subno
s.
Comments
Post a Comment