ANTLR lexers in depthEdit
Preliminaries
These simple examples demonstrate some lexer basics for ANTLR. They assume that ANTLR 3 is installed and in the CLASSPATH
; likewise the current directory should also be in the CLASSPATH
:
export CLASSPATH=".:/usr/local/antlr/lib/antlr-3.0.jar"
export CLASSPATH="$CLASSPATH:/usr/local/antlr/lib/antlr-2.7.7.jar"
export CLASSPATH="$CLASSPATH:/usr/local/antlr/lib/antlr-runtime-3.0.jar"
export CLASSPATH="$CLASSPATH:/usr/local/antlr/lib/stringtemplate-3.0.jar"
Each example uses a Simple
grammar in a file called Simple.g
.
The grammar can be compiled using:
java org.antlr.Tool Simple.g
An executable can then be produced with:
javac Simple.java SimpleLexer.java SimpleParser.java
This requires a Simple.java
file containing a main
method:
import org.antlr.runtime.*;
public class Simple {
public static void main(String[] args) throws Exception {
ANTLRInputStream input = new ANTLRInputStream(System.in);
SimpleLexer lexer = new SimpleLexer(input);
CommonTokenStream tokens = new CommonTokenStream(lexer);
SimpleParser parser = new SimpleParser(tokens);
System.out.println("\n"); // blank line before printing other output
parser.thing();
}
}
Note that the start rule in these examples is thing
.
The compiled executable can be run using:
java Simple
Input is read from the standard input. Hit Control-D twice to signal the end of the input.
Unrecognized tokens
This grammar demonstrates how ANTLR responds to input not recognized by the lexer:
grammar Simple;
// lexer rules
FOO: 'foo' { System.out.println("foo"); };
BAR: 'bar' { System.out.println("bar"); };
// parser rules
thing: .* EOF { System.out.println("done"); };
Invalid input
For input:
hello
ANTLR prints:
line 1:0 no viable alternative at character 'h'
line 1:1 no viable alternative at character 'e'
line 1:2 no viable alternative at character 'l'
line 1:3 no viable alternative at character 'l'
line 1:4 no viable alternative at character 'o'
done
Note that the lexer emits warnings for unrecognized characters and then moves on. Ultimately no tokens are recognized and no tokens are passed on to the parser, other than the EOF
token.
Valid input
For input:
foobar
Prints:
foo
bar
done
Mixed input (valid/invalid)
For input:
*foo+bar=
Prints:
line 1:0 no viable alternative at character '*'
foo
line 1:4 no viable alternative at character '+'
bar
line 1:8 no viable alternative at character '='
done
Here we clearly see the unrecognized characters being skipped over by the lexer; the parser only sees the valid tokens.
The tokens
section
Basics
This example uses the tokens
section instead of the lexer rule for FOO
like in the previous example:
grammar Simple;
tokens {
FOO = 'foo';
}
// lexer rules
BAR: 'bar' { System.out.println("bar"); };
// parser rules
thing: .* EOF { System.out.println("done"); };
Invalid input
Input:
hello
Output:
line 1:0 no viable alternative at character 'h'
line 1:1 no viable alternative at character 'e'
line 1:2 no viable alternative at character 'l'
line 1:3 no viable alternative at character 'l'
line 1:4 no viable alternative at character 'o'
done
Note that this example behaves identically to the one above.
Valid input
Input:
foobar
Output:
bar
done
Note that tokens declared in the tokens
section can’t have actions attached to them; this is why nothing is printed when FOO
is recognized.
Mixed input
Input:
*foo+bar=
Output:
line 1:0 no viable alternative at character '*'
line 1:4 no viable alternative at character '+'
bar
line 1:8 no viable alternative at character '='
done
Precedence
This grammar explores whether tokens declared in the tokens
section take precedence over those defined in normal lexer rules; note how both FOO
and BAR
now compete, trying to match foo
:
grammar Simple;
tokens {
FOO = 'foo';
}
// lexer rules
BAR: 'foo' { System.out.println("bar"); };
// parser rules
thing: .* EOF { System.out.println("done"); };
Upon compilation, ANTLR prints:
warning(208): Simple.g:8:1: The following token definitions are unreachable: BAR
Note that the relative ordering within the file has no effect; upon changing the order:
grammar Simple;
// lexer rules
BAR: 'foo' { System.out.println("bar"); };
tokens {
FOO = 'foo';
}
// parser rules
thing: .* EOF { System.out.println("done"); };
ANTLR still emits:
warning(208): Simple.g:8:1: The following token definitions are unreachable: BAR
Now for a slightly more subtle example; there is no longer a literal duplication of the two tokens but BAR
will still try to match the same thing as FOO
:
grammar Simple;
tokens {
FOO = 'foo';
}
// lexer rules
BAR: 'a'..'z' 'a'..'z' 'a'..'z' { System.out.println("bar"); };
// parser rules
thing: .* EOF { System.out.println("done"); };
This compiles without warning.
For input:
foo
ANTLR prints:
done
And for input:
bar
It prints:
bar
done
Thus demonstrating that tokens defined in the tokens
section take precedence over those defined in normal lexer rules.
Tokens defined implicitly in the parser
In this section we show that tokens defined implicitly in the parser (using string literals) are exactly equivalent to tokens defined in the tokens
section:
grammar Simple;
// lexer rules
BAR: 'bar' { System.out.println("bar"); };
// parser rules
thing: ('foo' | BAR)* EOF { System.out.println("done"); };
Behaves as follows:
#input
foo
# output
done
# input
bar
# output
bar
done
# input
foobar
# output
bar
done
And changing the grammar slightly in order to compare precedence (note that lexer rule BAR
and the implicitly-defined token 'foo'
will now try to match the same thing):
grammar Simple;
// lexer rules
BAR: 'a'..'z' 'a'..'z' 'a'..'z' { System.out.println("bar"); };
// parser rules
thing: ('foo' | BAR)* EOF { System.out.println("done"); };
Behaves as follows:
#input
foo
# output
done
# input
bar
# output
bar
done
# input
foobar
# output
bar
done
Note that the behaviour of implicitly-defined tokens is identical to that from the previous section, where tokens were explicitly defined in the tokens
section.
Precedence among lexer rules
Literal tokens
This grammar demonstrates that when matching input that could match multiple lexer rules, ANTLR will choose the longest match. This is clearest in the case of literal tokens (see below for an example using non-literal tokens):
grammar Simple;
// lexer rules
SHORT: 'aaa' { System.out.println("short"); };
LONG: 'aaaa' { System.out.println("long"); };
// parser rules
thing: .* EOF { System.out.println("done"); };
Behaviour is as follows:
# input
aaa
# output
short
done
# input
aaaa
# output
long
done
# input
aaaaaaa
# output
long
short
done
Of special interest is the last example; note how ANTLR choose the longest match over the shorter possibility even though the SHORT
lexer rule appears first in the grammar.
Non-literal tokens
Non-literal tokens are different to literal tokens in that the order in which the lexer rules are specified is critical, as shown in this example grammar:
grammar Simple;
// lexer rules
FOO: 'a'..'z'+ { System.out.println("foo"); };
BAR: 'a'..'f'+ { System.out.println("bar"); };
// parser rules
thing: .* EOF { System.out.println("done"); };
Here ANTLR warns that warning(208): Simple.g:5:1: The following token definitions are unreachable: BAR
.
Changing the order:
grammar Simple;
// lexer rules
BAR: 'a'..'f'+ { System.out.println("bar"); };
FOO: 'a'..'z'+ { System.out.println("foo"); };
// parser rules
thing: .* EOF { System.out.println("done"); };
ANTLR will now preferentially match BAR
, if possible and all else being equal:
# input
aaa
# output
bar
done
# input
zzz
# output
foo
done
# input
aaaaaaaazzz
# output
foo
done
Note how when the the input can match FOO
or BAR
, ANTLR prefers BAR
because it appears first but in the case where ANTLR could match the input as two tokens (BAR
followed by FOO
) or as a single, longer token (just FOO
), it opts for the greedy match.
So ordering is critical for non-literal lexer rules, but in general this ordering is only relevant during the analysis phase; the following example rules should make this clear:
LETTERS: 'a'..'z'+;
LETTERS_SUBSET: 'a'..'f'+; // unreachable according to analysis
NUMBERS_SUBSET: '0'..'5'+; // reachable according to analysis
NUMBERS: '0'..'9'+;
If we take the two number-matching rules, note that the ordering was important during the analysis phase. But when the lexer is actually run the ordering is overridden by ANTLR’s preference for greedy matches:
# inputs
012345 # could match either rule
# but ANTLR will choose NUMBERS_SUBSET
# because it appears first in the lexer
0123456 # could match NUMBERS_SUBSET, followed by NUMBERS
# alternatively, could just match one NUMBERS token
# ANTLR chooses the greedy match (one NUMBERS token)
The only case in which strict ordering is preserved at both analysis and runtime regardless of length is when filtering is turned on; (see the section on filtering below).
Prediction
ANTLR builds predictive lexers; that is, they attempt to use a small amount of "lookahead" to see what kind of token they are scanning without actually scanning the entire token. By default the lookahead in lexers is one character. ANTLR will automatically increase this in order to disambiguate cases like this:
# when ANTLR sees an "f" it must look ahead 7 more characters
# (until it sees the "d" or the "t")
# only then can it predict either FOOD or FOOT
FOOD: 'fooooood...';
FOOT: 'foooooot...';
Once lookahead has determined the type of token to be expected, ANTLR races forward and tries to scan the entire thing, raising an exception if the token didn’t match the prediction. The following simple grammar illustrates this:
grammar Simple;
FOO: BAR ':' BAZ {System.out.println("FOO");};
fragment BAR: 'bar' {System.out.println("bar");};
fragment BAZ: 'baz' {System.out.println("baz");};
EVERYTHING_ELSE: . {System.out.println("EVERYTHING_ELSE");};
thing: .* EOF {System.out.println("done");};
And here is the behaviour:
# input
bar:.
# output
bar
line 1:4 mismatched character '.' expecting 'b'
done
Here the lexer sees bar
, sees a :
, and assumes that this is a FOO
token. It therefore expects a b
character after the colon but instead finds a period, and so throws an exception.
# input
bar.
# output
bar
line 1:3 mismatched character '.' expecting ':'
done
Note that with even less input (before seeing even the semi-colon), the lexer has already decided that this is a FOO
token and so expects to see a semi-colon after the r
; when it actually finds a period instead it throws an exception.
# input
baz:.
# output
line 1:2 mismatched character 'z' expecting 'r'
EVERYTHING_ELSE
done
Now we try a different input before the semi-colon to see if the lexer will identify this as EVERYTHING_ELSE
or not; evidently when it sees the ba
it has already predicted that this will be a FOO
token, and so fails. Note that even though it throws the exception, it continues on and scans one more character. If we chop one more letter off the input, not that it doesn’t continue scanning one character:
# input
baz:
# output
line 1:2 mismatched character 'z' expecting 'r'
done
Likewise, if we chop off yet another character, the lexer still expects to find a FOO
token:
# input
baz
# output
line 1:2 mismatched character 'z' expecting 'r'
done
The same is true if we chop off another character:
# input
ba
# output
line 1:2 mismatched character 'z' expecting 'r'
done
Only when we replace the second character with something other than a
do we can an exception-free recognition:
# input
b.
# output
EVERYTHING_ELSE
EVERYTHING_ELSE
done
Likewise, a lone b
or any other string that doesn’t start with ba
is recognized without any problems:
# input
b
# output
EVERYTHING_ELSE
done
# input
foo
# output
EVERYTHING_ELSE
EVERYTHING_ELSE
EVERYTHING_ELSE
done
But note that embedding a ba
anywhere in the input is enough to trigger the exceptions again:
#input
fooba
# output
EVERYTHING_ELSE
EVERYTHING_ELSE
EVERYTHING_ELSE
line 1:5 mismatched character '<EOF>' expecting 'r'
done
I believe ba
will trigger the exception while b
will not because ANTLR uses one character of lookahead in order to make predictions. When scanning through the input and looking ahead to a lone b
character, ANTLR doesn’t yet know enough to predict whether the next token will be FOO
or EVERYTHING_ELSE
. On the other hand, when scanning a b
character and looking ahead to find an a
, ANTLR assumes that a FOO
token has been found.
This prediction behaviour makes ANTLR lexers exceedingly difficult to get right. There is no "backtracking" like you might find in a hand-coded recursive descent parser. If you want the example grammar to handle strings like fooba
without raising an exception then the only practical way that I know to do so is to use the filter
option:
grammar Simple;
options {filter=true;}
FOO: BAR ':' BAZ {System.out.println("FOO");};
fragment BAR: 'bar' {System.out.println("BAR");};
fragment BAZ: 'baz' {System.out.println("BAZ");};
EVERYTHING_ELSE: . {System.out.println("EVERYTHING_ELSE");};
thing: .* EOF {System.out.println("done");};
This grammar will handle things input strings like ....foobaaaaa....bar:baz...
and correctly identify the bar:baz
string embedded in the middle of the string without raising exceptions for the incomplete substrings elsewhere in the input.
Filtering
As already mentioned above, filtering mode changes the precedence and matching behaviour of the generated lexer in the following ways:
- ANTLR synthesizes a
Tokens
rule that will try the lexer rules one at a time in the exact order that they appear in the grammar. - Greedy matching does not apply; the first match found wins regardless of length.
- If no rules match, the lexer skips a character (throwing it away) and tries again, restarting from the top of the rules list.
- This behaviour is equivalent to having backtracking turned on (failing options do not result in an exception message being emitted; the lexer merely rewinds ready to try the next option) with a fixed lookahead of
k = 1
, combined with a strict ordering of the alternatives.
Limitations
When filter
is set to true
, @after
and action blocks have no effect. This means that when set to true
you can only generate a lexer, not a combined lexer/parser (this is confirmed in this post to the antlr-interest mailing list by Terence Parr):
grammar Simple;
options {filter=true;}
FOO: 'foo' {System.out.println("FOO");};
EVERYTHING_ELSE: . {System.out.println("EVERYTHING_ELSE");};
thing
@init { System.out.println("init"); } // printed regardless of filter setting
@after { System.out.println("after"); } // not printed when filter = true
: .* EOF {System.out.println("done");}; // not printed when filter = true
Results:
# input
init
FOO
EVERYTHING_ELSE
EVERYTHING_ELSE
EVERYTHING_ELSE
# output
Results with filter = true
commented out:
# output
init
FOO
EVERYTHING_ELSE
EVERYTHING_ELSE
EVERYTHING_ELSE
done
after
Note that in non-filtering mode, the action block is executed, along with the @init
and @after
blocks. In filtering mode, only the @init
block is executed. Due to this limitation there isn’t really any sense in trying to write a parser when filtering is turned on, even in a case like this example grammar where the catch-all rule guarantees that everything will be tokenized. See "Patches to ANTLR 3.0" for a patch that I submitted that addresses this limitation.
Ambiguity
grammar Simple;
BAR: .+ { System.out.println("BAR"); }; // The following alternatives are unreachable: 1
FOO: 'foo' { System.out.println("FOO"); }; // The following token definitions are unreachable: FOO
thing: .* EOF;
grammar Simple;
BAR: .+ { System.out.println("BAR"); }; // The following alternatives are unreachable: 1
FOO: 'foo' { System.out.println("FOO"); }; // The following token definitions are unreachable: FOO
thing: .* EOF;
Feeding input such as "abc" to this lexer yields:
line 1:0 required (...)+ loop did not match anything at character 'a'
line 1:1 required (...)+ loop did not match anything at character 'b'
line 1:2 required (...)+ loop did not match anything at character 'c'
Likewise, input of "foo" yields:
line 1:0 required (...)+ loop did not match anything at character 'f'
line 1:1 required (...)+ loop did not match anything at character 'o'
line 1:2 required (...)+ loop did not match anything at character 'o'
To consider why this is the case, we look at the generated lexer:
- The
mTokens
method will always predictBAR
and never predictFOO
(in keeping with the "unreachable" warning forFOO
. - The
mBAR
method will throw anEarlyExitException
exception for the first character scanned, no matter what the character is; this is once again in keeping with the "unreachable" alternative warning withinBAR
. - The parser will catch the exception and keep asking the lexer for tokens, resulting in two more exceptions before the input is exhausted.
Both warnings go away if the BAR
rule is rewritten as follows:
BAR: '\u0000'..'\uFFFE'+;
ANTLRWorks shows .+
in its syntax diagram as exactly equivalent to '\u0000'..'\uFFFE'
but do not let that fool you; the answer lies on page 101 of the ANTLR book when it describes how .+
and .*
default to non-greedy behaviour:
ANTLR considers them idioms for "Match any symbol until you see what lies beyond the subrule."
In other words:
# this rule:
BAR: .+;
# is really equivalent to:
BAR: (options { greedy = false; }: '\u0000'..'\uFFFE')+;
# which is not the same as:
BAR: '\u0000'..'\uFFFE'+;
# or (written in a different way):
BAR: (options { greedy = true; }: '\u0000'..'\uFFFE')+;
Setting greedy = true
to this kind of subrule doesn’t make sense because "what lies beyond the subrule" is not specified (there is no subrule in fact). A counter example will make this clear:
# here a non-greedy subrule (really the ".*") makes sense
# because the "what lies beyond" is clearly stated
# in this case, "what lies beyond" is the sequence '*/'
COMMENT: '/*' .* '*/';
In fact, it is never correct to have greedy = false
on the right edge of a rule and ANTLR will always warn about it:
FOO: (options { greedy = false; } : 'foo')+;
This explains why a rule like .+
is described as containing an unreachable alternative; although I’d still like someone wiser in the ways of ANTLR to confirm my hypothesis.
Also, I am not sure why the warning for FOO
goes away because it still is unreachable (input "foo" is recognized as BAR
).
The generated code is different in the following ways:
- The
mBAR
method usesmatchRange('\u0000','\uFFFE')
range under the covers, instead ofmatchAny()
. - The
mBAR
method uses look-ahead to see if the input character is in the range of0000
toFFFE
(it always is); this is different from the other lexer which did not use lookahead and which by default assumes that no match is possible and never follows thematchAny()
path.
With the rule order inverted ANTLR issues no warnings:
grammar Simple;
FOO: 'foo' { System.out.println("FOO"); };
BAR: .+ { System.out.println("BAR"); };
thing: .* EOF;
For this grammar:
# input
foo
# output
FOO
# input
abc
# output
line 1:0 required (...)+ loop did not match anything at character 'a'
line 1:1 required (...)+ loop did not match anything at character 'b'
line 1:2 required (...)+ loop did not match anything at character 'c'
Clearly, using the .
(match any character) symbol in conjunction with +
doesn’t work in the current version of ANTLR. If we replace .+
with '\u0000'..'\uFFFE'
then the behaviour is as expected:
# input
foo
# output
FOO
# input
abc
# output
BAR
When we change BAR
to match a single character ANTLR won’t issue any warnings even though the rule appears first (and an f
could match BAR
or potentially be the first character of FOO
):
BAR: . { System.out.println("BAR"); };
FOO: 'foo' { System.out.println("FOO"); };
Behaviour:
# input
foo
# output (longest match wins)
FOO
# input
abc
# output
BAR
BAR
BAR
And for comparison (reverse order):
FOO: 'foo' { System.out.println("FOO"); };
BAR: . { System.out.println("BAR"); };
Behaviour:
# input
foo
# output
FOO
# input
abc
# output
BAR
BAR
BAR
The above examples again confirm that all else being equal, the longest match will win regardless of lexer rule order.
BAR: . . . { System.out.println("BAR"); };
FOO: 'foo' { System.out.println("FOO"); };
Behaviour:
# input
foo
# output
BAR
# input
abc
# output
BAR
Note how when the length of the two possible matches is equal, the first lexer rule (BAR
in this case) wins even if the other lexer rule may appear to be more specific.
On changing the order:
FOO: 'foo' { System.out.println("FOO"); };
BAR: . . . { System.out.println("BAR"); };
The behaviour is:
# input
foo
# output
FOO
# input
abc
# output
BAR
# input (first character could match FOO)
fbc
# output
BAR
# input (first two characters could match FOO)
foc
# output
BAR
It is interesting how in this example FOO
is not predicted despite the presence of two characters fo
; this is different from the findings above which showed prediction kicking in after seeing only two characters. I believe the reason why prediction doesn’t occur in this case is because both FOO
and BAR
are known to be of the same length at analysis time and so ANTLR won’t predict FOO
until it has seen the whole thing. Witness this counter example:
# input
fo
# output
line 1:0 no viable alternative at character 'f'
Here ANTLR doesn’t predict FOO
, but nor can it recognize BAR
because three characters are required and only two are present.
Analysis problems
Here is another example of how ambiguity can cause problems. Consider the following examples:
// OK
ANYTHING : . ;
thing : ANYTHING* EOF;
// OK
ANYTHING : . ;
thing : element? EOF;
element : ANYTHING+ ;
// OK
ANYTHING : . ;
thing : element EOF;
element : ANYTHING* ;
// not OK
ANYTHING : . ;
thing : element* EOF;
element : ANYTHING+ ;
All is well in the first three cases. The last case causes ANTLR to complain "Decision can match input such as "ANYTHING" using multiple alternatives: 1, 2". In this case given input like "abcdef", ANTLR doesn’t know whether to match it as a series of element
objects or as a single element
.
This is discussed on pages 283 and 285 of the ANTLR book, discussing an ambiguity problem with semi-colons (emphasis added):
The solution is to either make semicolons required or make them only statements. Semicolons should not be both statement terminators and statements as shown previously. Naturally, a good language designer would simply fix the language. With the grammar as is, though, ANTLR automatically resolves the nondeterminism greedily.
ANTLR generates a warning, but you can safely ignore it. At some point ANTLR will let you silence warnings for decisions that ANTLR properly resolves.**
Empty lexer rules
Empty lexer rules are a "no no"; that is, *
and ?
are bad in the lexer if they are entire production, and can lead to hanging the ANTLRWorks debugger or a non-working lexer (source: this antlr-interest post by Terence Parr).