ANTLR4 学习

#Chapter 2 The Big Picture
##2.1 Let’s Get Meta!

  • interpreter
    if an application computes or “executes” sentences, we call that
    application an interpreter. Examples include calculators, configuration file
    readers, and Python interpreters.
  • translator
    If we’re converting sentences from one language to another, we call that application a translator. Examples include Java to C# converters and compilers.
  • parsers or syntax analyzers
    Programs that recognize (identify the various components in a phrase and differentiate it from other phrases) languages.
  • Two stages of parsing
    • lexical analysis or tokenizing (lexer)
      The process of grouping characters into words or symbols (tokens)
    • tokens
      Tokens consist of at least two pieces of information: the token type (identifying the lexical structure) and the text matched for that token by the lexer.
    • parser
      The second stage is the actual parser and feeds off of these tokens to recognize
      the sentence structure.
      By default, ANTLR-generated parsers build a data structure called a parse tree or syntax
      tree
      that records how the parser recognized the structure of the input sentence
      and its component phrases.

Example

We would specify phrase structure with a set of rules. Here is the grammar rule that corresponds to the first level of the assign subtree from the diagram”:
assign : ID '=' expr ';' ; // match an assignment statement like "sp = 100;"

##2.2 Implementing Parsers (之后有时间重新读一遍)

The ANTLR tool generates recursive-descent parsers from grammar rules.

A more general term for this kind of parsing is top-down parsing; recursive-descent parsers are just one kind of top-down parser implementation.

The stat symbol means calling method stat() for the parse tree. To get an idea of what recursive-descent parsers look like, here’s the (slightly cleaned up) method that ANTLR generates for rule assign:

1
2
3
4
5
6
7
// assign : ID '=' expr ';' ;
void assign() { // method generated from rule assign
match(ID); // compare ID to current input symbol then consume
match('=');
expr(); // match an expression by calling expr()
match(';');
}

For example, the stat rule that invokes assign
likely has a list of other kinds of statements.

1
2
3
4
5
6
/** Match any kind of statement starting at the current input position */
stat: assign // First alternative ('|' is alternative separator)
| ifstat // Second alternative
| whilestat
...
;

A parsing rule for stat looks like a switch.

1
2
3
4
5
6
7
8
9
void stat() {
switch ( «current input token» ) {
CASE ID : assign(); break;
CASE IF : ifstat(); break; // IF is token type for keyword 'if'
CASE WHILE : whilestat(); break;
...
default : «raise no viable alternative exception»
}
}

##2.3 You Can’t Put Too Much Water into a Nuclear Reactor (ambiguous phrase)
We have to provide unambiguous grammars.

  • duplicate is an obvious ambiguities
  • some more subtle ambiguities:
    1
    2
    3
    4
    5
    6
    stat: expr ';' // expression statement
    | ID '(' ')' ';' // function call statement
    ;
    expr: ID '(' ')'
    | INT
    ;
    Here are the two interpretations of input f(); starting in rule stat:

Ambiguities can occur in the lexer as well as the parser, but ANTLR resolves
them so the rules behave naturally

1
2
BEGIN : 'begin' ; // match b-e-g-i-n sequence; ambiguity resolves to BEGIN
ID : [a-z]+ ; // match one or more of any lowercase letter

that input beginner would match only to rule ID. The lexer would not match beginner as BEGIN followed by an ID matching input ner.

##2.4 Building Language Applications Using Parse Trees