YCP Logo Lecture 9: Predictive Parsing

Top-Down Parsing

Top-down parsing is an approach where (equivalently)

we start with the start symbol, and recursively expand it into the input string

we find a leftmost derivation for the input string

Ways to implement a top-down parser:

recursive-descent

lookup-table-driven

Top-down parsing is a relatively simple approach.  However, it will not work with all grammars.  In particular, grammars that use left-recursion won't allow top-down parsing.

General Model for Parsing

As we talk about parsing algorithms, it is important to be clear about how they work.  The parser uses the lexical analyzer to read tokens (terminal symbols) one at a time from the input program.  The parser's task is to organize the entire program (string of terminal symbols) into a parse tree according to the language's grammar.

As the parser works, the lexical analyzer is logically positioned with its "caret" positioned after all of the tokens that have already been read, and before the tokens that remain to be read:


Whenever necessary, the parser can request the next token.  The token is returned to the parser, and the caret moves one position forward.  So, when we talk about the "next token", we mean the one that is positioned immediately after the caret.  The parser can also ask the lexical analyzer to look ahead and return the next 1..k tokens following the caret (for some small integer k), without actually consuming them.  Look ahead is a very useful capability, since it may allow the parser to make an informed choice whenever it needs to find a grammar production to apply in its effort to parse the input.

We consider a special token, which we will denote '$', as representing the end of the input string.  When the parser requests a token and sees '$', it knows that the entire input program has been consumed.

Recursive Descent

Recursive descent is an easy way to "hand code" a parser for a grammar that is amenable to top-down parsing.

In recursive-descent parsing, we write a function for each nonterminal symbol.  The purpose of the function implementing nonterminal symbol A is to expand an occurrence of the A nonterminal into the string of terminal symbols immediately following the parser's current caret position.

A typical function to expand the nonterminal symbol A in a recursive-descent parser looks like this [Dragon book, p. 219]:

function A {

choose a production A -> s1 s2 s3 ... sn

for each symbol s in s1..sn {

if s is a terminal symbol {
consume next token from the input program
raise an error if the next token does not match s
} else { /* s is a nonterminal */
call the parser function for s
}
}
}

Parsing an entire program is initiated by calling the parser function representing the grammar's start symbol.

The most important issue in implementing each function is how the production should be chosen, since there may be multiple productions on a single nonterminal symbol.

One possibility is to make the choice nondeterministic; the parser tries all possible choices of production.  Any time the parser reaches a point where an error is raised (because the next input token didn't match the one expected), the parser "backtracks" to the most recent nondeterministic choice.  Nondeterminism and backtracking can be time-consuming, as the parser explores many dead ends in its attempt to find a leftmost derivation.

A better choice is to look ahead at the next tokens in the input program, and predict which production should be applied.  If the input token or tokens immediately following the caret uniquely identify a production any time a nonterminal is expanded, then the nondeterminism is unnecessary.

To implement a predictive parser, we can analyze the grammar to compute the FIRST and FOLLOW sets for each nonterminal symbol.

FIRST sets

The FIRST set of any terminal symbol t is trivially { t }.

The FIRST set for nonterminal A, which we denote FIRST(A), represents the set of terminal symbols that can appear at the very beginning of the string of terminal symbols into which A can be expanded.  If A can expand to the empty string ε, then ε is in FIRST(A).

For example, consider the following grammar:

A -> a | B | C

B -> b | ε

C -> c e | d e

FIRST(A) includes the terminal symbols a, b, c, and d.  FIRST(A) also includes ε.

For any nonterminal A, FIRST(A) can be computed recursively as follows (Dragon book, p. 221):

For each production of the form A -> s1 s2 s3 ... sn:

If s1 is a terminal symbol, s1 is in FIRST(A)

If s1 is a nonterminal, then all symbols in FIRST(s1) are also in FIRST(A)

If for some i in 1..n, all symbols s1..si-1 are nonterminals which can expand to ε, then all symbols in FIRST(si) are in FIRST(A)

If each s1..sn are nonterminals which can expand to ε, then ε is in FIRST(A)

How is the FIRST set useful?  Let's say we are implementing the A function in a recursive descent parser, and we are trying to choose between the following productions:

A -> s1 ...

A -> s2 ...

A -> s3 ...

If FIRST(s1), FIRST(s2), and FIRST(s3) are disjoint sets, then all we have to do is ask the lexical analyzer to look ahead at the next token.  Whichever of the three FIRST sets the token appears in identifies the correct production to apply.

FOLLOW sets

FOLLOW sets are also helpful in implementing a predictive parser in order to handle nonterminals that can (directly or indirectly) derive the empty string ε.

The idea is that, for any nonterminal A, FOLLOW(A) contains the set of terminal symbols which can immediately follow a string of terminal symbols generated by expanding an occurrence of A.  If it is possible that no further terminal symbols will appear after the expansion of A, then the special "end of input" token $ is in FOLLOW(A).

The idea is that when we are trying to choose a production for a nonterminal A, if A can derive ε, and the next token to be read is in FOLLOW(A), then we should consider applying whatever production allows A to derive ε.

Example: same grammar as above (assume A is the start symbol):

A -> a | B | C

B -> b | ε

C -> c e | d e

FOLLOW(A) is { e, $ }.

Recursive procedure for computing FOLLOW sets for the nonterminal symbols in a grammar (Dragon book, p. 221).

If S is the start symbol, then $ is in FOLLOW(S).

If a production A -> α B β exists, then all symbols in FIRST(β) except ε are in FOLLOW(B).

If either

a production A ->α B exists, or 

a production A-> α B β exists, and FIRST(β) contains ε

then all symbols in FOLLOW(A) are in FOLLOW(B)

Example of FIRST and FOLLOW

Given the expression grammar from which we have eliminated left-recursion:


We can construct the FIRST and FOLLOW sets:


Generalization of FIRST sets

The FIRST set of any string of grammar symbols α = X1 X2 X3 ... Xn is defined as follows:

all non-ε symbols in FIRST(X1) are in FIRST(α)

if ε is in FIRST(X2), then all non-ε symbols in FIRST(X2) are in FIRST(α)

etc...

If all X1..Xn have ε in their first sets, then ε is in FIRST(α)

Table-driven predictive (top-down) parsing

This approach is commonly-used by parser generators.  The idea is that the parser generator analyzes the grammar (computing the FIRST and FOLLOW sets), and builds a parse table.  The parse table indicates, each time a production needs to be chosen, which production should be applied, based on the next (lookahead) token that will be returned by the lexical analyzer.

This approach works for LL(1) grammars.  A grammar in the set LL(1) has the property that a single token of lookahead always sufficies to identity the correct production any time we need to expand a nonterminal.

Basic idea: the table has a row for each nonterminal symbol, and a column for each terminal symbol (token type).

When we need to choose a production to expand some nonterminal symbol A, we consult A's row in the table, and ask the lexical analyzer for the next token t to be returned.  Whatever production appears in the table entry for t's column is the one the parser will use.

Constructing a predictive parsing table (Dragon book, p. 224):

NOTE: there is a mistake in the first printing of the book, affecting step 1 below.  See the errata.

Denote the entry in A's row and t's column as M[A, t]

For each production A -> α

1. For each terminal symbol t in FIRST(α), add A -> α to M[A, t].

2. If FIRST(α) contains ε, then for each symbol b in FOLLOW(A), add A -> α to M[A, b].  If FIRST(α) contains ε and $ is in FOLLOW(A), then add A -> α to M[A, $].

What is going on here?

The first step is straightforward: if we're trying to expand an A and we see a terminal symbol in the FIRST set of α, where α is the right-hand side of one of A's productions, we apply that production.

The second step is a bit more subtle.  If α, the symbols on the right-hand side of a production on A, can either directly or indirectly derive the empty string ε, then if we see any symbol in A's FOLLOW set, we should apply the production, assuming that it will expand to the empty string.

Any table entries not set to a production indicate error states: if we're trying to expand a particular nonterminal, and we see a terminal symbol for which no production is specified, then the input cannot be parsed, and the parser signals an error.

Another possibility is that when we build the table, we'll try to add more than one production to a single entry.  This means that the grammar is not LL(1): a single token of lookahead is not sufficient to uniquely identify the correct production to apply.

Example

Our expression grammar from last time:


Its FIRST and FOLLOW sets:


We use this information to construct the following parse table:

a b + * $
E E → T E' E → T E'
E' E' → + T E' E' → ε
T T → F T → F
T' T → ε T' → * F T' T → ε
F F → a F → b

Because no table entry contains more than one production, we know that the grammar is in LL(1) and that predictive parsing will successfully handle any input string.