9 Top Down Parser – Pre-processing
Rajeswari Sridhar
In this module, the role of a Parser in the context of a compiler is discussed. Types of Parsers and the preprocessing that is necessary for a Top down parser are dealt in detail in this module. Pre-processing steps of eliminating left recursion and left factoring is dealt with algorithm and example.
9.1 .Role of the Parser
Parser is typically integrated with the lexical phase of the compiler. This is done to avoid multiple passes of the compiler, anyway a compiler will have more than one pass. Integration of a parser with the lexical phase is depicted in figure 9.1 (a) and an elaborate representation is given in figure 9.1 (b). From figure 9.1(a) it can be understood that the scanner issues tokens to the parser and the parser validates the tokens and converts it to an intermediate representation (IR).
Figure 9.1 (b) gives more clarity on the interaction between the lexer and the parser. The lexer scans the input code, tokenizes and issues a token to the parser, whenever a “GetNextToken” request is issued by the parser. The lexer records lexical errors during the scanning process while the parser records any syntax errors. The semantic phase is one more component of the front end of the compiler which will check for semantic errors. All these phases interact with the symbol table.
9.2 Brief discussion on Grammar
Before we actually discuss the types of parsers, a brief discussion on the Grammar is necessary. A grammar is used to denote the sentence structure of a particular language. A Grammar can be of 4 types, type 0, type 1, type 2 and type 3. Type 0 grammar defines a superset of a class of languages while Type 3 grammar defines a smaller set of language. For compiler, type 2 grammar otherwise called Context Free Grammar is used. A Context Free Grammar (CFG) is defined formally as a four tuple, (V, T, P, S) where, V is the set of Variables / Non-terminals, T is the set terminals that constitute the string, P is the set of Productions that has a LHS and RHS where the LHS can be replaced by RHS to derive a string and S is the special symbol called as the start symbol and is a subset of V. A string is said to belong to a grammar if and only if it is derived from the start symbol.
Equation (9.6), apply left most derivation to derive the string “id*id+id”. The process of derivation is depicted as a tree representation and is called as a derivation tree where the LHS non-terminal is the parent and the symbols on the RHS of the production are the children.
The grammar construct is used to derive strings. Thus, given any input string, the grammar construct is used to start deriving the string by applying one derivation after another. This process is called as top-down derivation. On the other hand, given a string, if the symbols are combined and if it yields the start symbol the process is called bottom-up derivation. In both events, if the string is derived from the start symbol or if the combination yields the start symbol of the grammar the string is said to be belonging to the grammar.
This process of derivation is used by the parsers to validate a string or a sequence of strings.
9.3 Types of Parsers
The functionality of the parser is to verify whether a sequence of tokens belong to a correct sentence structure. The validation is based on designing rules that has to be followed for constructing a sentence. The rules are specified using any one type of grammar. Using this grammar, when an input sentence is given, the parser compares whether the input sentence belongs to the grammar.
There are various types of parsers. The following is one of the classifications of parsers:
- niversal Parsers
– Cocke- Younger-Kasami (CYK)
– Earley Parser
- Top-Down Parsers
- Bottom Up Parsers
9.3.1 Universal Parsers
The Universal parsers CYK and Earley can parse any type of grammar given to it. They are typically used in natural language processing to validate the syntax of natural language sentences using a predefined grammar for Natural language. However, these parsers are not efficient for a compiler, where the sentence structure is based on Context Free Grammar.
9.3.2 Top Down Parsers
The top down parser is a name that is derived based on the construction of the derivation tree to validate a string for a particular grammar. If the input string is derived from the start symbol of the grammar then we conclude that the string belongs to the Grammar. For example, consider the Grammar
The string has to be taken by looking at the leaves of the derivation tree from left to right. Thus the string formed in “cabd” and this does not match with the input string. So, we try an alternate application of the production for “A” and this results in the string “cad”.
Thus in top-down parsing, we try all possible productions and if it doesn‟t match, we backtrack and try an alternate production to derive the string.
As backtracking is a typical characteristic of top-down parsing, these parsers have to handle this while validating sentences and these parsers are called as recursive descent parsers. There is a variation of recursive descent parsers which is Predictive parsers which avoids backtracking. One such type of predictive parsers is the LL parsers, where the first “L” stands for input being scanned from left to right and the second “L” stands for left-most derivation.
9.3.3 Bottom Up Parsers
As already discussed bottom up parsing, starts from the string and combines the strings to generate the start symbol in a bottom up fashion. Consider the same string “cad” for the grammar given in (9.7).
As the start symbol “S” could be reached this string belongs to the grammar. LR parsers are bottom up parsers where the “L” indicates input being scanned from left to right and the “R” indicates that we are applying right most derivation.
9.3.4 Property used by parsers
Top Down and Bottom up parsers parse the string based on a viable-prefix property. The property states that before the string is fully processed, if there is an error, the parser will identify it and recover from it. This property is based on identifying the possible prefix of all strings that belong to any context free grammar. All programming language constructs are defined using context free grammar. For example consider the following grammar with “stmt” as the start symbol, {stmt, E} are Non-terminals, {if, then, else, a, b} being terminals and productions defined as follows:
In this grammar, the prefix “a” is common for all the productions, hence while deriving a string one would not know which productions to substitute. This involves lot of backtracking and therefore the LL(1) parsers should consider the action of parsing after left factoring the grammar.
9.4.1 Elimination of Left Recursion
This is the first step of the pre-processing that is necessary for a grammar to be parsed by LL (1) grammar. As already discussed, if A à Aa is a production, this is referred to as A-production and this needs to be removed from the grammar. The algorithm for removing left recursion is given in algorithm 9.1
The first step arranges all the non-terminals in some order starting from the start symbol. There are two loops to pair every non-terminal with every other non-terminal. The logic behind the elimination algorithm is to substitute any non-terminal that starts with another non-terminal thus verifying the occurrence of an indirect left recursion. After forming new productions, immediate left recursion is eliminated. Immediate left recursion is eliminated using the following procedure. Consider the grammar to have the following A-productions and non-A productions
Left recursion from this grammar is eliminated by converting to a right-recursive grammar by using the following procedure