19 PARSER GENERATOR
Rajeswari Sridhar
In this module we will learn to use the parser generator to parse a given string. We will learn the YACC parser tool.
19.1 Parsers
The LL(k) and LR(k) parsers discussed in the previous modules uses the grammar definition of a programming language and verifies whether an input string belongs to a grammar by constructing a parsing table. The construction of the parsing table is a tedious process as can be understood from the previous modules and hence we check whether it is possible for a tool to do the parsing action. We have already discussed the LEX tool to do the action of tokenizing. We also know that there is enough interaction between the lexical phase and the syntactic phase, where the lexer issues a token based on a request from the parser. Due to the interaction between the lexer and parser we could use a tool for the parser and integrate it with the lexer. The following are some of the tools available for parsing:
- ANTLR (ANother Tool for Language Recognition) tool generates LL(k) parsers
- YACC (Yet Another Compiler Compiler) generates LALR(1) parsers
- Bison – It is an improved version of YACC
As bottom up parsers are preferred, YACC is a preferred tool for parsing.
19.2 YACC Tool
The overall working of the YACC tool is shown in figure 19.1. The YACC is a LALR parser and hence avoids shift/reduce conflicts to a greater extent. The input program is a YACC specification program and is named with a “.y” extension. This program consists of the grammar rules of a programming language and the corresponding action that needs to be taken. This “.y” file is compiled using a YACC compiler. The output of the YACC compiler is a file named “y.tab.c” which can be compiled using a C compiler to generate an output file. The “.y” file in-turn would call the “.l” LEX specifications file to tokenize the given input.
19.2.1 Structure of a YACC program
We can observe from figure 19.1 that YACC provides a general tool for describing the input to a computer program. The YACC file specifies the structures of the input, along with code to be called to recognize each structure. YACC converts the specification into a subroutine that handles the input process handled by this subroutine.
The input subroutine produced by YACC calls a user-supplied routine to return the next basic input item. Thus, the user can specify his inp ut in terms of individual characters or in terms of higher level constructs such as names and numbers. The user-supplied routine may also handle idiomatic features such as comment and continuation conventions, which typically defy easy grammatical specification.
As in the case of a LEX program, the YACC is also a rule-based programming language. The YACC specification consists of the three parts: YACC declarations, translation rules and User- defined auxiliary procedures. The individual sections are delimited with a “%%”. The following is a skeleton of a YACC specification file. yacc declarations, and C declarations in %{ %} %%
translation rules
%%
user-defined auxiliary procedures
The declarations section consists of variables declarations that could be used in the program. The declaration section may be empty also. The next section is the translation rules section where we have grammar productions along with the actions. The productions correspond to the rules and the action corresponds to things that need to be done if the rule matches. The following is a sample of the translation rules
production1 | { semantic action1 | } | |
production2 | { semantic action2 | } | |
… | |||
productionn | { semantic actionn } | à (19.1) |
In YACC the productions are defined to accommodate multiple RHS for a single LHS non- terminal by separating the individual productions with the “|” symbol. However, the action that has to match the RHS is given immediately after the RHS of the production. The following is an example.
Nonterminal : tokens/nonterminals { action } | |
| tokens/nonterminals { action } | |
… | |
; | à (19.2) |
From equation (19.2) we can find that the tokens can be either single character or a sequence of characters. If the tokens are single characters then they can be used directly within productions, e.g. ‘+’. On the other hand, if the tokens have more number of characters then they are given a name and these named tokens must be declared first in the declaration part as given in equation (19.3).
%token TokenName | à (19.3) |
In equation (19.1), semantic actions may refer to values of the synthesized attributes of terminals and non-terminals in a production:
X : Y1 Y2 Y3 … Yn { action } à (19.4)
In (19.4) to associate a value to the LHS and RHS, the action part of the translation rule is used. To associate the values for X and Yi we could use “$$”to refer to the LHS value of the attribute of X while, “$i” is used to refer to the value of the attribute of Yi in the RHS of the production.
Consider the following example of translation rule for the production,
factor à (expr) | à (19.5) |
The production of (19.5) is represented as follows in the production section:
factor : ‘(’ expr ‘)’ { $$=$2; } | à (19.6) |
From (19.6) we can observe that the production of (19.5) is converted, where the LHS variable remains as it is, the “à” is replaced with “:”. The RHS has two symbols ‘(‘ and ‘)’, which are stated within quotes to indicate it is a constant. Now, the action section indicates, “$$” to specify this corresponds to the LHS of the production “factor” and “$2” indicates that the value of the RHS variable is “2”. This is shown in figure 19.2
19.2.2 YACC program entry point
yyparse() is the entry point of the YACC program. This function is called once from main() and this function repeatedly calls yylex() until parsing is completed. If a syntax error is encountered yyparse() calls a yyerror() function which is user specified. The yyparse() function returns 0 if all of the input was processed without any error and returns 1 if there is syntax error.
Example:
int main() { return yyparse(); }
19.2.3 YACC declarations
The YACC specification file contains declarations which have the information about tokens. The tokens are given names and is declared using ‘%token’. In this section, single-character tokens need not be declared. In addition, any name not declared as a token is assumed to be a non-terminal. The start symbol of a grammar is declared using ‘%start’ which is optional. The precedence and associativity associated with the operators are also stated in this section. The information that needs to be copied into the y.tab.c file is also stated, but they are enclosed in a pair of “%”.
19.2.4. YACC rules
The grammar productions are mapped as YACC rules. Consider the following grammar productions
A ® B1 B2 … Bm
A ® C1 C2 … Cn
A ® D1 D2 … Dk | à (19.7) |
The productions in (19.7) is mapped into the following YACC rule A ® B1 B2 … Bm
| C1 C2 … Cn
| D1 D2 … Dk ;
The multiple productions of A are merged into one rule with each production being separated from the other using the ‘|’ symbol. Each symbol that is available at the RHS of the production can have arbitrary actions in terms of C code, statements, actions, etc. For example, the grammar of (19.7) can have the following as the YACC rule.
A : B1 { printf(“after B1 \n”); x = 0; } B2 { x++; } B3
In this example, B1 is to print something and initializes a variable ‘x’, B2 increments ‘x’ and B3 could do something with it.
19.2.5 YACC Conflicts
For a YACC parser, the parser considers a left recursive grammar more efficient than right-recursive one. Conflicts arise when there is more than one way to proceed with parsing. The two types of conflicts that the YACC has to handle are shift-reduce and reduce-reduce. The YACC parser handles the shift-reduce conflict in favor of shift while the reduce/reduce conflict is handled by reducing it with the first rule listed. However, the YACC parser avoids conflicts by specifying the operator precedence and associativity. The YACC parser also restructures the grammar to avoid the conflict. The YACC uses the y.output to identify reasons for the conflict and takes corrective measures to avoid the conflicts.
The following example discusses how the YACC parser associative precedence and associativity to avoid conflicts.
Binary operators: %left, %right, %nonassoc:
%left ‘+’ ‘-‘
%left ‘*’ ‘/’
%right ‘^‘
Unary operators: %prec
We can change the precedence of a rule to be that of the token specified. Consider the following:
%left ‘+’ ‘-‘
%left ‘*’ ‘/‘
Expr: expr ‘+’ expr
| ‘–’ expr %prec ‘*’
| …
The precedence indicates that for this grammar the unary ‘-’ has higher precedence where “prec” is a keyword.
YACC follows a general approach to avoid conflicts and is listed below:
1. Use “yacc – v” to generate the file y.output.
- Examine y.output to find parser states with conflicts.
- For each such state, examine the items to figure why the conflict is occurring.
- Transform the grammar to eliminate the conflict.
Steps 1 to 4 could be iterated as many times to eliminate and resolve conflict. Table 19.1 summarizes some reasons for conflict and the method to resolve them.
Table 19.1 Conflicts and solution
Reason for conflict | Possible grammar transformation |
Ambiguity with operators in expressions | Specify associativity, precedence |
Error action | Remove or eliminate offending error action |
Semantic action | Remove the offending semantic action |
Insufficient look-ahead | Expand the non-terminal involved |
Other | Start all over |
19.2.6 Error handling
Errors are handled by the YACC parser similar to the manner in which errors are handled in LR parsers. The “token” ‘error’ is reserved for error handling. This token can be used in rules or it could suggest places where errors might be detected and recovery can occur. Consider the following example where it indicates the error handling situation for an “if- then” grammar construct.
Example: stmt : IF ‘(‘ expr ‘)’ stmt
| IF ‘(‘ error ‘)’ stmt
| FOR …
| …
The grammar production is defined for an error situation in the translation rules section. When an error occurs, the parser pops its stack until it enters a state where the token ‘error’ is legal and then behaves as if it saw the token ‘error’, performs the action encountered and resets the look-ahead token to the token that caused the error. If no ‘error’ rules are specified then the parser halts the processing.
The YACC Parser remains in error state until three tokens are correctly read in and shifted. The parser prevents cascaded error messages. If an error is detected while parser in error state the parser ensures no error message is given and the input token causing the error is deleted. The following functions are provided in YACC for error handling in addition to the user-defined procedure yyerror()
· yyerrok : – To force the parser to believe that an error has been fully recovered
• yyclearin: – To clear the token that caused the error
The YACC parser tries to place the error tokens closer to the start symbol of the grammar to allow recovery without discarding all input. The YACC parser also places error tokens near terminal symbols to help permit a small amount of input to be discarded in the event of an error. Error messages are also issued by the YACC parser on finding an error. The parser calls a function void yye rror(char *s) , where * s points to an error message which the user has provided and prints out the error messages. YACC provides more informative error messages using the “int yychar” which indicates the token number that causes the error by keeping track of line numbers, as well as any additional desired information.
19.3 Integration of LEX and YACC
The LEX tool is typically used as a tokenizer and the YACC is a parser. To parse a programming language construct, the input needs to be tokenized. So, the YACC specification program works with a LEX file. The user specifications are written in the LEX file for tokenizing with a “.l” extension and is compiled by a LEX compiler to generate lex.yy.c. The user’s YACC specification is compiled to generate a y.tab.c and y.tab.h. The three files, y.tab.c, y.tab.h, lex.yy.c are fed to a C compiler to generate the C output file and is shown in figure 19.3
.
Summary:
In this module we discussed the features of YACC and the need for YACC to perform parsing. The tool is more convenient to use than starting to code the parser from scratch. The subsequent module will discuss the next phase of the compiler, Semantic phase.