3 Lexical phase –Regular expression

Lexical Phase – Introdction

The objective of this module is to get a clear understanding of the lexical phase and to learn about defining patterns using regular expression.

This module discusses the functionalities of the lexical phase of the compiler. This will brief why we need lexical phase, what it does and how to do it efficiently.

3. 1 Functions of the Lexical Analyser

The main job of the lexical phase is scanning and lexical analysis. Scanning is the first part which helps in removal of comments and compaction of consecutive white space characters into single white space. The second part is the lexical analysis which is more complex. This part of the lexical phase helps in producing tokens. The input to this phase is the high level language text file and output will be a sequence of tokens. After this phase, the input text file no longer exists and is split into tokens to be passed on to the next phase of the compiler. In the process of splitting the input into tokens, the lexical analyser does the following functions:

• Identifies language keywords and standard identifiers

• Handles include files and macros

• Counts line numbers

• Removes whitespaces

• Reports illegal symbols

• Creates symbol table

Lexical analyzer does not have to be an individual phase. But having a separate phase simplifies the design and improves efficiency and portability. With these things in mind and in order to efficiently do these functions, the lexical analyser is not assigned a single pass of its own, rather

the lexical and the syntax analyser together functions as one pass and is specified in Figure 3

From figure 3.1, it could be understood that, the lexer and the parser are combined together into one pass, where the parser issues a “Nexttoken()” command and the lexer issues or gives a token to the parser. The grouping of the phases helps in achieving the following:

• Simplifies the syntax analysis in terms of language definition

• Enhances modularity and supports portability. It helps in using the same analyser-parser combination for various high-level languages across platforms as the analysis phase is target independent.

• Reusability

• Efficiency

3.2 Definitions

Let us look at some basic definitions pertaining to the lexical phase of the compiler. Token is a group of characters having a collective meaning. Lexeme is a particular instant of a token. For example, consider the variable “pi”. So, token could be thought of as a name given to a lexeme. For this variable the token is “identifier” and the corresponding lexeme is “pi”. Similarly for the string “if”, the lexeme is “if” and the token is “keyword”. The lexemes are typically described by defining a pattern for it. The set of rules describing how a token can be formed is defined as its pattern. So, defining patterns for every possible lexeme is one of the main jobs of this phase of the compiler. For example, for the token identifier the pattern is [a-z]|[A-Z]) ([a-z]|[A-Z]|[0-9])*. Here, this corresponds to a regular expression. The operator „[„ and „]‟ indicates that any one character in this range could be used. The operator „|‟ is the union operator. It indicates either this or that. The * is a Kleene closure operator which indicates zero or more combination of the symbols that has this operator. The implicit operator here is the concatenation operator “.” which indicates the sequence of symbols that occur. We will discuss more on defining regular expressions in the subsequent sections.

3.2.1 Issues

Tokenizing is the primary job of this phase of the compiler. Hence, the primary issue is to find out how to identify tokens? Typically tokens are identified by defining patterns either as a regular expression or as an automata. The next issue is how the lexical phase will recognize the tokens using a token specification. This indirectly poses ways of implementing the nexttoken() routine. In order to solve these issues, the first two phases of the compiler are integrated.

3.2.2 Lexical Analysis Problem

Formally stating, given a set of token descriptions in terms of token name and regular expression defining the pattern for a lexeme, the lexical phase accepts an input string and partitions the strings into tokens (class, value). In choosing the lexeme, there could be two matching prefix that would be a part of two different tokens. For example, consider the operator “**” for exponentiation. Will the compiler, consider this is as a multiplication operator on seeing the first “*” and will it log the second “*” as an error? It considers this as an exponentiation operator because it tries to match the longest matching prefix. Thus ambiguity encountered in this phase is typically resolved by choosing the longest matching token and between two equal length tokens, the first one is selected. Table 3.1 lists some examples of tokens and their corresponding example lexemes.

Table 3.1 Few Examples of Tokens and their description.

2.3 Token Attributes

A token is represented along with its attribute. A typical attribute is a pointer to the symbol-table entry in which the information about the token is populated. Consider the following example

E=M*C**2

à (3.1)

For the statement in (3.1), the lexemes are “E”, “=”, “M”, “*”, “C”, “**”, “2”.

The tokens and the their attributes are listed below

<id, pointer to symbol-table entry for E>

<assign_op,> – attribute since it is a keyword indicating the operator

<id, pointer to symbol-table entry for M>

<multi_op,> – attribute since it is a keyword indicating the operator <id, pointer to symbol-table entry for C>

<exp_op,> – no attribute since it is a keyword indicating the operator <num,integer value 2>

3.2.4 Patterns: The lexical phase, decides the token based on defining patterns. Let us look at how to define patterns using regular expression. Table 3.2 gives the basic idea about the basic patterns of a regular expression.

Table 3.2 Basic regular expression

comments but constructing automata may be easier. In this context, it is to be understood that the language generated by an automata and the regular expression refers to a class of languages called regular language. Defining automata is easier rather than stating regular expressions. There is another option to combine both and specify partial automata with regular expressions on the edges. This doesn‟t require us to specify all the possible states but specify different actions at different states. As far as the automata is concerned a deterministic finite automata (DFA) is faster in string matching. However, constructing a DFA is difficult. Therefore, we create non-deterministic finite automata (NDFA) from every regular expression using an algorithm. Merge all the automata using epsilon moves (like the | construction). Then from the NDFA construct a DFA using another algorithm and then use a procedure to minimize the automaton starting with separate accepting states.

3.2.7 Automata

Automata typically refer to a Deterministic one. It is a five tuple representation (Q, ∑, δ, q0, F), where q0 belongs to Q, Q is a finite set of states, F is a subset of Q indicating final states,

δ is a mapping from Q x ∑ to Q. The transition function states, for every state on every input symbol there is exactly one transition. We define the language of the automata as

L = {w | δ (q0, w) = r, r is a subset of F}.

The interpretation of this definition is that, the string “w” should take you from the start state q0 to any one of the final state. Hence, every string has exactly one path and so a DFA is faster for string matching. On the other hand, a NFA is similar to a DFA, but it gives some flexibility. It is a five tuple representation (Q, ∑ , δ, q0, F), q0 belongs to Q, F is a subset of Q and δ is a mapping from Q x ∑ to 2Q . The transition function here indicates that for every state on every input symbol there could be 0 or more transitions.

We define the language of the NFA as

L = {w | δ (q0, w) = R, where any one state of R is a subset of F}.

Since, multiple paths exists because of the existence of multiple transitions, a NFA takes more time for string matching. ε -NFA is same as NFA but with more flexibility in allowing to change state without consuming any input symbol.

The definition of ε -NFA is different from NFA only in the transition function which is defined as δ a mapping from Q x ∑ U { ε } to 2Q. Hence, it is slower than NFA for string matching.

Summary: This module focused on constructing regular expressions and regular definitions as a way of defining pattern for regular expressions which will be used by the lexical phase of the compiler.