6 Lexical Phase -RE to DFA
The objective of this module is to construct a minimized DFA from a regular expression. A NFA is typically easier to construct but string matching with a NFA is slower. Hence, we go in for a DFA representation. However, if the regular expression is converted to a DFA using the Thompson’s subset construction algorithm, which was discussed in modules 4 and 5, the resultant DFA will have more states than actually necessary. Therefore for string matching, this DFA would take more than the optimized DFA that is directly constructed for the language. So, in this module, we will discuss the construction of a minimized DFA from a regular expression.
6.1 Minimized DFA Construction
A DFA can be constructed for a language using a direct representation as a directed graph, or as a procedure that converts a regular expression to a DFA. The conversion from RE to DFA can be done using the following procedure
- Thompson’s subset construction algorithm – results in a non-minimized DFA and hence could be minimized using Table filling algorithm
- Syntax tree procedure – Results in a minimized DFA
In this module, the syntax tree procedure that converts a regular expression to a minimized DFA is discussed.
6.2 Syntax Tree Procedure
The following algorithm, gives the steps to be followed in converting the regular expression (RE) to a DFA.
SyntaxTreeAlgorithm
Input: Regular Expression
Output: a Minimized DFA
- Augment the regular expression r with a special symbol # which is used as an end marker and any transition over # in a DFA will be an accepting. Hence, the new expression is r #.
- Construct a syntax tree for r#
- Traverse the tree to construct functions nullable( ), firstpos( ), lastpos ( ), and followpos( ) Based on the functions firstpos( ), lastpos ( ), and followpos( ) and using the tree, the minimized DFA is constructed Let us discuss each of these functions in detail before going ahead with the implementation of the algorithm:
6.2.1 Syntax Tree construction
Based on the precedence of the regular expression operators “*”, “+” and “.” a syntax tree is constructed. The kleene closure operator * has the highest precedence and will have just one child. The next precedence is assigned to the concatenation operator “.” and this will have two children. The least precedence is given to the union operator “+” and this will also have two children. The input symbols and the special symbol “#” will be the leaf nodes of this expression.
If “a” and “b” are input symbols and they are connected using the “+” operator then the construction of the syntax tree is given in figure 6.1 (a). The syntax tree will be a typical binary tree.
Using this method and the precedence of operators, a syntax tree for the regular expression r# is constructed. All the leaf nodes are labeled with integers from 1 to n, which would be used as the information to construct the DFA later.
6.2.2. Nullable( )
After constructing the syntax tree, for every symbol in the syntax tree, the function nullable() is defined. The leaf nodes are first assigned nullable based on whether the sub-tree under them can generate an empty string ε. Hence, if a leaf node is labeled with “ε” then the nullable information of that node is set to “True” and for all other input alphabet “a” such that “a” is not an operator, the nullable information of that node is set to “False”. For the regular expression operators, nullable information is assigned. They form the interior node part of the syntax tree. The concatenation operator and the union operator cannot generate empty string and hence their nullable information depends on the nullable status of its children. The concatenation operator is considered as “and” operation while the union operator is considered as “or” operation for determining the nullable information of the interior node. If c1 and c2 are the children of these interior nodes, then the following relationship is used to calculate the nullable information of the interior nodes “+” and “.”
nullable (+) = nullable(c1) or nullable(c2) | (6.1) |
nullable (.) = nullable (c1) and nullable(c2) | (6.2) |
Here the operators “or” and “and” indicates the logical operators. Hence, nullable(+) will be set to “True” if any of its children is nullable and nullable(.) will be set to “True” only if both of its children are nullable. Since, the kleene closure operator can generate ε as a string, the nullable of the * node is set to “True”
nullabe(*) = True | (6.3) |
6.2.3 firstpos()
The function firstpos(n) is defined as the set of positions that can match the first symbol of a string generated by the sub-tree at node “n”. It is defined for all the nodes and depends on the firstpos() values of the leaf nodes.
To start with, for any input symbol “a”, the firstpos(a) is the integer assigned to it. For the node “ε”, firstpos( ) is defined as empty. After assigning the firstpos( ) for all the leaf nodes, firstpos() of all interior nodes are computed. To compute this, nullable information of all interior and leaf nodes are necessary. For a node labeled with the Kleene closure operator, *, and child c1 the firstpos() is defined as
firstpos(*) = firstpos(c1) | (6.4) |
For interior nodes labeled with “+” and “.”with children c1 and c2, firstpos() is defined as follows:
firstpos(+) = firstpos(c1) U firstpos(c2) | (6.5) |
firstpos(.) = if (nullable(c1) == true) then
return (firstpos(c1) È firstpos(c2))
else
return firstpos(c1) | (6.6) |
This function is computed for all interior and leaf nodes and the result is stored in a list.
6.2.4 lastpos()
lastpos() is also computed for all the nodes of the syntax tree. lastpos(n) is defined as the set of positions that can match the last symbol of a string generated be the subtree at node n.
To start with, for any input symbol “a”, the lastpos(a) is also the integer assigned to it similar to the firstpos(). For the node “ε”, lastpos( ) is defined as empty, since ε which is defined as the suffix or prefix of any string is not shown with the string during representation. After assigning the lastpos( ) for all the leaf nodes, lastpos() of all interior nodes are computed. To compute this, nullable information of all interior and leaf nodes is necessary. For a node labeled with the Kleene closure operator, *, and child c1 the lastpos() is defined as
lastpos(*) = lastpos(c1) | (6.7) |
For interior nodes labeled “+” and “.” and with children c1 and c2, lastpos() is defined as follows:
lastpos(+) = lastpos(c1) U lastpos(c2) | (6.8) |
lastpos(.) = if (nullable(c2) == true) then | |
return (lastpos(c1) È lastpos(c2)) | |
else | |
return lastpos(c2) | (6.9) |
The difference between the firstpos() and the lastpos() in only for the sub-tree that involves the concatenation operator and for all other nodes, firspos() and lastpos() are same. This function is computed for all interior and leaf nodes and the result is stored in a list.
The summary of all the functions are listed in Table 6.1
based on the Kleene closure operator. In line 3, every element present in lastpos() of the left child of the concatenation node is considered and the followpos() value is updated according to the relationship given in line 4 of the algorithm. Similarly, line 6 of the algorithm, considers all the elements present in the lastpos of the * node and determines their followpos() based on the relation given in line 7 of the algorithm. As line 7 involves followpos() of * node, the followpos() of the concatenation operator node is first determined which is then used to identify the followpos() of the Kleene closure operator node.
6.2.6 DFA Construction
The construction algorithm is given in Algorithm 6.2. Initially, the firstpos (root) is considered. The integers representing in this firstpos( ) is considered as one state of the DFA. This is the starting state of the DFA. Line 1 and 2 of the algorithm, indicates, this procedure, where the variable Dstates is used to remember the states of the DFA. Line 3 initializes a while loop which generates new states based on the transition from the initial state. The start state is marked in line 4 to indicate that the state has been considered. Line 5 initiates for loop to define transitions from the start state on all input symbols. The new state will have state numbers which is determined as a union of the state numbers of the followpos() of the state numbers in the initiating state. The procedure is repeated till there are no more new states can be generated.
- s0 := firstpos(root) where root is the root of the syntax tree
- Dstates := {s0} and is unmarked
- while there is an unmarked state T in Dstates do
- mark T
- for each input symbol a Î S do
- let U be the set of positions that are in followpos(p)
- for some position p in T,
- such that the symbol at position p is ‘a’
- if U is not empty and not in Dstates then add U as an unmarked state to Dstates
end if
Dtran[T,a] := U
end do
end do
Algorithm 6.2 – DFA Construction
Example 6.1 : Construct a minimized DFA for the regular expression (a|b)*abb
In example 6.1, the regular expression is suffixed with # to form (a|b)*abb#. According to the construction algorithm, given in figures 6.1 (a –c), the syntax tree is constructed and is shown in figure 6.2
and lastpos() and is indicated in figure 6.3. Consider, the union operator node of the children “1” and “2”. The firstpos( | ) and the laspos( | ) is given by the union of the firstpos and lastpos of its children. Hence, it is given by {1,2}. The interior node, “.”, which is a parent of * and (3), the firstpos( . ) is computed as union of the firstpos() of its children since c1 being the * node is nullable. The computation of firstpos() of all the other interior nodes is computed as firstpos() of its left child c1 since, c1 is not nullable for all other interior nodes. On the other hand, the lastpos(.) will be computed based on c2. As the right child of all interior node is not nullable, the lastpos, is computed as just the lastpos (c2) only.
Then using this information, followpos() is computed. At the star node, the followpos(1) and followpos(2) will be the set {1,2} using line 7 of algorithm 6.1. Consider the parent of the * node and we add {3} to the followpos(1) and followpos(2) using line 4 of algorithm 6.1. The followpos() of all the nodes is tabulated in Table 6.2
Table 6.2 – Followpos() of figure 6.3
Node | Followpos(n) | |
1 | {1, 2, 3} | |
2 | {1,2,3} | |
3 | {4} | |
4 | {5} | |
5 | {6} | |
6 | Φ |
After computing the followpos(), the DFA is constructed by using root’s firstpos() as the start state. The root’s firstpos() is {1,2,3} and hence, this set is the start state. In this set, “1” and “3” corresponds to the input symbol “a”. Hence, to define the edge from this state on “a”, we need to consider the followpos(1) and followpos(3) which is {1,2,3} and {4} respectively. So, we define an edge from {1,2,3} to {1,2,3,4} on the input “a”. Similarly, the state “2” corresponds to input “b”. Hence, followpos(2) is to be considered for the edge on “b” from {1,2,3} which is the same state itself. Now, we have a new state {1,2,3,4}. From this state, “1” and “3” corresponds to “a” and hence, there is a self loop on this state on the input symbol “a”. For “b”, nodes 2, 4 corresponds and hence the union of their followpos() is {1,2,3,5} which is again a new state. The process is repeated and the resultant DFA is shown in figure 6.4. The final state corresponds to the states in which the number corresponding to # is available. In this example, the state number is represented by “6”. In the states of the DFA, “6” is contained in the set representing {1,2,3,6} and thus it corresponds to the final state of the DFA. Thus we end up with 4 states in the DFA as against 5 states which is the resultant of Thompson’s construction algorithm.