4 Finite automata –DFA
The primary objective of this module is to define patterns using Deterministic Finite Automata (DFA) and Non-deterministic Finite automata (NFA). As a first step, this module also discusses the procedure to convert a regular expression to ε-NFA.
4.1 Transition Diagram
A transition diagram is a stylized flowchart or a directed graph that depicts the actions that take place when a lexical analyser is called by the parser to get the next token. It has been already discussed in the modules 1 and 3. Lexical analyser and the syntax analyser works together to carry out the analysis phase of the compiler. The parser issues a “gettoken” command while a lexer “gives” a token to the parser. This token is given by matching a longest matching pattern, where the pattern is specified by the Finite Automata or Regular expression. Figure 4.1 shows simple automata for the pattern “ > =”.
In figure 4.1, there is a designated special state called start state which is numbered as “0”. The other states are “1”, “2”, and “3”. In these states, “2” and “3” have been indicated with a double circle, which indicates a final or accepting state. Any path that takes from the start state to the final state is considered as a valid path and the sequence of edges constituting these edges qualify for the strings of the automata. In figure 4.1, the valid strings are “>=” and “>”. For the string “>”, other values that follow “>” is ignored and a valid prefix is chosen. At state 2, the lexer returns, <relop, GE>, indicating that the lexeme is “>=” which is “Greater than or Equal to” (GE) and the corresponding token associated is “relop”, indicating “relational operator”. The diagram of figure 4.1 corresponds to a finite automata (FA) and can be stated formally as ({0, 1, 2, 3}, {>, <, =”, δ, 0, {2,3}), where the transition function is indicated in figure 4.1. Figure 4.2 specifies a FA for identifiers in the Pascal language, where “letter” indicates any alphabet from A-Z or a-z and “digit” indicates any number between 0 and 9 and it returns the token “identifier” and the actual string encountered as “lexeme”.
4.1.1 Deterministic Finite Automata
As can be seen from figures 4.1 and 4.2 the automata are deterministic as there is at most one transition on an input symbol from every state. Formally stating, a FA, by default is a Deterministic one and a five tuple representation (Q, ∑ , δ, q0, F), q0 belongs to Q and F is a subset of Q, δ is a mapping from Q x ∑ to Q. Every string has exactly one path and hence it is faster for string matching. In a DFA, no state has an e-transition that is the DFA cannot change state without consuming any input symbol. A DFA accepts an input string x if and only if there is some path in the transition graph from start state to some accepting state.
Let us consider one more example. Let us construct a DFA M which can accept the strings which begin with ‘a’ or ‘b’, or begin with ‘c’ and contain at most one ‘a’.
Constructing the DFA involves, drawing the directed graph which is nothing but the transition diagram. The string can begin with a, b, c. Hence, have an edge, from state ‘0’ on, a, b, c to final states 2, 1. If the string begins with b, c, then there is no restriction on the sequence of characters that follow. Hence, transition on a, b, c from state ‘1’ can stay in state ‘1’ itself. On the other hand, from state 2, there should be exactly one edge on ‘a’, while b, c from the states 2, 3, can stay there itself. For this representation, a regular expression would be easier to write. The regular expression corresponding to this is given below
(a+b)(a+b+c)* + c(b+c)*a(b+c)* | (4.1) |
However, for string matching and a better understanding, a graphical representation of the automata is preferred. Hence, we convert one representation to the other. The details would be discussed in subsequent sections after the discussion of NFA and ε-NFA.
4.1.2 Non-deterministic finite Automata
The Non-deterministic finite automaton (NFA) is similar to deterministic finite automata but it gives more flexibility. It is a five tuple representation (Q, ∑ , δ, q0, F), q0 belongs to Q and F is a subset of Q and δ is a mapping from Q x ∑ to 2Q . An example NFA is given below:
From figure 4.4 it can be seen, that from state 0, there are two edges on the input ‘a’ and ‘b’. Hence, to check whether a string that begins with “a” belongs to this NFA, multiple paths have to be verified and even if anyone path includes final state, it can be understood that the string belongs to the automata. Due to the necessity to check multiple possible paths, a NFA is slower in string matching.
In an NFA, there can be more than one transition defined from a single state for an input symbol. However, in a DFA there can be at most one transition from a single state on an input symbol.
4.1.3 Non-Deterministic Finite automata with ε
This representation is same as NFA but is still more flexible to change state without consuming any input symbol. The difference between this NFA and NFA without ϵ, is that the transition function δ is a mapping from Q x ∑ U {ε} to 2Q
Hence, to check whether a string belongs to this NFA, we need to find the ε -Closure(r) if r is a state reached in one of the paths. ϵ -Closure(r) is defined as the set of states including ‘r’ that can be reached from state ‘r’ on ϵ transitions only. Since, there is a necessity to compute ε -Closure, this automata is still slower than NFA for string matching. An example of a ε -NFA is given in Figure 4.5. From figure 4.5, it can be seen that from state 1 there is a transition on ϵ. Furthermore it can be stated that ε-Closure(1) = {1, 3, 2} and ε–Closure(3) = {3, 2} as an example for ε-Closure.
4.2 Regular Expression and Finite Automata
A NFA accepts an input string ‘x’ if and only if there is some path in the transition graph from start state to some accepting state. A path can be represented by a sequence of state transitions called “moves”. As already discussed in this module, for some patterns defining a regular expression is easier and for some other an automaton is easier to construct. However, depending on the application, one may prefer a regular expression or an automata. Hence, procedures need to be defined to convert one representation to the other. As we have already discussed, a DFA is quicker for string matching and so is a regular expression. On the other hand, constructing a DFA or regular expression is difficult while constructing a NFA or ε-NFA is easier.
An application would prefer a DFA for its representation while the input could be available as a regular expression. A regular expression could be converted to a DFA using any one of the following procedures
1. Convert to an ε-NFA, convert this to a NFA and convert the NFA to DFA using the subset construction algorithm
- Convert to DFA based on the syntax tree procedure.
If the Thompson’s subset construction algorithm as indicated in (1) is preferred, then the constructed DFA need to be minimized. However, the procedure followed by (2) gives a minimized DFA.
4.3 Algorithm to convert regular expression to ϵ-NFA
The algorithm is based on the assumption that for every regular expression ‘r’ there exists an automaton. The first consideration in constructing the automata from regular expression is in prioritizing the operators. Kleene Closure operator “*” has the highest precedence followed by the concatenation operator “.” followed by the union operator “+” or “|”.
Figures 4.6 a, b and c give the automata for the simple regular expression without involving any regular expression operators.
Figure 4.6 a, is an automata corresponding to the regular expression “a”. The regular expression “a” corresponds to only one string “a”. This is given as an automaton with two states and one edge labeled “a” that goes from start to final state. Figure 4.6 b is an automata corresponding to the regular expression “ε”. The difference between figures 4.6 a and 4.6b is just that the edge is labeled ‘ε’ instead of ‘a’. Figure 4.6c corresponds to automata for the Φ string. There is no edge from the start to the final state and hence no path exists from start to the final state. Thus, this automaton doesn’t accept any string. The constraint that is posed in the construction is that there is exactly one final state.
A regular expression is formed using the base expression as an alphabet and connecting them using regular expression operators. Base expression is what has been discussed in Figure 4.6 (a-c). A regular expression can be formed using anyone of the operators, +, ., * which indicates union, concatenation or kleene closure. If R1, R2 are base regular expressions, another regular expression could be formed in one of the following ways:
R3 | = R1+R2 | (4.2) |
R4 | = R1 . R2 | (4.3) |
R5 | = R1* | (4.4) |
Equation (4.2) indicates a new regular expression R3 formed by performing union of two regular expressions. The construction of the ϵ-NFA is given in figure 4.7
Let M1 and M2 be the automata corresponding to R1 and R2 respectively. A new automaton with ε transitions is constructed by adding two new states, one start and one final state. ε transitions are defined from the new start state to the existing automata M1, M2’s start state. ε transitions are also defined from the final states of M1 and M2 to the new final state. Thus, if a string belongs to the expression R1, then it follows the path through M1 by suffixing and prefixing the string with ε. On the other hand if a string corresponds to regular expression R2, then it follows the path of M2. Thus this new NFA accepts strings belonging to R1 and R2, which is the implication of the union operator. To construct automata for other expressions, a-NFA Equation 4.3 indicates the concatenation operator, where regular expression R4 corresponds to the strings of R1 followed by R2. The corresponding automata is given in figure 4.8
Summary:
This module discussed the concepts of DFA, NFA and ε-NFA and their ways of defining patterns for string matching. However, conversion from regular expression to automata may be required, and this module focused on the first step in converting regular expression to ϵ-NFA.