CS 340 - Lecture 3

Regular expressions, DFAs, and NFAs are equally expressive

It is possible to convert freely between regular expressions, deterministic finite automata, and nondeterministic finite automata

Given one, we can convert it to any of the other forms

Converting regular expression to NFA

In general, any regular expression X can be converted to an equivalent NFA called NFAX containing a single start state and a single accepting state

Case 1: sequence of symbols

output is sequence of states with transitions accepting those symbols

e.g., the regular expression abba yields the NFA

e.g., the regular expression ε yields the NFA

Case 2: disjunction

if A and B are regular expressions whose equivalent NFAs are NFAA and NFAB, then the we can construct an NFA called NFAA|B that accepts the language generated by A|B as follows:

create start and accepting states of NFAA|B

use ε-transitions to connect the start state of NFAA|B to the start states of NFAA and NFAB

change start states of NFAA and NFAB so that they are no longer start states

use ε-transitions to connect accepting states of NFAA and NFAB to the accepting state of NFAA|B

change accepting states of NFAA and NFAB so that they are no longer accepting states

e.g., the NFA recognizing the language generated by abba|bab:

Case 3: repetition

If A is a regular expression whose equivalent NFA is NFAA, then we can construct an NFA called NFAA* which accepts the languge generated by A* as follows:

create start and accepting states of NFAA*

create an ε-transition from start state to accepting state of NFAA*

create an ε-transition from accepting state to start state of NFAA*

create ε-transition from start state of NFAA* to start state of NFAA

change start state of NFAA so that it is not a start state

create ε-transitions from accepting state of NFAA to accepting state of NFAA*

change accepting state of NFAA so that it is not an accepting state

e.g., construct NFA that recognizes language generated by (abba|bab)*

Case 4: concatenation

if A and B are regular expressions whose equivalent NFAs are NFAA and NFAB, then the we can construct an NFA called NFAAB that accepts the language generated by AB as follows:

create start and accepting states of NFAAB

create ε-transition from start state of NFAAB to start state of NFAA

change start state of NFAA so it is not a start state

create ε-transition from accepting state of NFAA to start state of NFAB

change accepting state of NFAA so it is not an accepting state

change start state of NFAB so it is not a start state

create ε-transition from accepting state of NFAB to accepting state of NFAAB

change accepting state of NFAB so it is not an accepting state

e.g.: construct NFA that recognizes (a|b)c

first part: (a|b)

second part: c

overall NFA: (a|b)c

Converting an NFA to a DFA

Here is a sketch of the algorithm to convert an NFA into a DFA:

Rule: in an NFA, if two states are connected by an ε-transition, then they are equivalent.

Define a table mapping sets of NFA states to corresponding DFA states.

-- this function converts an NFA to a DFA

function Convert_NFA_To_DFA() {

work list := new empty queue

Start := set of NFA states equivalent to NFA start state

enqueue Start on to work list

while (work list is not empty) {

dequeue a set of NFA states S from the work list

if (S has not been processed yet) {

mark S as processed

D = Map_NFA_States_To_DFA_State(S)

for each symbol Y in alphabet {

T = set of states reachable on Y

E = Map_NFA_States_To_DFA_State(T)

create DFA transition from D to E on symbol Y

enqueue T on to the work list

}

}

}

mark the first DFA state created as the DFA start state

}

-- this function returns the DFA state correpsonding to a set of NFA states, creating the DFA state if necessary

function Map_NFA_States_To_DFA_States(U) {

if (table contains entry for U) {

return the DFA state in table corresponding to U

}

create new DFA state F in table corresponding to U

if (U contains an NFA accepting state) {

make F an accepting state

}

return F

}

Example: convert the NFA produced by translating the regular expression (aa|ab)* into a DFA.

Input NFA:

Output DFA:

Table of NFA state sets to DFA states:

NFA state set DFA state
{0, 1, 4, 7} 0
{2, 5} 1
{0, 1, 3, 4, 7} 2
{0, 1, 4, 6, 7} 3

Converting a DFA to a regular expression

The algorithm to convert a DFA to a regular expression is left as an exercise for the reader :-)

The equivalence cycle

Given the existence of all 3 algorithms (regexp -> NFA, NFA -> DFA, DFA -> regexp), we can easily see that regular expressions, DFAs, and NFAs are equivalent:

Why all of this matters

Ok, you say, we have 3 equivalent formalisms for describing regular languages.  What's the big deal?

It turns out that the algorithms described in this lecture have great practical importance in the implementation of programming languages.

The rules describing how to form the tokens of a programming language are, for all well-known programming languages, a regular language.  For example, in C, an identifier token must begin with a letter, after which we can have any sequence of letters, digits, or underscore characters.

Let us consider an extended regular expression syntax where

[A-Z] means any capital letter

[a-z] means any lower case letter

[0-9] means any digit

So, a regular expression describing C identifiers is

([A-Z]|[a-z])([A-Z]|[a-z]|[0-9]|_)*

Similar regular expressions can be constructed for other kinds of tokens, such as numeric literals, string literals, etc. Regular expressions are a very convenient format for specifying the lexical structure of a language.

Once we have defined a regular expression for each kind of token, we can combine all of the regular expressions using disjunction into a single regular expression that generates all of the tokens of the language.

An implementation of a programming language must take a source program as input and translate it into executable form.  The first phase of this process, called the scanner, takes the sequence of characters in the source program and turns them into a sequence of tokens.

We can simplify this part of the programming language implementation using a tool called a scanner generator.  A scanner generator allows the language designer to specify the legal tokens of the language using regular expressions.  Then, the scanner generator translates the regular expressions into an NFA, which is further translated into a DFA.  DFAs have the nice characteristic that they can be implemented as a table-driven state machine which can process any sequence of characters quickly while using a finite amount of memory.  The scanner generator creates source code for a table-driven DFA which recognizes the language described by the original regular expressions.

One popular scanner generator which creates C/C++ source code for fast scanners is flex:

http://flex.sourceforge.net