Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

Jamie Zawinski

In a previous job I did a fair amount of mentoring. One piece I would use any excuse to bring out was a discussion of the theory behind regular expressions. It wasn’t just a digression - I found regular expressions much easier to read once I understood how they worked.

But before we get to regular expressions we need to do some automata theory.

## Deterministic Finite Automata

Deterministic Finite Automata are a simple model of computation. A DFA is so called because it is comprised of a finite number of states, with deterministic transitions between them.

A DFA begins in a designated start state, and attempts to process some input (often a sequence of characters). If the DFA is in a designated final state upon reaching the end of the input, the DFA is said to match the input.

Here is an example DFA: The start state is represented by the arrow at the left hand side (state A), and the final state is represented by a double circle (state F). This DFA has two paths. The bottom path (A, D, E, F) matches the string 101 only. The top path (A, B, C, F) matches the string 00, followed by zero or more 1s, followed by a 0.

DFAs can also be represented as state tables:

State 0 1
A B D
B C
C F C
D E
E F

In order to be deterministic a DFA may only have a single state transition for a given input in any particular state. If we relax this requirement, we get a Nondeterministic Finite Automaton.

## Nondeterministic Finite Automata

An NFA is an extension of a DFA, with the following two additional rules:

• The requirement for unique state transitions for a given input is removed.
• There may be state transitions which match an empty input (hereafter denoted 𝜀).

It follows from the definition that any DFA is also an NFA - it simply has no nondeterministic states or 𝜀 moves.

Here is an example NFA: This NFA matches any of the following:

• The string 00 (A, C, F, I, J)
• The string 11 (A, C, G, I, J)
• A 0 followed by at least two 1s (A, B, E, J)
• The string 10 followed by zero or more 1s (A, D, H, J)
• The string 0001 [C Montgomery Burns!] (A, C, F, I, E, J)
• The string 1101 (A, C, G, I, E, J)
• The string 000 followed by zero or more 1s (A, C, F, I, H, J)
• The string 110 followed by zero or more 1s (A, C, G, I, H, J)

It turns out that these changes do not increase power of the system - for any NFA there exists an equivalent DFA. This can be found by constructing a DFA where each state represents the set of possible states that can be reached in the NFA.

As an example, we shall apply this process to the above NFA: From the start state A, we can read both 0 and 1. After reading a 0, we will be in either of states B or F, so the set {B, F} forms a state in the DFA. From BF, we can read a 0 and end up in either of states I or J. J is a final state, so the state IJ is also a final state.

Starting from A and reading a 1 will take us to the new state DG, and from there a 1 will take us to IJ, which we’ve seen before. From B or F, reading a 1 will take us from B to B reflexively, or from B to E. From DG, reading a 0 will take us to H or J, which forms a second final state. From BE, reading a 1 will take us either to B reflexively, from B to E, or from E to J, forming the new final state BEJ. Since BE is a subset of BEJ, reading a 1 at BEJ forms a reflexive transition.

HJ also has a reflexive transition when reading a 1, due to the reflexive transition on H. From IJ we can read a 0 and end up in the final state EHJ. From those reading a 1 leaves us in HJ, which we’ve seen before. This covers all the transitions in the original NFA, so we’re done. Verifying that this DFA matches the same inputs as the NFA is left as an exercise for the reader.

“This is all well and good” I hear you say, “but what does this have to do with Regular Expressions”. We’re getting to that.

## Regular Expressions

Regular Expressions are a way of representing strings that are members of a Regular Language (a Regular Language being the set of all strings matched by a given Regular Expression).

The key limitation of Regular Language is that Regular Expressions cannot count - it is not possible to represent (for instance) balanced parentheses in a Regular Language.

There are 5 elementary Regular Expressions - two simple expressions and three compound expressions (parentheses can also be used to represent precedence).

1. 𝜀, matching nothing.
2. a, matching a single character.
3. Concatenation, AB, representing Regular Expression A followed by Regular Expression B.
4. Union, A|B or sometimes A+B, representing either Regular Expression A or Regular Expression B.
5. The Kleene Star, A*, representing Regular Expression A 0 or more times.

This is enough to build any Regular Expression. Other operators provided by most programming languages are just syntactic sugar. For example:

• Character groups [A-Z] are just abbreviated unions.
• Optional matches, A?, are equivalent to A|𝜀.
• At least one matches, A+, are equivalent to AA*.
• Fixed count matches, such as A{n}, are equivalent to A repeated n times.

It turns out that the five elementary Regular Expressions can all be represented as NFAs. The simple expressions consist of two states joined by either 𝜀 or the character in question. The compound expressions are more interesting. The ovals containing Expression represent other Regular Expressions, with their start and final states connected via 𝜀 transitions to states in the compound expression.

As an example we’ll use the Regular Expression `((011|10)1*)|(00|11)(01*|𝜀)`. You should be able to convince yourself that this Regular Expression matches the same inputs as the NFA above. But we can also algorithmically compile it into a combination of the five elementary expressions: The Regular Expression is a union of two parts. The first (the top line in the diagram) contains a further union, concatenated with a Kleene Star expression. The second union contains a concatenation of two unions.

Since this is an NFA, it can be compiled into a DFA: Since this is not the same as our first compilation result, but matches the same input, we’ve also shown that two different NFAs can be equivalent.

When you compile a Regular Expression (such as with Java’s `Pattern.compile`), a process like this is what occurs. And since the DFA can be represented as a state table, once compiled, they are an efficient way to match a string. This also explains why Regular Expressions cannot count - they are equivalent to a Finite State Machine and are incapable of unbounded counting.

I find this to be a good base for understanding Regular Expressions. Having this understanding of why Regular Expressions are the way they are, their limitations and why they have those limitations, makes it easier to understand when they are appropriate to use and how to use them effectively.