Writing a Regular Expression Engine in Scala - Part 1

Having explained the theory behind Regular Expressions (see here), let’s implement a Regular Expression engine in Scala.

The code can be found on Github in repo [1].

Why Scala

Of all the programming languages I have tried, I think Scala is my favourite. It favours immutable objects, supports functional styles of programming, and has an expressive type system which reduces the difference between the set of compilable programs and the set of working programs.

Scala has several features which will be useful when writing the Regular Expression engine, which will be pointed out as we go.

Building an NFA

We begin by writing a class to represent an NFA.

NFA Representation

Recall that an NFA consists of a set of states, and a set of transitions between them. An obvious implementation might represent a state as a Map[Character, Set[State]], with a flag to mark final states.

The problem with this approach is that it incompatible with immutable objects. Once a state has been created, it cannot be modified - we can only create a new state. This means that when a transition is added to a state, we must recursively modify all the states which point to said state. If the NFA contains cycles, this process repeats forever. The solution to this is to have the NFA itself own the transitions.

Thus an NFA will consist of a start State, a set of final states, and a map of State -> Character -> Set[State]. There’s a slight complication: we also need to be able to match the empty string, 𝜀.

In Scala, this can be represented succinctly.

sealed trait Matched

case object Epsilon extends Matched
case class MatchedCharacter(c: Char) extends Matched

case class Nfa(startState: State, finalStates: Set[State], transitions: Map[State, Map[Matched, Set[State]]]) {
}

Constructing an NFA

This is a good start, but constructing NFAs like this would be tedious, illegible, and error prone. To mitigate this we can use a companion object with some static factories for the five base NFAs (a companion object is a singleton object with the same name as a class).

object Nfa {

}

We start by writing a factory method for the single character case.

val EMPTY: Nfa = character(Epsilon)

def character(c: Matched): Nfa = {
  val startState = State()
  val finalState = State()

  Nfa(
    startState,
    Set(finalState),
    Map(
      startState -> Map(c -> Set(finalState))
    )
  )
}

This describes the single character NFA as an NFA with a start state, a single final state, and a transition from one to the other. As a small optimisation, since Epsilon is a singleton, we can cache a reference to the empty Nfa. This is one of the benefits of immutable state.

This leaves the three compound cases: concatenation, union, and the Kleene star.

type TransitionMap = Map[State, Map[Matched, Set[State]]]

def concatenation(n1: Nfa, n2: Nfa): Nfa = {
  val startState = new State()
  val joinState = new State()
  val finalState = new State()

  Nfa(
    startState,
    Set(finalState),
    Map(
      startState -> Map(Epsilon -> Set(n1.startState)),
      joinState -> Map(Epsilon -> Set(n2.startState))
    ).asInstanceOf[TransitionMap]
      ++ n1.finalStates.map(s => (s, Map(Epsilon.asInstanceOf[Matched] -> Set(joinState))))
      ++ n2.finalStates.map(s => (s, Map(Epsilon.asInstanceOf[Matched] -> Set(finalState))))
      ++ n1.transitions
      ++ n2.transitions
  )
}

def union(n1: Nfa, n2: Nfa): Nfa = {
  val startState = new State()
  val finalState = new State()

  Nfa(
    startState,
    Set(finalState),
    Map(
      startState -> Map(Epsilon -> Set(n1.startState, n2.startState))
    ).asInstanceOf[TransitionMap]
      ++ n1.finalStates.map(s => (s, Map(Epsilon.asInstanceOf[Matched] -> Set(finalState))))
      ++ n2.finalStates.map(s => (s, Map(Epsilon.asInstanceOf[Matched] -> Set(finalState))))
      ++ n1.transitions
      ++ n2.transitions
  )
}

def star(n: Nfa): Nfa = {
  val startState = new State()
  val finalState = new State()
  val startLoop = new State()
  val endLoop = new State()

  Nfa(
    startState,
    Set(finalState),
    Map(
      startState -> Map(Epsilon -> Set(startLoop, finalState)),
      startLoop -> Map(Epsilon -> Set(n.startState)),
      endLoop -> Map(Epsilon -> Set(startLoop, finalState))
    ).asInstanceOf[TransitionMap]
      ++ n.finalStates.map(s => (s, Map(Epsilon.asInstanceOf[Matched] -> Set(endLoop))))
      ++ n.transitions
  )
}

These work pretty much as you’d expect. We create the required additional states, and add the required 𝜀 transitions to the transitions of the existing NFAs. Irritatingly we need typecasts to stop the compiler inferring that the transition maps only use Epsilon as a key, which prevents the combination of the maps.

This gives us all the tools we need to construct an NFA that is equivalent to any Regular Expression.

Instance Methods

While functionally this is all we need, it’s still pretty clunky. We can use a Scala feature to help with this.

Scala doesn’t have operators. It has types that have methods with names like +, - and ::. We can add methods with these sorts of names to our own types.

While operators can be overused (the Dispatch library was a classic example of this), in cases where we are trying to code externally defined operators with Scala code they can be quite helpful.

In this case, the union and Kleene star have standard notation (| and * respectively). We can add methods with these names to the Nfa class.

def |(other: Nfa): Nfa = Nfa.union(this, other)
def * : Nfa = Nfa.star(this)

val x: Nfa = n|m
val y: Nfa = n*

(Yes, I’m aware concatenation is missing. I’m getting to that.)

Concatenation

One approach to concatenation is to allow instances of the Nfa type to be constructed from Strings. A String is a sequence of characters, which we can take to be the Nfa that matches the sequence. This also generalises nicely to the empty string, which can be taken as the Nfa matching 𝜀.

implicit def string2Nfa(s: String): Nfa = {
  if (s.isEmpty) Nfa.EMPTY
  else s.map(c => Nfa.character(MatchedCharacter(c))).reduce(Nfa.concatenation)
}

This also allows us to demonstrate another Scala feature: implicits. If we have an instance of a type A, we required a type B, and we have an implicit method a2B of type A => B, the compiler will allow us to use instances of type A as if they were instances of type B.

Used improperly, implicits can make code confusing by making it difficult to see where a conversion is happening. Determining if this is one of those cases is left as an exercise for the reader.

We can now write code like this:

val x: Nfa = "abc"|"xyz"*

There’s just one more trick we need.

Application

It turns out Scala has a trick we can use to implement concatenation as an instance method.

Scala allows any type that implements an apply method to be called as a function.

def apply(other: Nfa): Nfa = Nfa.concatenation(this, other)

Which means the following is working Scala code:

val nfa: Nfa = ("011"|"10")("1"*)|("00"|"11")("0"("1"*)|"")

It may have a few more parentheses than one would use to write it in the standard library, but it’s otherwise a pretty natural way of writing it.

Conclusion

We’ve implemented the beginning of a regular expression engine, by creating an NFA class and a concise way of instantiating it. In Part 2, we’ll look at implementing compilation into a DFA.