Writing a Regular Expression Engine in Scala - Part 2

In part 1 we implemented an NFA class in Scala. In this part, we will implement compilation into a DFA, and the ability to use the DFA to match strings.

The code can be found on Github in repo [1].

Compiling an NFA into a DFA

Recall that the compilation algorithm works as follows:

From the start state, find all the states reachable by consuming some input. Create a new state in the DFA representing this set.
Repeat this process for each state created, until no new sets are generated.

DFA Representation

Recall that a DFA is deterministic; it may not contain 𝜀 transitions or multiple choices when consuming a particular character in a particular state. This means that the Dfa class is simpler than its NFA counterpart.

type TransitionMap = Map[State, Map[Char, State]]

case class Dfa(startState: State, finalStates: Set[State], transitions: TransitionMap) {}

Implementing compilation

Since we have a recursive algorithm, we will implement compilation recursively. The compile method itself takes no arguments. Internally, it has two key helper functions:

findReachableStates, which given a set of states in the NFA finds all the states reachable under an 𝜀 transition
constructDfa, which takes a set of NFA states and adds them into a given DFA.

compile then calls constructDfa with the start state of the NFA, a DFA consisting of a single start state, and a mapping that marks the two states as equivalent.

val nfaStartStates = Set(startState)
val dfaStartState = new State(nfaStartStates.map(_.label))

constructDfa(
  nfaStartStates,
  Dfa(dfaStartState, Set(), Map()),
  Map(nfaStartStates -> dfaStartState))

findReachableStates

The findReachableStates helper function is straightforward. For each state, we find all the transitions, and then subsequently find all the 𝜀 transitions, and flatten the result. This can be represented concisely as a flatMap operation. We then remove any states which have already been tried, and then recurse.

This can be implemented tail recursively. Tail recursion (recursion as the final operation of a recursive function) is equivalent to a while loop, and this is an optimisation the Scala compiler can perform.

@tailrec
def findReachableStates(states: Set[State], visited: Set[State] = Set(), reachableStates: Set[State] = Set()): Set[State] = {
  if (states.isEmpty) reachableStates
  else {
    val reachable = states.flatMap(s => transitions.getOrElse(s, Map()).getOrElse(Epsilon, Set())).diff(visited)

    findReachableStates(reachable, visited ++ states, states ++ reachable ++ reachableStates)
  }
}

constructDfa

Having worked out the reachable states, we combine all their non-𝜀 transitions together, and for each target set of states, create a new state, throwing out any transitions for which we already have transitions from a previous iteration. We use Scala’s pattern matching to concisely destructure the map pairings.

val nfaTransitions: Map[MatchedCharacter, Set[State]] = reachableStates
  .map(s => transitions.getOrElse(s, Map()).filter({
    case (matched, _) => matched != Epsilon
  }).asInstanceOf[Map[MatchedCharacter, Set[State]]])
  .fold(Map())((map1: Map[MatchedCharacter, Set[State]], map2: Map[MatchedCharacter, Set[State]]) => {
    val keys1 = map1.keySet
    val keys2 = map2.keySet

    val intersection = keys1.intersect(keys2)
    val only1 = keys1.diff(keys2)
    val only2 = keys2.diff(keys1)

    (Map()
      ++ only1.map(key => (key, map1(key)))
      ++ only2.map(key => (key, map2(key)))
      ++ intersection.map(key => (key, map1(key) ++ map2(key))))
  })

val newDfaStates = nfaTransitions.map({
  case (_, states) => (states, new State(states.map(_.label)))
}) -- stateMap.keys

val newTransitions: Map[Char, State] = nfaTransitions.map({
  case (matched, states) => (matched.c, newDfaStates.getOrElse(states, stateMap(states)))
})

Then we can create a new DFA. This is a copy of the current DFA, with the new transitions added and the set of final states updated (if a final state is reachable via 𝜀 transitions from the current set of states, they are a final state in the DFA). Because everything is immutable, a shallow copy is sufficient.

If there were no new states produced, we’re done. Otherwise, we need to repeat this procedure for the newly generated states, before combining all of the transition maps together. Note that while this may produce several instances of the same state, this doesn’t actually matter as they all work as keys in the transition map regardless.

if (newDfaStates.isEmpty) updatedDfa
else newDfaStates
  .map({
    case (states, _) => constructDfa(
      states,
      updatedDfa,
      stateMap ++ newDfaStates
    )
  })
  .fold(dfa)((dfa1, dfa2) => {
    dfa1.copy(
      finalStates = dfa1.finalStates ++ dfa2.finalStates,
      transitions = {
        val intersection = dfa1.transitions.keySet.intersect(dfa2.transitions.keySet)

        val intersectionMap = intersection.map(state => {
          state -> (dfa1.transitions(state) ++ dfa2.transitions(state))
        })

        dfa1.transitions ++ dfa2.transitions ++ intersectionMap
      }
    )
  })

We now have a Dfa instance which corresponds to the Nfa. Now we can implement matching.

Implementing matching

Given a Dfa and a String to match, we want to return a sequence of all the possible matches. The matching algorithm will be greedy; in the event that the DFA may match the string in multiple ways, we want to return the longest match. Once we’ve found a match, the next search should commence from the end of the previous match.

This basically reduces down to a number of cases:

If the input is empty, return a sequence containing the last match if it exists, or the empty sequence if it does not
If there is input
- If the input matches and the DFA is in a final state, the the result becomes a possible match (we do not add it to the sequence yet as we are being greedy). If it is not a final state, we simply append the character to the current match.
- If the input does not match, we append the last known match to the sequence (if it exists) and backtrack to the end of the last known match before proceeding from the start state.

This algorithm maps nicely to Scala code:

case class Dfa(startState: State, finalStates: Set[State], transitions: TransitionMap) {
  def matches(s: String): Stream[String] = {
    def helper(s: Vector[Char], state: State, lastResult: Option[String], acc: Vector[Char], next: Vector[Char]): Stream[String] = s match {
      // If there is no more input, return the last result if we have one
      case Vector() if lastResult.isDefined => Stream(lastResult.get)
      case Vector() => Stream()
      case h +: t => transitions.getOrElse(state, Map()).get(h) match {
        // If the next state is final, record the result
        case Some(nextState) if finalStates.contains(nextState) => helper(t, nextState, Some((acc :+ h).mkString("")), acc :+ h, t)
        // If the next state is not final, build the accumulator
        case Some(nextState) => helper(t, nextState, lastResult, acc :+ h, next)
        // If there is no next state, return a Stream containing the last result if we have one
        case None if lastResult.isDefined => lastResult.get #:: helper(next, startState, None, Vector(), Vector())
        case None => helper(t, startState, None, Vector(), Vector())
      }
    }

    helper(s.toVector, startState, None, Vector(), Vector())
  }
}

This is a recursive function with an accumulator for the current match.

We have elected to use a Stream as the sequence type. This is a lazy collection; its elements are not computed until they are needed. This means that we need not pay the cost of finding all the matches if we’re only interested in the first one.

Conclusion

We’ve implemented a simple Regular Expression engine in Scala. While lacking in most of the conveniences of a modern Regular Expression engine, we are still able to create a DFA for any regular language.