In part 1 we implemented an NFA class in Scala. In this part, we will implement compilation into a DFA, and the ability to use the DFA to match strings.
The code can be found on Github in repo [1].
Compiling an NFA into a DFA
Recall that the compilation algorithm works as follows:
- From the start state, find all the states reachable by consuming some input. Create a new state in the DFA representing this set.
- Repeat this process for each state created, until no new sets are generated.
DFA Representation
Recall that a DFA is deterministic; it may not contain 𝜀 transitions or multiple choices when consuming a particular character in a particular state. This means that the Dfa
class is simpler than its NFA counterpart.
Implementing compilation
Since we have a recursive algorithm, we will implement compilation recursively. The compile
method itself takes no arguments. Internally, it has two key helper functions:
findReachableStates
, which given a set of states in the NFA finds all the states reachable under an 𝜀 transitionconstructDfa
, which takes a set of NFA states and adds them into a given DFA.
compile
then calls constructDfa
with the start state of the NFA, a DFA consisting of a single start state, and a mapping that marks the two states as equivalent.
findReachableStates
The findReachableStates
helper function is straightforward. For each state, we find all the transitions, and then subsequently find all the 𝜀 transitions, and flatten the result. This can be represented concisely as a flatMap
operation. We then remove any states which have already been tried, and then recurse.
This can be implemented tail recursively. Tail recursion (recursion as the final operation of a recursive function) is equivalent to a while loop, and this is an optimisation the Scala compiler can perform.
constructDfa
Having worked out the reachable states, we combine all their non-𝜀 transitions together, and for each target set of states, create a new state, throwing out any transitions for which we already have transitions from a previous iteration. We use Scala’s pattern matching to concisely destructure the map pairings.
Then we can create a new DFA. This is a copy of the current DFA, with the new transitions added and the set of final states updated (if a final state is reachable via 𝜀 transitions from the current set of states, they are a final state in the DFA). Because everything is immutable, a shallow copy is sufficient.
If there were no new states produced, we’re done. Otherwise, we need to repeat this procedure for the newly generated states, before combining all of the transition maps together. Note that while this may produce several instances of the same state, this doesn’t actually matter as they all work as keys in the transition map regardless.
We now have a Dfa
instance which corresponds to the Nfa
. Now we can implement matching.
Implementing matching
Given a Dfa
and a String
to match, we want to return a sequence of all the possible matches. The matching algorithm will be greedy; in the event that the DFA may match the string in multiple ways, we want to return the longest match. Once we’ve found a match, the next search should commence from the end of the previous match.
This basically reduces down to a number of cases:
- If the input is empty, return a sequence containing the last match if it exists, or the empty sequence if it does not
- If there is input
-
- If the input matches and the DFA is in a final state, the the result becomes a possible match (we do not add it to the sequence yet as we are being greedy). If it is not a final state, we simply append the character to the current match.
-
- If the input does not match, we append the last known match to the sequence (if it exists) and backtrack to the end of the last known match before proceeding from the start state.
This algorithm maps nicely to Scala code:
This is a recursive function with an accumulator for the current match.
We have elected to use a Stream
as the sequence type. This is a lazy collection; its elements are not computed until they are needed. This means that we need not pay the cost of finding all the matches if we’re only interested in the first one.
Conclusion
We’ve implemented a simple Regular Expression engine in Scala. While lacking in most of the conveniences of a modern Regular Expression engine, we are still able to create a DFA for any regular language.