Matching Regular Expressions by Computing Their Derivatives

:: racket, tutorials

By: Mike Delmonaco

Regular expressions allow us to describe patterns in text. They are very useful and show up all over the place in programming, but matching regular expressions can be difficult. One well-known technique for matching regular expressions is converting the regular expression to a finite state machine. This is pretty elegant, but can get complicated and messy.

An alternative technique, which is the subject of this blog post, involves something called a Brzozowski derivative. This technique can be used to compute the derivative of a generalized regular expression.

1  The Derivative of a Regular Expression?

The derivative of a regular expression re with respect to a character (or string) c is another regular expression which matches strings that, when appended to c, are matches of re. Here are a few examples:
  • The derivative of #rx"abc" with respect to #\a is #rx"bc"

  • The derivative of #rx"(ab)|(xy)" with respect to #\a is #rx"b". The second alternative is eliminated

  • The derivative of #rx"(abc)*" with respect to #\a is #rx"bc(abc)*"

  • The derivative of #rx"b" with respect to #\a is a regular expression that has no matches ever.

We take the derivative of a regular expression with respect to a string by repeatedly taking the derivative of the regular expression with respect to each character.

That’s the idea. If the derivative of the regular expression with respect to the target string is a regular expression which matches the empty string, we have a match.

We will represent regular expressions with racket datums:

  rx   =   character
    |   'empty
    |   'null
    |   (list 'or  rx rx)
    |   (list 'and  rx rx)
    |   (list 'not  rx)
    |   (list 'seq  rx rx)
    |   (list '*  rx)
  • character matches the character character

  • 'empty matches the empty string

  • 'null doesn’t match any strings

  • (list 'or a b) matches strings that match a or b

  • (list 'and a b) matches strings that match a and b. This is a feature that most regular expression implementations do not support.

  • (list 'not a) matches strings that do not match a. This is a feature that most regular expression implementations do not support.

  • (list 'seq a b) matches strings that can be partitioned into a substring that matches a followed by a substring that matches b. This is the concatenation of two regular expressions.

  • (list '* a) matches strings that match zero or more concatenations of a with itself

First, we must define a helper function that determines whether a regular expression matches the empty string. If it does, we return 'empty. Otherwise, we return 'null.

> (define (v re)
    (match re
      ['null 'null]
      ['empty 'empty]
      [(? char?) 'null]
      [`(* ,a) 'empty]
      [`(,(or 'and 'seq) ,a ,b) (if (equal? (v a) 'null) 'null (v b))]
      [`(or ,a ,b) (if (equal? (v a) 'empty) 'empty (v b))]
      [`(not ,a) (if (equal? (v a) 'null) 'empty 'null)]))
> (v 'empty)

'empty

> (v 'null)

'null

> (v #\a)

'null

> (v '(and empty #\a))

'null

> (v '(or empty #\a))

'empty

> (v '(* #\a))

'empty

Now, we’re ready to implement the derivative:

> (define (d/dc re c)
    (match re
      [(? char? c^) (if (eq? c c^) 'empty 'null)]
      [(or 'empty 'null) 'null]
      [`(* ,a) `(seq ,(d/dc a c) (* ,a))]
      [`(seq ,a ,b)
       `(or (seq ,(d/dc a c) ,b)
            (seq ,(v a) ,(d/dc b c)))]
      [`(and ,a ,b)
       `(and ,(d/dc a c) ,(d/dc b c))]
      [`(or ,a ,b)
       `(or ,(d/dc a c) ,(d/dc b c))]
      [`(not ,a) `(not ,(d/dc a c))]))
> (d/dc #\a #\a)

'empty

> (d/dc #\a #\b)

'null

> (d/dc `(seq #\a #\b) #\a)

'(or (seq empty #\b) (seq null null))

> (d/dc `(seq #\a #\b) #\z)

'(or (seq null #\b) (seq null null))

> (d/dc '(or (seq #\a #\b) (seq #\a #\c)) #\a)

'(or (or (seq empty #\b) (seq null null)) (or (seq empty #\c) (seq null null)))

> (d/dc '(or (seq #\a #\b) (seq #\a #\c)) #\z)

'(or (or (seq null #\b) (seq null null)) (or (seq null #\c) (seq null null)))

> (d/dc '(and (seq #\a #\b) (seq #\a #\c)) #\a)

'(and (or (seq empty #\b) (seq null null))

      (or (seq empty #\c) (seq null null)))

> (d/dc '(and (seq #\a #\b) (seq #\a #\c)) #\z)

'(and (or (seq null #\b) (seq null null)) (or (seq null #\c) (seq null null)))

> (d/dc '(* #\a) #\a)

'(seq empty (* #\a))

> (d/dc '(* (seq #\a #\b)) #\a)

'(seq (or (seq empty #\b) (seq null null)) (* (seq #\a #\b)))

For the character case, we check if the target character matches the regular expression character. If it does, then the only thing that can follow the target character in a string is the empty string, so that’s the derivative. If it doesn’t, there are no strings that can follow the character, so we return null. Remember, the result of a derivative is a regular expression that matches strings that, when appended after the target character, match the original regular expression. If the derivative is 'null, then the match failed because there are no strings that can follow the character and produce a match. In other words, the target character is not the first character of any matches of the regular expression.

For the 'empty expression, we return 'null because the empty expression does not match a character. The empty expression only matches the empty string.

For the 'null expression, we return 'null because the null expression does not match any strings.

For star expressions, we concatenate the sub-expression’s derivative with the star expression. Concatenating the original star expression gives us the repetition behavior.

For sequences, there are two cases. The intuitive case is the first one, `(seq ,(d/dc a c) ,b). This is the derivative of the first regular expression concatenated with the second expression. However, if the first regular expression matches the empty string, the derivative of the sequence could also be (d/dc b c). We implement this using our v helper and properties of regular expressions.

If (v a) returns 'empty, the second case becomes `(seq empty ,(d/dc b c)), which is equivalent to (d/dc b c). If (v a) is 'null, we get `(seq null ,(d/dc b c)), which is equivalent to 'null. So we only ever "try" the derivative of b when a matches the empty string.

We want the derivative to match all strings matched by either of these two cases, so we use 'or.

The other cases are just straightforward recursive cases, threading the derivative through the sub-expressions.

The first two examples show the simple character cases. The following examples seem complicated, but if we use the properties of regular expressions, we can simplify the results. For example, (d/dc `(seq #\a #\b)  #\a) evaluates to '(or (seq empty #\b) (seq null null)), which is equivalent to #\b. Try to simplify the other outputs and make sure the results are what you expect.

1.1  Detour: A Simplifier

If you simplified those by hand, you’d probably agree that this is a computer’s job. So let’s make a function to do it for us! This will make it easier to analyze our derivatives and it’s a fun exercise.

We’ll start by writing the rewrite rules:

> (define (simplify-step re)
    (match re
      [(or 'empty 'null (? char?)) re]
      [`(* null) 'empty]
      [`(* empty) 'empty]
      [`(seq null ,b) 'null]
      [`(seq ,a null) 'null]
      [`(seq empty ,b) b]
      [`(seq ,a empty) a]
      [`(and null ,b) 'null]
      [`(and ,a null) 'null]
      [`(and ,(? char? a) ,(? char? b)) (if (eq? a b) a 'null)]
      [`(and ,a ,a) a]
      [`(or null ,b) b]
      [`(or ,a null) a]
      [`(or ,a ,a) a]
      [`(not (not ,a)) a]
      [_ re]))

These are some simple rewrite rules for simplifying regular expressions. For example, the rule `(and null ,b)  ~> 'null makes sense because a string has to match both 'null and some regular expression b to match `(and null ,b), but it won’t match 'null, so the whole thing is equivalent to 'null. Make sure all of the rules make sense to you.

Now we’ll implement the recursive part:

> (define (simplify-step* re)
    (let ([re^ (simplify-step re)])
      (if (equal? re re^)
          re^
          (simplify-step* re^))))
> (define (simplify re)
    (match re
      [(or 'empty 'null (? char?)) re]
      [`(* ,a) (simplify-step* `(* ,(simplify a)))]
      [`(seq ,a ,b) (simplify-step* `(seq ,(simplify a) ,(simplify b)))]
      [`(and ,a ,b) (simplify-step* `(and ,(simplify a) ,(simplify b)))]
      [`(or ,a ,b) (simplify-step* `(or ,(simplify a) ,(simplify b)))]
      [`(not ,a) (simplify-step* `(not ,(simplify a)))]))

simplify-step* repeatedly applies simplify-step until the result is unchanged. This is guaranteed to terminate because our rewrite rules either produce a sub-expression, an atomic expression, or the unchanged input expression. Atomic expressions are returned unchanged, so eventually, we’ll either produce an expression with no rewrite rule (gets returned unchanged) or an atomic and then terminate.

simplify simplifies the expression bottom-up, simplifying sub-expressions first and then repeatedly applying our rewrite rules on the input expression with simplified children.

This is by no means an exhaustive, rigorous simplifier, but it gets the job done for our purposes.

Finally, we can simplify those derivatives:

> (simplify (d/dc #\a #\a))

'empty

> (simplify (d/dc #\a #\b))

'null

> (simplify (d/dc `(seq #\a #\b) #\a))

#\b

> (simplify (d/dc `(seq #\a #\b) #\z))

'null

> (simplify (d/dc '(or (seq #\a #\b) (seq #\a #\c)) #\a))

'(or #\b #\c)

> (simplify (d/dc '(or (seq #\a #\b) (seq #\a #\c)) #\z))

'null

> (simplify (d/dc '(and (seq #\a #\b) (seq #\a #\c)) #\a))

'null

> (simplify (d/dc '(and (seq #\a #\b) (seq #\a #\c)) #\z))

'null

> (simplify (d/dc '(* #\a) #\a))

'(* #\a)

> (simplify (d/dc '(* (seq #\a #\b)) #\a))

'(seq #\b (* (seq #\a #\b)))

Beautiful!

2  Putting it All Together

Now that we can differentiate a regular expression with respect to a character, we can differentiate with respect to a string by sequencing character derivatives.

> (define (d/ds re s)
    (for/fold ([re re])
              ([c (in-string s)])
      (d/dc re c)))
> (simplify (d/ds '(seq #\a #\b) "ab"))

'empty

> (simplify (d/ds '(seq (seq #\a #\b) #\c) "ab"))

#\c

> (simplify (d/ds '(* #\a) ""))

'(* #\a)

> (simplify (d/ds '(* #\a) "a"))

'(* #\a)

> (simplify (d/ds '(* #\a) "aaaaa"))

'(* #\a)

> (simplify (d/ds '(* (seq #\a #\b)) ""))

'(* (seq #\a #\b))

> (simplify (d/ds '(* (seq #\a #\b)) "a"))

'(seq #\b (* (seq #\a #\b)))

> (simplify (d/ds '(* (seq #\a #\b)) "ab"))

'(* (seq #\a #\b))

> (simplify (d/ds '(* (seq #\a #\b)) "ababababa"))

'(seq #\b (* (seq #\a #\b)))

With this, we can finally determine if a string matches a regular expression:

> (define (our-regexp-match re s)
    (match (v (d/ds re s))
      ['empty #t]
      ['null #f]))
> (our-regexp-match #\a "a")

#t

> (our-regexp-match #\a "b")

#f

> (our-regexp-match '(seq #\a #\b) "ab")

#t

> (our-regexp-match '(seq #\a #\b) "az")

#f

> (our-regexp-match '(seq #\a #\b) "abc")

#f

> (our-regexp-match '(or (seq #\a #\b) (seq #\a #\c)) "ab")

#t

> (our-regexp-match '(or (seq #\a #\b) (seq #\a #\c)) "ac")

#t

> (our-regexp-match '(or (seq #\a #\b) (seq #\a #\c)) "az")

#f

> (our-regexp-match '(* #\a) "")

#t

> (our-regexp-match '(* #\a) "a")

#t

> (our-regexp-match '(* #\a) "aaaaaaaaaa")

#t

> (our-regexp-match '(* (seq #\a #\b)) "")

#t

> (our-regexp-match '(* (seq #\a #\b)) "a")

#f

> (our-regexp-match '(* (seq #\a #\b)) "ab")

#t

> (our-regexp-match '(* (seq #\a #\b)) "aba")

#f

> (our-regexp-match '(* (seq #\a #\b)) "abab")

#t

This method is cool and simple, but does not generalize well to features like capture groups. Regardless, I think it is a very interesting idea and a fun exercise.