Collapsing Regexp alternations

Today I took a detour from my normal work to investigate an issue we had where we were performing a Denial of Service (DoS) attack against ourselves with a particular RegExp pattern – npm's url-regex package provided it.

That package combines the list of top-level domains (TLDs) from the tlds package into one big list of alternations like (?:blog|coffee|com|co|net|org). Alternations can be deadly in RegExp patterns because each one represents another branch that the search has to take and to which it might backtrack.

The idea I had was to rearrange these domains first by popularity and then to collapse them into an optimized description like this instead: (?:(?:c(?:o(?:m|ffee)?))|org|net|blog). Did you see it? By grouping by each letter we eliminate most of the branches on most iterations.

In short, this experiment took a long time to perform and the results rejected my hypothesis. In conclusion, the built-in RegExp engines probably already perform the same optimization I made, other than the first-letter groups may be rearranged (which may not make much of an impact on performance).

Premise

My idea was to take the list of the million most popular URLs and also a big text document with few URLs in it and then run them through the different variations of the RegExp pattern, time the results, then compare. Simple enough.

Steps

Test harness

Let's start by building our timing function – it's going to perform the measurements.

/**
 * Runs a function and returns the runtime
 *
 * @param {Function} f thunk to run
 * @returns {Number} runtime of f (in ms)
 */
const t = f => {
  const tic = process.hrtime();
  f();
  const [ s, ns ] = process.hrtime(tic);

  return s * 1e3 + ns / 1e6;
}

We should be able to pass any thunk into there and it will run the function and report the time it took.

Next we want to find all of the URLs in a document. We'll use RegExp.prototype.exec() to do this and we'll make sure we set the g flag on our RegExp pattern so that we can continue searching after the previous find.

Since we'll be trying different patterns I zoomed out one level and also wrapped it with t() – making some throw-away code here so nothing robust is necessary. We're going to create a global variable to hold the matches. We need this if we want to keep the loop simple as that variable will be the exit condition.

let match;

/**
 * Times how long it takes to find all matches
 * to the given pattern in the given document
 *
 * @param {string} document the input document
 * @param {RegExp} pattern specifies what to find
 * @returns {Number} time to find all matches (in ms)
 */
const runner = document => pattern => t( () => {
  // reset our loop and pattern (pattern is mutable)
  match = true;
  pattern.lastIndex = 0;

  while ( match ) {
    match = pattern.exec( document );
  }
} );

That does it for our test harnesses. We'll run them over an array of document and patterns…

const times = [ docA, docB ].map( doc => {
  const run = runner( doc );
  return [ patternA, patternB, patternC ].map( run );
} );

With documents as large as I'm using I'm not bothering to run them multiple times in a row. I think that all locality concerns will be nullified by the amount of data flowing through. We should also be safe from loop optimizations since both match and pattern are mutating on every iteration – the compiler/engine cannot safely skip iterations.

Building the patterns

The url-regex package provides a function which generates its url. It's easy enough to generate the one with the TLDs in it.

const url = require( 'url-regex' );
const base = url( { strict: false } );

For the next one where I rearrange .com to the front I manually copied this RegExp into my editor, made some edits, then pasted it back into Node.

The last pattern though was much harder as I had to generate it algorithmically and ran into bumps (should have done a better web search before hacking into it).

For this phase of the project I'm going to pretend that tlds returns [ 'ca', 'cat', 'cot', 'car', 'catastrophe', 'catastrophic' ] instead of the full list since it's easier to explore.

Without further ago let's get coding!

We want to build a prefix-tree of them. A quick web search revealed the trie-prefix-tree package which did all the work for us.

const trie = require( 'trie-prefix-tree' )
let tlds = trie( require( 'tlds' ) ).tree()

tlds === {
  c: {
    a: {
      $: 1,
      t: {
        $: 1,
        a: { t: { a: { s: { t: { r: { o: { p: { h: {
          e: { $: 1 },
          i: { c: { $: 1 } }
        }
      },
      r: { $: 1 }
    },
    o: { t: { $: 1 } }
  }
}

What a mess! But we can clean it up. Immediately we note two patterns: when we've reached a leaf node which represents the end of a word there's a { $: 1 } value to indicate it; even if there's only one leaf node on a subtree the entire subtree gets built letter-by-letter still. Let's combine those first so we can bring back some sanity to the output.

/**
 * Combines stretches of letters with no branches
 *
 * @param {Object} o output from `trie(…).tree()`
 * @returns {Object} augmented and collapsed tree
 */
const collapse = o => {
  Object.keys( o ).forEach( key => {
    // start by collapsing the subtree beneath us
    collapse( o[ key ] );

    const subKeys = Object.keys( o[ key ] );

    // our only child is a terminator
    // we can move it up immediately and remove the branch
    if ( subKeys.length === 1  && subKeys[ 0 ] === '$' ) {
      o[ key ] = o[ key ].$;
    }

    // our only child is another letter or substring
    // we can move it up
    // and rename the parent into the combination
    const newKey = subKeys.length === 1 && subKeys[ 0 ] !== '$'
      ? `${ key }${ subKeys[ 0 ] }`
      : key;

    if ( newKey !== key ) {
      o[ newKey ] = o[ key ][ subKeys[ 0 ] ];
      delete o[ key ];
    }
  } );
}

collapse( tlds )

After this new more digestible output.

{
  c: {
    a: {
      $: 1,
      t: { $: 1, astroph: { e: 1, ic: 1 } },
      r: { 1 }
    },
    ot: { 1 }
  }
}

The structure of our final RegExp is clear – we just have to code it up. By hand we'll plan out our destination. We want to use non-capture groups for each group. We should also have no need for backtracking since we're manually organizing this.

Further, we need to be careful with our ordering because the RegExp will match the first pattern. That is, if we attempt to match (?:co|com) then we'll never match com since co is a valid match of that string too. Going longest-first is the easy way out but as long as the alternations don't conflict we're safe.

(?:c(?:a(?:t(?:astroph(?:e|ic))?|r)?)|ot))

It looks confusing but if we add our indentation back in it will look about the same as the data structure above and be clearer.

(?:
  c(?:
    (?:
      a(?:
        t(?:
          astroph(?:
            e
            |
            ic
          )
        )?
        |
        r
      )?
    )
    |
    ot
  )
)

I'll spare you the details because they are ugly and I know there's a better way to write it than I did, but you can still read the source. It's not even correct; it leaves extra non-capture groups when they aren't needed, though I'm sure that the RegExp engine will ignore those. It's too bad it didn't perform better!

2 thoughts on “Collapsing Regexp alternations

  1. Also I should note that the character length for the compiled pattern was longer than the simple list of TLD alternations – it was about 11,000 characters instead of just 10,000; however, it compressed better and the result was around 5 KB instead of 6 KB.

    That is, even if it didn’t speed up the parsing speed it may save on some download size.

Leave a Reply

%d bloggers like this:
search previous next tag category expand menu location phone mail time cart zoom edit close