Trie Regular Expressions in Java

A trie regular expression lets you match expressions in a list more or less equally quickly however long the list. Some regular expression engines, such as that in Perl 5.10+, will automatically convert an alternation into a trie for optimal efficiency. Java's regex engine does not. So I've written this library to provide this functionality for Java. It provides a little additional functionality as well -- normalizing whitespace into \s++ if you wish, for example, and marking word boundaries. So, for example, if you wish to match {an,angry,aardvark} it will give you the expression

(?i:\ba(?>n(?>gry)?+|ardvark)\b)

Note that the expression returned will always be in its own group -- set of parentheses -- so you can compose it into a larger regular expression as you would a character.

I have used dfh.trie.Trie to compile regular expressions matching lists of hundreds of thousands of words. It takes a while to compile such an expression but it matches as quickly as an expression matching a single word.

An Example

The class below compiles a variety of regular expressions with varying options.

import static dfh.trie.Trie.AUTO_BOUNDARY;
import static dfh.trie.Trie.BACKTRACKING;
import static dfh.trie.Trie.CASEINSENSITIVE;
import static dfh.trie.Trie.CONDENSE;
import static dfh.trie.Trie.NO_CHAR_CLASSES;
import static dfh.trie.Trie.PRESERVE_WHITESPACE;
import static dfh.trie.Trie.trie;

import java.lang.reflect.Field;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import dfh.trie.Trie;

class Sampler {

	/*
	 * some words to look for
	 */
	static String[] words = { 
		"cat", 
		"cot", 
		"dog",           // NOTE
		"Dig",           // case
		"elephant",
		"eeeeeeeeek",    // long repeating sequence
		"elevator shaft" // interior whitespace
	};

	/*
	 * a string to look in
	 */
	static String sentence = "The categorical cottager dogged the elephant "
			+ "dig's elevator   shaft while shouting 'eeeeeeeeeeeeeeek!'";

	public static void main(String[] args) throws IllegalArgumentException,
			IllegalAccessException {
		// we try some options
		test(0);
		test(AUTO_BOUNDARY);
		test(CONDENSE);
		test(BACKTRACKING);
		test(CASEINSENSITIVE);
		test(PRESERVE_WHITESPACE);
		test(NO_CHAR_CLASSES);
		test(AUTO_BOUNDARY | CONDENSE | CASEINSENSITIVE);
	}

	static void test(int options) throws IllegalArgumentException,
			IllegalAccessException {
		String re = trie(words, options);
		Pattern p = Pattern.compile(re);
		Matcher m = p.matcher(sentence);
		describeOptions(options);
		System.out.println(re);
		while (m.find())
			System.out.printf("  %s%n", m.group());
		System.out.println();
	}

	static void describeOptions(int options) throws IllegalAccessException {
		boolean first = true;
		for (Field f : Trie.class.getFields()) {
			if (f.getType().equals(int.class)) {
				int flag = f.getInt(null);
				if ((options & flag) != flag)
					continue;
				if (first)
					first = false;
				else
					System.out.print(" | ");
				System.out.print(f.getName());
			}
		}
		if (!first)
			System.out.println();
	}

}

The output of this is

(?>e(?>le(?>vator\s++shaf|phan)t|eeeeeeeek)|dog|c[ao]t|Dig)
  cat
  cot
  dog
  elephant
  elevator   shaft
  eeeeeeeeek

AUTO_BOUNDARY
(?>\b(?>e(?>le(?>vator\s++shaf|phan)t|eeeeeeeek)|dog|c[ao]t|Dig)\b)
  elephant
  elevator   shaft

CONDENSEdfh.trie.Trie will look for repeating patterns of up to 16 characters. 
Trie.trie(new String[]{"foofoofoofoo"}, Trie.CONDENSE), for instance, will give you (?>(?:foo){4}).
Trie will only replace the repeating sequence with the condensed version if this will produce a smaller
regular expression, so Trie.trie(new String[]{"foofoofoo"}, Trie.CONDENSE) gives you (?>foofoofoo)
rather than (?>(?:foo){3}) -- the latter expression is actually one character longer than the former.
(?>e(?>le(?>vator\s++shaf|phan)t|e{8}k)|dog|c[ao]t|Dig)
  cat
  cot
  dog
  elephant
  elevator   shaft
  eeeeeeeeek

BACKTRACKING
(?:e(?:le(?:vator\s+shaf|phan)t|eeeeeeeek)|dog|c[ao]t|Dig)
  cat
  cot
  dog
  elephant
  elevator   shaft
  eeeeeeeeek

CASEINSENSITIVE
(?i:(?>e(?>le(?>vator\s++shaf|phan)t|eeeeeeeek)|d[io]g|c[ao]t))
  cat
  cot
  dog
  elephant
  dig
  elevator   shaft
  eeeeeeeeek

PRESERVE_WHITESPACE
(?>e(?>le(?>vator shaf|phan)t|eeeeeeeek)|dog|c[ao]t|Dig)
  cat
  cot
  dog
  elephant
  eeeeeeeeek

NO_CHAR_CLASSES
(?>e(?>le(?>vator\s++shaf|phan)t|eeeeeeeek)|dog|c(?>o|a)t|Dig)
  cat
  cot
  dog
  elephant
  elevator   shaft
  eeeeeeeeek

CASEINSENSITIVE | AUTO_BOUNDARY | CONDENSE
(?i:\b(?>e(?>le(?>vator\s++shaf|phan)t|e{8}k)|d[io]g|c[ao]t)\b)
  elephant
  dig
  elevator   shaft

The Unusual Character Trick

Suppose you want to compile a regular expression out of a list of regular expressions. Well, this isn't really what dfh.trie.Trie was designed for. It isn't Regexp::Assemble. However, there is one little trick you can use to do something a little similar. Say you want to match c[aeiou]t|d[aeiou]g -- that is, you want to match particular sequences of vowels and consonants where the vowels can be any vowel but the consonants must occur in a particular order. What you do is you pick some character that isn't otherwise used (and isn't a regex meta-character due to be escaped), x , say, or , and replace all the vowels in your words with this character. You then compile the regex the ordinary way, and then do a search-and-replace, replacing the unusual character with an atomic regular expression -- you can attach a quantifier to it and the entire expression will be quantified -- matching the sub-pattern you're interested in. In this case:

String re = Trie.trie(new String[] { "c€t", "d€g" },
    Trie.CASEINSENSITIVE | Trie.NO_CHAR_CLASSES).replaceAll("€", "[aeiou]");

producing

(?i:(?>d[aeiou]g|c[aeiou]t))

It's necessary if you're using this trick to use the NO_CHAR_CLASSES option to prevent constructing something like (?>c[[aeiou]x]t) . With NO_CHAR_CLASSES this will be (?>c(?>[aeiou]|x)t) .

trieify

I find I often want a regular expression to match a smallish number of phrases and it would be nice to have a command line utility that would whip it up for me. Such a utility is included in the tarball downloadable on the download page. It's heart is the class trieify in the util/ directory. This class uses dfh.cli for command line parsing, and being a Java program it's a bit of a pain to convert into a command line utility. For this purpose I put a shell script called trieify in ~/bin/ . I use Linux. If all of this sounds mysterious to you, it probably isn't worth your trouble.

#!/bin/sh

java -jar ~/jars/trieify.jar "$@"

You'll have to fix the path to point to the correct jar, of course. Here's trieify 's usage information:

USAGE: trieify [options] <phrase>*

  condense word lists into trie regex

    --case-sensitive -I                preserve case distinctions
    --no-boundary -n                   do not assume word boundaries in input
    --no-character-classes -C          use (?>a|b|c) rather than [abc]
    --no-condense -D                   if set, long repeating sequences will be
                                       left alone; aaaaaaaaaa -/-> a{10}
    --backtracking -b                  whether to drop the possessive suffix +
                                       and possessive group prefix (?>

    --whitespace-characters -w         treat whitespace characters the same as
                                       others
    --space-and-tab -s                 treat whitespace characters as
                                       signifying [ \u00a0\t]
    --space-only -y                    treat whitespace characters as
                                       signifying [ \u00a0]

    --file -f                  <file>  take word list from file; repeatable
    --output -o                <file>  write output to file

    --perl-safe -p                     avoid constructs unavailable in Perl 5.8
    --java-safe -j                     double all \ characters so trie can be
                                       pasted into a Java string

    --help -? -h                       print usage information

trieify provides command line access to class dfh.trie.Trie, letting you condense
word lists into a trie regex. The word list will consist of any command line 
arguments or the lines read in from files specified by the --file argument.

Note that whitespace receives special treatment unless the --whitespace option is
provided. Whitespace will be trimmed off the ends of all phrases and interior 
whitespace will be represented by some possessive character class sequence, 
\s++ by default. Thus 'foo bar' will be converted into an expression that matches
'foo', some amount of whitespace, and 'bar'.

Note that the argument '--' will block option parsing, so if you want to convert
'-foo' and '-bar' into (?i:-(?>foo|bar)\b) you provide trieify the arguments 
'-- -foo -bar'. The single character options can be bundled, so

	trieify -InC foo bar 

is the same as

	trieify -I -n -C foo bar