Picture of a grammar tree as produced by MatchToDot. If you don't see a picture, your browser can't show in-line svg graphics. Sorry.

Grammar

The dfh.grammar library is a pattern matching library for Java inspired by the enhanced regular expression facilities provided by Perl 5.10+ and Perl 6. The basic functionality in the Perl 5.10+ regex engine is awesome in itself, but the syntactic sugar provide by Regexp::Grammar is another order of magnitude better still, and Perl 6 is still more fabulous. I have long loved these character pattern matching facilities and craved the same for Java, the language which is my bread and butter. This is not to knock Java. Java's character class syntax is fabulous. I love being able to combine character classes like sets. But due to double escaping Java regular expressions of any size are unreadable, and there's no way to make them recursive. Compositionality may be achieved to some extent by adding patterns together like strings, but I wanted more. Since I prefer coding to Googling for code, I decided to write my own improvement on Java regexes. This library is the product of this labor of obsession.

The Basic Idea

The image above was produced by taking the grammar
<ROOT> = ~<b> <g> <r>?+ <a> <m>{2} <a> <r> ~<b>

   <a> = "a"
   <b> = / $ | ^ /x
   <g> = /g/i
   <m> = "m"
   <r> = "r"
and applying it to the word Grammar. Specifically, I combined the two using the MatchToDot application included with the dfh.grammar distribution, passing that through GraphViz and rescaling the image a bit with Inkscape. As you can see, a grammar looks a lot like a BNF grammar. It consists of a list of named rules. Rule names are sequences of characters in the \w class enclosed in angle brackets. Each line in the grammar consists of a rule name, an equals sign, and a rule definition.Or a blank line or a comment beginning with #. Rule definitions may also end in a comment. The rule definition consists of one or more elements optionally modified in various ways. A leading ~ or ! marks a forward-looking zero-width assertion.Variable width backwards assertions are indicated by ~- or !-; and for symmetry one may use ~+ and !+ to indicate forward assertions. Though the backwards assertions native to dfh.grammar are variable width, unlike those of java.lang.regex.Pattern, they do have limitations that don't pertain to forward assertions. See syntax. Of course, one may define a terminal rule that provides a java.util.regex.Pattern zero-width assertion, though the grammar will not be able to recognize it as such. Trailing suffixes familiar from Perl-style regular expressions -- *? , {1,2} , etc. -- indicate repetitions. Constituents may be grouped with square brackets -- [ <a> <b> ]+ . Alternates are marked with | . And so forth.
The atomic constituents of rules are literals, marked with single or double quotes, and regular expressions.Or arbitrary extensions of the dfh.grammar.Rule, as is explained elsewhere.
Among the rules is a special rule called <ROOT> . That is to say, we call it <ROOT> in the documentation, but in the grammar it may have any name; it is just the first rule listed. For the grammar to match a character sequence the root rule must match. Constituent rules are invoked as needed by other rules.
The net result is an object that matches much like a java.util.regex.Pattern . For example, the grammar above is identical to the regex
  \b(?i:g)\br?+am{2}ar\b

How it works

This is roughly the algorithm in a nutshell: The grammar compiles into what I gather from poking about in Wikipedia is a recursive descent parser.The dfh.grammar formalism bears a resemblance to parsing expression grammar, though its rules, aside from the first, are not ordered, and its grammars may be ambiguous. It uses a caching algorithm similar to that of packrat parsers, but it only stores the success or failure of matching for non-terminal rules. For terminal rules, the entire match is cached. This accelerates parsing without require exponential memory. Basically this is a finite state automaton that consumes symbols as it moves over a symbol sequence. This automaton examines symbols in two streams: the character sequence you give it and a reversed version of this sequence for lookbehinds. To accelerate matching, terminal matches are cached. Also, all rules cache offsets at which they fail to match so they never attempt to match at such offsets twice. Functionally identical rules share caches, so, for example, a literal sub-rule in one rule will not attempt to match where an identical literal from another rule has already been tried.

Why?

So if grammars are basically regexes, why not use regexes? Well, for simple patterns you probably do want regexes. You can type them in faster and they will match faster. You want a grammar when the pattern is complex, recursive, or requires arbitrary tests not expressible in the standard regular expression syntax. The advantages of a grammar are

Examples

The examples below are just that: examples. For a more complete survey of how grammars can be instantiated, manipulated, and used, see the code in the test directory, t/ , or the examples directory, examples/ .

Matching

As with java.util.regex.Matcher , the basic methods are match , find , and lookingAt , where the first must match the entire character sequence, the second, some prefix of the sequence, and the third, anywhere.
		String[] rules = {                  # dfh.grammar.Grammar has several constructors. 
		Generally in this documentation I show the constructor that takes an array of strings as its parameter. 
		For longer grammars one should write the grammar in an external file and pass the constructor a file object 
		or a reader. One can also construct a grammar from a single string, which will be extremely useful when Java 
		finally supports multi-line strings.
				"<ROOT> = <c> <d>",
				"<c> = <a>++",
				"<d> = <a> <b>",
				"<a> = /a/",
				"<b> = /b/",
		};
		Grammar g = new Grammar(rules);
		String s = "aab";
		Matcher m = g.lookingAt(s);
		Match n = m.match();
		System.out.println(n);
		System.out.println(n.group());
produces
(<c> <d>: 0, 3 [(<a>+: 0, 1 [(/a/: 0, 1)]), (<a> <b>: 1, 3 [(/a/: 1, 2), (/b/: 2, 3)])])
aab
The stringification of dfh.grammar.Match itself has the grammar
<stringication> = "(" <rule_id> ": " <offsets> [ "[" <submatches>"]" ]? ")"
   <submatches> = <stringification> [ ", " <stringification> ]*

Iterating over Matches

As with java.util.regex.Matcher , a dfh.grammar.Matcher behaves as an iterator over matches.
		String[] rules = {
				"<ROOT> = <a>+ <a>+",
				"<a> = /a/",
		};
		Grammar g = new Grammar(rules);
		String s = "aaa";
		Options opt = new Options();
		opt.allowOverlap(true);
		Matcher m = g.find(s, opt);
		Match n;
		while ((n = m.match()) != null)
			System.out.println(n);
produces
(<a>+ <a>+: 0, 3 [(<a>+: 0, 2 [(/a/: 0, 1), (/a/: 1, 2)]), (<a>+: 2, 3 [(/a/: 2, 3)])])
(<a>+ <a>+: 0, 3 [(<a>+: 0, 1 [(/a/: 0, 1)]), (<a>+: 1, 3 [(/a/: 1, 2), (/a/: 2, 3)])])
(<a>+ <a>+: 0, 2 [(<a>+: 0, 1 [(/a/: 0, 1)]), (<a>+: 1, 2 [(/a/: 1, 2)])])
(<a>+ <a>+: 1, 3 [(<a>+: 1, 2 [(/a/: 1, 2)]), (<a>+: 2, 3 [(/a/: 2, 3)])])
Note that the matcher in this case is iterating over all possible ways the grammar could match the given string, not just the non-overlapping matches. This is because of this bit:
		Options opt = new Options();
		opt.allowOverlap(true);
		Matcher m = g.find(s, opt);
This is the general way one modifies the behavior of a dfh.grammar.Matcher , other than by changing the method of dfh.grammar.Grammar that produced it. You instantiate a dfh.grammar.Options object and then change its properties with its accessors.For the sake of brevity I did not use the JavaBean spec for accessors. Once a matcher has been created, it is independent of the Options object used in its creation. You may change this however you like and it will not affect the behavior of the Matcher .

Applying Arbitrary Conditions to Rules

One of the more powerful features of grammars is that you can apply arbitrary conditions to rules, validating sub-portions of a match mid-match.This is equivalent to the conditional expression (?(?{!condition()})(?!)) in Perl -- condition in this case is a bit of code that, if false, will cause the match to fail.
                String[] rules = { 
                                "<ROOT> = <year> '/' <month> '/' <day> (is_date)",
                                "<year> = /\\b\\d{4}/",
                                "<month> = /\\d{1,2}/",
                                "<day> = /\\d{1,2}/",
                };
                Grammar g = new Grammar(rules);
                g.defineCondition("is_date", new Condition() {
                        @Override
                        public boolean passes(Match n, Matcher m, CharSequence s) {
                                int year = parse(s, n.children()[0]);
                                int month = parse(s, n.children()[2]);
                                int day = parse(s, n.children()[4]);
                                Calendar c = Calendar.getInstance();
                                c.set(Calendar.YEAR, year);
                                c.set(Calendar.MONTH, month);
                                c.set(Calendar.DATE, day);
                                return c.get(Calendar.YEAR) == year && c.get(Calendar.MONTH) == month && c.get(Calendar.DATE) == day;
                        }

                        private int parse(CharSequence s, Match n) {
                                return Integer.parseInt(s.subSequence(n.start(), n.end()).toString());
                        }
                });
                String s = "2012/13/41 2001/1/1";
                Options opt = new Options();
                opt.allowOverlap(true);
                Matcher m = g.find(s, opt);
                Match n;
                while ((n = m.match()) != null)
                        System.out.println(n.group());
produces
2001/1/1
This is obviously a somewhat silly example as it is practically no different from post-filtering the results without the condition, but it does illustrate the process. Within the grammar rules you only name the condition. After grammar compilation you assign to this name an extension of the abstract class dfh.grammar.Condition . In complex grammars conditions can cause spurious matches to fail quickly, accelerating the discovery of real matches. Also, they enforce a declarative style which makes the logic of the grammar easier to follow.