Syntax

A grammar is specified via a formalism similar to a Backus-Naur Form grammar, though the point of grammars is not to parse BNF grammars or express context-free grammars but to create useful character sequence matchers. Consequenty, no claims are made as to the purity of grammars as CFGs or whatever other theoretical construct. Basically, a grammar is a collection of matching rules, where every rule specifies a pattern of terminal expressions -- character sequences -- or non-terminal rules.

General Rules

Here is the grammar understood by dfh.grammar expressed in its own formalism:
                grammar  = [ <padding> ]* [ <rule> [ <padding> ]* ]+

                   rule .= /^/m <rule_name> <eq> <rule_body>
              rule_body  = [ <unbracketed_sequence> | <unbracketed_alternation> ] [ <s> <condition> ]? <rule_end>
unbracketed_alternation  = <element> [ <s> "|" <s> <element> ]+
   unbracketed_sequence  = !<backref> <sequence_element> [ <s> !<up_level_br> <sequence_element> ]*
            alternation  = "[" [ <tags> ]? <s> <element> [ <s> "|" <s> <element> ]+ <s> "]"
              assertion  = <assertion_type> <s> <element>
         assertion_type  = <short_a> | <long_a>
            cnd_element  = <identifier> | <condition>
              condition  = "(" <space>? <cnd_element> [ <cnd_separator> <cnd_element> ]*+ <space>? ")"
                    dot .= "." !<dot>
                element  = <sequence> | <alternation> | <repetition> | <label> | <regex> | <literal> | <up_level_br>
                 long_a  = [ <not> <s> ]? <preposition>
             repetition  = <element> <repetition_suffix>
               rule_end  = [ <s> <comment> ]+ | <space>? <nl>
                      s  = [ <space> | <space>? "\\" <rule_end> <space>? ]?
               sequence  = "[" [ <tags> ]?+ <s> !<br> <sequence_element> [ <s> <sequence_element> ]* <s> "]"
       sequence_element  = <element> | <br> | <zero_width> | <dot>
             zero_width  = <barrier> | <assertion>
                backref  = <br> | <up_level_br>
                padding  = <blank_line> | <comment>
      repetition_suffix  = [ <repetition_symbol> | <limits> ] [ <repetition_modifier> ]?
                barrier  = ":"{1,2}
             blank_line .= /^/m <nl>
          cnd_separator  = <space> | <space>? /[&|^]/ <space>?
                comment .= /^/m? /#.*/ <nl>
                  label  = "<" <identifier> ">"
                 limits  = "{" / \d++ | \d++,\d*+ | ,\d++ /x "}"
                literal  = <single_quoted> | <double_quoted>
                    not  = "not" | "!"
                  regex  = /\/ (?: \\ [^\n\r\f] | [^\n\r\f\/] )++ \//x <regex_modifiers>
    repetition_modifier  = "+" | "?"
      repetition_symbol  = "*" | "+" | "?"
              rule_name  = "<"? <identifier> 1
                   tags  = "{" <identifier> [ "," <identifier> ]* "}"
            up_level_br  = <br> "^"
                     br  = /\d++/
          double_quoted  = /" (?: \\ [^\n\r\f] | [^\n\r\f"] )++ "/x
                     eq  = /[.:]?=/
             identifier  = /\w++/
                         # newline according to http://en.wikipedia.org/wiki/Newline, omitting QNX pre-POSIX
                     nl  = /\r\f|\f\r|\r|\f/
            preposition  = "before" | "after"
        regex_modifiers  = /[rimsdux]++/
                short_a  = /[~!][+-]?/
          single_quoted  = /' (?: \\ [^\n\r\f] | [^\n\r\f'] )++ '/x
                  space  = /[\s&&[^\n\r\f]]++/

Basically, a grammar expects a set of rules, each of which may define a terminal character pattern or a set of constituent rules. The matching is top-down: the grammar attempts to match a master rule, which generally requires matching constituent rules.

Rule

A rule is composed of two parts: a label and a body. The label must match the pattern /<\w+>|\w+/ , that is, a sequence of word characters, optionally surrounded by angle brackets. The label is separated from the body by one of

= ordinary rule
.= optional whitespace between constituents
:= whitespace required between constituents
That is
foo  = 'a' 'b'
matches ab,
foo .= 'a' 'b'
matches ab or a b or a b or a
b
, etc.
, and
foo := 'a' 'b'
will match only a b. Note that one can reproduce the effect of .= and := in an ordinary rule via explicit whitepace handling rules. In fact, this is likely to be more efficient. The advantage of the two special whitespace variants is purely in rule syntax clarity. See below for more comments regarding whitespace.

The rule body consists of a list of constituent rules and may optionally end in a post-match condition and/or comment. If a constituent rule is a labeled rule, it is referred to by its label inside angle brackets. Some examples:

<r1> = 'a'
 r2  = <r1> 'b'
<r3> = /\b regex \b/ix
<r4> = "alt"          \ # a multi-line rule
       |              \
       "ernation"       # a comment
<r5> = 'c' (condition)

Except in quoting expressions -- a sequence inside a pair of angle brackets, single quotes, double quotes, or forward slashes -- whitespace is ignored, so you can format your rules however you wish. However, you must in effect escape newlines (see the 4th rule above).

Root

<label> = 'some' 'rule'

Every grammar has a master, or root, rule. This is the rule the grammar tries to match against a character sequence. This rule may have constituent rules and its constituents may have constituents, but these other rules are only invoked as needed in the course of matching the root rule. The root rule is a rule like any other: a label followed by a definition. It is marked in the grammar only by being the first rule listed, though formerly it was required that it bear the label “ROOT” and most of the examples in this documentation still follow that convention.

Comments

The grammar parser ignores blank lines. The comment character is # . If this character appears outside of some quoting construction -- double or single quotes or the forward slashes bracketing a regular expression -- everything after it is ignored.

		
# this and the previous line are ignored
rule = 'a' # and this is ignored as well

Usual Bits

Sequence

rule = 'a' 'b'

As in Java regular expressions.

Alternation

rule = 'a' | 'b'

As in Java regular expressions.

Repetition

rule = 'a'* 'b'? 'c'+ 'd'*? 'e'?? 'f'+? 'g'*+ 'h'?+ 'i'++ 'j'{2} 'k'{1,5}? # etc

As in Java regular expressions. The curly bracket notation accepts {,n} as a synonym for {0,n} .

Group

rule1 = [ 'a' | 'b' ]{3,}+ # the "{3,}+" is just to give the group syntax a purpose
rule2 = [ 'a'* 'b' ]{1}+   # a non-backtracking group

Unlike in Java and Perl 5, but following the convention of Perl 6, the grouping brackets are not () but [] . Round brackets are reserved for conditions. Note that dfh.grammar lacks its own character class syntax -- for that you must use Java native regular expressions. Also, there are no subvarieties of grouping expressions as there are in Java and Perl. Every group is a capturing group inasmuch as every match retains the entire match tree. There are (currently) no in-line match modifiers; for this you must rely on native regular expressions. For the equivalent of (?>...) you must use the repetition suffix {1}+ . Perhaps at some future date I will add a shorthand notation for this.

Back Reference

rule = /["'"]/ /\w++/ 1 

Note: no group construct is required but the index is relative to the elements of the same sequence. Backreferences of this sort are reversible, so you can have, for instance

foo = [ "'" | '"' ] <bar> 1
bar = not after <foo> <quux>

This presents the limitation that you can't use branching constructions such as foo = /['"]/ [ 'a' | 1 ] . If this is something you need and you're not worried about lookbehinds -- reversible rules -- then you can use the up-level form:

foo = [ "'" | '"' ] <bar> [ 1^ | 'shmoo' ]

Now the reference necessarily refers to the main sequence of the rule. If this still isn't enough -- you need to refer to a sub-sequence or you need to refer a match in another rule -- you will need to resort to a condition, though this will cost you efficiency.

Terminal Rules

Terminals rules have no constituent rules. They match against a character sequence at some offset or they don't. If a grammar has no terminal rules, it never makes contact with the underlying character sequence and hence can never match.This state of affairs will be found during compilation and a dfh.grammar.GrammarException will be thrown, aborting the program. The exception will carry an explanatory message such as "cycle found in rules: <b>, <a>".

During matching, the matcher will cache some information about attempted matches by various rules at various offsets in the character sequence. For non-terminal rules all that is cached is the success or failure of the match. This is to prevent the cache from consuming too much memory -- a particular non-terminal rule can potentially match exponentially many ways at a given offset. Because the number of possible terminal matches is linear in the length of the sequence, the whole match object is stored for terminal matches, accelerating the match process.

Regular Expressions

rule = /^ foo \b/imx 

Any Java regular expression (aside from multi-line expressions with the x modifier -- everything must be on one line) can be placed inside forward slashes and used in a rule. Modifiers can either be included inside the slashes like so

rule = /(?i:foo)/
or appended after the last slash as in Perl.
rule = /foo/i

Regex modifiers

Regular expressions can take all the modifiers understood by java.util.Pattern plus one more:

i Pattern.CASE_INSENSITIVE
m Pattern.MULTILINE
s Pattern.DOTALL
x Pattern.COMMENTS
d Pattern.UNIX_LINES
u Pattern.UNICODE_CASE
r expression is reversible -- if it matches a sequence it will also match the reverse of this sequence -- or reversed; so it is usable in a lookbehind assertion. In the latter case, where it is reversed, it cannot be used except in backwards assertions.

This last modifier cannot be placed inside the slashes but must be appended after.

Why regular expressions in grammars

Since a dfh.grammar.Grammar duplicates almost every capability of a Java regular expression you might wonder why these are necessary at all. It is because dfh.grammar.Grammar

  1. lacks anchor constructs -- \b , ^ , $ , \A , \Z and so forth
  2. lacks character classes, which are particularly nice in Java
  3. lacks modifiers equivalent to Pattern.CASE_INSENSITIVE or Pattern.MULTILINE.
  4. If you can make do with the more impoverished pattern capturing ability of regular expressions, they're much faster, so the more you can rely on them rather than grammars the faster your code will be.
  5. If you have pre-existing regular expressions, it is convenient to be able plug them into grammars unchanged.

Literals

rule1 = '"foo"'
rule2 = "'foo'" 

String literals are sequences of characters enclosed in double or single quotes. Every character in the literal must be identical to the corresponding character in the sequence matched against for a literal rule to match.

Escape sequences

All the escape sequences that work in Java string literals work in dfh.grammar string literals.

External Rule Definition

You may wish to build grammars compositionally by using one to define a rule in another, to plug your own extensions of dfh.grammar.Rule in, or to build large or variable regular expressions programmatically and then plug them into a static grammar. There are two ways to do this.

Pre-loading Rules

you may load in a mapping from rule labels to rules in while compiling the grammar

		Map<String, Rule> map = new HashMap<String, Rule>(1);
		map.put("r", whateverRule);
		String[] rules = {
		//
		"ROOT = ~- <r> 'bar'",//
		};
		Grammar g = new Grammar(rules, map);
It is necessary to pre-load rules if you wish to use them in lookbehind assertions, as in this example. The compiler needs to confirm that any rule in a lookbehind is reversible and it is only able to do this if it already has the rules on hand during compilation.

Deferred Rule Definition

If you do not plan to use the rules in lookbehinds you may compile a grammar with undefined rules and define these after.

		String[] rules = {
		//
		"ROOT = <q> <text> <bar> 1",//
		};
		Grammar g = new Grammar(rules);
		g.defineRule("q", Pattern.compile("[\"']")); // deferred regular expression
		g.defineRule("text", whateverRule);          // deferred arbitrary rule
		g.defineRule("bar","quux");                  // deferred string literal 
As you can see from this example, deferred rule definition is somewhat more convenient than pre-loaded rules. For one thing, it is more concise and hence easier to read. For another, there are convenience methods for defining deferred rules when they are to match literal character sequences or regular expressions.

Naturally, if you attempt to match a grammar with undefined rules against a character sequence a dfh.grammar.GrammarException will be thrown.

Conditions

		String[] rules = {
		//
		"ROOT = /\\b\\d++\\b/ (less_than_100)",//
		};
		Grammar g = new Grammar(rules);
		g.defineCondition("less_than_100", new Condition() {
			public boolean passes(Match m, Matcher n, CharSequence s) {
				int i = Integer.parseInt(s.subSequence(m.start(), m.end())
						.toString());
				return i < 100;
			}
		});
		String s = "99 100 1000";
		Matcher m = g.find(s);
		Match n;
		while ((n = m.match()) != null)
			System.out.println(s.substring(n.start(), n.end()));

Conditions are perhaps the most powerful feature grammars have beyond those they share with regular expressions.This is equivalent to the conditional expression (?(?{!condition()})(?!)) in Perl -- condition in this case is a bit of code that, if false, will cause the match to fail. The Perl expression has the advantage that it can be put anywhere and the disadvantage that it's hard to see at a glance what it's doing. Also, in the Perl case one has to do a bit of work to gather up information about the current matching state that one might want as arguments to condition. Regexp::Grammars improves the situation for Perl a good deal. These are arbitrary tests that can be applied after a rule has matched. The example above produces

99

Caution should be taken when using conditions. In particular, they should be functional: a condition that matches at a particular offset in a character sequence should always match. Do not make conditions dependent on unrelated state such as the weather at matching time, the phase of the moon, or the contents of a dynamic database. To do otherwise would invalidate the results cached when matches at this offset are attempted and possibly lead to false negatives when matching.

Also, you may not realize how frequently a condition needs to be tested. Matching recursive patterns is exponentially complex. The grammar is able to store whether the conditionalized rule ever matches at a particular offset but not all the ways it can match. Hence conditions are a good way to slow down a match. If you find the condition slows down the match unacceptably you may need to remove condition testing to a post-processing stage. This has the disadvantage that it is less declarative and may require iterating over a great many matches in post-processing -- you probably will want to iterate over all possible matches, not merely all non-overlapping matches. On the other hand, you will most likely need to apply the test less often than you would in the midst of matching.

Methods on the Condition Object

A condition object by itself does nothing useful -- it doesn't know when it should pass or fail so it always passes. To make such an object useful, you must override one of its passes methods.

passes(Match, Matcher, CharSequence)

This method gives you the most contextual information so it allows you to write the most powerful conditions. You can consider any previous match and its propertiesthough not subsequent matches; you'll have to use post-match filtering for that. Also, if you override one of the other methods there will be a longer call chain during condition testing, so overriding this method will give you some efficiency gains. However, if you might be dealing with backwards lookbehinds you'll need to take care to handle order reversal and so forth. If you cna get away with it, it is probably best to override one of the methods with a more pared down signature so you don't need to worry about such things.

passes(Match, CharSequence)

This method will give you the entire match tree rooted at this match and the character sequence to work with. You still have to handle character reversal yourself.

passes(CharSequence)

This simplest method is most likely what you want to deal with. It gives you just the sequence of characters matched in proper order.

Numerical Conditions

Quite often conditions express some numerical relation. This requires parsing the matched character sequence as the right sort of number and then applying numerical tests. To facilitated this, there are two numerical subclasses of dfh.grammar.Condition that will do the number parsing for you, leaving you to implement the numerical test. These classes are:

Logical Conditions

Quite often you will want to apply some boolean combination of conditions to a match -- a & b or a & !(b | c) , for example. You may do so, using parentheses for grouping and the logical operators & (“and”), | (“or”), and ^ (“exclusive or”). Blank space is a synonym for & , so a b is interpreted identically to a & b . Logical conditions short circuit, so it's best to test your cheap conditions before your expensive ones.

The “exclusive or” ^

The exclusive or operator for logical conditions is better thought of as an exists-one operator. That is, it is true if one and only one of its operands is true. When it has only two operands, this is equivalent to logical exclusive or (⊕). If it has more than two, it is not. If p, q, and r are true, p ⊕ q ⊕ r is true: p ⊕ q ⊕ r = (p ⊕ q) ⊕ r = false  ⊕  true = true ; p ^ q ^ r, however, where ^ is the operator used in dfh.grammar 's conditions, is false, because three, not one, of the operands is true. If we add parentheses, however, they are again equivalent: (p ^ q) ^ r = false  ^  true = true . The key thing to remember is that p ⊕ q ⊕ r = (p ⊕ q) ⊕ r but p ^ q ^ r ≠ (p ^ q) ^ r. In other words, ⊕ is an ordinary binary operator but ^ is an n-ary operator masquerading as in infix binary operator. A sequence of operands joined by ^ will be evaluated jointly, not in pairs.

Why do things this way? Isn't this confusing? Well, I figure an exists-one operator is much more useful in the case where one has more than one operand. Logical exclusive or then tells you when the number of true operands is odd, which is seldom what you want. The exists-one operator tells you whether there is a unique true condition, which is more frequently what you want to know. And in any case, most often you just have two operands anyway, and when you don't you can always add in parentheses to restore the logical xor semantics. Couldn't this be done by adding a separate n-ary operator convention? Yes. I guess I value concision over consistency.

Assertions

Assertions are in effect special conditions that may appear anywhere in a rule. They do not move the character matching offset but will cause the rule or match to fail if some condition is not met.

Zero-Width Matches

Zero-width matches are rules that test whether the characters adjacent to the current offset meet some condition. You may define a great many of these with ordinary regular expressions:

String[] rules = {
        //
        "  ROOT = <matt> <s> <joe> <s> <is_sue>",//
        "   joe = /\\bjoe\\b/i",//
        "is_sue = /(?=sue)/i",//   looks ahead
        "is_mat = /(?<=matt)/i",// looks behind
        "     s = /\\s++/",//
};

The problem with doing this this way, though, is that

  1. you are limited to terminal rules
  2. your backwards assertions must be fixed-width

So dfh.grammar provides its own assertion syntax with which you can define assertions which are variable width in either direction.

Lookaheads

String[] rules = {
        //
        "ROOT = <a> /\\s++/ <b> /\\s++/ <c> /\\s++/ <d>",//
        "   a = ~          '1' /\\b\\d++/",//     the number must begin with 1
        "   b = ~+         '2' /\\b\\d++/",//     synonym; the '+' here simply means "forward"
        " b_2 = before     '2' /\\b\\d++/",//     or you can use plain English
        "   c = !          '3'     /\\b\\d++/",// the number must not begin with 3
        "   d = !+         '4'     /\\b\\d++/",//
        " d_2 = not before '4' /\\b\\d++/",//     plain English negation
        " d_3 = notbefore  '4' /\\b\\d++/",//
        " d_4 = !before    '4' /\\b\\d++/",//
        " d_5 = ! before   '4' /\\b\\d++/",//
};

The examples above are silly but they serve to illustrate the syntax. If you prefer concision there are the single character variant: ~ and ! . To these you may a directional suffix, + , which means “before”. This is provided to make the forward assertion syntax fully parallel to the backwards assertion syntax (see below), but may be omitted.

If you prefer clarity, you can use a variety of natural language, or abbreviated natural language, expressions: before , not before , and the variants substituting ! for not and/or omitting the space.

The rule immediately to the right of the assertion expression is that which must either match or fail to match.

Lookbehinds

String[] rules = {
        //
        "ROOT = <a> /\\s++/ <b>",//
        "   a = /\\b\\d++/ ~-        '1'",// the number must end in 1
        " a_2 = /\\b\\d++/ after     '1'",// plain English version
        "   b = /\\b\\d++/ !-        '2'",// the number must not end in 2
        " b_2 = /\\b\\d++/ not after '2'",// plain English version
        " b_3 = /\\b\\d++/ notafter  '2'",// and other variants
        " b_4 = /\\b\\d++/ ! after   '2'",//
        " b_5 = /\\b\\d++/ !after    '2'",//
};

Lookbehinds are parallel in syntax to lookaheads except that, for obvious reasons, there is no single-character variant. The single-character directional suffix is - and the natural language variant of this is after .

Note that you do not write the rule backwards that they modify. The grammar compiler will do this for you. So for example

		String[] rules = {
				//
				"ROOT = !- [ <fred> /\\s++/r ] <fred>",//
				"fred = 'fred'",//
		};
		Grammar g = new Grammar(rules);
		String s = "fred fred bob";
		Matcher m = g.find(s);
		Match n;
		while ((n = m.match()) != null)
			System.out.println(n);

produces

(!-[ <fred> /\s++/r ] <fred>: 0, 4 [(!-[ <fred> /\s++/r ]: 0, 0), ("fred": 0, 4)])

That is, it only finds the first fred, and the stringification of the match keeps the description facing forwards.

Note the r modifier on the regular expression. A regular expression can only be used in a backwards assertion if it carries this modifier, which simply marks it as reversible. Reversible regular expressions are those which only match palindromes. In this case, a block of all the same character class will be a palindrome, so it's reversible. This powerful restriction on the regular expressions usable in backwards assertions doesn't reduce the power of backwards assertions at all inasmuch as the rest of the machinery of matching can be provided by dfh.grammar itself.

how it works

Backwards assertions work by matching against a dfh.grammar.ReversedCharSequence , a wrapper class that translates character indices between the base and reversed sequences. So character indices count back from the end and index 0 refers to the last character. To this reversed character sequence it applies a rule identical to the one defined in your grammar but with all sequences reversed, so 'cat' covertly becomes 'tac' . This reversal of the sequences occurs at compilation time and it works on a copy of the reversed rule tree, so these same rules can be applied in a forward direction elsewhere. Regular expressions, because of their branching, back references, directional assertions, and so forth, are non-trivial to reverse. Also, one could plug in quite vast regular expressions which would take some time to reverse during compilation. Furthermore, backreferences make some regular expressions virtually impossible to handle by reversing the sequence.

The r modifier on a regular expression marks it as reversible or reversed -- already written to be used on a reversed sequence. Only palindromic regular expressions may be used both inside and outside backwards assertions.

Backtracking barriers may produce confusing results if used in a backwards assertion. Consider the rules

ROOT = ~- <a> 'bar'
   a = 'b'+ : 'bc'

Looking at rule <a> it appears that it will always fail to match. First it matches a sequence of one or more bs. Then it puts up a barrier to any backtracking. Then it attempts to match another b. But it already matched all the bs it could find, so it must fail! But matching from right to left, as the backwards assertion does, means it can grab a b before hitting the barrier. In fact, in this case the grammar will match bbcbar.

Backtracking Barriers

A backtracking barrier is a sort of meta-assertion. It says, "the matching engine will not have to revisit this decision or anything that precedes it." It serves a function similar to the (?> ... ) in Perl-compatible regular expressions or possessive quantification. It makes matching more tractable by eliminating at one stroke some number of permutations in the matching engine's search space.

:

The single-colon barrier indicates that the rule containing it fails if it cannot complete the match after the the barrier without revisiting decisions before the barrier. If you know that the first few elements in a sequence will match only one way if at all, you can follow them with the single-colon barrier.

::

The double-colon barrier indicates that if the portion of the rule after the barrier fails to match then the entire grammar fails to match at the current offset in the character sequence.

Named Captures

<ROOT> = [{a}'a']++
<ROOT> = [{foo} 'a' /\\s++/ 'b']++ [{bar} 'c'] [{baz,quux} 'd' | 'e']

Some regular expression engines, such as those of Python, Perl, and the PCRE library, provide a syntax for naming captured groups. It began with Python, where the syntax was (?P<name>pattern) . This construction assigns a name to whatever pattern matches, so later one can fetch it for whatever purpose. Other languages then adopted similar syntax for this purpose. Perl, for instance, provides /(?<name>pattern)/ . Well, for dfh.grammar I preferred to use the angle brackets, as in BNF, to label rules, so you can assign a label to a group in dfh.grammar using curly brackets instead, as illustrated above.

Unlike these various regular expression engines, dfh.grammar always captures and returns the entire match tree, so all the matches are there regardless of whether you've named them or bracketed them. The names merely serve to facilitate finding the bits matched by particular rules. Furthermore, every named rule always tags its match with its name. The curly bracket notation is chiefly useful for what would otherwise be anonymous rules. Also, as illustrated in the second example above, you may provide more than one name inside the curly brackets -- names have the same restrictions as rule labels; the name separator inside the brackets is the comma. This allows you to multiply categorize matches -- both dates and numbers, say.

Special Whitespace Handling

.= and :=

As described above, you can specify how to handle whitespace between tokens by varying the assignment symbol after the rule name:

foo  = 'a' 'b' # ordinary rule;       matches 'ab'
bar .= 'a' 'b' # optional whitespace; matches 'ab' or 'a b'
baz := 'a' 'b' # required whitespace; matches 'a b'

This is a convenience feature that facilitates writing clear, uncluttered rules. Ordinary rules will be more efficient.

Note that the whitespace magic extends into rule sub-rules. The rule

foo .= 'a' [ 'b' | 'c' 'd' ]

matches ab, a b, acd, a cd, ac d, or a c d.

The Dot

Funny things go on in these special whitespace rules when you add a quantifier.

Grammar g = new Grammar("foo := 'a' 'b'+");
for (String s : new String[] { "ab", "a b", "abb", "a bb", "ab b", "a b b" }) {
    System.out.print(s + ' ');
    System.out.println(g.matches(s).match() != null);
}
ab false
a b true
abb false
a bb true
ab b false
a b b false 

The grammar treats the quantified token as a “word” of variable length rather than as a variable length sequence of words. If you want to force the sequence interpretation, there is a special symbol which represents some quantity of whitespace: the dot.

Grammar g = new Grammar("foo := 'a' [ . 'b' ]+");
for (String s : new String[] { "ab", "a b", "abb", "a bb", "ab b", "a b b" }) {
    System.out.print(s + ' ');
    System.out.println(g.matches(s).match() != null);
}
ab false
a b true
abb false
a bb false
ab b false
a b b true 

The dot represents some quantity of space. With .= it allows space to be present:

Grammar g = new Grammar("foo .= 'a' [ . 'b' ]+");
for (String s : new String[] { "ab", "a b", "abb", "a bb", "ab b", "a b b" }) {
    System.out.print(s + ' ');
    System.out.println(g.matches(s).match() != null);
}
ab true
a b true
abb true
a bb true
ab b true
a b b true
versus
Grammar g = new Grammar("foo .= 'a' 'b'+");
for (String s : new String[] { "ab", "a b", "abb", "a bb", "ab b", "a b b" }) {
    System.out.print(s + ' ');
    System.out.println(g.matches(s).match() != null);
}
ab true
a b true
abb true
a bb true
ab b false
a b b false
With := , as illustrated above, it requires space to be present.

The dot isn't precisely equivalent to /\s*/ or /\s+/ . For one thing, it can be backtracked into. The rule

foo = 'a' . ' b'
can match, whereas
foo = 'a' /\s*/ ' b'
cannot. The regular expression in the latter consumes all available whitespace and cannot give any up, so ' b' , which has a leading space, can never match.

For another thing, the dot does not need to consume any space, it need only be adjacent to space or the ends of the sequence matched against.

foo = 'a' . ' b'
can match against a b, where there is a single space between the characters.
foo := . 'a' .
can match against a, even though it contains no whitespace.

Note that in these examples the dot is used in ordinary rules. In this case, it is equivalent to the expression /\s/r* .

Options

Matcher m = g.find(s);                                               // use all defaults
...
m = g.find(s, new Options().allowOverlap(true).longestMatch(false)); // change two parameters
...
Options opt = new Options().allowOverlap(true).longestMatch(false);  // options are reusable
m = g.find(s2, opt);
...
m = g.find(s3, opt);

With java.regex.Matcher matching, one can modify the matching altering various properties of the matcher object separately. With dfh.grammar matching one must pass all the desired options in as a collection when creating the matcher. This improves efficiency and ensures the immutability of its state aside from its position in the sequence of matches over which it is iterating. The parameter collection is an instance of dfh.grammar.Options . Among other things it specifies which should be regarded as the first and last characters in the sub-sequence to match against, whether overlapping matches should be considered, and whether debugging output should be produced.

start

Options opt = new Options().start(10); // default: 0
System.out.println(opt.start());       // 10

If you wish the match to begin at some offset other than 0 in the character sequence, set the start. As with regular expressions, this will not affect such things as word boundary matching. If the string is "abc", there is no word boundary at offset 1 even if this is where you start matching.

end

Options opt = new Options().end(10); // default: length of sequence
System.out.println(opt.end());       // 10

end is the counterpart of start : if you do not wish any matches to include indices beyond a particular offset, you must set the end .

allow overlap

Options opt = new Options().allowOverlap(true); // default: false
System.out.println(opt.allowOverlap());         // true

By default a grammar will only return one match per offset tested against and no two matches returned will contain the same character. Suppose, however, that you wish to see all ways a grammar can match a string (possibly a very large number!). In this case, you set allowOverlap to true .

keep rightmost match

Options opt = new Options().keepRightmost(true); // default: false
System.out.println(opt.keepRightmost());         // true

If a match fails you might still want to know, for the sake of debugging or error message generation, how far it got before it failed. This is what keepRightmost does for you. Since recording rightmost matches incurs some overhead, the default value of this option is false .

longest match

Options opt = new Options().longestMatch(true); // default: true
System.out.println(opt.longestMatch());         // true

Suppose your grammar contains the alternation

foo = 'cat' | 'cat dog'

Because cat is listed first, the first match returned when a find is done against the string camel cat dog fish will be cat even though cat dog will also match at the same offset. If you are not allowing overlapping matches (see above), you will never see the cat dog match. If this is not what you want, set longestMatch to true . This will cause the matcher to iterate over all possible matches at a given offset, retaining only those of maximum length. If you allow overlapping matches, the matcher will return the matches in the order of discovery. Otherwise, it will return the first maximum length match found.

In fact, in my experience I find that I usually want only the longest matches and receiving just the first match produces unexpected results. The default value of longestMatch is true . You have to turn longest matching off, not on.

match all

Options opt = new Options().matchAll(true); // no default
System.out.println(opt.matchAll());         // true

The matchAll option is just a shortcut for setting longestMatch to false and allowOverlap to true with just one boolean.

maximum recursion depth

Options opt = new Options().maxRecursionDepth(10); // default: 3
System.out.println(opt.maxRecursionDepth());       // 10

Recursion and dfh.grammar 's top-down matching are an uneasy mix. Instead of looking at a character and figuring out what rules it could match, the grammar constructs hypothetical match trees and sees whether they fit the underlying sequence. Or rather, it constructs the left branches of trees. If it can match the leftmost leaf of the tree, it tries adding the next branch. This works fine until you have a recursive grammar that allows the possibility of infinite left-branching, something like

foo = <foo> '0' | '1'

This grammar is a silly recursive equivalent of /10++/ . Because of the way dfh.grammar tests alternates -- first alternate first -- it will attempt first to construct an infinitely deep left-branching tree, never actually trying the '1' alternate and beginning matching. In this case the solution would be to write a less silly grammar, but sometimes you simply have to write a left-branching rule. To cope with this situation dfh.grammar has a maximum branching depth. When the grammar is compiled, it checks to see whether there are any problematic cycles of this sort. If it finds left-branching recursion, every time it picks an alternate This problem can arise even if the recursive alternate isn't first. Suppose, for example, that the first n alternates never match and the n + 1 alternate is recursive. in the problematic rule it first checks to see whether it's in a repeating loop. If it finds that it is, it increments the recursion count. If it finds that it cannot proceed without incrementing the count past the maximum recursion depth limit, it abandons the alternate and tries the next.

study

Options opt = new Options().study(true); // default: false
System.out.println(opt.study());         // false

One way to accelerate matching is to “study” the character sequence. Studying consists of finding all the positions at which terminal rules match. This can be done with quick string indexation methods and such and it culls the number of offsets that can even be attempted. However, it is generally only useful if one is using the find method for matching. The matches and lookingAt methods both consider only a single offset, so most of the work done studying is wasted.

indexer

Options opt = new Options().indexer(new CharacterIndexer('z')); // default: null
System.out.println(opt.indexer());                              // CharacterIndexer(z)

Another way to accelerate matching is to use a dfh.grammar.Indexer . This is an object that chooses character offsets which might be the start offset of a match. This is like studying but instead of caching all terminal matches and determining possible start offsets, indexing just finds the start offsets. If you do both (not generally wise), studying will cache all terminal matches but indexing will generate the start offsets tried.

There are three indexers to choose from unless you write your own:

CharacterIndexer
Any offset holding the character parameter will be tried.
StringIndexer
Any offset will be tried that begins the character sequence represented by the string parameter.
PatternIndexer
A regular expression is matched against the character sequence. Wherever it matches -- overlapping matches will be tried -- the grammar will be applied.

The only indexer which is more or less guaranteed to accelerate matching is the CharacterIndexer , but it is also the least likely to be applicable to most problems. The PatternIndexer is by far the most flexible but incurs such overhead on its own that it is unlikely to provide much acceleration except where the grammar is complicated and seldom matches. See the IndexerBenchmarks class in examples/ .

fat memory

Options opt = new Options().fatMemory(true); // default: false
System.out.println(opt.fatMemory());         // true

Another way to accelerating matching is to use a quicker match caching mechanism that wastes more memory. If you do nothing, the grammar will choose its caching mechanism based on the length of the sequence being matched against. If it is long (see below), a hash map based cache will be used. Otherwise an array-based cache is used. If you set fatMemory to true, the array-based cache will be used regardless of the sequence length.

lean memory

Options opt = new Options().leanMemory(true); // default: false
System.out.println(opt.leanMemory());         // true

If you are matching against a particularly long character sequence, or you have a particularly complicated grammar, you may be concerned that you will exhaust the heap. You can reduce memory use by choosing a lean match cache. Whereas the fatter caches are array and hash based, the lean cache is a tree map. The array cache allocates storage space for every offset immediately. The hash cache doubles its memory allotment periodically as it fills up. The tree cache only grabs additional memory gradually as it fills up.

long string length

Options opt = new Options().longStringLength(10000); // default: 100000
System.out.println(opt.longStringLength());          // 10000

If you don't set either fatMemory or leanMemory to true, the match cache will be chosen based on the length of the sequence matched against. Specifically, if the length is less than longStringLength , an array cache will be used. Otherwise it will be a hash-based cache.

log

Options opt = new Options().log(System.err); // default: null
opt.trace().println(10);                     // 10

The log option is a java.io.PrintStream to which log messages will be printed if it isset to a non-null value. In fact, setting the log to some print stream causes a boolean debug option to be set to true , which causes a number of debuggign code blocks to be called in the course of matching. Logging a match thus incurs a good bit of overhead, but due to the exponential complexity of matching it is often the best way to debug a grammar.