TreePath: a Generic Tree Querying Library

TreePath is like XPath but more general. It can be adapted to any tree data structure. As in XPath, a tree path expression consists of a number of steps -- <step1> <step2> <step3> . Each step selects a set of candidates nodes in the context of the candidates from the preceding step, the initial context being some node, usually the root, offered as an inital parameter to the expression. Optionally, a step has predicates which filter the candidate set. The nodes that remain after the final step are those selected by the expression. So, for example, /foo would select the foo child nodes under the root node and /foo/bar[@length = 1] would select all the bar nodes that are children of foo nodes under the root, but these bar nodes must have a length attribute with a value of 1. The sections below provide more examples and explain the mechanism and usage.

Examples

//foo
All foo nodes.
  a
 / \
b  foo
|  / |
c d  e
     |
    foo
//~a|e~
All nodes whose tags contain an a or e .
  a
 / \
b  foo
|  / |
c d  e
     |
    foo
/>foo/preceding::*
All nodes preceding the closest foo nodes under the root.
  a
 / \
b  foo
|  / |
c d  e
     |
    foo
/leaf::* or //*[@leaf]
All leaves.
  a
 / \
b  foo
|  / |
c d  e
     |
    foo

Mechanism

In order to adapt dfh.treepath to a particular variety of tree, you need to extend the Forester class. A Forester then makes Path objects for you by compiling tree path expressions. If you give a Path the some node of a tree, it will select the nodes from the tree corresponding to your expression.

Forester<Node> f = new MyForester();
Path<Node> p = f.path('//foo/>bar[@quux xor @corge]');
List<Node> l = f.select(root);

Foresters

Foresters know about your trees. In particular, they know how to find the children and parent of a node and whether a particular string matches its “tag” (or tags). Optionally, they may also know what sort of nodes you'd rather ignore and what, if anything, constitutes a unique identifier. So, for example, if your nodes are of class Node and have the methods Node Node.parent() and java.util.List<Node> Node.children() , the following class will give you a Forester that can handle your trees.

class MyForester extends Forester<Node> {

    @Override
    protected List<Node> children(Node n, Index<Node> i) {
        return n.children();
    }

    @Override
    protected boolean hasTag(Node n, String tag) {
        return n.tag().equals(tag);
    }

    @Override
    protected boolean matchesTag(Node n, Pattern p) {
        return p.matcher(n.tag()).find();
    }

    @Override
    protected Node parent(Node n, Index<Node> i) {
        return n.parent();
    }
			
}

If your nodes don't know their parents, you have to override another method:

    @Override
    protected Index<Node> index(Node root) {
        return new ParentIndex<Node>(root, this);
    }

ParentIndex will walk the tree from root , recording the parent-child relationship.

If your nodes don't have tags, you can still use tree path expressions, but you'll have to use the wildcard (*) in place of a tag name or pattern.

Paths

Foresters are Path factories: they compile path expressions into Path objects.

Path<Node> p = f.path("id(foo)//bar/*[@att = 1.5][not @quux]")

Once you have a path object -- you should reuse them when possible -- you call its select method, perhaps with an Index You need to provide an index if the forester by itself cannot generate the index from whatever node you give it. In particular, if your nodes don't know their parent and you use a ParentIndex, this index assumes whatever node you give it initially is the root. If this is false, and in particularly if it is a damaging error, then you had best generate the true index from the root and use this. Of course, you will gain some efficiency by reusing indexes as well. . If you do not provide an index, one will be generated.

List<Node> list1 = p.select(n1), list2 = p.select(n2, index);

Indices

An index is an object that caches information about a tree for rapid retrieval. Minimally it caches an appropriate forester and the root node. It may also cache the mapping from child to parent and unique ids to nodes. The minimal index is cheap to construct, so there is little lost in constructing a new one with every selection. If you are doing many matches against the same tree, you can get a single index for it initially using your forester.

Index<Node> i = f.index(root);

NodeTests

Node tests classify nodes into types -- nodes with a particular tag or whose tag matches a particular pattern, zero-width nodes, text nodes, whitespace nodes, etc. A node test has only the node itself to examine, no selection context (other than the tree index). They are used to interpret the steps, but the chief way one will interact with them while using path expressions is in defining categories of nodes to ignore.

NodeTest<Node> t = new NodeTest() {
    @Override
    public boolean passes(Node n, Index<Node> i) {
        return n.someTestOrOther();
    }
}
Forester<Node> f = new MyForester(t); // now all the nodes that pass someTestOrOther are invisible
	

Formalism

The following sections explain the theory of tree path expressions (mostly).

Grammar

The following dfh.grammar grammar specification defines the set of valid tree path expressions.

       treepath = <path> [ <s> "|" <s> <path> ]*

        and_cnd = <condition> [ <s> /&|(?<!\/)\band\b(?!\/)/ <s> <condition> ]+ (and_precedence)
            arg = <treepath> | <literal> | <num> | <attribute> | <attribute_test> | <condition>
           args = "(" <s> <arg> [ <s> "," <s> <arg> ]* <s> ")"
      attribute = <aname> [ <args> ]?
 attribute_test = <attribute> <s> <cmp> <s> <value> | <value> <s> <cmp> <s> <attribute>
      condition = <term> | <not_cnd> | <or_cnd> | <and_cnd> | <xor_cnd> | <group>
     first_step = [{segment} <separator>?+ <step> ]
          group = "(" <s> <condition> <s> ")"
        not_cnd = /!|(?<!\/)\bnot\b(?!\/)/ <s> <condition> (not_precedence)
         or_cnd = <condition> [ <s> /\|{2}|(?<!\/)\bor\b(?!\/)/ <s> <condition> ]+
           path = <first_step> [ <subsequent_step> ]*+
      predicate = "[" <s> [ <signed_int> | <treepath> | <attribute_test> | <condition> ] <s> "]"
           step = [ <full> | <abbreviated> ] [ <predicate> ]*+
subsequent_step = [{segment} <separator> <step> ]
           term = <attribute> | <attribute_test> | <treepath>
          value = <literal> | <num> | <attribute>
        xor_cnd = <condition> [ <s> /^|(?<!\/)\bxor\b(?!\/)/ <s> <condition> ]+ (xor_precedence)
    abbreviated = !-[ "//" | "/>" ] [ "." | ".." | <id> ]
           full = [ <axis> ]?+ <forward>
            num = <signed_int> | <float>
           axis = !-[ "//" | "/>" ] <axis_name> "::"
          float = /[+-]?+/ <int>?+ /\.\d++/ [ /e[+-]?+/i <int> ]?+
        forward = <wildcard> | <specific> | <pattern>
             id = "id(" /(?:[^)\\]|\\.)++/ ")"
        literal = <squote> | <dquote>
     signed_int = /[+-]?+/ <int>
          aname = /@(?:[\p{L}_$]|\\.)(?:[\p{L}_$\p{N}]|[-:](?=[\p{L}_\p{N}])|\\.)*+/
      axis_name = /(?>s(?>ibling(?>-or-self)?+|elf)|p(?>receding(?>-sibling)?+|arent)|leaf|following(?>-sibling)?+|descendant(?>-or-self)?+|child|ancestor(?>-or-self)?+)/
            cmp = /[<>=]=?|!=/
         dquote = /"(?:[^"]|\\.)*+"/
            int = /\b(?:0|[1-9][0-9]*+)\b/
        pattern = /~(?:[^~\\]|\\.)++~/
              s = /\s*+/
      separator = /\/[\/>]?/
       specific = /[\p{L}_](?:[\p{L}\p{N}_]|[-:](?=[\p{L}_\p{N}])|\\.)*+/
         squote = /'(?:[^']|\\.)*+'/
       wildcard = "*"

This grammar contains a few conditions; e.g., (not_precedence) . These conditions all govern the precedence of logical operations, ensuring that only valid parse trees for these operations are returned. The class that implements this grammar, including the conditions, is dfh.treepath.PathGrammar .

Steps

Every tree path expression is a sequence of steps. Each step begins with a collection of nodes and chooses a new set, taking each node in the context set and selecting a new set and filtering it. For example, a step might take a particular context node, select all its children, and then filter these down to those with a particular tag or those with a particular index -- a single node. Some examples:

/b
Of the children of the context node, take those with the b tag.
//b
Of all the descendants of the context node, and the context node itself, take those with the b tag.
//b[@foo = 'bar']
Of all the descendants of the context node, take those with the b tag, then filter these down to those whose @foo property has the value "bar".
//b[@index > 3][@foo = 'bar']
Of all the descendants of the context node, take those with the b tag, then drop the first 3, then filter the remainder down to those whose @foo property has the value "bar".

A step consists of a separator -- / , // , or /> --, a node type -- e.g., b --, and zero or more predicates.

Separators

none
Occurring only before the initial step, the null separator indicates that path is relative. The candidate nodes are chosen from the appropriate axis defined relative to the context node.
/
For non-initial steps, the definition is the same as for the null separator. For the initial step, it indicates that the context node is the root.
//
For the initial step, the candidate node axis is descendant-or-self , otherwise it is descendant .
/>
This separator indicates that the candidate nodes are those nodes of the specified node type which are closest to the context node, meaning the path from the context node to them doesn't contain any nodes of the specified type. In the XML document <root><a><b><a/></b></a></root> , the expression />a would select the a element immediately under the root but not the nested element -- the latter element is screened from the root by the first.

This separator is unique to tree path expressions. I have included it because I often find myself wanting to define this sort of path for a tree.

The latter two separator types cannot be followed by an explicit axis specification. The last cannot be followed by the wildcard node type.

Node Types

Node type expressions select an initial candidate set of nodes. There are three varieties, all defined relative to a node's “tag” or tags. A node's tag, in turn, is simply a string type label. Unlike in XPath, tree path expressions do not assume a node has a single tag, or any tag, though in the latter case expressions will be less efficient and one cannot use the /> separator.

Literal
/a
//foo
/>bar
corge
...

Literal node type expressions require that a selected node have precisely the type specified.

Pattern
/~a~
//~foo~
/>~bar~
~corge~
...

Pattern node type expressions require only that the pattern -- the bit between the tildes -- match some part of one of the node's tags. The pattern must compile into a Java regular expression or an error will be thrown when the path is compiled.

Wildcard
/*
//*
*
		
	

The wildcard node type expression matches every node.

Axes

Each step has an implicit or explicit axis -- a set of candidate nodes chosen by their graphical relationship to a context node. The default axis is child:: -- all the children of the context node. If the separator preceding the current step is // , the understood axis is descendant:: (or descendant-or-self:: , if it's the first step in the path) . For a thorough understanding of axes try Googling "axis xpath". For the most part, the concepts are the same. There are some differences in the axes available in tree path expressions, however. Most importantly, tree path expressions provide a leaf:: axis, which is the list of nodes at or under the context node that have no children.

See dfh.treepath.PathGrammar.Axis for the full list of axes.

Predicates

Predicates are filters which take a candidate set and eliminate nodes that fail to pass some test. A step may have zero or more predicates. A predicate is a function from a node and the other nodes it was selected with given a particular context node to the values true or false . If it evaluates to true , the node is retained, otherwise it is discarded. As in XPath, predicate expressions are always enclosed in square brackets.

Index

/a[0]
//a[1]
/>a[2]
a[3]
...

Indices are the simplest predicate. They simply pick the node with the appropriate index from the collection being evaluated. NOTE: the index is zero-based, unlike in XPath. Why? Because I find one always uses these path expressions while programming in a language that uses zero-based indices, and switching back and forth is confusing, particularly since, unlike in SQL, the formalism *looks* the same in the expression language and the programming language. In Java, a[0] picks the first element in the a array, so one expects a[0] to be the expression that picks the first a node immediately under the context node. Also note that //a/b[0] is going to pick b nodes that are the first node of that type immediately under some a node -- that is, it will not pick the first such node. The collection selected from is that picked by one of the context nodes selected by the preceding step, not the collection of all nodes selected by all the context nodes together.

One may use negative indices in tree path expressions. If you add a negative index to the size of the collection indexed you get the corresponding positive index, so -1 is the index of the last node in the collection, -2, the index of the second to last node, and so forth. For example in

  a
 /|\
c d e

we have the following correspondences:

  a
 /|\
c d e

/a/*[0] or /a/*[-3]


  a
 /|\
c d e

/a/*[1] or /a/*[-2]


  a
 /|\
c d e

/a/*[2] or /a/*[-1]

If you overshoot the end of the collection in either direction you simply get no nodes.

Path

/a[foo]
//a[//bar]
/>a[sibling::quux | ancestor::corge]
a[/>nef[2]]
...

Evaluates to true if the collection of nodes it selects is not empty. For the XML document <root><a/></root> , the path /.[a] will select the root node but /.[b] will select nothing.

Attribute

/a[@foo]
//a[@bar(1)]
/>a[@quux(//corge, @true)]
a[@nef('a', 3, -1.5)]
...

Attributes are described in more detail below, but briefly, they are functions from candidate nodes, and perhaps other parameters, to values. The following values all evaluate to false:

Everything else evaluates to true .

Attribute Test

/a[@foo = @bar]
//a[@bar(1) < 2]
/>a[@quux(//corge, @true) > 'mim']
a[@nef('a', 3, -1.5) == @pick(//twee, 0)]
...

Attribute tests allow you to override the default boolean interpretation of attributes.And note that there must be an attribute on one or the other side of the operator in a test. //a[1 < 2] and //a[foo = bar] and the like will not parse as tree path expressions. However, one can convert anything into an attribute with the @echo attribute, so this in the end is no limitation. In general there is little value in predicates like [@echo(1) < 2] that evaluate the same way for all nodes, but @echo lets you construct them. Attribute tests always involve a pair of expressions separated by an operator. The available operators are

=
Requires equality but not necessarily identity. Objects and strings use equals(Object) ; collections use equality of cardinality; numbers use equality of value.
==
Identity. For numbers, this is the same as equality, but objects require object identity. Collections require identical members and iteration order.
!=
The negation of = .
<
Sort order comparison. Evaluates to true if the left argument sorts before the right. Collections are compared by cardinality.
>
Like < but requiring the opposite order.
<=
Like < but also returning true if the two arguments cannot be sorted.
>=
Like > but also returning true if the two arguments cannot be sorted.

Logical Expression

/a[@foo or @bar]
//a[not />corge[3]]
/>a[@quux != 3 ^ @nef]
a[not @foo and (//bar or @chuzzle > 1)]
...

All the other predicate types aside from index predicatesOne can use the @index attribute to duplicate the functionality of index predicates with attributes. //a[0] is equivalent to //a[@index = 0]. can be combined with boolean operators.

not or !
Takes a single argument and reverses its value. Highest precedence.
and or &
Takes joins two or more values. Returns true if all values are true. Second highest precedence.
or or ||The double bar boolean operator is distinct from the single bar alternation operator. //a[b | c] will parse to a different path from //a[b || c]. In this case, the two paths are logically equivalent -- both select a nodes that have either b or c nodes as children, though the first collects all such nodes into one pool whereas the other collects them into two and returns true if either pool is non-empty. But consider //a[b | c & d] versus //a[b || c & d]. The first will be equivalent to //a[(b | c) & d] whereas the second will be equivalent to //a[b || (c & d)].
Takes joins two or more values. Returns true if at least one value is true. Lowest precedence.
xor or ^
Takes joins two or more values. Returns true if one and only one value is true. Third highest precedence.
( ... )
Groups arguments to make the order of operations explicit or change the precedence of operations.

I have provided both concise and long versions of all the operators to suit varying tastes, but I tend to use the long operators myself to prevent confusion.

Attributes

In tree path attributes are functions that return some property of a node in a selection context. For example, there might be an attribute @tag that returns the tag value of a node, or @position that returns its index in the candidate list it is found inThis is not to say that internally candidate sets are always instances of java.util.List. Rather, they are collections that maintain the order of node discovery. The index of an item in such a collection is just the number of times one must iterate over the candidate collection to find it, counting the first iteration as iteration 0. .

Attributes are implemented as methods in some Forester with a particular signature and bearing the @Attribute anotation. For example, here is the implementation of the @echo attribute.

	@Attribute(description = "converts anything into an attribute")
	protected final Object echo(N n, Collection<N> c, Index<N> i, Object o) {
		return o;
	}

The first three parameters in the method signature are required. The rest can be whatever you want them to be, though they must correspond to one of the types understood by path expressions. Here is the relevant line from the grammar:

            arg = <treepath> | <literal> | <num> | <attribute> | <attribute_test> | <condition>

Tree path expressions correspond to a java.util.Collection of nodes; literals -- single- or double-quoted strings --, to String ; nums to Integer , Double , or just Number ; attributes to arbitrary objects; and attribute tests and conditions, to Boolean . You can use primitive types in lieu of Boolean and so forth. You may also use variadic methods. For example, here is the implementation of the @log attribute:

	@Attribute(description = "records parameters to a log stream")
	protected final Boolean log(N n, Collection<N> c, Index<N> i, Object... msg) {
		for (Object o : msg)
			loggingStream.println(o);
		return Boolean.TRUE;
	}

Attributes are discovered by reflection, which is complicated code to write and inefficient but gives us incredible flexibility. All attributes in a forester's inheritance tree will be discovered aside from private attributes, and these won't be found even in the terminal class. I could handle accessibility modifiers differently, and perhaps I will in the future, but for now this is how things work. The reflective code gathers attributes the first time an instance of the class is created. This reduces the reflection overhead. Some checking of method signatures is done at this time, but it's best to create unit tests for any attributes you define. Type erasure limits the extensiveness of the signature checking ()as does my laziness).

The base Forester class provides a small number of attributes to all extensions:

@true
The true value.
@false
The false value.
@null
The null value.
@log
An attribute that always returns true and prints each of its parameters on its own line in a debugging stream, System.Err , by default.
@uid
An expression uniquely identifying the node by its position in the tree. /0/1 is the @uid of the second child of the first child of the root node.
@id
A unique string id for the node, if any.
@this
The node itself.
@root
Whether the node is the root of its tree.
@leaf
Whether the nodes is a leaf, meaning it has no children.
@pick
Picks an item from a collection. @pick(/*, 0) will return the first child of the root node.
@size
The size of a collection. If //foo selects 3 nodes, then @size(//foo) will return 3.
@index
The index of the node in the collection it was selected with. //foo/bar[@index = 1] will select the same nodes as //foo/bar[1] .
@echo
The value of its parameter. //foo/bar[@echo(1)] will return 1. This attribute is chiefly useful as a wrapper to convert anything into an attribute that can be used in an attribute test.
@tsize
The number of nodesThis will be the count of non-ignorable nodes. in the tree rooted at the context node. //*[@leaf & @tsize = 1] is equivalent to //*[@leaf] , while //*[@leaf & @tsize = 2] will always return an empty collection.
@width
The number of leaves under the current node. //*[@leaf & @width = 1] is equivalent to //*[@leaf] , while //*[@leaf & @width = 2] will always return an empty collection.
@depth
The number of steps between the current node and the root. //*[@root & @depth = 0] is equivalent to //*[@root] (which is equivalent to the more efficient expression /. ) , while //*[@root & @depth > 0] will always return an empty collection.
@height
The length of the longest path from the current node to a leaf, where the @height of a leaf is 1. The height of a in
  a
 / \
b  foo
|  / |
c d  e
     |
    foo
is 4.

The FunctionalForester class is a library of additional attributes, many of them copied from XPath. A forester that extends FunctionalForester , MatchPath , for instance, gets access to all these attributes.

@Attribute Parameters

An attribute is created by annotating a method in a Forester with the dfh.treepath.Attribute annotation, optionally providing a name and/or a description.

		
		@Attribute
String foo(Node n, Collection<Node> c, Index<Node> i) { return "foo"; }

@Attribute("short")
String reallyReallyLongMethodName(Node n, Collection<Node> c, Index<Node> i, int j) { return "short" + j; }

@Attribute(value = "longer", description = "has a longer name than short")
String alsoLong(Node n, Collection<Node> c, Index<Node> i, String... bar) {
    StringBuilder b = new StringBuilder();
    for (String s: bar)
        b.append(s);
    return b.toString(); 
}

@Attribute(description = "a roundabout way of putting a number into an expression")
Number num(Node n, Collection<Node> c, Index<Node> i, Number num) { return num; }
value

The value parameter is useful to avoid namespace collisions, use Java keywords like this as attribute names, and dodge Java naming conventions. If no value parameter is given the attribute name will be the method name.

description

The description provides a bit of metadata about the attribute. This is visible in the code and javadocs, where it is largely redundant, but it is also exposed through the Forester.attributes() method. The latter returns an alphabetized map from the names of attributes available to the forester to two-member arrays of meta-data: the attribute description and an abbreviated description of the corresponding method signature. The main method of dfh.treepath.MatchPath uses this method to summarize all the attributes available to that class:

	public static void main(String[] args) {
		int length = 0;
		for (String s : MatchPath.standard().attributes().keySet())
			length = Math.max(length, s.length());
		String format = "@%-" + length + "s : %s%n%" + length + "s    %s%n";
		for (Entry<String, String[]> e : MatchPath.standard().attributes()
				.entrySet()) {
			System.out.printf(format, e.getKey(), e.getValue()[0], "",
					e.getValue()[1]);
		}
	}
returns
@assertion       : whether node is an assertion
                   public boolean dfh.treepath.MatchPath.assertion(...)
@depth           : the number of steps between node and root
                   protected int dfh.treepath.Forester.depth(...)
@echo            : converts anything into an attribute
                   protected final Object dfh.treepath.Forester.echo(...,Object)
@explicit        : whether the rule generating this node is explicitly defined
                   public boolean dfh.treepath.MatchPath.explicit(...)
@false           : the false value
                   protected final Boolean dfh.treepath.Forester.False(...)
@group           : group matched
                   public String dfh.treepath.MatchPath.group(...)
@height          : the longest path between node and a leaf; 1 if node is a leaf
                   protected int dfh.treepath.Forester.height(...)
@id              : the nodes string id, if any
                   protected String dfh.treepath.Forester.id(...)
@index           : the index of the context node in the context collection
                   protected int dfh.treepath.Forester.index(...)
@label           : rule generating match
                   public String dfh.treepath.MatchPath.label(...)
@leaf            : whether the context node is a leaf
                   protected boolean dfh.treepath.Forester.isLeaf(...)
@length          : length of group matched
                   public int dfh.treepath.MatchPath.length(...)
@log             : records parameters to a log stream
                   protected final Boolean dfh.treepath.Forester.log(...,Object[])
@m:abs           : absolute value
                   protected Number dfh.treepath.FunctionalForester.abs(...,Number)
@m:ceil          : round up
                   protected Number dfh.treepath.FunctionalForester.ceil(...,Number)
@m:floor         : round down
                   protected Number dfh.treepath.FunctionalForester.floor(...,Number)
@m:int           : integral portion
                   protected Integer dfh.treepath.FunctionalForester.integralPortion(...,Number)
@m:max           : maximum value
                   protected Number dfh.treepath.FunctionalForester.max(...,Number[])
@m:min           : minimum value
                   protected Number dfh.treepath.FunctionalForester.min(...,Number[])
@m:prod          : product of values
                   protected Double dfh.treepath.FunctionalForester.product(...,Number[])
@m:round         : round to nearest whole number
                   protected Number dfh.treepath.FunctionalForester.round(...,Number)
@m:sum           : sum of values
                   protected Double dfh.treepath.FunctionalForester.sum(...,Number[])
@null            : the null value
                   protected final Object dfh.treepath.Forester.Null(...)
@pick            : picks a node from a collection
                   protected final Object dfh.treepath.Forester.pick(...,java.util.Collection,int)
@root            : whether the node is root
                   protected boolean dfh.treepath.Forester.isRoot(...)
@s:cmp           : compare string order
                   protected Integer dfh.treepath.FunctionalForester.compare(...,String,String)
@s:concat        : string concatenating items
                   protected String dfh.treepath.FunctionalForester.concatenate(...,Object[])
@s:contains      : whether the string contains a particular infix
                   protected boolean dfh.treepath.FunctionalForester.contains(...,String,String)
@s:ends-with     : whether the string parameter has a particular suffix
                   protected boolean dfh.treepath.FunctionalForester.endsWith(...,String,String)
@s:find          : look for pattern in string
                   protected boolean dfh.treepath.FunctionalForester.find(...,String,String)
@s:index         : the index of an infix in the string
                   protected int dfh.treepath.FunctionalForester.index(...,String,String)
@s:join          : concatenate items with separator
                   protected String dfh.treepath.FunctionalForester.join(...,Object,Object[])
@s:lc            : lowercase
                   protected String dfh.treepath.FunctionalForester.lowercase(...,String)
@s:len           : string length
                   protected Integer dfh.treepath.FunctionalForester.length(...,String)
@s:looking-at    : match prefix
                   protected boolean dfh.treepath.FunctionalForester.lookingAt(...,String,String)
@s:matches       : whether the string parameter matches a pattern
                   protected boolean dfh.treepath.FunctionalForester.matches(...,String,String)
@s:nspace        : normalize whitespace
                   protected String dfh.treepath.FunctionalForester.normalizeWhitespace(...,String)
@s:replace       : replace all occurrences of infix
                   protected String dfh.treepath.FunctionalForester.replace(...,String,String,String)
@s:replace-all   : replace all occurrences of pattern
                   protected String dfh.treepath.FunctionalForester.replaceAll(...,String,String,String)
@s:replace-first : replace first occurrence of pattern
                   protected String dfh.treepath.FunctionalForester.replaceFirst(...,String,String,String)
@s:starts-with   : whether the string parameters as a particular prefix
                   protected boolean dfh.treepath.FunctionalForester.startsWith(...,String,String)
@s:substr        : select substring
                   protected String dfh.treepath.FunctionalForester.substring(...,String,Integer[])
@s:trim          : trim whitespace
                   protected String dfh.treepath.FunctionalForester.trim(...,String)
@s:uc            : uppercase
                   protected String dfh.treepath.FunctionalForester.uppercase(...,String)
@s:ucfirst       : capitalize only first letter
                   protected String dfh.treepath.FunctionalForester.ucFirst(...,String)
@size            : the size of the node collection
                   protected int dfh.treepath.Forester.size(...,java.util.Collection)
@this            : the context node
                   protected final Object dfh.treepath.Forester.This(...)
@true            : the true value
                   protected final Boolean dfh.treepath.Forester.True(...)
@tsize           : the size in nodes of the tree rooted at node
                   protected int dfh.treepath.Forester.tsize(...)
@u:def           : whether the parameter value is non-null
                   protected final Boolean dfh.treepath.FunctionalForester.defined(...,Object)
@u:millis        : current time in milliseconds
                   protected Long dfh.treepath.FunctionalForester.millis(...)
@uid             : unique id of context node representing its position in its tree
                   protected final String dfh.treepath.Forester.uid(...)
@width           : the number of leaves under node; 1 if node is a leaf
                   protected int dfh.treepath.Forester.width(...)
@zero            : whether node is zero-width
                   public boolean dfh.treepath.MatchPath.zero(...) 

The ellipsis represents arguments you do not provide in tree path expressions. A final array represents a varargs argument.

Rationale

The rationale behind the attributes() method and the description parameter of @Attribute that feeds it is just that attributes are not easily discovered in javadocs or code, particularly when they may be added ad hoc, but they are particularly important to tree path expressions. So I made attributes self-documenting.

Ad Hoc Attributes

If you wish to define a particular attribute you can add it to an anonymous inner class:

Forester<N> f = new MyForester() {
    @Attribute
    boolean foo(N n, Collection<N> c, Index<N> i) {
        return bar(n) > quux(n);
    }
};
Path<N> p = f.path(//*[@foo]);

This is useful mostly for rapid exploratory hacks. Once you find something that works, you can move your ad hoc attributes into your forester class or an extension.

Mixins

The typical way to provide attributes to an extension of tree path is to implement them in a class extending Forester . This isn't done for any principled reason arising from the logic of tree path expressions, however, but because Java's implementation of inheritance and reflection make this easy. It would be more natural to put attributes into libraries and load them into a Forester class as needed. This mixin structure and Java are an uncomfortable fit, but I have implemented it as an experimental improvement of TreePath.

A library of attributes to be mixed in orthogonally to the inheritance hierarchy must extend the class AttributeLibrary , must be public, and must have a public zero-argument constructor. For example:

public class TestLibrary extends AttributeLibrary<Node> {
    public TestLibrary() {
        super();
    }

    @Attribute
    String foo(Node e, Collection<Node> c, Index<Node> i) {
        return "foo";
    }
}

To load the attributes provided by this library in a class, you call Forester.mixin(Class<? extends AttributeLibrary<N>>...) . You can load libraries in at any time, but if you want the new attributes to persist through serialization it is recommended that you add them in by overriding the init method like so:

@Override
protected void init() {
    if (attributes == null) {
        super.init();
        mixin(Library1.class, Library2.class);
    }
}

Note that this method can be used only to supplement the attributes obtained from the class hierarchy. If MyForester provides attribute @foo and we try to mix an identically named method in from a library, the native @foo will win out.

Dependencies, Thread Safety, and Serialization

The only dependency of this library is my dfh.grammar library, which has no dependencies.

All the classes in this library are thread safe.

All the classes in this library aside from Index are serializable, though you have to take care if you use mixins.

Why

I use my grammar library, dfh.grammar , a lot. Generally when I match something with a grammar I'm not satisfied simply to know a match is possible, what was matched, or where the match starts and ends: I want the entire parse tree, and given the tree, I want to find particular parts and do things with them. If I'm parsing a date, I want to find the year part and the month part and so on. If I'm parsing an email address, I want to separate the user name from the domain and I want to know the subdomains. For many purposes the methods provided by the match object itself suffice, but often enough I found myself writing the same sort of picky code that basically drilled down a particular path in a particular sort of parse tree. If this were XML I'd use XPath. If it were HTML I'd use CSS (or perhaps XPath after converting it to XHTML). So I thought I'd make something like XPath for my parse trees. I didn't want it to be precisely the same as XML, however, because my parse trees weren't precisely like XML trees. The notion of “attribute” wasn't native to the problem, nor were namespaces. While I was thinking about this it occurred to me that I could make a generic library that could handle any sort of tree, adapting the attribute idea to represent callbacks.

Having started along this path, I figured I would add features to the expression language that I often wanted: an easy way to obtain leaves or nodes near to a particular node but not screened from it by other nodes of the same type -- what the />foo expression represents. I found I could make it trivial to adapt the library to a new sort of tree and trivial to add attributes, both of which were very nice. So here we are.

Future Plans

I may add mathematical expressions to the things that can occur in predicates.

If I find the time I might implement the tree path specification in other languages, starting with Perl. I find this useful. I'd like someone else to as well. If I make it available to someone who prefers to work in another language, particularly one that facilitates code sharing as Perl does, then I increase the odds that someone other than myself will use it.