Digital humanities


Maintained by: David J. Birnbaum (djbpitt@gmail.com) [Creative Commons BY-NC-SA 3.0 Unported License] Last modified: 2016-02-25T17:49:55+0000


The XPath functions we use most

There's a more complete list, with examples, at http://www.w3schools.com/xsl/xsl_functions.asp.

A few useful XPath features

Variables
Variable names begin with a dollar sign, and you can create them as needed. See the discussion of the for construction, below.
The dot: .
In XPath the dot represents the current node, whatever it is. For example, //age[. eq 10] finds all of the <age> elements in the document and then filters them according to the predicate. The dot within the predicate means to take each <age> element in turn (make it the current node) and test whether it is equal to the value 10.
for $i in (sequence) return …
The for construction can be used for iteration. for $i in (1, 3, 5) return (//sp)[$i] will return the first, third, and fifth <sp> elements in the document. Variable names begin with a dollar sign, so this XPath expression creates a variable $i, sets it to each of the values in the parenthesized sequence in turn, and then uses that value as a numerical predicate to retrieve the corresponding <sp> element. The name of the variable is arbitrary, except that it must begin with the dollar sign.

General-purpose functions

distinct-values(arg+)
Removes duplicates from a set of values.
reverse((arg*))
Reverses the order of the items in a sequence. Handy for counting backwards; the XPath expression (1 to 10) yields ten numbers in order but (10 to 1) yields an empty sequence because XPath can’t count backwards. You can overcome this limitation with reverse(1 to 10). (Alternatively you could use for $i in (1 to 10) return 11 - $i.)
name(arg?)
Returns the name (GI) of the node. //*/name() will find all elements in the document and instead of returning them (tags, contents, and all), it will return just their names.

Casting

number(arg), string(arg)
Convert the argument to the specified value. If you can’t be sure in advance that the conversion will succeed (what does it mean to convert a string of letters to a number?), look up the details in Kay.
xs:string(arg), xs:integer(arg), xs:double(arg), xs:decimal(arg), xs:float(arg),
Cast to specific datatypes. May generate an error if the input value isn’t castable as the target datatype. There are other datatypes that can also be used here.

Strings

concat(string+), string-join((string+),string)
concat() joins the strings as is. string-join() lets the user specify a sequence of strings to join (the first argument) and a separator string to insert between the items.
normalize-space(string)
Converts all white space to space characters, compresses sequences of spaces into a single space, strips leading and trailing spaces.
upper-case(string), lower-case(string)
Changes case of string. Useful for case-insensitive searching, sorting, comparing, etc. We never use upper-case() ourselves.
string-length(string)
Returns the length of the string in characters. Often used as a path step, e.g., //sp/string-length(.). The preceding XPath finds all of the <sp> elements and then returns the length of each in turn (the dot refers to the current context node, that is, to each individual <sp> as you loop through them). You can't use string-length(//sp) because the string-length() function can only take a single argument, and //sp is likely to return multiple nodes.
contains(string1, string2), starts-with(string1, string2), ends-with(string1, string2)
Tests whether the first string has the property specified by the second, that is, whether the first contains, starts with, or ends with the second. Useful for filtering; //sentence[ends-with(., '?')] finds all <sentence> elements and keeps only the ones that end with a question mark. The question mark in quotation marks is the second string. The dot (not in quotation marks) is an XPath way of representing the current node, whatever it is. For each <sentence> retrieved by //sentence, then, within the predicate the dot is treated as the value of that particular (current) <sentence> node.
translate(string1, string2, string3)
Takes string1 and replaces every instance of a character in it from string2 with the corresponding character from string3. translate('string','ti','pa') will change string into sprang. Can only do one-to-one replacements; see also replace(). Can be used for deletion by making string3 shorter than string2; //p/translate(., 'aeiou', '') will strip all the vowels from each <p> by replacing them with nothing.
substring-before(string1, string2), substring-after(string1, string2)
Returns the part of string1 before (or after) the first occurrence of string2. Useful for breaking apart certain structures, e.g., the area code for a ten-digit US telephone number in normal 123-456-7890 format is telephone/substring-before(.,'-').
matches(string, regex)
Tests whether the regex (regular expression pattern) occurs in the string. We cover regex later; for now, one type of regex is a plain string, so (with oversimplification) matches(string1, string2) is equivalent to contains(string1, string2). The real power of matches() will become clearer once we get to regex.
replace(string, regex, regex-replace)
The translate() function, above, can only replace single characters with single characters. The replace() function can match regex patterns and perform more complicated replacements. Stay tuned.
tokenize(string, regex)
Breaks a string into parts by dividing at the regex. Handy for processing IDREFS attributes; ask for details.

Numbers

count((arg*)), avg((arg*)), max((arg*)), min((arg*)), sum((arg*))
Count, average (mean), largest value, smallest value, and total of all values. The arguments have to make sense; trying to run sum() over letters will generate an error. Note the double parentheses; these functions take a single value that is a sequence, not a set of values. sum(1, 2, 3) will generate an error because it lists three values. sum((1, 2, 3)) yields 6 because there is just a single argument, a sequence of three values.
ceiling(num), floor(num), round(num)
These take a single argument and round up, down, or closest.

Boolean

not(arg)
Inverts the truth value of the argument. Usefully wrapped around other functions, e.g., //p[not(q)] returns all <p> elements that do not contain a <q> child element.

Context

position()
Returns the position of the node. Useful for filtering, e.g., (//sp)[position() < 6] retrieves the first five <sp> elements in the document. Note that nothing goes inside the parentheses.
last()
Used as a positional predicate. (//p)[1] returns the first <p> element in the document. (//p)[last()] returns the last. Note that nothing goes inside the parentheses.

Comparison

XPath supports two types of comparison: value comparison and general comparison.

Value comparison

The value comparison operators are:

Value comparison can be used only to compare exactly one item to exactly one other item. For example, to create a predicate that will filter <sp> elements to keep only those where the value of the associated @who attribute is equal to the string hamlet, we can write:

//sp[@who eq 'hamlet']

Since each <sp> has exactly one @who attribute and since we are comparing it to a single string, the test will return True or False for each <sp> in the document. Because the exactly one item can be an empty sequence (technically no items), the test will also work (and return False) when an <sp> element has no @who attribute. It is, however, an error if either side of the comparison contains a sequence of more than one item.

Value comparison is often used for numerical values. To keep all of the speeches (<sp> elements) with more than 8 line (<l>) descendants, we can write:

//sp[count(descendant::l) gt 8]

In the preceding example, the output of the count() function is a single item, an integer, and it is being compared to another single item, the integer value 8.

General comparison

The general comparison operators are:

While value comparison operators can compare only one thing on the left to one thing on the right, general comparison operators can have one or more items on either side of the comparison (also zero items, since the empty sequence is also allowed). For example:

//sp[@who = ('hamlet', 'ophelia')]

will retain all <sp> elements where the @who attribute is equal to either hamlet or ophelia. This makes general comparison a convenient alternative to a complex predicate like:

//sp[@who eq 'hamlet' or @who eq 'ophelia']

In comparisons with exactly one item on either side of the comparison operator, value comparison and general comparison are equivalent.

One possibly surprising feature of general comparison is the way it behaves with negation. Consider:

//sp[@who != ('hamlet', 'ophelia')]

This does not find all speeches by anyone other than Hamlet or Ophelia! It finds all speeches where the @who attribute is not equal to any one of the individual items in the sequence on the right. This means that it finds all speeches without exception, since the ones by Hamlet are not by Ophelia (the test succeeds because @who is not equal to ophelia in situations where it is equal to hamlet) and vice versa.

So how do you find all speeches by anyone other than Hamlet or Ophelia? Try:

//sp[not(@who = ('hamlet', 'ophelia'))]

The preceding predicate says that we want to keep all speeches where it is not the case that the @who attribute is equal to either hamlet or ophelia.

Summary of comparison operators

Description Value General
Equal to eq =
Not equal to ne !=
Greater than gt >
(&gt;)
Greater than or equal to
(not less than)
ge >=
(&gt;=)
Less than lt <
(&lt;)
Less than or equal to
(not greater than)
le <=
(&lt;=)