Digital humanities


Maintained by: David J. Birnbaum (djbpitt@gmail.com) [Creative Commons BY-NC-SA 3.0 Unported License] Last modified: 2022-10-12T15:34:01+0000


The XPath functions we use most

For more complete lists, with examples, see the references listed in the XPath section of our main course page (but note that the current version of XPath is 3.1 and some of those references go only up to version 2.0). Items inside the function parentheses are placeholders and should be replaced with real values, as in the examples provided. For example, sequence means that you should supply a sequence of items; num means that the argument has to be some type of number (or something that XPath can treat as a number); item means any item, etc. Where we use a regex repetition indicator, it has the same meaning as it does in regex; e.g., tokenize(string, regex?) means that the function requires a first argument, which must be a string, and has an optional second argument, which must be a regular expression.

Typographic convention: In this document and in all course materials from now on when we refer to an attribute by name we’ll prefix a @ character to it. For example, if we want to write about an attribute with the name who, we’ll spell it @who. This is a common convention in XPath documentation, similar to the way when we talk about an element with the name speech we write angle brackets before and after it, that is, <speech>. Attribute names cannot begin with a literal at-sign, so a <speech> element with a @who attribute that has the value Hamlet will be spelled <speech who="Hamlet"> in the XML.

A few useful XPath features

Variables

Variable names begin with a dollar sign, and you can create them as needed. See the discussion of the for construction, below.

The dot: .

In XPath the dot represents the current context item, whatever it is. For example, //age[. eq 10] finds all of the <age> elements in the document and then filters them according to the predicate. The dot within the predicate means to take each <age> element in turn (make it the current context item) and test whether it is equal to the numerical value 10. See below, under Comparison, for information about how to compare things in XPath.

for $i in (sequence) return …

The for construction can be used for iteration. for $i in (1, 3, 5) return (//sp)[$i] will return the first, third, and fifth <sp> elements in the document. Variable names begin with a dollar sign, so this XPath expression creates a variable $i, sets it to each of the values in the parenthesized sequence in turn, and then uses that value as a numerical predicate to select the corresponding <sp> element. The name of the variable is arbitrary, except that it must have a leading dollar sign (there are other constraints on variable names, e.g., variable names cannot begin with digits, but you don’t need to memorize them because <oXygen/> will alert you).

You could, alternatively, get the same results with (//sp)[position() = (1, 3, 5)]. This tests each <sp> element in the document in turn to see whether it is the first, third, or fifth in the sequence of all <sp> elements in the document. See the answer key for XPath assignment #2 (once it’s posted) for an explanation of why the parentheses around //sp are required. See the discussion of Comparison, below, for why we have to use an equal sign, and not the eq operator, here.

The simple map (!) and arrow (=>) operators

!

The simple map operator (spelled !) means do the thing on the right once for each item on the left. For example, //act ! count(descendant::sp) says Find all <act> elements (in a play, for example) and, for each one, count the number of <sp> (speech) elements it contains. We use the descendant:: axis because if the act is divided into scenes, the speeches will be children of the scenes, rather than of the acts.

=>

The arrow operator (spelled =>) means apply the function on the right to the entire sequence (all at once) on the left. For example, //speaker => distinct-values() means find all of the <speaker> elements in the document and use a sequence of those elements as input into the distinct-values() to obtain a deduplicated sequence of speaker names.

General-purpose functions

distinct-values(sequence)

Removes duplicates from a set of values. distinct-values(//speaker) returns a deduplicated sequence of string values, one for each distinct <speaker> element in the document. The order of the items returned by the distinct-values() function is not defined, which is a technical term that means that it isn’t predictable. If you need the values sorted in a particular way, then, such as alphabetically, you need to sort them yourself.

reverse(sequence)

Reverses the order of the items in a sequence, which, among other things, is handy for counting backwards. The XPath expression 1 to 10 returns a sequence of ten integer values (1 through 10) in order, but 10 to 1 returns an empty sequence (note: it does not raise an error) because XPath can’t count backwards. You can overcome this limitation with reverse(1 to 10). (Alternatively you could use for $i in (1 to 10) return 11 - $i, but you shouldn’t because it’s harder to understand.)

Note that reverse() reverses a sequence and in XPath a string is not considered a sequence of characters. This means that reverse("xml") returns "xml" (not "lmx") because it reverses a one-item sequence whose only member is the string "xml", and reversing a one-item sequence gives you the same one-item sequence. If you want to reverse the characters in a string, you have to explode the string into a sequence of individual characters explicitly, reverse the sequence, and then stitch them back together into a new string, and there are functions that will let you do that.

name(node?)

Returns the name (the technical term is generic identifier, abbreviated GI) of the node supplied as its argument. If no argument is supplied, it uses the current context node. If the argument is not a node, the function raises an error. //* ! name() will find all elements in the document and instead of returning them (displaying tags, attributes, contents, and all), it will return just their names. During exploratory document analysis we often combine name() with distinct-values() to find all of the distinct element types in our document: //* ! name() => distinct-values().

Casting

Casting means changing the datatype of an item.

number(arg), string(arg)

Convert the argument to the specified type of value. If you can’t be sure in advance that the conversion will succeed (what does it mean to convert a string of letters to a number?), look up the details in Kay.

xs:string(arg), xs:integer(arg), xs:double(arg), xs:decimal(arg)

Cast to specific datatypes. These functions will raise an error if the input value isn’t castable as the target datatype. There are other datatypes that can also be used here, but strings and numbers (integer, decimal, and double are all numeric datatypes) are the ones we use most. (There is also an xs:float numeric datatype, but you should use xs:double instead; see Kay p. 208.)

Strings

concat(string, string+), string-join((string+), string?)

concat() joins the strings as is. It requires at least two strings, so if you give it only one, it will raise an error. string-join() lets the user specify a sequence of strings to join (the first argument) and a separator string to insert between the items. This is a handy way to create, for example, a comma-separated list: string-join(//speaker, ', ') will concatenate all of the <speaker> elements in a document (the first argument to the function is a sequence of <speaker> elements) into a list with a comma and a space between values (specified as the second argument to the function), and it’s smart enough not to add a trailing comma after the last item. If no separator is specified, string-join() just concatenates the strings, like concat(). The items to be concatenated can be either literal strings or items that can be cast as (that is, treated as) strings (such as the <speaker> elements in the example above). If the first argument to string-join() is an empty sequence it isn’t an error, but you’ll get an empty sequence back as a result.

normalize-space(string)

Converts all white space to space characters, compresses sequences of spaces into a single space, strips leading and trailing spaces. As with the functions above, the argument can be either a string or something that can be cast as a string.

upper-case(string), lower-case(string)

Changes case of string. Useful for case-insensitive searching, sorting, comparing, etc.

string-length(string)

Returns the length of the string in characters. For example, //sp ! string-length(.) finds all of the <sp> elements and then returns the length of each in turn (the dot refers to the current context node, that is, to each individual <sp> as you loop through them). You can't use string-length(//sp) because the string-length() function can only take a single argument, and //sp is likely to return multiple nodes (although it will work if there is only one <sp> element in the document).

contains(string, string), starts-with(string, string), ends-with(string, string), contains-token(string, string)

Tests whether the first string has the relationship with the second specified by the function name, that is, whether the first contains, starts with, ends with, or contains as a separate (whitespace-separated) word token the second. Useful for filtering; //sentence[ends-with(., '?')] finds all <sentence> elements and keeps only the ones that end with a question mark. The question mark in quotation marks is the second string. The dot (not in quotation marks) is an XPath way of representing the current node, whatever it is. For each <sentence> selected by //sentence, then, within the predicate the dot is treated as the value of that particular (current) <sentence> node.

translate(string, string, string)

Takes the first string and replaces every instance of a character in it from second string with the corresponding character from the third string. For example, translate('string','ti','pa') will change string into sprang; it replaces every t with a p and every i with an a. Can only do one-to-one replacements; see also replace(). Can be used for deletion by making the third string shorter than the second; //p ! translate(., 'aeiou', '') will strip all the vowels from each <p> by replacing them with nothing.

substring-before(string, string), substring-after(string, string)

Returns the part of the first string before (or after) the first occurrence of the second string. Useful for breaking apart certain structures, e.g., the area code for a ten-digit US telephone number in normal 123-456-7890 format is $telephone ! substring-before(.,'-').

There is also a substring(string, integer, integer?) function, which returns a substring of the first argument. The second argument is the start position and the third argument is the length of the substring (if missing, the substring extends to the end of the first argument). For example, to capitalize the first letter of a string you can use:

$input ! concat(substring(upper-case(.), 1, 1), substring(., 2))
matches(string, regex)

Tests whether the regular expression pattern occurs in the string. One type of regex is a plain string, so (with oversimplification) matches(string, string) is equivalent to contains(string, string). The real power of matches() in comparison to contains() is that it can match regex patterns, and not just strings.

replace(string, regex, regex-replace)

The translate() function, above, can only replace single characters with single characters. The replace() function can match regex patterns and perform more complicated replacements, including replacements that use capture groups (as we practice in our regex unit).

tokenize(string, regex?)

Breaks a string into parts by dividing at the regex. If the second argument (the regex) is missing, it tokenizes on sequences of whitespace characters by default.

Numbers

count(sequence), avg(sequence), max(sequence), min(sequence), sum(sequence)

Count, average (arithmetic mean), largest value, smallest value, and total of all values. The arguments have to make sense; trying to run avg() or sum() over strings will raise an error. You can count any sequence, and max() and min() will work with strings (letters later in the alphabet are larger than letters earlier in the alphabet), but in practice we use everything except count() primarily with numbers.

ceiling(num), floor(num), round(num)

These take a single argument and round up, down, or closest.

Boolean

not(arg)

Inverts the truth value of the argument. Usefully wrapped around other XPath expressions. For example, since //p[q] returns all <p> elements that have any <q> child element, //p[not(q)] returns all <p> elements that do not have any <q> child element.

Context

position()

Returns the position of the node in its context sequence. Useful for filtering, e.g., (//sp)[position() < 6] selects the first five <sp> elements in the document. Nothing is allowed inside the parentheses, but they are required anyway because functions always require parentheses.

last()

Used as a positional predicate. (//p)[1] returns the first <p> element in the document. (//p)[last()] returns the last. Nothing is allowed inside the parentheses.

Comparison

XPath supports two types of comparison: value comparison and general comparison.

Value comparison

The value comparison operators are:

Value comparison can be used only to compare exactly one item to exactly one other item and they must be of the same (or promotable) datatypes. For example, to create a predicate that will filter <sp> elements to keep only those where the value of the associated @who attribute is equal to the string "Hamlet", we can write:

//sp[@who eq 'Hamlet']

Each <sp> has exactly one @who attribute and we are comparing it to a single string, so the test will return True or False for each <sp> in the document. Should an <sp> element not have a @who attribute the speech will not be selected, but it also does not raise an error because the predicate means find all <sp> and keep only those that have a @who attribute with a value that is equal to "Hamlet". Both <sp> elements that have a @who attribute equal to something other than "Hamlet" and those that have no @who attribute at all fail the test and are not included in the result, but not having a @who attribute is a syntactically correct way to fail the test, and not an error.

Value comparison is often used for numerical values. To keep all of the speeches (<sp> elements) with more than 8 line (<l>) descendants, we can write:

//sp[count(descendant::l) gt 8]

In the preceding example, the output of the count() function is a single item, an integer, and it is being compared to another single item, the integer value 8.

General comparison

The general comparison operators are:

While value comparison operators can compare only one thing on the left to one thing on the right, general comparison operators can have one or more items on either side of the comparison (also zero items, since the empty sequence is also allowed). For example:

//sp[@who = ('Hamlet', 'Ophelia')]

will select all <sp> elements where the @who attribute is equal to either Hamlet or Ophelia. This makes general comparison a convenient alternative to a complex predicate like:

//sp[@who eq 'Hamlet' or @who eq 'Ophelia']

In comparisons with exactly one item on either side of the comparison operator, as long as the datatypes are compatible, value comparison and general comparison are effectively equivalent.

One possibly surprising feature of general comparison is the way it behaves with negation. Consider:

//sp[@who != ('Hamlet', 'Ophelia')]

This does not find all speeches by anyone other than Hamlet or Ophelia! It finds all speeches where the @who attribute is not equal to some individual item in the sequence on the right. This means that it finds all speeches without exception, since the ones by Hamlet are not by Ophelia (the test always succeeds because @who is not equal to Ophelia in situations where it is equal to Hamlet, and vice versa). This is probably not what you want.

So how do you find all speeches by anyone other than Hamlet or Ophelia? Try:

//sp[not(@who = ('Hamlet', 'Ophelia'))]

The preceding predicate says that we want to keep all speeches where it is not the case that the @who attribute is equal to either Hamlet or Ophelia.

Summary of comparison operators

Description Value General
Equal to eq =
Not equal to ne !=
Greater than gt >
(&gt;)
Greater than or equal to
(not less than)
ge >=
(&gt;=)
Less than lt <
(&lt;)
Less than or equal to
(not greater than)
le <=
(&lt;=)