Maintained by: David J. Birnbaum (djbpitt@gmail.com) Last modified: 2022-10-12T15:34:01+0000
For more complete lists, with examples, see the references listed in the XPath section of our main course page (but note that
the current version of XPath is 3.1 and some of those references go only up to version
2.0). Items inside the function parentheses are placeholders and should be replaced with
real values, as in the examples provided. For example,
sequence
means that you should supply a sequence of
items; num
means that the argument has to be some
type of number (or something that XPath can treat as a number);
item
means any item, etc. Where we use a regex
repetition indicator, it has the same meaning as it does in regex; e.g.,
tokenize(string, regex?)
means that the function
requires a first argument, which must be a string, and has an optional second argument,
which must be a regular expression.
Typographic convention: In this document and in all course materials
from now on when we refer to an attribute by name we’ll prefix a
@
character to it. For example, if we want to
write about an attribute with the name who
, we’ll spell it
@who
. This is a common convention in XPath
documentation, similar to the way when we talk about an element with the name
speech
we write angle brackets before and after it, that is,
<speech>
. Attribute names cannot begin with
a literal at-sign, so a <speech>
element
with a @who
attribute that has the value
Hamlet
will be spelled
<speech who="Hamlet">
in the XML.
Variable names begin with a dollar sign, and you can create them as needed.
See the discussion of the for
construction, below.
.
In XPath the dot represents the current context item, whatever it is. For
example, //age[. eq 10]
finds all of
the <age>
elements in the document
and then filters them according to the predicate. The dot within the
predicate means to take each <age>
element in turn (make it the current context item) and test whether it is
equal to the numerical value 10. See below, under Comparison, for
information about how to compare things in XPath.
for $i in (sequence) return …
The for
construction can be used for
iteration.
for $i in (1, 3, 5) return (//sp)[$i]
will return the first, third, and fifth
<sp>
elements in the document.
Variable names begin with a dollar sign, so this XPath expression creates a
variable $i
, sets it to each of the
values in the parenthesized sequence in turn, and then uses that value as a
numerical predicate to select the corresponding
<sp>
element. The name of the
variable is arbitrary, except that it must have a leading dollar sign (there
are other constraints on variable names, e.g., variable names cannot begin
with digits, but you don’t need to memorize them because <oXygen/>
will alert you).
You could, alternatively, get the same results with
(//sp)[position() = (1, 3, 5)]
. This
tests each <sp>
element in the
document in turn to see whether it is the first, third, or fifth in the
sequence of all <sp>
elements in the
document. See the answer key for XPath assignment #2 (once it’s posted) for
an explanation of why the parentheses around
//sp
are required. See the discussion
of Comparison, below, for why we have to use an equal sign, and not the
eq
operator, here.
!
) and arrow
(=>
) operators!
The simple map operator (spelled
!
) means do the thing on the right
once for each item on the left
. For example,
//act ! count(descendant::sp)
says
Find all
. We use the
<act>
elements (in a
play, for example) and, for each one, count the number of
<sp>
(speech) elements it
containsdescendant::
axis because if the act is
divided into scenes, the speeches will be children of the scenes, rather
than of the acts.
=>
The arrow operator (spelled
=>
) means apply the function on the
right to the entire sequence (all at once) on the left
. For example,
//speaker => distinct-values()
means
find all of the
.<speaker>
elements in the document and use a sequence of those elements as input
into the distinct-values()
to
obtain a deduplicated sequence of speaker names
distinct-values(sequence)
Removes duplicates from a set of values.
distinct-values(//speaker)
returns a
deduplicated sequence of string values, one for each distinct
<speaker>
element in the document.
The order of the items returned by the
distinct-values()
function is not
defined, which is a technical term that means that it isn’t
predictable. If you need the values sorted in a particular way, then, such
as alphabetically, you need to sort them yourself.
reverse(sequence)
Reverses the order of the items in a sequence, which, among other things, is
handy for counting backwards. The XPath expression
1 to 10
returns a sequence of ten
integer values (1 through 10) in order, but
10 to 1
returns an empty sequence
(note: it does not raise an error) because XPath can’t count backwards. You
can overcome this limitation with
reverse(1 to 10)
. (Alternatively you
could use
for $i in (1 to 10) return 11 - $i
, but
you shouldn’t because it’s harder to understand.)
Note that reverse()
reverses a
sequence and in XPath a string is not considered a sequence of
characters. This means that
reverse("xml")
returns
"xml"
(not
"lmx"
) because it reverses a one-item
sequence whose only member is the string
"xml"
, and reversing a one-item
sequence gives you the same one-item sequence. If you want to reverse the
characters in a string, you have to explode the string into a sequence of
individual characters explicitly, reverse the sequence, and then stitch them
back together into a new string, and there are functions that will let you
do that.
name(node?)
Returns the name (the technical term is generic identifier,
abbreviated GI) of the node supplied as its argument. If no argument is
supplied, it uses the current context node. If the argument is not a node,
the function raises an error.
//* ! name()
will find all elements in
the document and instead of returning them (displaying tags, attributes,
contents, and all), it will return just their names. During exploratory
document analysis we often combine
name()
with
distinct-values()
to find all of the
distinct element types in our document:
//* ! name() => distinct-values()
.
Casting means changing the datatype of an item.
number(arg)
,
string(arg)
Convert the argument to the specified type of value. If you can’t be sure in advance that the conversion will succeed (what does it mean to convert a string of letters to a number?), look up the details in Kay.
xs:string(arg)
,
xs:integer(arg)
,
xs:double(arg)
,
xs:decimal(arg)
Cast to specific datatypes. These functions will raise an error if the input
value isn’t castable as the target datatype. There are other datatypes that
can also be used here, but strings and numbers (integer, decimal, and double
are all numeric datatypes) are the ones we use most. (There is also an
xs:float
numeric datatype, but you
should use xs:double
instead; see Kay
p. 208.)
concat(string, string+)
,
string-join((string+), string?)
concat()
joins the strings as is. It
requires at least two strings, so if you give it only one, it will raise an
error. string-join()
lets the user
specify a sequence of strings to join (the first argument) and a separator
string to insert between the items. This is a handy way to create, for
example, a comma-separated list:
string-join(//speaker, ', ')
will
concatenate all of the <speaker>
elements in a document (the first argument to the function is a sequence of
<speaker>
elements) into a list with
a comma and a space between values (specified as the second argument to the
function), and it’s smart enough not to add a trailing comma after the last
item. If no separator is specified,
string-join()
just concatenates the
strings, like concat()
. The items to be
concatenated can be either literal strings or items that can be cast as
(that is, treated as) strings (such as the
<speaker>
elements in the example
above). If the first argument to
string-join()
is an empty sequence it
isn’t an error, but you’ll get an empty sequence back as a result.
normalize-space(string)
Converts all white space to space characters, compresses sequences of spaces into a single space, strips leading and trailing spaces. As with the functions above, the argument can be either a string or something that can be cast as a string.
upper-case(string)
,
lower-case(string)
Changes case of string. Useful for case-insensitive searching, sorting, comparing, etc.
string-length(string)
Returns the length of the string in characters. For example,
//sp ! string-length(.)
finds all of
the <sp>
elements and then returns
the length of each in turn (the dot refers to the current context node, that
is, to each individual <sp>
as you
loop through them). You can't use
string-length(//sp)
because the
string-length()
function can only take
a single argument, and //sp
is likely
to return multiple nodes (although it will work if there is only one
<sp>
element in the
document).
contains(string, string)
,
starts-with(string, string)
,
ends-with(string, string)
,
contains-token(string, string)
Tests whether the first string has the relationship with the second specified
by the function name, that is, whether the first contains, starts with, ends
with, or contains as a separate (whitespace-separated) word token the
second. Useful for filtering;
//sentence[ends-with(., '?')]
finds all
<sentence>
elements and keeps only
the ones that end with a question mark. The question mark in quotation marks
is the second string. The dot (not in quotation marks) is an XPath way of
representing the current node, whatever it is. For each
<sentence>
selected by
//sentence
, then, within the predicate
the dot is treated as the value of that particular (current)
<sentence>
node.
translate(string, string, string)
Takes the first string and replaces every instance of a character in it from
second string with the corresponding character from the third string. For
example, translate('string','ti','pa')
will change string
into
sprang
; it replaces every t
with
a p
and every i
with an a
. Can only do one-to-one
replacements; see also replace()
. Can
be used for deletion by making the third string shorter than the second;
//p ! translate(., 'aeiou', '')
will
strip all the vowels from each <p>
by replacing them with nothing.
substring-before(string, string)
,
substring-after(string, string)
Returns the part of the first string before (or after) the first occurrence
of the second string. Useful for breaking apart certain structures, e.g.,
the area code for a ten-digit US telephone number in normal
123-456-7890
format is
$telephone ! substring-before(.,'-')
.
There is also a
substring(string, integer, integer?)
function, which returns a substring of the first argument. The second
argument is the start position and the third argument is the length of the
substring (if missing, the substring extends to the end of the first
argument). For example, to capitalize the first letter of a string you can
use:
$input ! concat(substring(upper-case(.), 1, 1), substring(., 2))
matches(string, regex)
Tests whether the regular expression pattern occurs in the string. One type
of regex is a plain string, so (with oversimplification)
matches(string, string)
is equivalent
to contains(string, string)
. The real
power of matches()
in comparison to
contains()
is that it can match regex
patterns, and not just strings.
replace(string, regex, regex-replace)
The translate()
function, above, can
only replace single characters with single characters. The
replace()
function can match regex
patterns and perform more complicated replacements, including replacements
that use capture groups (as we practice in our regex unit).
tokenize(string, regex?)
Breaks a string into parts by dividing at the regex. If the second argument (the regex) is missing, it tokenizes on sequences of whitespace characters by default.
count(sequence)
,
avg(sequence)
,
max(sequence)
,
min(sequence)
,
sum(sequence)
Count, average (arithmetic mean), largest value, smallest value, and total of
all values. The arguments have to make sense; trying to run
avg()
or
sum()
over strings will raise an error.
You can count any sequence, and max()
and min()
will work with strings
(letters later in the alphabet are larger
than letters earlier in the
alphabet), but in practice we use everything except
count()
primarily with
numbers.
ceiling(num)
,
floor(num)
,
round(num)
These take a single argument and round up, down, or closest.
not(arg)
Inverts the truth value of the argument. Usefully wrapped around other XPath
expressions. For example, since //p[q]
returns all <p>
elements that have
any <q>
child element,
//p[not(q)]
returns all
<p>
elements that do not
have any <q>
child element.
position()
Returns the position of the node in its context sequence. Useful for
filtering, e.g.,
(//sp)[position() < 6]
selects the
first five <sp>
elements in the
document. Nothing is allowed inside the parentheses, but they are required
anyway because functions always require parentheses.
last()
Used as a positional predicate. (//p)[1]
returns the first <p>
element in the
document. (//p)[last()]
returns the
last. Nothing is allowed inside the parentheses.
XPath supports two types of comparison: value comparison and general comparison.
The value comparison operators are:
eq
equal tone
not equal togt
greater thange
greater than or equal to (not less
than)lt
less thanle
less than or equal to (not greater
than)Value comparison can be used only to compare exactly one item to exactly one other
item and they must be of the same (or promotable) datatypes. For example, to create
a predicate that will filter <sp>
elements
to keep only those where the value of the associated
@who
attribute is equal to the string
"Hamlet"
, we can write:
//sp[@who eq 'Hamlet']
Each <sp>
has exactly one
@who
attribute and we are comparing it to a
single string, so the test will return True or False for each
<sp>
in the document. Should an
<sp>
element not have a
@who
attribute the speech will not be selected,
but it also does not raise an error because the predicate means find all
. Both
<sp>
and keep only those that have a
@who
attribute with a value that is equal
to "Hamlet"
<sp>
elements that have a
@who
attribute equal to something other than
"Hamlet"
and those that have no
@who
attribute at all fail the test and are not
included in the result, but not having a @who
attribute is a syntactically correct way to fail the test, and not an error.
Value comparison is often used for numerical values. To keep all of the speeches
(<sp>
elements) with more than 8 line
(<l>
) descendants, we can write:
//sp[count(descendant::l) gt 8]
In the preceding example, the output of the
count()
function is a single item, an integer,
and it is being compared to another single item, the integer value 8.
The general comparison operators are:
=
equal to!=
not equal to>
greater than (may also be written
>
)>=
greater than or equal to (not less
than; may also be written >=
)<
less than (may also be written
<
)<=
less than or equal to (not greater
than; may also be written <=
)While value comparison operators can compare only one thing on the left to one thing on the right, general comparison operators can have one or more items on either side of the comparison (also zero items, since the empty sequence is also allowed). For example:
//sp[@who = ('Hamlet', 'Ophelia')]
will select all <sp>
elements where the
@who
attribute is equal to either
Hamlet
or Ophelia
. This makes general comparison a convenient
alternative to a complex predicate like:
//sp[@who eq 'Hamlet' or @who eq 'Ophelia']
In comparisons with exactly one item on either side of the comparison operator, as long as the datatypes are compatible, value comparison and general comparison are effectively equivalent.
One possibly surprising feature of general comparison is the way it behaves with negation. Consider:
//sp[@who != ('Hamlet', 'Ophelia')]
This does not find all speeches by anyone other than Hamlet or Ophelia! It
finds all speeches where the @who
attribute is
not equal to some individual item in the sequence on the right. This means
that it finds all speeches without exception, since the ones by Hamlet are not by
Ophelia (the test always succeeds because @who
is not equal to Ophelia
in situations where it is equal to Hamlet
, and
vice versa). This is probably not what you want.
So how do you find all speeches by anyone other than Hamlet or Ophelia? Try:
//sp[not(@who = ('Hamlet', 'Ophelia'))]
The preceding predicate says that we want to keep all speeches where it is not the
case that the @who
attribute is equal to either
Hamlet
or Ophelia
.
Description | Value | General |
---|---|---|
Equal to | eq |
= |
Not equal to | ne |
!= |
Greater than | gt |
> ( > ) |
Greater than or equal to (not less than) |
ge |
>= ( >= ) |
Less than | lt |
< ( < ) |
Less than or equal to (not greater than) |
le |
<= ( <= ) |