Maintained by: David J. Birnbaum (djbpitt@gmail.com) Last modified: 2021-12-27T22:03:48+0000
As we discussed in our general introduction to XML (What is XML and why should humanities scholars care?), there are two principal sets of reasons why digital humanists use XML to model their texts:
XPath is a language for selecting parts of an XML document for subsequent
processing. As such, the main thing an XPath expression does is allow the
user to describe, in a formal way that a computer can process easily, certain parts
of a document (e.g, all of the paragraphs
, all of the first paragraphs of
a section, unless the section is part of an appendix
, etc.). In addition to
defining specific parts of a document, XPath can also manipulate the data it finds
(see the discussion of functions below), but the main thing it does is serve as a
helper language, or ancillary technology, to identify parts of a document
that will then be manipulated by another language. The principal XML-related
languages that employ XPath to find information in XML documents are XSLT
(eXtensible Stylesheet Language Transformations) and XQuery (XML Query
language). We’ll learn about using XSLT and XQuery to manipulate XML documents
later, but before you can do something with information in an XML document you have
to be able to find it, and that’s what XPath does.
This document provides some basic information about how to use XPath to describe, find, and navigate to information inside an XML document. For the most part you won’t yet be doing anything with the information you find, but once you’ve learned how to find it, you’ll employ that knowledge in subsequent lessons about XSLT and XQuery to interrogate and manipulate your source documents. The introduction you’re reading now is not a complete description of XPath, but it will get you started, and you can then find information about additional XPath resources in Michael Kay’s book.
The discussion of XPath components (path expressions, axes, predicates, functions)
below depends on two key concepts, nodes and sequences. A
node is a piece of information in the XML tree, such as an element,
an attribute, or a string of text. The XHTML paragraph that you are reading now is a
single <p>
element node that contains a
mixture of text nodes (strings of plain text),
<code>
element nodes (used for snippets of
XML, such as the element names <q>
and
<code>
, which are highlighted
typographically by my style sheet), <q>
element nodes (identifying quoted text, which has quotation marks inserted
automatically during rendering), etc. In this particular example, each of the
element nodes (<code>
,
<q>
, etc.) within the
<p>
node happens to contain, in turn, just a
single text node, but elements can also contain just other element nodes, just text,
a mixture of elements and text, or nothing at all. The
<p>
node, in turn, is contained within a
section (<div>
) node, etc. This can be
illustrated with the following partial tree diagram (element nodes are depicted as
yellow ovals and text nodes as blue rectangles):
This partial tree corresponds to text that might read:
… the node namesqandcode, which are …
This, in turn, appears in the markup serialization as:
]]> … the node names ]]>q]]> and ]]>code]]>, which are … ]]>
These three perspectives (tree, rendering, XML serialization) all represent the same XML document. What XML (and, therefore XPath) cares about is the tree. In XPath terms (look at the tree):
<p>
node contains just five
child nodes. In order, these are a text node (the node names), the first
<q>
node, another
text node (and), the second
<q>
node, and a third text node (, which are).
<q>
node contains a single text
node, qfor the first and
codefor the second.
<p>
node is the parent of its five
child nodes (text, the first
<q>
, more text, the second
<q>
, and still more text) and each
<q>
node is the parent of one text node.
Nodes that have the same parent, such as the five children of the
<p>
node, are called siblings
of one another. The tree diagram has been formatted to display siblings on the
same level as one another, and the three text nodes directly under the
<p>
node plus the two
<q>
nodes are siblings. The two text
nodes contained by the two <q>
nodes are
descendants of the <p>
node, but not children. Although they are on the same level as each other, they
aren’t siblings because they don’t share a parent. What we read on the web page
as continuous text is actually a mixture of text nodes and element nodes, and
the elements nodes, in turn, contain text nodes. The text may all appear
continuous during rendering, but, as the tree shows, it lives at different
levels of the hierarchy.<p>
node, which are siblings of one
another, occur in a particular order. This is why XML can be described as
representing an ordered hierarchy of content objects.The group of nodes that an XPath expression returns is a sequence, which is a technical term for an ordered collection of items that permits duplicates. A sequence is not the same thing as a set because, according to the formal definition, the members of a set are unordered and cannot contain duplicates. The students enrolled in this course constitute a set insofar as there are no duplicates (nobody can be enrolled more than once) and they have no inherent order (one can organize them by height or alphabetically or in many other ways, but doing that doesn’t change the identity of the set).
XPath has four principal interrelated components, as follows (the examples are things one can do with each component, but they don’t illustrate how to do those things, about which see below):
Component | Purpose | Examples |
---|---|---|
path expression | Describe the location of some nodes in a tree. |
|
axis | Describe the direction in which one looks in the tree. An axis is part of a path expression. |
|
predicate | Filter the results of a path expression. |
|
function | Do something with the information retrieved from the document instead of just returning it as received. |
|
Each of these components is described in more detail below. (There are a few other
types of XPath expressions that we don’t discuss here. For example, 1 +
2
is an XPath expression that describes a sequence of one
item, the integer 3
.)
Path expressions are used to navigate from a current location (called the
context node) to other nodes in the tree. By default, specifying the
name of a node type in a path expression says to look for it among the children
of the current context node. New steps in a path expression are indicated
with slash characters (note: not back-slashes), and the context node
changes with each step. This will be clearer if we walk step-by-step through some
examples. Let’s assume below that we’re dealing with a prose document that consists
of chapters, marked up as <chapter>
elements, each of which contains one or more paragraphs, marked up as
<paragraph>
elements. The paragraphs, in
turn, contain a mixture of plain text and quotations, marked up as
<quote>
elements.
quote
means collect all the
<quote>
child elements of the
current context node.
If one launches this path expression from within a
<paragraph>
element, it retrieves all of
the <quote>
elements immediately inside
that <paragraph>
element, ignoring any
others in the document. This means that it ignores
<quote>
elements outside the
<paragraph>
element context node, and it
also ignores <quote>
elements that are
inside other <quote>
elements in the
<paragraph>
element, since those more
deeply-nested <quote>
elements are not
immediate children of the <paragraph>
(they are children of children).chapter/paragraph/quote
means starting from the current context node, find all of the<chapter>
elements that are its immediate children, then all of the<paragraph>
elements that are children of those<chapter>
elements, and then all of the<quote>
elements that are children of those<paragraph>
elements.
Only the items returned by the last step in the path are added to the sequence to be
returned by the path expression. In the preceding example, the system traverses
<chapter>
and
<paragraph>
elements on its way to find
<quote>
elements, but only the
<quote>
elements themselves are part of the
value of the path expression, that is, of the sequence that the
expression returns. The expression visits the other elements in passing, but it does
not collect them.
Slashes indicate stages in the path and the context node changes at each stage.
Initially the context is wherever one starts (I’ll explain how that’s determined
when we talk about how XSLT and XQuery use XPath), so in this example we begin by
finding all of the <chapter>
elements that
are children of whatever element we’re in. Once we reach the first slash, the
context node changes to the sequence of
<chapter>
elements that we just retrieved at
the first step, so we’re now looking for
<paragraph>
elements that are children of
those <chapter>
elements. Another slash
changes the context node yet again, this time to the sequence of all
<paragraph>
elements retrieved earlier, and
we are now looking for <quote>
elements that
are children of those <paragraph>
elements.
Each step in the path is really defining a sequence of context nodes for
the next step, and it then sets each one in turn as the new context node as it moves
along the path.
By default, the steps in a path expression are the names of element nodes. It is also
possible to address other types of nodes directly, such as attributes and text
nodes. This means that, for example, if all paragraphs are tagged with an attribute
value describing their language (e.g.,
... ]]>
),
one could find all of the language information on paragraphs by navigating to the
paragraphs and then not to any element within them, but to the value of the @language
attribute instead. Assuming paragraphs are inside chapters, which are
inside a root <novel>
element, that path
expression might look like /novel/chapter/paragraph/@language
.
See below for an explanation of the leading slash. As is also explained below, in
XPath a leading at-sign (@
) identifies an attribute, and we’ll
use one from now on when we talk about attributes in XPath, but the attribute name
in the actual XML is written without the at-sign.
As stated, this path would not retrieve the paragraphs in a particular language; it
would retrieve the @language
attributes, the values of which are the
names of the languages. It is, of course, possible to retrieve all paragraphs
(<p>
elements) only if they are in English (for
example) instead, but that isn’t what this particular path expression does; this
path expression retrieves the @language
attribute nodes themselves.
By default a step in an XPath looks for an element that is a child of the current context node. As was noted above, it is possible to specify other types of nodes than elements, and it is also possible to look for nodes that are not just children, but also, for example, parents or siblings. XPath is capable of navigating from any context to any other location in the tree.
The direction in which XPath looks at each step in a path is determined by an axis, and by default we look for element nodes on the child axis. The most important directional axes in XPath are:
child
: All nodes contained directly by the current context
node.descendant
: All nodes contained directly by the current context
node, recursively, that is, all the way down the tree. In other words, the
descendants of a node are its children, its children’s children, etc.parent
: The node that contains the current context node. Within the
social metaphor of the XML family, children have only one parent. The only node
that does not have a parent is the node at the very top of the tree (above the
root element), called the document node.ancestor
: The parent of the current context node, its parent node,
etc., all the way up to the document node.preceding-sibling
: All nodes that share a parent with the context
node and precede it in document order. In the list you’re reading now, the
preceding siblings of the current list item element are the other elements that
precede it and have the same parents, which means the other list items that
precede it in this list, but not those that follow it and not those that may
precede it elsewhere in the document (since they have different parents).preceding
: All nodes that precede the current context node in
document order. This includes both preceding siblings and preceding nodes that
are not siblings. Note that preceding must be understood in terms of
nodes in a tree, rather than tags in a serialization. For this reason, ancestors
are not preceding; although they begin before the current context
(their start tag precedes it), the node itself doesn’t precede the current
context because it is still open. That is, the start tag precedes the current
context, but the element contains it, rather than preceding it, and
XPath cares about elements, not tags.following-sibling
: All nodes that share a parent with the context
node and follow it in document order. The mirror image of the
preceding-sibling
axis.following
: All nodes that follow the current context node in
document order, including both following siblings and following nodes that are
not siblings. The mirror image of the following
axis.These eight axes fully describe looking in any direction from the current context
node (there is also a self
axis, which stays at the current context
node, and a few others that also aren’t used much). There is no sibling
axis; if you want all siblings, regardless of direction, there are a couple of ways
to express that, but there is no way to do so with just a single axis.
The axes can be categorized by direction (up, down, left, right) and distance (short, long), as follows:
Axis | Direction | Distance |
---|---|---|
child | down | short |
descendant | down | long |
parent | up | short |
ancestor | up | long |
preceding-sibling | left | short |
preceding | left | long |
following-sibling | right | short |
following | right | long |
The division of the tree into these eight directional axes is illustrated by the following example:
[Image courtesy of Syd Bauman, Northeastern University]
In the preceding image, intended to reflect the tree view of an XML document, the shaded diamond in the middle represents the current location, that is, the context node. The axes used to reach the other nodes are as follows:
Axes | Depiction | Nodes |
---|---|---|
child | Dark green edges | The three nodes immediately below the current location |
descendant | Dashed green line | The three child nodes mentioned above, plus the seven nodes below them, all the way down (their children and their children’s children) |
parent | Magenta edges | The node immediately above the current location |
ancestor | Magenta dashed line | The parent plus its parent, and its parent’s parent |
preceding-sibling | Dark red edges | The two nodes to the left of the current location that have the same parent |
preceding | Dark red dashed line | The preceding-sibling nodes plus the six other nodes that are entirely to the left of the current location |
following-sibling | Blue edges | The node to the right of the current location that has the same parent |
following | Blue dashed line | The following-sibling node plus the nine other nodes that entirely to the right of the current location |
A step in a path expression actually contains not just the name of an element type
(or other node specifier; one can specify things other than elements), but also an
axis. We often don’t think about the axis because when no axis is specified
explicitly, a default child
axis is assumed, but the child
axis is present, even if only implicitly, when no explicit axis is specified.
An axis is specified by taking its name followed by a double colon and prepending it
to the element name (or other path step). For example, a path paragraph
looks for <paragraph>
elements on the child
axis, while preceding-sibling::paragraph
looks instead for
<paragraph>
elements that are preceding
siblings. This means that paragraph
as a step in a path by itself is
short-hand for child::paragraph
. Usually nobody specifies the child
axis, since it’s implicit when it isn’t stated.
In addition to specifying the name of a specific element, one can look for any and
all elements on an axis by using an asterisk (*
). For example, the path
paragraph/*
means find all the child
The asterisk can be used on other axes, as well, so that
<paragraph>
elements of the current
context and then find all of the child elements of those
<paragraph>
elements, regardless of
element type.preceding-sibling::*
means starting at the current context node,
find all preceding sibling elements, regardless of element type.
The notation single dot (.
) refers to the current context, and is
equivalent to self::*
, that is, all of the nodes on the
self
axis, which is the one current context element, whatever it is.
The notation double dot (..
) refers to the one parent node, whatever it
is, and is equivalent to parent::*
.
A slash (/
) normally indicates a step in a path expression, telling the
system to look for whatever follows with reference to the current context. This
means that, for example, paragraph/quote
means find all of the
A slash at the very beginning of a path expression, though, has a special meaning:
it means <paragraph>
elements that are children
of the current context and then (slash = new step in the path) all of the
<quote>
elements that are children of
each of those <paragraph>
elements.start at the document node, at the top of the tree.
Thus,
/paragraph
means find all of the
a query that will succeed only if the root
element of the document (the one that contains all other elements) happens to be a
<paragraph>
elements that are immediate
children of the document node,<paragraph>
(and therefore immediately under
the document node).
A double slash (//
) is shorthand for the descendant
axis,
so that chapter//quote
would first find all of the
<chapter>
elements that are children of the
current context and then find all of the
<quote>
elements anywhere within them, at
any depth (children, children’s children, etc.). When used at the beginning of a
path expression, e.g., //paragraph//quote
, the double slash means that
the path starts from the document node, at the top of the tree, and looks on the
descendant axis. The preceding XPath expression therefore means starting from the
document node, find all descendant
This is one
way to find all <paragraph>
elements (= all
<paragraph>
elements anywhere in the
document), and then find all <quote>
elements anywhere inside those
<paragraph>
elements.<quote>
elements anywhere
inside <paragraph>
elements at any depth,
while ignoring <quote>
elements that are not
inside <paragraph>
elements.
Attributes are not children and are not located on the child
axis.
Instead, they are located on their own attribute
axis. The attribute
axis can be specified as attribute::
, but it is usually abbreviated as
an at sign (@
). For example, the path expression
paragraph/@language
, which is short for
child::paragraph/attribute::language
, starts at the current context,
finds all of the <paragraph>
elements on the
child axis, and then finds the @language
attribute on each
<paragraph>
element. If a
<paragraph>
element doesn’t happen to
contain a @language
attribute, nothing is added to the sequence for
that particular <paragraph>
. Curiously,
although attributes are not children (they are not located on the child
axis), they do have parents, which are the elements to which they’re attached. This
means that in the preceding example, although the @language
attribute
is not a child of the <paragraph>
element
(because attributes by definition are not children, they are located on the
attribute
axis, rather than the child
axis), the
<paragraph>
element is nonetheless a parent
of the attribute, and is found on the parent
axis when the current
context node is the attribute node itself. One can specify all of the attributes of
the particular context node (which must be an element for this to make sense, since
only elements can have attributes) with @*
(short for
attribute::*
), so that paragraph/@*
navigates to all of
the <paragraph>
elements that are children
of the current context node and then to all of the attributes of any type that are
associated with each of them.
In addition to specifying elements and attributes by name, one can specify text nodes
as text()
, so that, for example, paragraph/text()
navigates first to the <paragraph>
elements
that are children of the current context node and then to all of the text nodes that
are its immediate children. Similarly, one can use the shorthand notation
node()
to refer to all types of nodes together. For example,
paragraph/node()
first finds all of the
<paragraph>
elements that are children of
the current context and then all of the nodes of any type that are children of those
<paragraph>
nodes. Remember, though, that
since no axis is specified explicitly before node()
, the
child
axis is implied. This means that node()
refers to
elements and text nodes, but not attribute nodes, because attribute nodes are not
found on the child
axis.
Predicates are used to filter the results of path expressions. The sequences that are returned by path expressions have an inherent and stable order, which is called document order. In XPath, document order is defined as depth first, which means that when the system has to return nodes in order, it looks down before it looks right, it never looks up (except to resume where it left off), and it never looks left. Here’s an example:
Suppose we use the path expression p//*
to find all of the elements of
any type (thus the asterisk) anywhere (thus the double slash, which means
descendant
axis) inside a <p>
element (that is a child of the current context). The preceding example shows one
such <p>
element with all of its descendant
elements numbered in document order. Their type is not specified because this
particular path expression is looking at all descendant elements, without checking
their type.
Because XPath document order is depth first, the processor looks down and to the left and finds the first element to add to the sequence to be returned, which is #1. But what should the second element in the sequence be? In a depth first system, like XPath, before the processor looks for the siblings of #1, it looks to see whether #1 has any children, and if so, it goes there first, so the next element it retrieves in #2. Since #2, in turn, has children, the system then gets #3. Because #3 doesn’t have children, the system then looks to the right, where it finds #4. At that point it has hit a dead end, with no children and no following siblings. It therefore backs up to the most recent place where it turned down, which is #2. Since the system has already visited the children of #2 (#3 and #4) and #2 doesn’t have any following siblings, it backs up again, this time to #1. It has already visited its children (it has only one child, #2), so it looks to its following siblings and finds #5. Before it continues scanning other siblings, though, it notices that #5 has a child, #6, so it heads there next, etc., traversing the tree according to the numbering above.
The procedure for a depth-first traversal of the tree can be illustrated by the following flow chart:
If you’re not familiar with flow charts, the conventions used here are:
Succeedarrow. If you fail because there are no children (either none at all or none that you haven’t already visited), you follow the
Failarrow instead and try to get a sibling.
If you follow the full sequence (in either the flow chart above or the numbered node diagram above that), you’ll see that the algorithm is that you collect nodes as you visit them and add them to the sequence you’re collecting, but you don’t add any node more than once. The charts show the order for visiting nodes in a depth-first traversal:
Get a new child?If that attempted visit succeeds, you add the node to the sequence you’re building and it becomes the new context node, so now try to visit its children. This is shown in the flow chart as looping on success (that is, if you find a child, you then look at its children).
In general it’s best to think of XML in terms of node on a tree (elements, attribute,
and text nodes), and not as a stream of characters with tags thrown in, but some
users find the tag perspective helpful when considering document order, especially
in the case of the preceding
and following
axes. From the
perspective of tags:
preceding
axis (preceding::*
) are those with
end tags that precede the start tag of the
current context element. If the end tag of an element precedes
the start tag of the current context element, it means that the entire
other element must precede the current context.following
axis
(following::*
) are those with start tags that
follow the end tag of the current context element. If the
start tag of an element follows the end tag of
the current context element, it means that the entire other element must follow
the current context.I’ve spent a lot of time discussing depth-first order because it can be used to
filter a sequence by postion. Suppose you want to format the first paragraph of each
chapter specially, perhaps with a drop cap, or by suppressing the indentation that
you apply to all other paragraphs. One way to do this is to mark up that paragraph
differently from the others, but that’s a fragile solution, since if you decide to
rearrange the text and the paragraph is no longer first, you have to change the
markup in addition to moving it. On the web you can apply special formatting to the
first paragraph using Cascading Style Sheets (CSS), but not all publication is on
the web. In XPath, though, you can specify the first paragraph of each chapter by
using a path expression like //chapter/paragraph[1]
. This says:
First start at the document node, at the very top of the document, and find
all descendant
There’s nothing magic about <chapter>
elements, that
is, all <chapter>
elements anywhere in
the document. Then, for each of them, find all of its child
<paragraph>
elements and select only the
first one.first,
although it’s typically
the most useful in real projects. If what you care about is the third paragraph of
each chapter, //chapter/paragraph[3]
will retrieve that. Two other
important details about numerical predicates are that:
//chapter/paragraph[last()]
will retrieve the last paragraph of
each chapter, without your having to tell it how many paragraphs there are.ancestor
) or left (preceding-sibling
,
preceding
), the first node is the one closest to the current
context node, etc., as if one were traversing a depth-first sequence backwards.
You can find illustrations of numbered traversal on different axes on pp. 609–12
of Michael Kay’s book.Predicates are expressed by putting them in square brackets after the step in the
path expression to which they apply, and it doesn’t have to be the last step. For
example, //chapter[1]/paragraph[2]
finds all of the
<chapter>
elements anywhere in the document
and keeps just the first of them, and it then gets all of the
<paragraph>
elements that are children of
that particular <chapter>
and keeps just the
second of them.
Any expression in square brackets that filters a step in a path expression is a predicate. Numerical predicates are the easiest to understand, since they test simply for the location of an element in a sequence returned by a depth-first traversal of the tree. More complex predicates use functions, described in the next section.
Functions operate on the information returned by a path expression or another
function. For example, the path expression chapter/paragraph
finds all
of the <chapter>
children of the current
context and uses them to find all of their
<paragraph>
children. If you don’t need the
actual paragraphs, and you just want to count them, you can use the
count()
function, so that chapter/count(paragraph)
means
that once you’ve made it to the chapter, you should return not the paragraphs
themselves, but just a count of them. This XPath expression will return a sequence
of number values, giving the count of the number of paragraphs in each chapter (that
is, a count of the number of <paragraph>
elements inside each <chapter>
element).
Note that this expression is different from count(chapter/paragraph)
.
The latter expression returns only one number because it defers counting until it
has retrieved all of the paragraphs inside all of the chapter elements that are
children of the current context. The first expression, on the other hand, counts
separately inside each chapter. The difference is that the two use the
count()
function at different steps in the path expression. There are
two steps (find the chapters, and then for each chapter find the paragraphs), and
one can count at either point.
XPath has a little more than one hundred functions, but in practical projects you’ll rarely needs more than a couple of dozen, which you’ll learn quickly as you start using them. Don’t try to memorize them all, but do read over the full list periodically, without trying to memorize it, just to remind yourself of what’s available, so that you can look it up as needed. There are organized lists of all of the XPath functions at https://www.w3schools.com/xml/xsl_functions.asp and detailed discussion with examples in Michael Kay’s book.
Functions can be nested. For example, there is a string-manipulation function to
convert all text to lower case and a different function to normalize the white space
(spaces, tabs, new lines, etc.) in text (the rule converts all white space to plain
space characters, reduces all sequences of white-space characters to single spaces,
and removes all leading and trailing white space). If you want to retrieve a set of
values and perform both of these functions, you can nest them:
normalize-space(lower-case(.))
. This means take the current context
node (represented by the dot), convert any text in it to lower case, and then
take the output of the
lower-case()
function and normalize the
white space in it.
You can use functions in predicates to filter expressions. For example, if you want
to retrieve all of the chapters that consist of just a single paragraph (perhaps as
part of proof-reading; if they consist of a single paragraph, perhaps they shouldn’t
have been independent chapters in the first place), you can do that with
//chapter[count(paragraph) eq 1]
. This says first find all of the
Note that the
<chapter>
elements and then filter them
by saving only the ones where the number of
<paragraph>
elements they contain is
equal to 1.<paragraph>
elements in question are on the
child axis because that’s what’s implied whenever no axis is specified.
You can also apply sequential predicates. Suppose you want to find all first
paragraphs of chapters that contain more than a hundred characters. XPath provides a
string-length()
function that returns the length of text by counting
characters. When it does this, it operates on the string value of the
element, which is the total count of all textual characters anywhere inside it, no
matter how deeply they may be nested. In other words, if a paragraph contains a
mixture of plain text and, say, <quote>
elements, the string-length()
function, when applied to that paragraph,
will count equally the textual characters directly inside the
<paragraph>
element and those inside the
<quote>
elements that may be inside the
<paragraph>
element. The XPath to specify
all first paragraphs of chapters only if they contain more than a hundred characters
is //chapter/paragraph[1][string-length(.) gt 100]
. This says find
all of the
The dot in
the <chapter>
elements anywhere
in the document, then find their child
<paragraph>
elements and select only the
first ones. Then filter those by selecting only the ones whose string length is
greater than 100, that is, that contain more than 100 characters.string-length()
function here refers to the current context node,
which became a <paragraph>
element at the
step of the path expression that specified paragraph
.
Note that retrieving the first paragraphs of all chapters only if they contain more
than 100 characters is not the same as retrieving, for all chapters, the first
paragraphs that contain more than 100 characters. The first of these tasks will
return nothing for chapters where the first paragraph fails to contain more than 100
characters. The second will return nothing for a chapter only if none of its
paragraphs contains more than 100 characters, and you could write it as
//chapter/paragraph[string-length() gt 100][1]
. The way this expression
operates is that it finds all chapters in the document, and then, for each chapter,
it finds all of its paragraph children. It filters those paragraph children by
keeping only the ones longer than 100 characters, and it then keeps only the first
of the paragraphs that survive that filtering. The two expressions,
//chapter/paragraph[1][string-length(.) gt 100]
and
//chapter/paragraph[string-length() gt 100][1]
, return different things
because the predicates are applied in order, from left to right.
The preceding survey of XPath has introduced a lot of new terms. For review purposes, the ones you should remember (or, at least, recognize when you see them again) are:
Term | Definition |
---|---|
axis | Path direction and scope, e.g., ancestor ,
preceding-sibling . |
depth-first order | See document order, below. |
document node | The node that serves as the parent of the top-level, or root, element. The document node is the only node of any type on an XML tree that does not have a parent node. |
document order | XPath traverses the tree in depth-first order, which means that it visits nodes in order and looks at a node’s children before it looks at its following siblings. |
function | Operation that can be performed on the result of a path expression, e.g., counting the number of nodes and returning just the count instead of the nodes themselves. |
node | Part of an XML document. The most important types of nodes are element,
attribute, and text() . |
path expression | The way to reach the nodes you care about. Path expression may have multiple steps, separated by slash characters. |
predicate | A filter applied to the results of a path expression, specified in square brackets. |
root element | The element that contains the entire document. The root element is actually the child of the document node. |
sequence | An ordered collection of pieces of information. One example of a sequence is the nodes singled out from the tree in document order by an XPath expression. |
See also the table of axes, above. The shorthand axis notation is:
Symbol | Meaning | Expanded version |
---|---|---|
. |
current context node | self::* (for elements) |
.. |
parent element | parent::* |
// |
descendant axis | descendant:: . At the beginning of a path expression, it means
that the path starts at the document node. |
@ |
attribute axis | attribute:: |
Slash (/
) indicates a step in a path expression. At the beginning of a
path expression, it represents the document node.