Digital humanities


Maintained by: David J. Birnbaum (djbpitt@gmail.com) [Creative Commons BY-NC-SA 3.0 Unported License] Last modified: 2016-02-07T15:53:06+0000


What can XPath do for me?

Contents


Introduction

As we discussed in our general introduction to XML (What is XML and why should humanities scholars care?), there are two principal sets of reasons why digital humanists use XML to model their texts:

  1. XML is a formal model designed to represent an ordered hierarchy, and to the extent that human documents are logically ordered and hierarchical, they can be formalized and represented easily as XML documents.
  2. Computers can operate very quickly and efficiently on trees (ordered hierarchies), much more quickly and efficiently than they can on non-hierarchical text. This means that if we can model the documents we need to study as trees, we can manage and manipulate large amounts of data in a shorter time, and using fewer computer resources.

XPath is a language for selecting parts of an XML document for subsequent processing. As such, the main thing an XPath expression does is allow the user to describe, in a formal way that a computer can process easily, certain parts of a document (e.g, all of the paragraphs, all of the first paragraphs of a section, unless the section is part of an appendix, etc.). In addition to defining specific parts of a document, XPath can also manipulate the data it finds (see the discussion of functions below), but the main thing it does is serve as a helper language, or ancillary technology, to identify parts of a document that will then be manipulated by another language. The principal XML-related languages that employ XPath to find information in XML documents are XSLT (eXtensible Stylesheet Language Transformations) and XQuery (XML Query language). We’ll learn about using XSLT and XQuery to manipulate XML documents later, but before you can do something with information in an XML document you have to be able to find it, and that’s what XPath does.

This document provides some basic information about how to use XPath to describe, find, and navigate to information inside an XML document. For the most part you won’t yet be doing anything with the information you find, but once you’ve learned how to find it, you’ll employ that knowledge in subsequent lessons about XSLT and XQuery to interrogate and manipulate your source documents. The introduction you’re reading now is not a complete description of XPath, but it will get you started, and you can then find information about additional XPath resources in Michael Kay’s book.

Node

The discussion of XPath components (path expressions, axes, predicates, functions) below depends on two key concepts, nodes and sequences. A node is a piece of information in the XML tree, such as an element, an attribute, or a string of text. The XHTML paragraph that you are reading now is a single <p> element node that contains a mixture of text nodes (strings of plain text), <code> element nodes (used for snippets of XML, such as the element names <q> and <code>, which are highlighted typographically by my style sheet), <q> element nodes (identifying quoted text, which has quotation marks inserted automatically during rendering), etc. In this particular example, each of the element nodes (<code>, <q>, etc.) within the <p> node happens to contain, in turn, just a single text node, but elements can also contain just other element nodes, just text, a mixture of elements and text, or nothing at all. The <p> node, in turn, is contained within a section (<div>) node, etc. This can be illustrated with the following partial tree diagram (element nodes are depicted as yellow ovals and text nodes as blue rectangles):

nodes p p q1 q p--q1 q2 q p--q2 text1 … the node names p--text1 text3 and p--text3 text5 , which are … p--text5 q1--q2 text2 q q1--text2 q1--text3 text4 code q2--text4 q2--text5 text1--q1 text3--q2

This partial tree corresponds to the following text as it appears in the paragraph above (without the ellipsis points, which I’ve added):

… the node names q and code, which are …

This, in turn, appears in the markup serialization as:

 ]]> … the node names ]]>q]]> and ]]>code]]>, which are … ]]>

These three perspectives (tree, rendering, XML serialization) all represent the same XML document. What XML (and, therefore XPath) cares about is the tree. In XPath terms (look at the tree):

Sequence

The group of nodes that an XPath expression returns is a sequence, which is a technical term for an ordered collection of items that permits duplicates. A sequence is not the same thing as a set because, according to the formal definition, the members of a set are unordered and cannot contain duplicates. The students enrolled in this course constitute a set insofar as there are no duplicates (nobody can be enrolled more than once) and they have no inherent order (one can organize them by height or alphabetically or in many other ways, but they remain the same people).

XPath components

XPath has four principal interrelated components, as follows (the examples are things one can do with each component, but they don’t illustrate how to do those things, about which see below):

Component Purpose Examples
path expression Describe the location of some nodes in a tree.
  1. Find all <paragraph> elements in the tree.
  2. Given a <chapter> element in the tree, find all <footnote> elements inside it.
  3. Given a <chapter> element in the tree, find all <footnote> elements inside it that contain a <definition> element. This last expression requires a predicate, about which see below.
axis Describe the direction in which one looks in the tree. An axis is part of a path expression.
  1. From a particular location in the tree, find all preceding <footnote> elements. Because we’re looking for preceding footnotes, the direction searched is backwards or left.
  2. From a particular <paragraph> element in the tree, find the <title> element of the <chapter> element that contains it. The direction searched is first upward in the tree (to the containing <chapter> element) and then downward (to the <title> element contained by that <chapter> element).
predicate Filter the results of a path expression.
  1. Find the first <paragraph> element in each <chapter> element. XPath does this by finding all <paragraph> elements in each <chapter> element and then filtering out the ones that are not the first in their cohort.
  2. Find all of the <paragraph> elements that contain <illustration> elements, ignoring the ones that don’t. You aren’t trying to retrieve the <illustration> elements themselves; you’re using them to filter the set of all <paragraph> elements according to whether or not they contain <illustration> elements.
function Do something with the information retrieved from the document instead of just returning it as received.
  1. Retrieve all of the <paragraph> elements in a <chapter> element (so far this is just a path expression) but instead of returning the actual elements, return just a count of how many there are. This uses the count() function.
  2. Retrieve a bunch of nodes that contain textual items (such as from a list) and concatenate their contents into a single string, inserting a comma and space after each one except the last. This uses the string-join() function.

Each of these components is described in more detail below. (There are a few other types of XPath expressions that we don’t discuss here. For example, 1 + 2 is an XPath expression that describes a sequence of one item, the integer 3.)

Paths

Path expressions are used to navigate from a current location (called the context node) to other nodes in the tree. By default, specifying the name of a node type in a path expression says to look for it among the children of the current context node. New steps in a path expression are indicated with slash characters (note: not back-slashes), and the context node changes with each step. This will be clearer if we walk step-by-step through some examples. Let’s assume below that we’re dealing with a prose document that consists of chapters, marked up as <chapter> elements, each of which contains one or more paragraphs, marked up as <paragraph> elements. The paragraphs, in turn, contain a mixture of plain text and quotations, marked up as <quote> elements.

Only the items returned by the last step in the path are added to the sequence to be returned by the path expression. In the preceding example, the system traverses <chapter> and <paragraph> elements on its way to find <quote> elements, but only the <quote> elements themselves are part of the value of the path expression, that is, of the sequence that the expression returns. The expression visits the other elements in passing, but it does not collect them.

Slashes indicate stages in the path and the context node changes at each stage. Initially the context is wherever one starts (I’ll explain how that’s determined when we talk about how XSLT and XQuery use XPath), so in this example we begin by finding all of the <chapter> elements that are children of whatever element we’re in. Once we reach the first slash, the context node changes to the sequence of <chapter> elements that we just retrieved at the first step, so we’re now looking for <paragraph> elements that are children of those <chapter> elements. Another slash changes the context node yet again, this time to the sequence of all <paragraph> elements retrieved earlier, and we are now looking for <quote> elements that are children of those <paragraph> elements. Each step in the path is really defining a sequence of context nodes for the next step, and it then sets each one in turn as the new context node as it moves along the path.

By default, the steps in a path expression are the names of element nodes. It is also possible to address other types of nodes directly, such as attributes and text nodes. This means that, for example, if all paragraphs are tagged with an attribute value describing their language (e.g., ... ]]>), one could find all of the language labels on paragraphs by navigating to the paragraphs and then not to any element within them, but to the @language attribute instead. (As explained below, in XPath a leading at-sign identifies an attribute, and we’ll use one from now on when we talk about attributes in XPath, but the attribute name in the actual XML is written without the at-sign.) Note that as stated, this path wouldn’t retrieve the paragraphs in a particular language; it would retrieve the names of the languages, that is, the values of the @language attribute, which is the name of the language. It is, of course, possible to retrieve all paragraphs only if they are in English, but that isn’t what this particular path expression does.

Axes

By default a step in an XPath looks for an element that is a child of the current context node. As was noted above, it is possible to specify other types of nodes than elements, and it is also possible to look for nodes that are not just children, but also, for example, parents or siblings. XPath is capable of navigating from any context to any other location in the tree.

The direction in which XPath looks at each step in a path is determined by an axis, and by default we look for element nodes on the child axis. The most important directional axes in XPath are:

These eight axes fully describe looking in any direction from the current context node (there is also a self axis, which stays at the current context node, and a few others that also aren’t used much). There is no sibling axis; if you want all siblings, regardless of direction, there are a couple of ways to express that, but there is no way to do so with just a single axis.

The axes can be categorized by direction (up, down, left, right) and distance (short, long), as follows:

Axis Direction Distance
child down short
descendant down long
parent up short
ancestor up long
preceding-sibling left short
preceding left long
following-sibling right short
following right long

The division of the tree into these eight directional axes is illustrated by the following example:

[Illustration of XPath axes]

[Image courtesy of Syd Bauman, Northeastern University]

In the preceding image, intended to reflect the tree view of an XML document, the shaded diamond in the middle represents the current location. The axes used to reach the other nodes are as follows:

Axes Depiction Nodes
child Dark green edges The three nodes immediately below the current location
descendant Dashed green line The three child nodes mentioned above, plus the seven nodes below them, all the way down (their children and their children’s children)
parent Magenta edges The node immediately above the current location
ancestor Magenta dashed line The parent plus its parent, and its parent’s parent
preceding-sibling Dark red edges The two nodes to the left of the current location that have the same parent
preceding Dark red dashed line The preceding-sibling nodes plus the six other nodes that are entirely to the left of the current location
following-sibling Blue edges The node to the right of the current location that has the same parent
following Blue dashed line The following-sibling node plus the nine other nodes that entirely to the right of the current location

A step in a path expression actually contains not just the name of an element type (or other node specifier; one can specify things other than elements), but also an axis. We often don’t think about the axis because when no axis is specified explicitly, a default child axis is assumed, but the child axis is present, even if only implicitly, when no explicit axis is specified.

An axis is specified by taking its name followed by a double colon and prepending it to the element name (or other path step). For example, a path paragraph looks for <paragraph> elements on the child axis, while preceding-sibling::paragraph looks instead for <paragraph> elements that are preceding siblings. This means that paragraph as a step in a path by itself is short-hand for child::paragraph. Usually nobody specifies the child axis, since it’s implicit when it isn’t stated.

In addition to specifying the name of a specific element, one can look for any and all elements on an axis by using an asterisk (*). For example, the path paragraph/* means find all the child <paragraph> elements of the current context and then find all of the child elements of those <paragraph> elements, regardless of element type. The asterisk can be used on other axes, as well, so that preceding-sibling::* means starting at the current context node, find all preceding sibling elements, regardless of element type.

The notation single dot (.) refers to the current context, and is equivalent to self::*, that is, all of the nodes on the self axis, which is the one current context element, whatever it is. The notation double dot (..) refers to the one parent node, whatever it is, and is equivalent to parent::*.

A slash (/) normally indicates a step in a path expression, telling the system to look for whatever follows with reference to the current context. This means that, for example, paragraph/quote means find all of the <paragraph> elements that are children of the current context and then (slash = new step in the path) all of the <quote> elements that are children of each of those <paragraph> elements. A slash at the very beginning of a path expression, though, has a special meaning: it means start at the document node, at the top of the tree. Thus, /paragraph means find all of the <paragraph> elements that are immediate children of the document node, a query that will succeed only if the root element of the document (the one that contains all other elements) happens to be a <paragraph> (and therefore immediately under the document node).

A double slash (//) is shorthand for the descendant axis, so that chapter//quote would first find all of the <chapter> elements that are children of the current context and then find all of the <quote> elements anywhere within them, at any depth (children, children’s children, etc.). When used at the beginning of a path expression, e.g., //paragraph//quote, the double slash means that the path starts from the document node, at the top of the tree. The preceding XPath expression therefore means starting from the document node, find all descendant <paragraph> elements (= all <paragraph> elements anywhere in the document), and then find all <quote> elements anywhere inside those <paragraph> elements. This is one way to find all <quote> elements anywhere inside <paragraph> elements at any depth, while ignoring <quote> elements that are not inside <paragraph> elements.

Attributes are not children and are not located on the child axis. Instead, they are located on their own attribute axis, which can be abbreviated with an at sign (@). The path expression paragraph/@language starts at the current context, finds all of the <paragraph> elements on the child axis, and then finds the @language attribute on each <paragraph> element. If a <paragraph> element doesn’t happen to contain a @language attribute, nothing is added to the sequence for that particular <paragraph>. One can alternatively specify @language as attribute::language. Curiously, although attributes are not children (they are not located on the child axis), they do have parents, which are the elements to which they’re attached. This means that in the preceding example, although the @language attribute is not a child of the <paragraph> element (because attributes by definition are not children, they are located on the attribute axis, rather than the child axis), the <paragraph> element is nonetheless a parent of the attribute, and is found on the parent axis when the current context node is the attribute node itself. One can specify all of the attributes of the particular context node (which must be an element for this to make sense, since only elements can have attributes) with @*, so that p/@* navigates to all of the <paragraph> elements that are children of the current context node and then to all of the attributes associated with each of them.

In addition to specifying elements and attributes by name, one can specify text nodes as text(), so that, for example, paragraph/text() navigates first to the <paragraph> elements that are children of the current context node and then to all of the text nodes that are its immediate children. Similarly, one can use the shorthand notation node() to refer to all types of nodes together. For example, paragraph/node() first finds all of the <paragraph> elements that are children of the current context and then all of the nodes of any type that are children of those <paragraph> nodes. Remember, though, that since no axis is specified explicitly before node(), the child axis is implied. This means that node() refers to elements and text nodes, but not attribute nodes, because attribute nodes are not found on the child axis.

Predicates

Predicates are used to filter the results of path expressions. The sequences that are returned by path expressions have an inherent and stable order, which is called document order. In XPath, document order is defined as depth first, which means that when the system has to return nodes in order, it looks down before it looks right, it never looks up (except to resume where it left off), and it never looks left. Here’s an example:

nodes 1 1 5 5 1--5 2 2 1--2 8 8 5--8 6 6 5--6 9 9 8--9 14 14 9--14 10 10 9--10 13 13 9--13 15 15 14--15 p p p--1 p--5 p--8 p--9 p--14 3 3 2--3 4 4 2--4 7 7 6--7 11 11 10--11 12 12 10--12 16 16 15--16 3--4 4--7 7--11 11--12 12--16

Suppose we use the path expression p//* to find all of the elements of any type (thus the asterisk) anywhere (thus the double slash, which means descendant axis) inside a <p> element (that is a child of the current context). The preceding example shows one such <p> element with all of its descendant elements numbered in document order. Their type is not specified because this particular path expression is looking at all descendant elements, without checking their type.

Because XPath document order is depth first, the processor looks down and to the left and finds the first element to add to the sequence to be returned, which is #1. But what should the second element in the sequence be? In a depth first system, like XPath, before the processor looks for the siblings of #1, it looks to see whether #1 has any children, and if so, it goes there first, so the next element it retrieves in #2. Since #2, in turn, has children, the system then gets #3. Because #3 doesn’t have children, the system then looks to the right, where it finds #4. At that point it has hit a dead end, with no children and no following siblings. It therefore backs up to the most recent place where it turned down, which is #2. Since the system has already visited the children of #2 (#3 and #4) and #2 doesn’t have any following siblings, it backs up again, this time to #1. It has already visited its children (it has only one child, #2), so it looks to its following siblings and finds #5. Before it continues scanning other siblings, though, it notices that #5 has a child, #6, so it heads there next, etc., traversing the tree according to the numbering above.

The procedure for a depth-first traversal of the tree can be illustrated by the following flow chart:

flow start Start decision_child Get new child? start->decision_child decision_child:ne->decision_child:nw Succeed decision_sibling Get following sibling? decision_child->decision_sibling Fail decision_parent Go up to parent? decision_child->decision_parent Succeed decision_sibling->decision_child Succeed decision_sibling->decision_parent Fail end End decision_parent->end Fail

If you’re not familiar with flow charts, the conventions used here are:

If you follow the full sequence (in either the flow chart above or the numbered node diagram above that), you’ll see that the algorithm is that you collect nodes as you visit them and add them to the sequence you’re collecting, but you don’t add any node more than once. The charts show the order for visiting nodes in a depth-first traversal:

  1. Try to visit any children of the current context node that you haven’t visited yet, starting with the leftmost. On the flow chart, this is the decision step labeled Get a new child? If that attempted visit succeeds, you add the node to the sequence you’re building and it becomes the new context node, so now try to visit its children. This is shown in the flow chart as looping on success (that is, if you find a child, you then look at its children).
  2. If your attempt to find an unvisited child fails, look right to see whether there are any following siblings. If there are, visit the closest one, which becomes the new current context node, and then start looking at its children, following the steps of this procedure.
  3. If there are no unvisited children (step #1 in the flow chart fails) and no following siblings (step #2 fails), try to back up the tree to the parent to see whether it has any unvisited children. If so, visit them, following this procedure. If not (that is, if step #1 fails after you’ve backed up), check for siblings of the parent(step #2). Whenever checking for siblings fails, keep backing up. If you back all the way up to the document node (which doesn’t have a parent) and there’s nobody left to visit (that is, when step #3 fails), you’re done.

In general it’s best to think of XML in terms of node on a tree (elements, attribute, and text nodes), and not as a stream of characters with tags thrown in, but some users find the tag perspective helpful when considering document order, especially in the case of the preceding and following axes. From the perspective of tags:

I’ve spent a lot of time discussing depth-first order because it can be used to filter a sequence by postion. Suppose you want to format the first paragraph of each chapter specially, perhaps with a drop cap, or by suppressing the indentation that you apply to all other paragraphs. One way to do this is to mark up that paragraph differently from the others, but that’s a fragile solution, since if you decide to rearrange the text and the paragraph is no longer first, you have to change the markup in addition to moving it. On the web you can apply special formatting to the first paragraph using Cascading Style Sheets (CSS), but not all publication is on the web. In XPath, though, you can specify the first paragraph of each chapter by using a path expression like //chapter/paragraph[1]. This says: First start at the document node, at the very top of the document, and find all descendant <chapter> elements, that is, all <chapter> elements anywhere in the document. Then, for each of them, find all of its child <paragraph> elements and select only the first one. There’s nothing magic about first, although it’s typically the most useful in real projects. If what you care about is the third paragraph of each chapter, //chapter/paragraph[3] will retrieve that. Two other important details about numerical predicates are that:

Predicates are expressed by putting them in square brackets after the step in the path expression to which they apply, and it doesn’t have to be the last step. For example, //chapter[1]/paragraph[2] finds all of the <chapter> elements anywhere in the document and keeps just the first of them, and it then gets all of the <paragraph> elements that are children of that particular <chapter> and keeps just the second of them.

Any expression in square brackets that filters a step in a path expression is a predicate. Numerical predicates are the easiest to understand, since they test simply for the location of an element in a sequence according to what gets returned by a depth-first traversal of the tree. More complex predicates use functions, described in the next section.

Functions

Functions operate on the information returned by a path expression or another function. For example, the path expression chapter/paragraph finds all of the <chapter> children of the current context and uses them to find all of their <paragraph> children. If you don’t need the actual paragraphs, and you just want to count them, you can use the count() function, so that chapter/count(paragraph) means that once you’ve made it to the chapter, you should return not the paragraphs themselves, but just a count of them. This XPath expression will return a sequence of number values, giving the count of the number of paragraphs in each chapter (that is, a count of the number of <paragraph> elements inside each <chapter> element). Note that this expression is different from count(chapter/paragraph). The latter expression returns only one number because it defers counting until it has retrieved all of the paragraphs inside all of the chapter elements that are children of the current context. The first expression, on the other hand, counts separately inside each chapter. The difference is that the two use the count() function at different steps in the path expression. There are two steps (find the chapters, and then for each chapter find the paragraphs), and one can count at either point.

XPath has a little more than one hundred functions, but in practical projects you’ll rarely needs more than a couple of dozen, which you’ll learn quickly as you start using them. Don’t try to memorize them all, but do read over the full list periodically, without trying to memorize it, just to remind yourself of what’s available, so that you can look it up as needed. There are organized lists of all of the XPath functions at http://www.w3schools.com/xpath/xpath_functions.asp and detailed discussion with examples in Michael Kay’s book.

Functions can be nested. For example, there is a string-manipulation function to convert all text to lower case and a different function to normalize the white space (spaces, tabs, new lines, etc.) in text (the rule converts all white space to plain space characters, reduces all sequences of white-space characters to single spaces, and removes all leading and trailing white space). If you want to retrieve a set of values and perform both of these functions, you can nest them: normalize-space(lower-case(.)). This means take the current context node (represented by the dot), convert any text in it to lower case, and then take the output of the lower-case() function and normalize the white space in it.

You can use functions in predicates to filter expressions. For example, if you want to retrieve all of the chapters that consist of just a single paragraph (perhaps as part of proof-reading; if they consist of a single paragraph, perhaps they shouldn’t have been independent chapters in the first place), you can do that with //chapter[count(paragraph) eq 1]. This says first find all of the <chapter> elements and then filter them by saving only the ones where the number of <paragraph> elements they contain is equal to 1. Note that the <paragraph> elements in question are on the child axis because that’s what’s implied whenever no axis is specified.

You can also apply sequential predicates. Suppose you want to find all first paragraphs of chapters that contain more than a hundred characters. XPath provides a string-length() function that returns the length of text by counting characters. When it does this, it operates on the string value of the element, which is the total count of all textual characters anywhere inside it, no matter how deeply they may be nested. In other words, if a paragraph contains a mixture of plain text and, say, <quote> elements, the string-length() function, when applied to that paragraph, will count equally the textual characters directly inside the <paragraph> element and those inside the <quote> elements that may be inside the <paragraph> element. The XPath to specify all first paragraphs of chapters that contain more than a hundred characters is //chapter/paragraph[1][string-length(.) gt 100]. This says find all of the <chapter> elements anywhere in the document, then find their child <paragraph> elements and select only the first ones. Then filter those by selecting only the ones whose string length is greater than 100, that is, that contain more than 100 characters. The dot in the string-length() function here refers to the current context node, which became a <paragraph> element at the step of the path expression that specified paragraph.

Review of terms and symbols

The preceding survey of XPath has introduced a lot of new terms. For review purposes, the ones you should remember (or, at least, recognize when you see them again) are:

Term Definition
axis Path direction and scope, e.g., ancestor, preceding-sibling.
depth-first order See document order, below.
document node The node that serves as the parent of the top-level, or root, element. The document node is the only node of any type on an XML tree that does not have a parent node.
document order XPath traverses the tree in depth-first order, which means that it visits nodes in order and looks at a node’s children before it looks at its following siblings.
function Operation that can be performed on the result of a path expression, e.g., counting the number of nodes and returning just the count instead of the nodes themselves.
node Part of an XML document. The most important types of nodes are element, attribute, and text().
path expression The way to reach the nodes you care about. Path expression may have multiple steps, separated by slash characters.
predicate A filter applied to the results of a path expression, specified in square brackets.
root element The element that contains the entire document. The root element is actually the child of the document node.
sequence An ordered collection of pieces of information. One example of a sequence is the nodes singled out from the tree in document order by an XPath expression.

See also the table of axes, above. The shorthand axis notation is:

Symbol Meaning Expanded version
. current context node self::* (for elements)
.. parent element parent::*
// descendant axis descendant::. At the beginning of a path expression, it means that the path starts at the document node.
@ attribute axis attribute::

Slash (/) indicates a step in a path expression. At the beginning of a path expression, it represents the document node.