Digital humanities


Maintained by: David J. Birnbaum (djbpitt@gmail.com) [Creative Commons BY-NC-SA 3.0 Unported License] Last modified: 2021-10-20T23:51:54+0000


XPath assignment #2 answers

The task and possible solutions

You can find an XML (TEI) version of Shakespeare’s Hamlet at http://dh.obdurodon.org/bad-hamlet.xml. We’ve deliberately damaged some of the markup in this edition to introduce some inconsistencies, but the file is well-formed XML, which means that you can use XPath to explore it. You should download this file to your computer (typically that means right-clicking on the link and selecting save as) and open it in <oXygen/>.

After you’ve completed your homework, save your answers to a markdown file and upload it to Canvas as an attachment. Some of these tasks are thought-provoking, and even difficult. If you get stuck, do the best you can, and if you can’t get a working answer, give the answers you tried and explain where they failed to get the results you wanted. Sometimes doing that will help you figure out what’s wrong, and even when it doesn’t, it will help us identify the difficult moments. These tasks require the use of path expressions, predicates, and the functions count() and not(), but they should not require any other XPath functions. There may be more than one possible answer.

Using the Bad Hamlet document and the XPath browser window in <oXygen/>, construct XPath expressions that will do the following (give the full XPath expressions in your answers, and not just the results):

  1. Most (not all) speeches in Hamlet contain mostly metrical line (<l>) and anonymous block (<ab>) elements (an anonymous block is the TEI element that the tagger used to represent a non-metrical speech line). Speeches also typically contain <speaker> elements, and may also contain stage directions (<stage>). We have deliberately left out at least one other type of subelement found in speeches. Based on this understanding:
    1. What XPath would find all of the speeches that do not contain any metrical lines as immediate children. How many are there?

      //sp[not(l)]

      For this answer, we use the predicate [not(l)] to check whether there are <l> elements on the child axis of all of the speeches we find. We keep only speeches that do not contain any line elements, throwing away any speeches that do contain those elements. There are 451 speech elements without line children. Some of you specified //sp[not(child::l)]. This is syntactically valid XPath, but it isn’t consistent with Best Practice; most coders wouldn’t mention the child axis explicitly because it’s the default.

    2. What XPath would find all of the speeches that do not contain either any metrical lines (<l>) or any anonymous blocks (<ab>)? How many are there? What do they contain instead? (As in #2d in XPath assignment #1, you haven’t yet learned the XPath to return a list of the types of elements they do contain, but if you find them all, you can scan the brief list that <oXygen/> returns, click on each one to see it in context, and see what’s going on.)

      //sp[not(l|ab)] or //sp[not(l or ab)] or //sp[not(l)][not(ab)] or //sp[not(l) and not(ab)] or //sp except (//sp[l]|//sp[ab])

      These path expressions all return speeches that do not have either <l> or <ab> elements as immediate children. There are seven of these elements. When you navigate to these speeches in the list that <oXygen/> returns, you find that the other child elements of speeches that we concealed from you are line groups, or <lg>. You can get those child elements by adding an asterisk as the next path step, meaning find all of the child elements of these <sp> elements, whatever they might be: //sp[not(l|ab)]/* or //sp[not(l or ab)]/* or //sp[not(l)][not(ab)]/* or //sp[not(l) and not(ab)]/*. You can get the element names, instead of the elements themselves, by adding the name() function as a path step after the asterisk: //sp[not(l|ab)]/*/name() or //sp[not(l or ab)]/*/name() or //sp[not(l)][not(ab)]/*/name() or //sp[not(l) and not(ab)]/*/name(). You can get rid of the duplicates by wrapping the entire expression in the distinct-values() function, e.g., distinct-values(//sp[not(l|ab)]/*/name()), etc.

      //sp[not(l|ab)] uses the union operator. l|ab matches the union of all lines and anonymous blocks, that is, everything that is either a line or an anonymous block (and some people refer to the union connector as the or connector). The path //sp collects all of the speeches, and the predicate then checks on the child axis (the default, since no axis is specified) and retains only those speeches that do not have either lines or anonymous blocks as immediate children.

      //sp[not(l or ab)] uses the or operator. Without the not() function, it would keep all <sp> elements that have either <l> or <ab> children. The not() function inverts the test, so it keeps only the <sp> elements that have neither type of child. Some people find the two senses of or confusing: the union operator (|) finds elements that are either one thing or another, while the keyword or is used in complex predicates to filter a sequence according to one condition or another.

      //sp[not(l)][not(ab)] uses two predicates. It first collects all of the speeches, after which the first predicate keeps only those that don’t have line elements as children. The second predicate then filters those still further, keeping only the ones that also don’t have anonymous blocks as immediate children. You can reverse the order of the predicates in this case (although there are other situations where changing the order of predicates leads to different results).

      //sp[not(l) and not(ab)] uses a compound predicate. It collects all of the speeches and then filters them all at once, keeping only those that both do not have any lines as children and do not have any anonymous blocks as children. The order of the two parts on either side of the and operator doesn’t matter.

      //sp except (//sp[l]|//sp[ab]) uses except to specify the difference of sets of nodes. See Kay pp. 628–31 for discussion. Our pattern says return all <sp> elements, but exclude from that return the union of all <sp> elements that have <l> children and all <sp> elements that have <ab> children.

      All five of these XPaths yield the same sequence of seven elements, and in all cases that use the and, or, or union (|) operators, you may change the order of the parts without changing the results. For example not(l or ab) returns the same nodes as not(ab or l).

  2. Explain why the following four XPath expressions return different results, and describe in prose what each of them does return, and why:
    1. //sp[@who="Hamlet"]/l[1]
    2. /descendant::sp[@who="Hamlet"]/l[1]
    3. (//sp[@who="Hamlet"]/l)[1]
    4. (/descendant::sp[@who="Hamlet"]/l)[1]

    The reason that these four XPath expressions return different results is because of the scoping effect of parentheses, that is, how the parentheses affect the sequence of items within which the system then applies the predicate [1] to select only the first item in a sequence. The other distinguishing factor of these expressions involves the specification of the descendant axis. The shorthand // functions the same way as the long form descendant::. The predicate in the expressions 2a and 2b filter elements in a different context than in expressions 2c and 2d because the parentheses alter the current context. Here are the details:

    2a (//sp[@who="Hamlet"]/l[1]) returns the first line of every speech by Hamlet. XPath expressions operate one step at a time, and the sequence returned by each step because the sequence of context nodes from which the next step proceeds. This means that when the first step returns a sequence of all of Hamlet’s speeches, each speech, one by one, becomes the context for returning the lines in a single speech, which are then filtered by the predicate to keep only the first line. The entire expression returns, then, a sequence of all of the first lines (<l> elements) of all of Hamlet’s speeches.

    Not all of Hamlet’s speeches contain <l> child elements, yet when we ask for the first <l> child element of each speech we don’t raise any errors. This fact illustrates another perhaps surprising feature of XPath: asking for something that does not exist is not an error, so if a speech has no <l> children we get no result (an empty sequence) for that speech. When we ask for //sp[@who="Hamlet"]/l[1] we get 159 first child <l> elements, but when we ask for //sp[@who="Hamlet"], we get 357 speeches. This means that 198 speeches don’t have a first <l> child element, and that means that they don’t have any <l> child elements. We can verify that with //sp[@who="Hamlet"][not(l)], which asks explicitly for all speeches by Hamlet that contain no <l> child elements, and which returns 198 results. We can find out what types of elements those speeches do have with:

    //sp[@who="Hamlet"][not(l)]/*[not(self::speaker|self::stage)]
    ! name()
    => distinct-values()

    This finds all speeches by Hamlet (//sp[@who="Hamlet"]) and filters them to keep only the ones that do not have any <l> children (//sp[@who="Hamlet"][not(l)]). We then find all of the child elements that those speeches do have (//sp[@who="Hamlet"][not(l)]/*) and filter those children to exclude the speaker name and any stage directions (//sp[@who="Hamlet"][not(l)]/*[not(self::speaker|self::stage)]), since those don’t hold spoken text, and what we’re looking for is the elements other than <l> that hold spoken text. We then use the name() function to get the names of each of those elements, using the ! simple map operator (see the solution to our XPath exercise #1 for an explanation). Since we don’t need the names of the element types to be repeated, we employ the distinct-values() function to remove the duplicates, using the => arrow operator (see the solution to our XPath exercise #1 for an explanation of this, too). This tell us that speeches by Hamlet that don’t have any <l> children have, instead, <ab> (anonymous block, which in this play is used for non-metrical lines) and <lg> (line group) child elements.

    2b (/descendant::sp[@who="Hamlet"]/l[1]) starts by looking for <sp> on the descendant axis from the document node (the top of the tree), and then finds all of the child line elements of those speeches. As in 2a, the predicate [1] modifies the immediately preceding step of the expression. This means that the expression returns the first line of each of Hamlet's speeches, the same result as 2a.

    2c (//sp[@who="Hamlet"]/l)[1]) also begins by finding all of the speeches by Hamlet, and then all of the child line elements of those speeches. At that point the full expression is wrapped in parentheses, making the current context a single, flattened sequence of all line elements in Hamlet speeches. The predicate [1] then applies to that entire flattened sequence, so it filters the result by asking for the very first line in the entire sequence of all lines spoken by Hamlet. As a result 2c returns only one line.

    2d ((/descendant::sp[@who="Hamlet"]/l)[1]), like the others, starts by looking for <sp> elements on the descendant axis from the document node and filtering them to keep only the ones by Hamlet. As in 2c, the numerical predicate is applied to the entire expression, and not to each individual speech, since the predicate applies to everything wrapped in parentheses, that is, to all of Hamlet’s line as a single sequence. For that reason, it keeps only the first line that is spoken by Hamlet and returns only one line, the same result as 2c.

    The point is that a predicate applies to the current context, which is defined as the immediately preceding sequence. In 2a and 2b, it’s each sequence of lines in each speech, separately, so the predicate applies once for every speech, and returns the first line of every speech. In the 2c and 2d, the parentheses cause the entire preceding expression to function as a single sequence, so the predicate applies only once, to the continuous sequence of all lines spoken by Hamlet, and therefore returns only one line, the first line spoken by Hamlet in the entire play.


What does // mean?

Most of the time we can pretend it means descendant::

Most of the time you don’t need to think about the details of how // doesn’t really mean descendant axis, even though it seems to behave as if it did. But if you run into one of the places where // doesn’t behave the same way as referring to the descendant axis explicitly, here’s why.

In most places // functions the same way as descendant::, so we often think of it as shorthand for it, just as @ is shorthand for attribute::, that is, for referring to nodes on the attribute axis. But // isn’t exactly synonymous with descendant::; what it actually means is descendant-or-self::node()/. Here’s why that matters, and why that perhaps confusing path step is useful.

The descendant-or-self:: axis means what you think it means: it looks for specified nodes on the descendant axis, but it also looks at the current context node, and not only at its descendants. The expressions //sp and /descendant::sp return the same results because:

Why is it useful that // doesn’t mean the same thing as descendant::?

If you want to find all @who attributes in the document, you can do that with //@who. But if you try to write /descendant::@who you’ll raise an error because you’re trying to look on two axes at once. Attributes aren’t descendants in the sense that they are never on the descendant axis because they are never on any of the directional axes (parent, child, etc.); they are only on the attribute axis. But because //@who really means /descendant-of-self::node()/@who, it looks for all nodes on the descendant axis (none of which are attributes themselves) and then looks for @who attributes on the attribute axis from those nodes. Since it is looking at all descendant nodes of the document node, it winds up looking at all @who attributes, no matter what node they belong to.

What are the pitfalls of using // as a synonym for descendant::?

There are two common types of errors that come from treating // as a synonym for descendant:::

  1. Suppose you want to find all of the speeches (<sp>) that have child <l> elements. If you write //sp[l] you’ll get the right results: your path step starts at the document node, eventually finds its way to all of the <sp> elements in the document, and uses the predicate to test each one to see whether it has any <l> children. This works because every path step is on an axis and the child axis is the default when no other axis is specified explicitly, so //sp[l] is synonymous with //sp[child::l].

    We can’t, though, find all <sp> elements that have <l> descendants with //sp[//l]. If you try this, you’ll select all 1137 speeches in the play, both those have have <l> descendants and those that don’t. The reason is that the predicate begins with a slash and a slash always means start at the document node, so instead of looking for <l> descendants of the current context (each <sp> in turn), your predicate is reporting true if there are any <l> element descendants of the root node. Since there always are, the predicate always returns true, so the test always succeeds and we wind up not filtering anything.

    We can fix this in two ways:

    1. Specify the descendant axis for real with //sp[descendant::l]. This predicate expression does not begin with a slash, so it looks for <l> descendants only of the current context node, which is each <sp> in turn.

    2. Start the predicate with a dot, which in an XPath context means current context node: //sp[.//l]. The predicate expression no longer begins with a slash, so it doesn’t start from the document node; it starts from the current context node (represented by the dot) and looks down from there.

    In our own work we favor the first solution because we find it easier to understand, but the two will return the same result. They aren’t exactly equivalent for the reason described above (we can specify an attribute on the attribute axis right after a double slash), but in practice we almost never want to do that anyway.

  2. Because XPath expressions proceed step by step, where the sequence selected at each step becomes a sequence of context nodes for the next step, //sp//l[1] doesn’t select all <l> children of all <sp> elements and then return the first one. What it does instead is select all <l> children of each <sp>, one <sp> at a time, and it applies the predicate to those individual sequences of lines. What you are asking for, then, is the first <l> child of each speech (one per speech), instead of the first <sp> element that is a child of a speech (one result for the entire play). The same is true of //sp/descendant::l[1]; here, too, you are selecting the descendant lines of each speech, one speech at a time, and filtering to keep just the first of each of those speech-specific sequences.

    You can work around this limitation by using parentheses to fuse all of the subsequences into one long sequence before applying the predicate: (//sp//l)[1] or (//sp/descendant::l)[1] will both return the first <l> element in the play that is a descendant of an <sp> element.