Digital humanities


Maintained by: David J. Birnbaum (djbpitt@gmail.com) [Creative Commons BY-NC-SA 3.0 Unported License] Last modified: 2022-03-03T20:22:01+0000


XPath assignment #2 answers

The task and possible solutions

You can find an XML (TEI) version of Shakespeare’s Hamlet at http://dh.obdurodon.org/bad-hamlet.xml. We’ve deliberately damaged some of the markup in this edition to introduce some inconsistencies, but the file is well-formed XML, which means that you can use XPath to explore it. You should download this file to your computer (typically that means right-clicking on the link and selecting save as) and open it in <oXygen/>.

Prepare your answers to the following questions in a markdown file upload it to Canvas as an attachment. As always, code snippets (including XPath snippets) in markdown must be surrounded with backticks.

Some of these tasks are thought-provoking, and even difficult. If you get stuck, do the best you can, and if you can’t get a working answer, give the answers you tried and explain where they failed to get the results you wanted. As always, you are encouraged to ask questions in the #xpath channel in Slack, but because you want to make progress in learning to debug your own code, your questions should tell us what you tried, what you expected, exactly what you got instead (not just didn’t work or got an error), and what you think the source of the problem is. Sometimes writing that sort of request for advice that will help you figure out what’s wrong on your own (see Rubber duck debugging), and even when it doesn’t, it will help us identify the difficult moments.

These tasks require the use of path expressions, predicates, and the functions count() and not(), but they should not require any other XPath functions. There may be more than one possible answer.

Using the Bad Hamlet document and the XPath browser window in <oXygen/>, construct XPath expressions that will do the following (give the full XPath expressions in your answers, and not just the results):

  1. Most (not all) speeches in Hamlet contain mostly metrical line (<l>) and anonymous block (<ab>) elements (an anonymous block is the TEI element that the tagger used to represent a non-metrical speech line). Speeches also typically contain <speaker> elements, and may also contain stage directions (<stage>). We have deliberately left out at least one other type of subelement found in speeches. Based on this understanding:
    1. What XPath would find all of the speeches that do not contain any metrical lines as immediate children. How many are there?

      //sp[not(l)]

      For this answer, we use the predicate [not(l)] to check whether there are <l> elements on the child axis of all of the speeches we find. We keep only speeches that do not contain any line elements, throwing away any speeches that do contain those elements. There are 451 speech elements without line children. Some of you specified //sp[not(child::l)]. This is syntactically valid XPath, but it isn’t consistent with Best Practice; most coders wouldn’t mention the child axis explicitly because it’s the default.

    2. What XPath would find all of the speeches that do not contain either any metrical lines (<l>) or any anonymous blocks (<ab>)? How many are there? What do they contain instead?

      From anywhere in the document (because a path expressions beginning with a slash will always start at the document node regardless of your current context location) you can use any of the following:

      • //sp[not(l|ab)]

      • //sp[not(l or ab)]

      • //sp[not(l)][not(ab)]

      • //sp[not(l) and not(ab)]

      • //sp except (//sp[l]|//sp[ab])

      These path expressions all return speeches that do not have either <l> or <ab> elements as immediate children. There are seven of these elements. When you navigate to these speeches in the list that <oXygen/> returns, you find that the other child elements of speeches that we concealed from you are line groups, or <lg>. You can get those child elements by adding an asterisk as the next path step, meaning find all of the child elements of these <sp> elements, whatever they might be: //sp[not(l or ab)]/* (similarly with the other options above). You can get the element names, instead of the elements themselves, by using the simple map operator to apply the name() function to the elements returned by the path expression: //sp[not(l or ab)]/* ! name(.) (likewise for the other expressions above). You can get rid of the duplicates by using the arrow operator to apply the distinct-values() function: //sp[not(l|ab)]/* ! name(.) => distinct-values() (likewise for the other expressions above).

      There are alternatives that don’t use the simple map operator or the arrow operator. For example, distinct-values(//sp[not(l|ab)]/*/name()) returns the same results as the last example above. We recommend using the simple map and arrow operators where appropriate because the distinctive operators and the left-to-right application make the code easier to understand.

      //sp[not(l|ab)] uses the union operator. l|ab matches the union of all lines and anonymous blocks, that is, everything that is either a line or an anonymous block (and some people refer to the union connector as the or connector). The path //sp collects all of the speeches, and the predicate then checks on the child axis (the default, since no axis is specified) and retains only those speeches that do not have either lines or anonymous blocks as immediate children.

      //sp[not(l or ab)] uses the or operator. Without the not() function, it would keep all <sp> elements that have either <l> or <ab> children. The not() function inverts the test, so it keeps only the <sp> elements that have neither type of child. Some people find the two senses of or confusing: the union operator (|) finds elements that are either one thing or another, while the keyword or is used in complex predicates to filter a sequence according to one condition or another.

      //sp[not(l)][not(ab)] uses two predicates. It first collects all of the speeches, after which the first predicate keeps only those that don’t have line elements as children. The second predicate then filters those still further, keeping only the ones that also don’t have anonymous blocks as immediate children. You can reverse the order of the predicates in this case (although there are other situations where changing the order of predicates leads to different results).

      //sp[not(l) and not(ab)] uses a compound predicate. It collects all of the speeches and then filters them all at once, keeping only those that both do not have any lines as children and do not have any anonymous blocks as children. The order of the two parts on either side of the and operator doesn’t matter.

      //sp except (//sp[l]|//sp[ab]) uses except to specify the difference of sets of nodes. See Kay pp. 628–31 for discussion. Our pattern says return all <sp> elements, but exclude from that return the union of all <sp> elements that have <l> children and all <sp> elements that have <ab> children.

      All five of these XPaths yield the same sequence of seven elements, and in all cases that use the and, or, or union (|) operators, you may change the order of the parts without changing the results. For example not(l or ab) returns the same nodes as not(ab or l).

  2. Explain why the following four XPath expressions return different results, and describe in prose what each of them does return, and why:
    1. //sp[@who="Hamlet"]/l[1]
    2. /descendant::sp[@who="Hamlet"]/l[1]
    3. (//sp[@who="Hamlet"]/l)[1]
    4. (/descendant::sp[@who="Hamlet"]/l)[1]

    The reason that these four XPath expressions return different results is because of the scoping effect of parentheses, that is, how the parentheses affect the sequence of items within which the system then applies the predicate [1] to select only the first item in a sequence. The other distinguishing factor of these expressions involves the specification of the descendant axis. The shorthand // functions the same way as the long form descendant::. The predicate in the expressions 2a and 2b filter elements in a different context than in expressions 2c and 2d because the parentheses alter the current context. Here are the details:

    2a (//sp[@who="Hamlet"]/l[1]) returns the first line of every speech by Hamlet. XPath expressions operate one step at a time, and the sequence returned by each step becomes the sequence of context nodes from which the next step proceeds. This means that when the first step returns a sequence of all of Hamlet’s speeches, each speech, one by one, becomes the context for returning the lines in that single speech, which are then filtered by the predicate to keep only the first line. The entire expression returns, then, a sequence of all of the first lines (<l> elements) of all of Hamlet’s speeches.

    Not all of Hamlet’s speeches contain <l> child elements, yet when we ask for the first <l> child element of each speech we don’t raise any errors. This fact illustrates another perhaps surprising feature of XPath: asking for something that does not exist is not an error, so if a speech has no <l> children we get no result (an empty sequence) for that speech. When we ask for //sp[@who="Hamlet"]/l[1] we get 159 first child <l> elements, but when we ask for //sp[@who="Hamlet"], we get 357 speeches. This means that 198 speeches don’t have a first <l> child element, and that means that they don’t have any <l> child elements. We can verify that with //sp[@who="Hamlet"][not(l)], which asks explicitly for all speeches by Hamlet that contain no <l> child elements, and which returns 198 results. We can find out what types of elements those speeches do have with:

    //sp[@who="Hamlet"][not(l)]/*[not(self::speaker|self::stage)]
    ! name(.)
    => distinct-values()

    This finds all speeches by Hamlet (//sp[@who="Hamlet"]) and filters them to keep only the ones that do not have any <l> children (//sp[@who="Hamlet"][not(l)]). We then find all of the child elements that those speeches do have (//sp[@who="Hamlet"][not(l)]/*) and filter those children to exclude the speaker name and any stage directions (//sp[@who="Hamlet"][not(l)]/*[not(self::speaker|self::stage)]), since those don’t hold spoken text, and what we’re looking for is the elements other than <l> that hold spoken text. We then use the name() function to get the names of each of those elements, using the simple map operator. Since we don’t need the names of the element types to be repeated, we employ the distinct-values() function to remove the duplicates, using the arrow operator. This tell us that speeches by Hamlet that don’t have any <l> children have, instead, <ab> (anonymous block, which in this play is used for non-metrical lines) and <lg> (line group) child elements.

    We used the self:: axis above inside a predicate to exclude <speaker> and <stage> elements from the results of a path step. We could, alternatively, use the except operator within the path step instead of the predicate. That version would look like:

    //sp[@who="Hamlet"][not(l)]/(* except (speaker|stage))
    ! name(.)
    => distinct-values()

    There is no reason to prefer one of these to the other, and you should use whichever one you find easiest to understand.

    2b (/descendant::sp[@who="Hamlet"]/l[1]) starts by looking for <sp> on the descendant axis from the document node (the top of the tree), and then finds all of the child line elements of those speeches. As in 2a, the predicate [1] modifies the immediately preceding step of the expression. This means that the expression returns the first line of each of Hamlet's speeches, the same result as 2a.

    2c ((//sp[@who="Hamlet"]/l)[1]) also begins by finding all of the speeches by Hamlet, and then all of the child line elements of those speeches. At that point the full expression is wrapped in parentheses, making the current context a single, flattened sequence of all line elements in Hamlet speeches. The predicate [1] then applies to that entire flattened sequence, so it filters the result by asking for the very first line in the entire sequence of all lines spoken by Hamlet. As a result 2c returns only one line.

    The way the parentheses operate here illustrates a perhaps surprising feature of sequences in XPath: sequences flatten their contents into a single sequences. For example, the sequence ( (1, 2, 3), (4, 5, 6) ) might look like a sequence of two items, each of which is itself a sequence of three integers. The way XPath works, though, is that this is equivalent to (1, 2, 3, 4, 5, 6) because nested sequences in XPath are automatically flattened. This is why wrapping all of the lines of all of Hamlet’s speeches in parentheses causes them to behave like a single sequence of lines, instead of a sequence of sequences of lines.

    XPath has a structure called an array that is similar to sequences except that it permits this type of nesting without flattening. We don’t introduce arrays in this course because they are not needed for most document processing, but if do require array functionality for your project, the instructors will help you learn to use them

    2d ((/descendant::sp[@who="Hamlet"]/l)[1]), like the others, starts by looking for <sp> elements on the descendant axis from the document node and filtering them to keep only the ones by Hamlet. As in 2c, the numerical predicate is applied to the entire expression, and not to each individual speech, since the predicate applies to everything wrapped in parentheses, that is, to all of Hamlet’s line as a single sequence. For that reason, it keeps only the first line that is spoken by Hamlet and returns only one line, the same result as 2c.

    The point is that a predicate applies to the current context, which is defined as the immediately preceding sequence. In 2a and 2b, it’s each sequence of lines in each speech, separately, so the predicate applies once for every speech, and returns the first line of every speech. In the 2c and 2d, the parentheses cause the entire preceding expression to function as a single, flattened sequence, so the predicate applies only once, to the continuous sequence of all lines spoken by Hamlet, and therefore returns only one line, the first line spoken by Hamlet in the entire play.


What does // mean?

Most of the time we can pretend it means descendant::

Most of the time you don’t need to think about the details of how // doesn’t really mean descendant axis, even though it seems to behave as if it did. But if you run into one of the places where // doesn’t behave the same way as referring to the descendant axis explicitly, here’s why.

In most places // functions the same way as descendant::, so we often think of it as shorthand for it, just as @ is shorthand for attribute::, that is, for referring to nodes on the attribute axis. But // isn’t exactly synonymous with descendant::; what it actually means is descendant-or-self::node()/. Here’s why that matters, and why that perhaps confusing path step is useful.

The descendant-or-self:: axis means what you think it means: it looks for specified nodes on the descendant axis, but it also looks at the current context node, and not only at its descendants. The expressions //sp and /descendant::sp return the same results because:

Why is it useful that // doesn’t mean the same thing as descendant::?

If you want to find all @who attributes in the document, you can do that with //@who. But if you try to write /descendant::@who you’ll raise an error because you’re trying to look on two axes at once. Attributes aren’t descendants in the sense that they are never on the descendant axis because they are never on any of the directional axes (parent, child, etc.); they are only on the attribute axis. But because //@who really means /descendant-of-self::node()/@who, it looks for all nodes on the descendant axis (none of which are attributes themselves) and then looks for @who attributes on the attribute axis from those nodes. Since it is looking at all descendant nodes of the document node, it winds up looking at all @who attributes, no matter what node they belong to.

What are the pitfalls of using // as a synonym for descendant::?

There are two common types of errors that come from treating // as a synonym for descendant:::

  1. Suppose you want to find all of the speeches (<sp>) that have child <l> elements. If you write //sp[l] you’ll get the right results: your path step starts at the document node, eventually finds its way to all of the <sp> elements in the document, and uses the predicate to test each one to see whether it has any <l> children. This works because every path step is on an axis and the child axis is the default when no other axis is specified explicitly, so //sp[l] is synonymous with //sp[child::l].

    We can’t, though, find all <sp> elements that have <l> descendants with //sp[//l]. If you try this, you’ll select all 1137 speeches in the play, both those have have <l> descendants and those that don’t. The reason is that the predicate begins with a slash and a slash always means start at the document node, so instead of looking for <l> descendants of the current context (each <sp> in turn), your predicate is reporting true if there are any <l> element descendants of the root node. Since there always are, the predicate always returns true, so the test always succeeds and we wind up not filtering anything.

    We can fix this in two ways:

    1. Specify the descendant axis for real with //sp[descendant::l]. This predicate expression does not begin with a slash, so it looks for <l> descendants only of the current context node, which is each <sp> in turn.

    2. Start the predicate with a dot, which in an XPath context means current context node: //sp[.//l]. The predicate expression no longer begins with a slash, so it doesn’t start from the document node; it starts from the current context node (represented by the dot) and looks down from there.

    In our own work we favor the first solution because we find it easier to understand, but the two will return the same result. They aren’t exactly equivalent for the reason described above (we can specify an attribute on the attribute axis right after a double slash), but in practice we almost never want to do that anyway.

  2. Because XPath expressions proceed step by step, where the sequence selected at each step becomes a sequence of context nodes for the next step, //sp//l[1] doesn’t select all <l> children of all <sp> elements and then return the first one. What it does instead is select all <l> children of each <sp>, one <sp> at a time, and it applies the predicate to those individual sequences of lines. What you are asking for, then, is the first <l> child of each speech (one per speech), instead of the first <sp> element that is a child of a speech (one result for the entire play). The same is true of //sp/descendant::l[1]; here, too, you are selecting the descendant lines of each speech, one speech at a time, and filtering to keep just the first of each of those speech-specific sequences.

    You can work around this limitation by using parentheses to fuse all of the subsequences into one long sequence before applying the predicate: (//sp//l)[1] or (//sp/descendant::l)[1] will both return the first <l> element in the play that is a descendant of an <sp> element.