Digital humanities


Maintained by: David J. Birnbaum (djbpitt@gmail.com) [Creative Commons BY-NC-SA 3.0 Unported License] Last modified: 2021-03-10T16:47:38+0000


XPath assignment #2 answers

The task and possible solutions

You can find an XML (TEI) version of Shakespeare’s Hamlet at http://dh.obdurodon.org/bad-hamlet.xml. We’ve deliberately damaged some of the markup in this edition to introduce some inconsistencies, but the file is well-formed XML, which means that you can use XPath to explore it. You should download this file to your computer (typically that means right-clicking on the link and selecting save as) and open it in <oXygen/>.

After you’ve completed your homework, save your answers to a markdown file and upload it to Canvas as an attachment. Some of these tasks are thought-provoking, and even difficult. If you get stuck, do the best you can, and if you can’t get a working answer, give the answers you tried and explain where they failed to get the results you wanted. Sometimes doing that will help you figure out what’s wrong, and even when it doesn’t, it will help us identify the difficult moments. These tasks require the use of path expressions, predicates, and the functions count() and not(), but they should not require any other XPath functions. There may be more than one possible answer.

Using the Bad Hamlet document and the XPath browser window in <oXygen/>, construct XPath expressions that will do the following (give the full XPath expressions in your answers, and not just the results):

  1. Most (not all) speeches in Hamlet contain mostly metrical line (<l>) and anonymous block (<ab>) elements (an anonymous block is the TEI element that the tagger used to represent a non-metrical speech line). Speeches also typically contain <speaker> elements, and may also contain stage directions (<stage>). We have deliberately left out at least one other type of subelement found in speeches. Based on this understanding:
    1. What XPath would find all of the speeches that do not contain any metrical lines as immediate children. How many are there?

      //sp[not(l)]

      For this answer, we use the predicate [not(l)] to check whether there are <l> elements on the child axis of all of the speeches we find. We keep only speeches that do not contain any line elements, throwing away any speeches that do contain those elements. There are 451 speech elements without line children. Some of you specified //sp[not(child::l)]. This is syntactically valid XPath, but it isn’t consistent with Best Practice; most coders wouldn’t mention the child axis explicitly because it’s the default.

    2. What XPath would find all of the speeches that do not contain either any metrical lines (<l>) or any anonymous blocks (<ab>)? How many are there? What do they contain instead? (As in #2d in XPath assignment #1, you haven’t yet learned the XPath to return a list of the types of elements they do contain, but if you find them all, you can scan the brief list that <oXygen/> returns, click on each one to see it in context, and see what’s going on.)

      //sp[not(l|ab)] or //sp[not(l or ab)] or //sp[not(l)][not(ab)] or //sp[not(l) and not(ab)] or //sp except (//sp[l]|//sp[ab])

      These path expressions all return speeches that do not have either <l> or <ab> elements as immediate children. There are seven of these elements. When you navigate to these speeches in the list that <oXygen/> returns, you find that the other child elements of speeches that we concealed from you are line groups, or <lg>. You can get those child elements by adding an asterisk as the next path step, meaning find all of the child elements of these <sp> elements, whatever they might be: //sp[not(l|ab)]/* or //sp[not(l or ab)]/* or //sp[not(l)][not(ab)]/* or //sp[not(l) and not(ab)]/*. You can get the element names, instead of the elements themselves, by adding the name() function as a path step after the asterisk: //sp[not(l|ab)]/*/name() or //sp[not(l or ab)]/*/name() or //sp[not(l)][not(ab)]/*/name() or //sp[not(l) and not(ab)]/*/name(). You can get rid of the duplicates by wrapping the entire expression in the distinct-values() function, e.g., distinct-values(//sp[not(l|ab)]/*/name()), etc.

      The first version uses the union operator. l|ab matches the union of all lines and anonymous blocks, that is, everything that is either a line or an anonymous block (and some people refer to the union connector as the or connector). The path //sp collects all of the speeches, and the predicate then checks on the child axis (the default, since no axis is specified) and retains only those speeches that do not have either lines or anonymous blocks as immediate children.

      The second version uses the or operator. Without the not() function, it would keep all <sp> elements that have either <l> or <ab> children. The not() function inverts the test, so it keeps only the <sp> elements that have neither type of child. Some people find the two senses of or confusing: the union operator (|) finds elements that are either one thing or another, while the keyword or is used in complex predicates to filter a sequence according to one condition or another.

      The third version uses two predicates. It first collects all of the speeches, after which the first predicate keeps only those that don’t have line elements as children. The second predicate then filters those still further, keeping only the ones that also don’t have anonymous blocks as immediate children.

      The fourth version uses a compound predicate. It collects all of the speeches and then filters them all at once, keeping only those that both do not have any lines as children and do not have any anonymous blocks as children.

      The fifth version uses except to specify the difference of sets of nodes. See Kay pp. 628–31 for discussion. Our pattern says return all <sp> elements, but exclude from that return the union of all <sp> elements that have <l> children and all <sp> elements that have <ab> children.

      All five of these XPaths yield the same sequence of seven elements, and in all cases that use the and, or, or union (|) operators, you may change the order of the parts without changing the results. For example not(l or ab) returns the same nodes as not(ab or l).

  2. Explain why the following four XPath expressions return different results, and describe in prose what each of them does return, and why:
    1. //sp[@who="Hamlet"]/l[1]
    2. /descendant::sp[@who="Hamlet"]/l[1]
    3. (//sp[@who="Hamlet"]/l)[1]
    4. (/descendant::sp[@who="Hamlet"]/l)[1]

    The reason that these four XPath expressions return different results is because of the scoping effect of parentheses, that is, how the parentheses affect the sequence of items within which the system then searches for the first item. The other distinguishing factor of these expressions involves the specification of the descendant axis. The shorthand // functions the same way as the longform descendant::. The predicate in the expressions 2a and 2b filter elements in a different context than in expressions 2c and 2d, because the parentheses alter the current context.

    2a finds all speeches by Hamlet, and then all of the child line elements of those speeches. The predicate [1] modifies the immediately preceding step of the expression, which means that it filters separately the sequence of lines from each speech by Hamlet. As a result, the predicate retains the first line element in the context of each speech, that is, the expression overall returns the first line of every speech by Hamlet.

    2b starts by looking for <sp> on the descendant axis, and then finds all of the child line elements of those speeches. As in 2a, the predicate [1] modifies the immediately preceding step of the expression. This means that the expression returns the first line of each of Hamlet's speeches, the same result as 2a.

    2c also begins by finding all of the speeches by Hamlet, and then all of the child line elements of those speeches. At that point the full expression is wrapped in parentheses, making the current context a single sequence of all line elements in Hamlet speeches. The predicate [1] then applies to that entire sequence, so it filters the result by asking for the very first line in the entire sequence of all lines spoken by Hamlet. As a result 2c returns only one line.

    2d, like the others, starts by looking for <sp> on the descendant axis from the document node and filtering them to keep only the ones by Hamlet. As in 2c, the numerical predicate is applied to the entire expression, and not to each individual speech, since the predicate applies to everything wrapped in parenthesis, that is, to all of Hamlet’s line as a single sequence. For that reason, it keeps only the first line that is spoken by Hamlet and returns only one line, the same result as 2c.

    The point is that a predicate applies to the current context, which is defined as the immediately preceding step in the path. In 2a and 2b, it’s each sequence of lines in each speech, separately, so the predicate applies once for every speech, and returns the first line of every speech. In the 2c and 2d, the parentheses cause the entire preceding expression to function as a single sequence, so the predicate applies only once, to the continuous sequence of all lines spoken by Hamlet, and therefore returns only one line, the first line spoken by Hamlet in the entire play.


New XPath 3.0 operators

XPath 3.0 introduced two new operators that can be helpful when writing complex XPath expressions that use functions. The provide alternative notations for operations that could be spelled in other ways, so they are not obligatory, but developers tend to find them more legible than the alternatives. We’ll use them in course materials, so you’ll want to become familiar with them whether you use them in your own code or not.

The arrow operator (=>)

The arrow operator, spelled =>, means use the sequence returned by the expression on the left as input to the function on the right. For example, //sp => count() means use the expression to the left (a sequence of all <sp> elements in the document) as input to the function on the right (count the sequence). This returns a single integer, representing the number of <sp> elements in the document.

The expression //sp => count() is equivalent to count(//sp), but because wrapping a function around an expression can make the code difficult to read (especially if you have functions nested inside functions that are themselves nested inside functions, etc.), many developers prefer the arrow operator because they regard it as more legible.

The simple map operator (!)

The simple map operator, spelled !, means use each item in the sequence on the left, in turn, as input to the function on the right. For example //sp ! count(.) will return a sequence of integer values of 1 because it will count each speech individually, so there will as many repetitions of 1 as there are speeches. That isn’t a useful thing to do, of course, but, for example, if you want to find all of the element names in your document, you can write //* ! name(.), which finds all elements and returns the name of each of them in turn. If you want to know the distinct names, that is, the element types, without repetition (there are more than 6700 elements in our Bad Hamlet file), you can write //* ! name(.) => distinct-values() (there are only 28 distinct element types). This first finds all elements, then gets the name of each of them, and then uses the distinct-values() function to deduplicate the list of element types. If you want the computer to count for you, you could add another arrow step: //* ! name(.) => distinct-values() => count(), which will return just a single integer value, 28.You could, of course, also write count(distinct-values(//*/name(.))), but writing the operations from left to right in the order in which they are performed, instead of from the inside out, is likely to be more legible.

The simple map operator is not always identical in meaning to a path step, even though //*/name(.) is identical in meaning to //* ! name(.). The difference is that a path step can only operate on nodes, and not on strings, numbers, or other non-node objects, so if you write (1,2,3)/(. * .) (trying to return a sequence of the squares of the integers 1, 2, and 3), you’ll raise an error because a sequence of integer values is not a sequence of nodes. But you can write (1,2,3) ! (. * .); it returns the sequence (1, 4, 9) If the sequence on the left is a sequence of nodes, the two notations are equivalent.

Comparison of the arrow operator and the simple map operator

The arrow operator, then, means the entire sequence to the left, as a single sequence, is the input to one instance of the operation on the right. The simple map operator, on the other hand, means that each individual item in the sequence to the left is, in turn, the input to a separate instance of the operation on the right. The arrow operator thus does one thing all at once to all of the input; the simple map operator does the same thing, repeatedly and individually, to each item in the input.

Confusingly, when you use the simple map operator you need to use a dot to represent the item being processed, but when you use the arrow operator, the sequence to the left is automatically the input to the function, so writing the dot will raise an error.