Digital humanities


Maintained by: David J. Birnbaum (djbpitt@gmail.com) [Creative Commons BY-NC-SA 3.0 Unported License] Last modified: 2023-03-03T15:57:01+0000


XPath assignment #4 answers

You can find an XML (TEI) version of Shakespeare’s Hamlet at http://dh.obdurodon.org/bad-hamlet.xml. We’ve deliberately damaged some of the markup in this edition to introduce some inconsistencies, but the file is well-formed XML, which means that you can use XPath to explore it. You should download this file to your computer (typically that means right-clicking on the link and selecting save as) and open it in <oXygen/>.

Prepare your answers to the following questions in a markdown file upload it to Canvas as an attachment. As always, code snippets (including XPath snippets) in markdown must be surrounded with backticks.

Some of these tasks are thought-provoking, and even difficult. If you get stuck, do the best you can, and if you can’t get a working answer, give the answers you tried and explain where they failed to get the results you wanted. As always, you are encouraged to ask questions in the #xpath channel in Slack, but because you want to make progress in learning to debug your own code, your questions should tell us what you tried, what you expected, exactly what you got instead (not just didn’t work or got an error), and what you think the source of the problem is. Sometimes writing that sort of request for advice that will help you figure out what’s wrong on your own (see Rubber duck debugging), and even when it doesn’t, it will help us identify the difficult moments.

These tasks require the use of path expressions, predicates, and functions. References to Kay are to the Michael Kay book; there’s a link in our online course description to a PDF version accessible through the Pitt library system. There may be more than one possible answer.

Using the Bad Hamlet document and the XPath browser window in <oXygen/>, construct XPath expressions that will do the following (give the full XPath expressions in your answers, and not just the results):

  1. What XPath will return a hyphen-separated list of all characters without duplicates? The resulting list will look something like:

    Claudius-Hamlet-Polonius ...

    Our solution uses string-join() (alternative solutions may also require distinct-values()). Note that there are several ways to identify the characters in this markup, including the <castList> element, the <speaker> elements, and the @who atribute on the <sp> element. Which should you use and why?

    The simplest solution is string-join(//role, '-') (or, with the arrow operator, //role => string-join('-'). This doesn’t use distinct-values(), since the values of <role> elements are already unique. In addition to uniqueness, it also separates characters who speak in unison. For example, if you use <speaker> elements you’ll find values like <speaker>Rosencrantz and Guildenstern</speaker>, and that isn’t a string you want in your list of characters. Furthermore, the <speaker> and @who values contain only speaking characters, so using those would miss non-speaking characters.

  2. Most metrical lines (<l>) have an @xml:id attribute with a value like sha-ham101010, ending in a six-digit number. The first digit is the act, the next two the scene, and the last three the line in the scene. Some metrical lines are split across multiple speakers, and in that case the six-digit number in the @xml:id value is followed by I (initial part), M (middle part), or F (final part). In a few places there may be more than one middle part, and in those cases the M is followed by a one-digit number. For example, one of Hamlet’s lines is:

    <l xml:id="sha-ham502277M2" n="277">One.</l>

    which is the second middle part. What XPath will return the number of <l> elements that are middle parts? Our solution uses count() and contains().

    //l[contains(@xml:id, 'M')] => count()

    As we explained, any line that is a middle part will have the character M in its @xml:id. To test whether a line's attribute has this character, we use the contains() function inside a predicate. The contains() function checks whether its second argument exists within the first, that is, whether the capital M is anywhere in the value of @xml:id, which is a string.

    To find all of these lines, we search on the descendant axis, starting from the document node, with the // shorthand, and append the predicate [contains(@xml:id, 'M')]. We then use the arrow operator with the count() function function to count them.

  3. Sometimes Rosencrantz speaks by himself and sometimes he speaks in unison with Guildenstern. What XPath finds all of the speeches by Rosencrantz, whether alone or together with Guildenstern? Our solution uses a single instance of contains().

    //sp[contains(@who, "Rosencrantz")]

    This solution is very similar to the solution to the previous question, except that it uses the function contains() to search for the presence of an entire string instead of just a single character. It is important to know that contains() is capable of doing both. This XPath finds all of the <sp> elements and filters them by determining whether the string Rosencrantz occurs anywhere in the @who attribute, indicating that Rosencrantz is speaking. Where Rosencrantz and Guildenstern speak together, the @who value is instead Rosencrantz Guildenstern", but that also contains the string Rosencrantz.

    The approach above will produce false positives if the play also has a character with a name like Rosencrantzenfeld because that also contains Rosencrantz as a substring. You can (= should) use the contains-token() function instead of contains()to avoid that peril.

    The contains-token() function matches a substring only if it is a separate word, that is, not a substring of another word. In that way the contains-token() function is a compact and legible way of performing the same logic as //sp[tokenize(@who) = 'Rosencrantz']. This longer, less legible version uses the tokenize() function to split the value of the @who attribute into word tokens on whitespace (whitespace is the default separator for the tokenize() function if none is specified explicitly) and it then tests whether any of the tokens is equal to the string Rosencrantz.

    The contains-token() function is new in XPath 3.0, so it is not included in Michael Kay’s book, which documents XPath and XSLT only through version 2.0.

  4. The string-length() function can be used in two ways. You can wrap it around an argument, so that, for example, string-length('Hi, Mom!') will return 8, the length in character count of the string inside the quotation marks. It can also be used as part of a path expression, so that, for example, if the XPath //sp returns a sequence of all <sp> elements, //sp/string-length(.) returns a sequence of the lengths of all <sp> elements as measured by counting characters. This works by finding all of the <sp> elements and then (next path step) getting the string length of each one. Remember that the dot inside the parentheses refers to the current context node, which is the member of the sequence of <sp> nodes that is being processed at the moment. We need to use this subterfuge because string-length(//sp) generates an error. The problem is that string-length() can take only a single argument, and //sp returns more than one item. Putting the string-length() function on its own path step with a dot inside means that it applies once for every <sp> element, and that each time it applies, it has just a single argument.

    Use this information to identify an XPath that finds the length of the longest speech. What length does it return? Our solution uses string-length() and max().

    max(//sp/string-length(.))

    We can read this from the inside out as first find all <sp> elements in the document; then, for each of them, count its length in characters; then, for the sequence of lengths (all of which are integers) return only the longest. Because we cannot find the string-length of more than one item at a time, we navigate to all of the <sp> elements in the play on the descendant axis with the // shorthand, and then take a step to use the string-length() function and use . to refer to the current <sp> for each one in turn. Wrapping the max() function around the XPath that produces this sequence of values will return the maximum value in that sequence, that is, the maximum length of any speech, which is 5248.

    We find this much easier to read when we write it using the simple map and arrow operators:

    //sp ! string-length(.) => max()

    By the way, this is a naive and textually incorrect way to measure the length of a speech. It includes the content of any embedded <speaker> or <stage> elements, which aren’t part of the spoken text, and it also includes any whitespace characters that might have been present because of indentation during pretty-printing. How might you measure the length of a speech in a more textually meaningful way, and how would you do that using XPath?

  5. Optional, challenging question: Given the preceding solution, how can you use that XPath to retrieve the longest <sp> itself? No fair checking the length and then writing a separate XPath that looks for that number. Your answer must find the longest speech without your knowing how long it is. Our solution doesn’t require any additional functions beyond the ones used in #4, but it does use a complicated predicate.

    //sp[string-length(.) eq max(//sp ! string-length(.))]

    Since the value returned by the previous solution is just a number, we can use it in a predicate to compare against the length of any speech that we are looking at. We first find the speeches and then check one by one whether the length of any of them is equal to the maximum length of all speeches in the play. This works because even within a predicate the // starts its search from the root of the document, so the comparison is to the maximal length of all speeches.

    It is also possible to find the longest speech without building on #4. An expression that does that is: //sp[not(string-length(.) &lt; //sp/string-length(.))] This takes advantage of the fact that when we use the general comparison operator < (which we have to spell with the character entity &lt; because it’s inside an attribute value, where a literal < would not be well formed), the comparison returns true if any item on the left side of the operator is less than any item on the right (see the discussion of General comparison at the bottom of our XPath functions we use most). The right side of our comparison here is a sequence of integers that represent the lengths of all speeches, and the item on the left is the integer length of the speech we’re looking at at the moment (we look at each one separately, because that’s how predicates work). The only speech that is not shorter than at least one of the sequence of all speeches in the play is the one that is itself the longest, so our test will pick out just that one. As you gain more experience with using general comparison operators to compare sequences to one another this type of logic will grow more intuitive.

    We don’t recommend the following alternative because it requires an extra statement, but it’s worthwhile knowing about the XPath let statement:

    let $longest := max(//sp/string-length(.))
    return //sp[string-length(.) eq $longest]

    The let statement defines a variable, and in this case, we create a variable called $longest (variable names in XPath begin with a dollar sign) which we set equal to the value of the longest speech, which is the integer 5248. The binding operator, which binds the value of the the expression on its right to the variable name on its left, is a colon followed by an equal sign (:=), and not just an equal sign (as in some other programming languages). The binding operator is sometimes called the walrus operator because it looks like the eyes and tusks of a walrus lying on its side—at least if you have a lively imagination.

    A let statement must be paired with a return statement, which normally uses the variable to compute the value of an XPath expression, so in this case the return statement returns a sequence of all <sp> elements with a string length equal to 5248. You can write the entire expression on one line if you prefer; we’ve broken it over two lines because we find it easier to read that way.

  6. Optional, very challenging question: What XPath produces a numbered list of all characters, without any duplicates, which should look something like:

    1. Claudius
    2. Hamlet
    3. Polonius
    4. ...

    There are several possible solutions, each of which raises issues that you may not have seen before. If you get an error message, try to figure out what it means and how to resolve it.

    One solution is //role ! concat(position(), ". " ,.). We retrieve all <role> elements and then, for each one, return a concatenation of its position in the sequence of <role> elements (using the position() function to get the position of the current context node in the sequence selected by the preceding path step), a literal dot followed by a space, and then the <role> element itself. The concat() function automatically atomizes its arguments, which is to say that when we pass it a <role> element, it converts it to its atomic (string) value (that is, it throws away the markup and just gives us back the character content), so that we wind up with results like “1. Claudius”, which is what we want.

    There is, alternatively, a concatenation operator, spelled ||, that you can use instead of the concat() function. The expression with that operator would look like //role ! (position() || ". " || .).

    If we get the characters using <speaker> or @who values instead of <role>, we need to deduplicate them with distinct-values(), and the the expression would be distinct-values(//speaker) ! concat(position(), ". " ,.).

    Finally, instead of iterating over the roles or distinct <speaker> or @who values and returning their positions and string values, as we do above, we can iterate over the positions and return the same thing. The expression in that case would be

    for $i in (1 to count(//role)) 
    return concat($i, ". ", (//role)[$i])

    The for expression iterates over a sequence and does something once for each member of the sequence. The sequence over which it iterates is a sequence of integers from 1 through however many characters there are in the play (there are 37, which XPath determines by counting the number of <role> elements). We use the to operator, which we haven’t used before, as an instruction to generate the sequence of integers for us dynamically. If there were, say, only 5 characters in the play, the expression 1 to count(//role) would be equivalent to the sequence (1,2,3,4,5).

    Although for $i in (1 to count(//role)) is just setting the variable $i to a different integer value each time it loops, on each pass through the for loop (//role)[$i] will point to a different character in the play. On the first pass, when $i equals 1, (//role)[$i] means (//role)[1], so it points to the first character in the sequence returned by //role. On the second pass, the number value is 2 and (//role)[$i] means (//role)[2], and points to the second character. This is what lets us generate our numbered list; both the numbers and the pointers into the list of characters are incremented by one on each pass through the loop. You can read more about how this works in http://xsltbyexample.blogspot.com/2010/05/obtain-position-from-for-expression-in.html, which details specifically how and why you would use this approach. We find this the least intuitive of the options discussed here, so it isn’t the one we’d choose in Real Life. It’s nonetheless worth knowing how for expressions work, but because simple map and path expressions have an implicit for built into every step (since the step to the right of the slash or bang is applied once for each item in the sequence to the left), we use explicit for statements much less in XPath than we do in many other programming languages.

In many instances we can apply an operation to a sequence of nodes with either a slash or simple mapping. For example, the following two expressions are equivalent, and each returns a sequence of integers that gives the string length for each act in the play, in document order (that is, from Act 1 through Act 5, consecutively):

We recommend using the simple map operator where appropriate, such as in this situation, because it makes it easier to see when we are taking a path step and when we are applying a function to each member of a sequence.

There are, though, at least two important differences between the slash and the simple map operator in this context: