Digital humanities


Maintained by: David J. Birnbaum (djbpitt@gmail.com) [Creative Commons BY-NC-SA 3.0 Unported License] Last modified: 2021-03-16T15:46:02+0000


XPath assignment #4 answers

You can find an XML (TEI) version of Shakespeare’s Hamlet at http://dh.obdurodon.org/bad-hamlet.xml. We’ve deliberately damaged some of the markup in this edition to introduce some inconsistencies, but the file is well-formed XML, which means that you can use XPath to explore it. You should download this file to your computer (typically that means right-clicking on the link and selecting save as) and open it in <oXygen/>.

After you’ve completed your homework, save your answers to a file and upload it to CourseWeb as an attachment. (Please use an attachment! If you paste your answer into the text box, CourseWeb may munch the angle brackets.) Some of these tasks are thought-provoking, and even difficult. If you get stuck, do the best you can, and if you can’t get a working answer, give the answers you tried and explain where they failed to get the results you wanted. Sometimes doing that will help you figure out what’s wrong, and even when it doesn’t, it will help us identify the difficult moments. These tasks require the use of path expressions, predicates, and functions. There may be more than one possible answer.

Using the Bad Hamlet document and the XPath browser window in <oXygen/>, construct XPath expressions that will do the following (give the full XPath expressions in your answers, and not just the results):

  1. What XPath will return a hyphen-separated list of all characters without duplicates? The resulting list will look something like:

    Claudius-Hamlet-Polonius ...

    Our solution uses string-join() (alternative solutions may also require distinct-values()). Note that there are several ways to identify the characters in this markup, including the <castList> element, the <speaker> elements, and the @who atribute on the <sp> element. Which should you use and why?

    The simplest solution is string-join(//role, '-') (or, with the arrow operator, //role => string-join('-'). This doesn’t use distinct-values(), since the values of <role> elements are already unique. In addition to uniqueness, it also separates characters who speak in unison. For example, if you use <speaker> elements you’ll find values like <speaker>Rosencrantz and Guildenstern</speaker>, and that isn’t a string you want in your list of characters. Furthermore, the <speaker> and @who values contain only speaking characters, so using those would miss non-speaking characters.

  2. Most metrical lines (<l>) have an @xml:id attribute with a value like sha-ham101010, ending in a six-digit number. The first digit is the act, the next two the scene, and the last three the line in the scene. Some metrical lines are split across multiple speakers, and in that case the six-digit number in the @xml:id value is followed by I (initial part), M (middle part), or F (final part). In a few places there may be more than one middle part, and in those cases the M is followed by a one-digit number. For example, one of Hamlet’s lines is:

    <l xml:id="sha-ham502277M2" n="277">One.</l>

    which is the second middle part. What XPath will return the number of <l> elements that are middle parts? Our solution uses count() and contains().

    count(//l[contains(@xml:id, 'M')]) or, with the arrow operator, //l[contains(@xml:id, 'M')] => count()

    As we explained, any line that is a middle part will have the character M in its @xml:id. To test whether a line's attribute has this character, we use the contains() function inside a predicate. The contains() function checks whether its second argument exists within the first, that is, whether the capital M is anywhere in the value of @xml:id, which is a string.

    To find all of these lines, we search on the descendant axis, starting from the document node, with the // short hand, and append the predicate [contains(@xml:id, 'M')]. Then, to count them, we wrap everything in the count() function.

  3. Sometimes Rosencrantz speaks by himself and sometimes he speaks in unison with Guildenstern.

    1. What XPath finds all of the speeches by Rosencrantz, whether alone or together with Guildenstern? Our solution uses a single instance of contains().

      //sp[contains(@who, "Rosencrantz")]

      This solution is very similar to the solution to the previous question, except that it uses the function contains() to search for the presence of an entire string instead of just a single character. It is important to know that contains() is capable of doing both. This XPath finds all of the <sp> elements and filters them by determining whether the string Rosencrantz occurs anywhere in the @who attribute, indicating that Rosencrantz is speaking. Where Rosencrantz and Guildenstern speak together, the @who value is instead "Rosencrantz Guildenstern", but that will still be accepted by the filtering in our predicate, because that string contains the string Rosencrantz

      Note that this approach will produce false positives if the play also has a character with a name like Rosencrantzenfeld because that also contains Rosencrantz as a substring. We didn’t ask you how to deal with that hypothetical situaiton, but the answer is that you can use the tokenize() function to split the value of the @who attribute into strings, separating on white space. With this approach, a value of Rosencrantz Guildenstern will produce two strings and Rosencrantzenfeld would produce one. You could then test for equality instead of using contains(), so that substrings would not produce false positives. The expression would look like //sp[tokenize(@who, '\s+') = 'Rosencrantz']. This returns true if any item in the sequence on the left (the individual words in the @who value, after splitting on whitespace) is equal to any item in the sequence on the right (there is only one item, the string Rosencrantz). This means that Rosencrantz Guildenstern would succeed but Rosencrantzenfeld would fail.

    2. Can you think of an alternative solution that doesn't use any functions (just a predicate)?

      You could also have done //sp[@who = "Rosencrantz" or @who = "Rosencrantz Guildenstern"], but this solution is naive and assumes that no other possible values for @who could have also included Rosencrantz as a speaker. It would even fail if the value were Guildenstern Rosencrantz, since it is looking for the entire string. On the other hand, contains() is sometimes a bit perilous because it will match on a substring, and not just on a whole word. As we mention above, that's not a problem here because there isn’t a character named, say, Rosencrantzenfield, but if there were, the contains() approach would match on him, too. That’s a silly example, of course, but it’s easy to forget about substrings and try to count occurrences of, say, the verb form go, only to have your results contaminated with good and everything else that has the letter sequence go in it.

      If you want to check for equality to either the string Rosencrantz or the string Rosencrantz Guildenstern, using general equals (=, not eq) instead of the or expression is more XPath idiomatic. As we note above, general equals checks whether any member of the sequence on the left matches any member of the sequence on the right. This means that the predicate @who = ("Rosencrantz", "Rosencrantz Guildenstern") is informationally identical to the predicate //sp[@who = "Rosencrantz" or @who = "Rosencrantz Guildenstern"]. This example highlights an important difference between general comparison and value comparison, which you can review at the bottom of our XPath functions we use most tutorial.

  4. The string-length() function can be used in two ways. You can wrap it around an argument, so that, for example, string-length('Hi, Mom!') will return 8, the length in character count of the string inside the quotation marks. It can also be used as part of a path expression, so that, for example, if the XPath //sp returns a sequence of all <sp> elements, //sp/string-length(.) returns a sequence of the lengths of all <sp> elements as measured by counting characters. This works by finding all of the <sp> elements and then (next path step) getting the string length of each one. Remember that the dot inside the parentheses refers to the current context node, which is the member of the sequence of <sp> nodes that is being processed at the moment. We need to use this subterfuge because string-length(//sp) generates an error. The problem is that string-length() can take only a single argument, and //sp returns more than one item. Putting the string-length() function on its own path step with a dot inside means that it applies once for every <sp> element, and that each time it applies, it has just a single argument.

    Use this information to identify an XPath that finds the length of the longest speech. What length does it return? Our solution uses string-length() and max().

    max(//sp/string-length(.)). We find this more legible when we write it using the simple mapping operator and the arrow operator: //sp ! string-length(.) => max()

    What this solution accomplishes is that it generates a sequence of the string-lengths of every speech in the play, and then finds the maximum of those values. Because we cannot find the string-length of more than one item at a time, we navigate to all of the <sp> elements in the play on the descendant axis with the // shorthand, and then take a step to use the string-length() function and use . to refer to the current <sp> for each one in turn. Wrapping the max() function around the XPath that produces this sequence of values will return the maximum value in that sequence, that is, the maximum length of any speech, which is 5248.

  5. Optional, challenging question: Given the preceding solution, how can you use that XPath to retrieve the longest <sp> itself? No fair checking the length and then writing a separate XPath that looks for that number. Your answer must find the longest speech without your knowing how long it is. Our solution doesn’t require any additional functions beyond the ones used in #4, but it does use a complicated predicate.

    //sp[string-length(.) eq max(//sp/string-length(.))]

    Since the previous solution (copied here in blue) is just a numerical value, we can use it in a predicate to compare against the length of any speech that we are looking at. All we have to do is find the speeches, and check one by one whether the length of any speech is equal to the maximum length of all speeches in the play. This works because, even though within a predicate, the expression inside our max() function (//sp/string-length(.)) uses the // to start its search from the root of the document.

    It is also possible to find the longest speech without specifying the length, that is, without building on #4. The expression that does that is: //sp[not(string-length(.) &lt; //sp/string-length(.))] This takes advantage of the fact that when we use the general comparison operator < (which we have to spell with the character entity &lt; because it’s inside an attribute value, where a literal < would not be well formed), the comparison returns true if any item on the left side of the operator is less than any item on the right (see the discussion of General comparison at the bottom of our XPath functions we use most). The right side of our comparison here is a sequence of integers that represent the lengths of all speeches, and the item on the left is the integer length of the speech we’re looking at at the moment (we look at each one separately, because that’s how predicates work). The only speech that is not shorter than at least one of the sequence of all speeches in the play is the one that is itself the longest, so our test will pick out just that one.

  6. Optional, very challenging question: What XPath produces a numbered list of all characters, without any duplicates, which should look something like:

    1. Claudius
    2. Hamlet
    3. Polonius
    4. ...

    There are several possible solutions, each of which raises issues that you may not have seen before. If you get an error message, try to figure out what it means and how to resolve it.

    One solution is //role/concat(position(), ". " ,.). We retrieve all <role> elements and then, for each one, return a concatenation of its position in the sequence of <role> elements (using the position() function to get the position of the current context node), a literal dot followed by a space, and then the <role> element itself. The concat() function automatically atomizes its arguments, which is to say that when we pass it a <role> element, it converts it to its atomic (string) value (that is, it throws away the markup and just gives us back the character content), so that we wind up with results like “1. Claudius”, which is what we want. The version that uses the simple mapping operator would look like //role ! concat(position(), ". " ,.).

    If we get the characters using <speaker> or @who values instead of <role>, we need to deduplicate them with distinct-values(), and because distinct-values() returns strings, and not nodes in the tree, we can’t use a path step to get the position and value the way we did with <role> elements. But we can use the simple map operator (!) with either nodes or atomic values, so the expression would then be distinct-values(//speaker) ! concat(position(), ". " ,.).

    Finally, instead of iterating over the roles or distinct values and returning their positions and string values, as we do above, we can iterate over the positions and return the same thing. The expression in that case would be for $i in (1 to count(//role)) return concat($i, ". ", (//role)[$i]). The for expression iterates over a sequence and does something once for each member of the sequence. The sequence over which it iterates is a sequence of integers from 1 through however many characters there are in the play (there are 37, which XPath determines by counting the number of <role> elements). We use the to operator, which we haven’t used before, as an instruction to generate the sequence of integers for us dynamically. If there were, say, only 5 characters in the play, the expression 1 to count(//role) would be equivalent to the sequence (1,2,3,4,5).

    Although for $i in (1 to count(//role)) is just setting the variable $i to a different integer value each time it loops, on each pass through the for loop, (//role)[$i] will point to a different character in the play. On the first pass, when $i equals 1, (//role)[$i] means (//role)[1], so it points to the first character in the sequence returned by //role. On the second pass, the number value is 2 and (//role)[$i] means (//role)[2], and points to the second character. This is what lets us generate our numbered list; both the numbers and the pointers into the list of characters are incremented by one on each pass through the loop.

    You can read more about how this works in http://xsltbyexample.blogspot.com/2010/05/obtain-position-from-for-expression-in.html, which details specifically how and why you would use this approach. As that page notes, this type of output can also be generated in XSLT, and the XSLT solution may be more intuitive than the pure XPath one.