Digital humanities


Maintained by: David J. Birnbaum (djbpitt@gmail.com) [Creative Commons BY-NC-SA 3.0 Unported License] Last modified: 2022-10-27T23:12:16+0000


Test #4: XPath

Instructions

This test has two required parts plus an optional extra-credit section. The first part asks questions about XPath and the second asks you to create XPath expressions and use them to learn about the Bad Hamlet file that you’ve been using for practice. You’ll find that file at http://dh.obdurodon.org/bad-hamlet.xml.

Write your answers in a properly formatted markdown file with a filename that conforms to our usual filenaming conventions and upload it to Canvas. The test is open book and you can use any references you’d like, except that you cannot receive help from another person. Should you have any questions, please ask in the #xpath channel in our Slack workspace. We can’t give you the answer, but we’ll do whatever we can short of that to help.

Don’t forget to set the XPath version in the <oXygen/> XPath toolbar or XPath builder to 3.1. You may also want to revisit our XPath functions we use most tutorial.

Part 1: Questions about XPath

  1. What is the difference between absolute and relative path expressions? Give an example of each and explain how you might use them to explore Hamlet in <oXygen/>.

  2. What is the difference between a path step and a predicate? To answer this question, give an example of an XPath expression that contains both path steps and predicates (at least one of each), explain how they are are distinguished by spelling, and explain what each one contributes to the overall meaning of the XPath expression.

  3. Explain the difference between general comparison and value comparison, and provide XPath expressions (applied to Hamlet) that illustrate that difference.

  4. Explain why //sp/l ! count(.) returns 2809 instances of the integer value 1 and //sp/l => count() returns just one integer, the value 2809. (Note that the second path step is a lower-case letter L, representing a metrical line; it is not the digit one.) That is, what is the difference between the simple map and arrow operators in these examples?

    Don’t be distracted by the dot in the first example and the absence of a dot in the second. If you omit the dot from the first example or include it in the second you’ll raise an error, but that’s just a quirk and not the point of the question. The point is: What does each of operators, simple map vs arrow, mean that causes them to produce different results?

Part 2: Creating and using XPath expressions

The functions we used to answer the following questions include contains(), count(), distinct-values(), not(), sort(), string-join(). All of these are described in Michael Kay except sort() because it was introduced in XPath 3.1 and Mike’s book was written when 2.0 was the most recent version. The sort() function works the way you might expect: if you supply a sequence of items as its only argument, it returns them sorted into alphabetical order. There may be more than one correct answer to some of the questions.

  1. All speech (<sp>) elements should have @who attributes, but some (incorrectly) don’t. What XPath expression will select the speeches without @who attributes? What XPath expression will tell you how many there are? (The path expression you write for that last question has to return a single integer value. It’s about having XPath do the counting, and not just about how many there are.) (Sanity check: there are two such speeches.)

  2. There are scenes where Hamlet, the title character in the play, does not speak. What XPath expression will find those scenes? (Sanity check: there are 7 such scenes.)

  3. Hamlet doesn’t speak in the scenes described in the previous question, but other characters do. What XPath expression will produce an alphabetized and deduplicated sequence of all speaker names in each of those scenes (one list per scene), formatted as comma-separated lists. For example, if the only characters who speak in a scene are Larry, Moe, and Curly, and each of them speaks several times, your XPath expression should output Curly, Larry, Moe.

    Your response has to proceed step by step and you need to give the intermdiate steps, and not just the final complete expression. We recommend running a sanity check after each step. The steps are:

    1. Start with the scenes where Hamlet doesn’t speak (the answer to the preceding question).

    2. Modify the XPath expression in the first step to select all speakers in each of those scenes.

    3. Modify the XPath expression in the second step to deduplicate those speaker sequences, so that you find all distinct speakers in each of those scenes instead of all instances of all speakers, with repetitions.

    4. Modify the XPath expression in the third step to sort the sequence of deduplicated speakers for each scene.

    5. Modify the XPath expression above to form the sorted sequence of unique speaker names for each scene into a comma-separated list.

      When you use the arrow operator for a function that takes one argument, like count() (the one argument is a sequence of items to count), the parentheses must be empty. For example, //sp => count() counts all of the speeches in the play. This is part of a general rule that after the arrow operator you never supply the first argument to the function explicitly. You can see this rule at work in Part 1, Question 4, above; it’s why you can’t write a dot inside count() after the arrow operator, and if you do, you’ll get an error about how the XPath processor thinks the dot is the second argument to the function (Cannot find a two-argument function …). The first argument after the arrow function is automatically implied, so the first one you specify literally, between the parentheses, will be interpreted as the second argument.

      When you use the arrow operator for a function that takes more than one argument (such as translate()), you have to supply all arguments except the first, since only the first is automatic. For example, (//speaker)[1] => translate('r', 'X') selects the first <speaker> element in the entire play (the value is Bernardo) and passes it into the translate() function to replace all instances of r with X. The translate() function takes three arguments (see our XPath functions we use most tutorial for details), but after the arrow operator we specify only the second and third ones. The output of this XPath expression is BeXnaXdo.

  4. There is one scene where Hamlet not only doesn’t speak, but isn’t even mentioned. What XPath expression will find that scene?

Part 3: Optional extra-credit questions

  1. What’s your favorite XPath function that hasn’t been used anywhere in this test? Give an example of how you could use it to explore or process Hamlet and explain how it works in your example. The example doesn’t have to be genuinely useful, but it does have to work and your explanation has to be accurate.

  2. The XPath expression

    //l[@xml:id="sha-ham101014"]

    selects a line in the play that reads:

     I think I hear them. Stand, ho! Who is
        there? Enter HORATIO and MARCELLUS.
    ]]>

    If we run the expression:

    //l[@xml:id="sha-ham101014"]/string()

    it returns the string value of the <l> element, that is, all of the text inside it, at any level, without the markup. But if we run:

    //l[@xml:id="sha-ham101014"]/text()

    it returns the spoken text, but without the textual content of the <stage> child element. Why do these two XPath expressions return different results and what is each of them doing? (Hint: text() is not a function, even though it looks like one. See the discussion of KindTest in Kay, pp. 697–98.)

  3. We can measure the length of a metrical line (<l> element) in the play in terms of character count or word count, but to measure the length of the spoken line we have to exclude any embedded stage directions from what we’re counting. For the line in extra-credit question #2, above:

    • What XPath expression will return the length of the spoken line in character count, excluding the embedded stage direction?

    • When we’re measuring length as character count, we don’t want to count all of the whitespace characters introduced by pretty-printing. What XPath expression will return the length of the spoken line above in character count after removing extraneous whitespace, so that each word is separated from neighboring words by only a single space character and there are no extra leading or trailing whitespace characters?

    • What XPath expression will return the length of the spoken line in word count, excluding the embedded stage direction?