Digital humanities


Maintained by: David J. Birnbaum (djbpitt@gmail.com) [Creative Commons BY-NC-SA 3.0 Unported License] Last modified: 2022-11-01T17:03:01+0000


Test #4: XPath (Answer key)

Instructions

This test has two required parts plus an optional extra-credit section. The first part asks questions about XPath and the second asks you to create XPath expressions and use them to learn about the Bad Hamlet file that you’ve been using for practice. You’ll find that file at http://dh.obdurodon.org/bad-hamlet.xml.

Write your answers in a properly formatted markdown file with a filename that conforms to our usual filenaming conventions and upload it to Canvas. The test is open book and you can use any references you’d like, except that you cannot receive help from another person. Should you have any questions, please ask in the #xpath channel in our Slack workspace. We can’t give you the answer, but we’ll do whatever we can short of that to help.

Don’t forget to set the XPath version in the <oXygen/> XPath toolbar or XPath builder to 3.1. You may also want to revisit our XPath functions we use most tutorial.

Part 1: Questions about XPath

  1. Question: What is the difference between absolute and relative path expressions? Give an example of each and explain how you might use them to explore Hamlet in <oXygen/>.

    What they look like: Any path expression that begins with a slash character (single or double) is an absolute path expression. Any path expression that begins with anything else is a relative path expression.

    What they mean: Inside <oXygen/> the current context is the location of the text cursor. An absolute path expression ignores the current context; it always starts from the document node and therefore returns the same result regardless of the location of the cursor. A relative path is relative to (starts from) the current cursor location.

    How to use them: Suppose you want to select all of the scenes in Act 3 in the play. An absolute path that does this is //body/div[3]/div and it does the same thing regardless of where your cursor is located. To find just the scenes of Act 3 with a relative path you can click immediately inside the <div> for Act 3 (but not inside any of its children) and use the path expression div. This works because clicking inside the act makes it the current context node, and your path expression means select all <div> children (children because the child axis is the default) of the current context node.

  2. Question: What is the difference between a path step and a predicate? To answer this question, give an example of an XPath expression that contains both path steps and predicates (at least one of each), explain how they are are distinguished by spelling, and explain what each one contributes to the overall meaning of the XPath expression.

    Example: The expression //l[not(parent::sp)]/.. is an absolute path expression with two path steps, separated by a slash, with a predicate applied to the first path step. It selects all parents of <l> elements that are not <sp> elements.

    Path steps: The first step starts from the document node and selects all <l> elements on the descendant axis, that is, anywhere in the document. The second path step starts at each of the nodes selected by the first path step (each of those, in turn, becomes the current context for the second path step) and selects its parent. A path step, then, uses the results of the preceding path step as the context nodes for where it begins.

    Predicate: The code inside the square brackets is a predicate that is applied to the first path step to filter the sequence it selects. The //l selects all lines; the predicate filters the sequence of all lines to keep only those that do not have a parent of type <sp>. The value of the expression up to that point is the nodes selected by the path step that survive the filtering. Those survivors become the context nodes for the next path step, which is how the entire expression selects the parents of <l> elements when those parents are not of type <sp>.

  3. Question: Explain the difference between general comparison and value comparison, and provide XPath expressions (applied to Hamlet) that illustrate that difference.

    Value comparison definition and example: Value comparison (operators eq ne gt ge lt le) compares one thing on the left to one thing on the right. The XPath expression //sp[speaker eq 'Hamlet'] uses value comparison to select the speeches that have a child <speaker> element with a string value equal to the string Hamlet.

    General comparison definition and example: General comparison (operators = != > >= < <=) compares sequences of any length on the left to sequences of any length on the right. The comparison as a whole succeeds if it succeeds for any pair of items in the two sequences. For example, //sp[speaker = 'Hamlet'] compares the sequence of all <speaker> children of each <sp> elements (each speech happens to have exactly one speaker child) to all strings in the sequence on the right (there is exactly one string in this example). Because this general-comparison example has sequences of only one item on either side of the comparison operator, it behaves the same way as value comparison.

    Difference: The most important difference between general and value comparison arises when one or both of the things being compared is a sequence of more than one item. The XPath expression //sp[speaker = ('Hamlet', 'Ophelia')] selects all speeches by either Hamlet or Ophelia because the comparison succeeds when any item to the left (there is just one for each speech) is equal to any item to the right (there are two). If we try to use value comparison here with //sp[speaker eq ('Hamlet', 'Ophelia')] we raise an error ( Required cardinality of second operand of 'eq' is zero or one; supplied value has cardinality more than one) because we are not comparing one thing to one thing. This use of general comparison is a concise and idiomatic improvement over the complex alternative //sp[speaker eq 'Hamlet' or speaker eq 'Ophelia'].

  4. Question: Explain why //sp/l ! count(.) returns 2809 instances of the integer value 1 and //sp/l => count() returns just one integer, the value 2809. (Note that the second path step is a lower-case letter L, representing a metrical line; it is not the digit one.) That is, what is the difference between the simple map and arrow operators in these examples?

    Don’t be distracted by the dot in the first example and the absence of a dot in the second. If you omit the dot from the first example or include it in the second you’ll raise an error, but that’s just a quirk and not the point of the question. The point is: What does each of operators, simple map vs arrow, mean that causes them to produce different output results?

    Simple map operator: //sp/l ! count(.) says to find all line children of speeches and for each line we find we count the line. Because we are doing the counting for each line separately, the count is always equal to one. There are 2809 results because there are 2809 <l> element children of speeches in the play. This simple map example is synonymous with //sp/l/count(.).

    Arrow operator: //sp/l => count() says to find all line children of speeches and count the resulting sequence, that is, count all the lines we find together. There are 2809 lines, so we get the a single result equal to 2809. This arrow example is synonymous with count(//sp/l).

    Difference: The simple map operator does the thing on the right (in this case, it counts) once for each item on the left (so it counts each line individually). The arrow operator does the thing on the right once for the entire sequence on the left, so it counts all lines together.

Part 2: Creating and using XPath expressions

The functions we used to answer the following questions include contains(), count(), distinct-values(), not(), sort(), string-join(). All of these are described in Michael Kay except sort() because it was introduced in XPath 3.1 and Mike’s book was written when 2.0 was the most recent version. The sort() function works the way you might expect: if you supply a sequence of strings as its only argument, it returns them sorted into alphabetical order. There may be more than one correct answer to some of the questions.

  1. Question: All speech (<sp>) elements should have @who attributes, but some (incorrectly) don’t. What XPath expression will select the speeches without @who attributes? What XPath expression will tell you how many there are? (The path expression you write for that last question has to return a single integer value. It’s about having XPath do the counting, and not just about how many there are.) (Sanity check: there are two such speeches.)

    Answer: //sp[not(@who)]. This finds all speeches and filters them to keep only the ones that do not have a @who attribute.

  2. Question: There are scenes where Hamlet, the title character in the play, does not speak. What XPath expression will find those scenes? (Sanity check: there are 7 such scenes.)

    Answer: //div/div[not(descendant::speaker = 'Hamlet')]

  3. Question: Hamlet doesn’t speak in the scenes described in the previous question, but other characters do. What XPath expression will produce an alphabetized and deduplicated sequence of all speaker names in each of those scenes (one list per scene), formatted as comma-separated lists. For example, if the only characters who speak in a scene are Larry, Moe, and Curly, and each of them speaks several times, your XPath expression should output Curly, Larry, Moe.

    Your response has to proceed step by step and you need to give the intermdiate steps, and not just the final complete expression. We recommend running a sanity check after each step. The steps are:

    1. Question: Start with the scenes where Hamlet doesn’t speak (the answer to the preceding question).

      Answer: //div/div[not(descendant::speaker = 'Hamlet')]

    2. Question: Modify the XPath expression in the first step to select all speakers in each of those scenes.

      Answer (we write these and other long answers on multiple lines to make them easier to read, but XPath doesn’t care whether they are written as one line or split across several):

      //div/div[not(descendant::speaker = 'Hamlet')]
      /descendant::speaker
    3. Question: Modify the XPath expression in the second step to deduplicate those speaker sequences, so that you find all distinct speakers in each of those scenes instead of all instances of all speakers, with repetitions.

      Answer:

      //div/div[not(descendant::speaker = 'Hamlet')]
      /(descendant::speaker 
      => distinct-values())

      We need the parentheses to deduplicate the sequences of speakers for each scene separately. The parentheses begin after the step that selects the scenes you care about, and because XPath proceeds step by step, each of the selected scenes, one by one, serves as the curent context for whatever follows. The parentheses let you apply everything inside them to each of those context nodes separately, that is, to select the speakers for just one scene at a time and deduplicate (and, eventually, sort, and string-join) them.

      If you use the arrow operator and fail to wrap parentheses around the speeches for each scene and their subsequent processing, you wind up processing all speakers of all of the selected scenes together, instead of separately for each scene. For this step, for example, you want to remove duplicate speakers within a scene, but not when the same person speaks in more than one of the selected scenes.

      This is one of the rare situations where it may be easier to read the expression if you use nesting instead of the arrow operator:

      //div/div[not(descendant::speaker = 'Hamlet')]
      /distinct-values(descendant::speaker)
    4. Question: Modify the XPath expression in the third step to sort the sequence of deduplicated speakers for each scene.

      //div/div[not(descendant::speaker = 'Hamlet')]
      /(descendant::speaker 
      => distinct-values()
      => sort())

      or

      //div/div[not(descendant::speaker = 'Hamlet')]
      /sort(distinct-values(descendant::speaker))
    5. Question: Modify the XPath expression above to form the sorted sequence of unique speaker names for each scene into a comma-separated list.

      When you use the arrow operator for a function that takes one argument, like count() (the one argument is a sequence of items to count), the parentheses must be empty. For example, //sp => count() counts all of the speeches in the play. This is part of a general rule that after the arrow operator you never supply the first argument to the function explicitly. You can see this rule at work in Part 1, Question 4, above; it’s why you can’t write a dot inside count() after the arrow operator, and if you do, you’ll get an error about how the XPath processor thinks the dot is the second argument to the function (Cannot find a two-argument function …). The first argument after the arrow function is automatically implied, so the first one you specify literally, between the parentheses, will be interpreted as the second argument.

      When you use the arrow operator for a function that takes more than one argument (such as translate()), you have to supply all arguments except the first, since only the first is automatic. For example, (//speaker)[1] => translate('r', 'X') selects the first <speaker> element in the entire play (the value is Bernardo) and passes it into the translate() function to replace all instances of r with X. The translate() function takes three arguments (see our XPath functions we use most tutorial for details), but after the arrow operator we specify only the second and third ones. The output of this XPath expression is BeXnaXdo.

      Answer:

      //div/div[not(descendant::speaker = 'Hamlet')]
      /(descendant::speaker 
      => distinct-values()
      => sort()
      => string-join(', '))

      or

      //div/div[not(descendant::speaker = 'Hamlet')]
      /string-join(sort(distinct-values(descendant::speaker)), ', ')
  4. Question: There is one scene where Hamlet not only doesn’t speak, but isn’t even mentioned. What XPath expression will find that scene?

    Answer: //div/div[not(contains(., 'Hamlet'))]

Part 3: Optional extra-credit questions

  1. Question: What’s your favorite XPath function that hasn’t been used anywhere in this test? Give an example of how you could use it to explore or process Hamlet and explain how it works in your example. The example doesn’t have to be genuinely useful, but it does have to work and your explanation has to be accurate.

  2. Question: The XPath expression

    //l[@xml:id="sha-ham101014"]

    selects a line in the play that reads:

     I think I hear them. Stand, ho! Who is
        there? Enter HORATIO and MARCELLUS.
    ]]>

    If we run the expression:

    //l[@xml:id="sha-ham101014"]/string()

    it returns the string value of the <l> element, that is, all of the text inside it, at any level, without the markup. But if we run:

    //l[@xml:id="sha-ham101014"]/text()

    it returns the spoken text, but without the textual content of the <stage> child element. Why do these two XPath expressions return different results and what is each of them doing? (Hint: text() is not a function, even though it looks like one. See the discussion of KindTest in Kay, pp. 697–98.)

    Answer: The XPath expression text() matches text nodes, so the second expression above finds the line in question and selects all of its text-node children. The words of the stage direction are in a text node that is a child of <stage>, so that text node is not a child of the line, which means that the expression doesn’t select it.

    The first expression above applies the string() function to the line, and that string() returns the string value of the element, which is a concatenation of all of its text-node descendants (not just children). That is, it effectively throws away any internal markup and selects all of the text anywhere inside. That means that it includes the content of the stage direction.

  3. We can measure the length of a metrical line (<l> element) in the play in terms of character count or word count, but to measure the length of the spoken line we have to exclude any embedded stage directions from what we’re counting. For the line in extra-credit question #2, above:

    • Question: What XPath expression will return the length of the spoken line in character count, excluding the embedded stage direction?

      //l[@xml:id="sha-ham101014"]/text()
      => string-join()
      => string-length()

      The line contains two text nodes because there’s a whitespace-only text node between the </stage> and </l> end-tags. The string-length() function takes only a single argument, so we can’t apply it to the two text nodes all at once, so we merge them into one string before measuring the length. We can’t use concat() to merge them because even though they’re two text nodes, they’re part of one sequence, and therefore constitute one function argument, and concat() is defined as requiring at least two arguments. For that reason we concatenate them using the one-argument version of string-join(), without a separator. That returns a single string, and we measure its character count with string-length().

      As an alternative we could measure the length of each text node and then add them up:

      //l[@xml:id="sha-ham101014"]/text()
      ! string-length()
      => sum()
    • Question: When we’re measuring length as character count, we don’t want to count all of the whitespace characters introduced by pretty-printing. What XPath expression will return the length of the spoken line above in character count after removing extraneous whitespace, so that each word is separated from neighboring words by only a single space character and there are no extra leading or trailing whitespace characters?

      Answer:

      //l[@xml:id="sha-ham101014"]/text()
      => string-join()
      => normalize-space()
      => string-length()

      After we concatenate the text nodes we use the normalize-space() function to strip leading and trailing whitespace and convert all internal whitespace sequences to single space characters.

    • Question: What XPath expression will return the length of the spoken line in word count, excluding the embedded stage direction?

      //l[@xml:id="sha-ham101014"]/text()
      => string-join()
      => tokenize()
      => count()

      The one-argument version of tokenize() splits a string into tokens (that is, words) on sequences of whitespace.