Digital humanities


Maintained by: David J. Birnbaum (djbpitt@gmail.com) [Creative Commons BY-NC-SA 3.0 Unported License] Last modified: 2021-03-26T16:30:55+0000


Test #4: XPath test answers

The task below uses Bad Hamlet to create an alphabetized list of all of the distinct words spoken by Hamlet in Act 5, Scene 1. The steps are ordered so that each builds on the preceding one, and if you get stuck on a step (required or extra-credit; for example, if you aren’t able to convert words to lower case or remove punctuation), it’s okay to skip it and proceed to the next step (but tell us that you’ve done that). Submit your answers in a markdown document, with backticks around the XPath expressions, specifying the XPath expression for each step (they are cumulative, so each will look like a modified version of the immediately preceding one). That is, you answer will look like a numbered list of XPath expressions, one expression for each of the steps below.

We suggest working in the XPath/XQuery Builder view (accessible from the <oXygen/> menus at Window → Show View) because the expression gets very long. (To run an expression in the Builder, click on the red sideways triangle at the top of the Builder panel. The Enter key just creates a new line; it doesn’t execute the expression.) Using the simple mapping operator (!) and the arrow operator (=>), where possible, will improve the legibility, but it is possible to complete these steps without those operators. You can split your expression into multiple lines in the Builder, which will also improve legibility.

Required steps

  1. Write an XPath expression that returns all of Hamlet’s lines (<l> elements) in Act 5, Scene 1. There are 43 of them. Our solution does not use any functions.

    //body/div[5]/div[1]/sp[speaker eq 'Hamlet']//l

    We use the descendant, rather than child, axis because some of Hamlet’s lines are descendants, but not children, of his speeches. For example:

    <sp who="Hamlet">
        <speaker>Hamlet</speaker>
        <ab xml:id="sha-ham202409" n="409">Why,</ab>
        <lg type="song">
            <l xml:id="sha-ham202410" n="410">One fair daughter, and no more,</l>
            <l xml:id="sha-ham202411" n="411">The which he loved passing well.</l>
        </lg>
    </sp>
  2. Modify your answer above to provide both the <l> elements and the anonymous blocks (<ab> elements). There are 91 anonymous blocks, so the total number of lines of both types is 134. Our solution does not use any functions.

    //body/div[5]/div[1]/sp[speaker eq 'Hamlet']/(descendant::l | descendant::ab)
  3. Modify your answer to the preceding step to break the text into a single list of all of the words in all of those lines. Our solution used the tokenize() function. This word list will include words inside the four stage directions that are children of the line elements in this scene. For extra credit, exclude those from your result, so that you are returning only the words spoken by Hamlet.

    //body/div[5]/div[1]/sp[speaker eq 'Hamlet']/(descendant::l | descendant::ab) 
    ! tokenize(.,'\s+')

    Some of you used normalize-space() to convert all sequences of whitespace characters to a single space character and then tokenzied on a space (tokenize(., ' ')), instead of tokenizing on \s+, which matches strings of one or more whitespace characters all at once. Either approach is fine; the only trick is that if you choose to tokenize on a single space character, you should normalize the whitespace first, since otherwise you’ll wind up with extra zero-length strings. The extra zero-length strings will turn out to be harmless as long as you get to the last extra-credit step and filter out zero-length strings, but if, for example, you were to count words at this point, you’d get a false result because the zero-length strings would be counted as if they were words.

    To remove words in the stage directions, add a path step to get only text node children of the lines and anonymous blocks:

    //body/div[5]/div[1]/sp[speaker eq 'Hamlet']/(descendant::l | descendant::ab)/text()
  4. Modify your answer to return all of those words in lower case, that is, convert words with initial capital letters to all lower case. Our solution uses the lower-case() function.

    //body/div[5]/div[1]//sp[speaker eq 'Hamlet']/(descendant::l | descendant::ab) 
    ! tokenize(.,'\s+') 
    ! lower-case(.)

Extra-credit steps

  1. Modify your answer to strip all punctuation except hyphens and apostophes. Our solution uses the replace() function.
    //body/div[5]/div[1]/sp[speaker eq 'Hamlet']/(descendant::l | descendant::ab) 
    ! tokenize(.,'\s+') 
    ! lower-case(.) 
    ! replace(., "[^-a-z']", '')

    The replace() function here uses a regex with a negated character class to match all characters that are not hyphens, lower-case letters, or apostrophes and replaces them with nothing, that is, removes them. Because our regex includes an apostophe, we wrap the regex expression in quotation marks, rather than apostrophes. A hyphen normally has a special meaning inside a character class, where it indicates a range, and we use it with that meaning in the a-z part of our regex here. But since we also want to be able to use it in its literal meaning (that is, to match a literal hyphen character), we make it the first character in our negated character class (that is, the first character after the caret [^]). A hyphen inside a character class naturally has its literal meaning when it’s the first character in the class, since in that context it cannot represent a range because no character specification precedes it, and ranges by definition have to specify their start and end points. Alternatively, we could have put the hyphen elsewhere and escaped it by preceding it with a backslash.

    Some of you used \w instead of a-z in the negative character class. This is equivalent with this data, but could differ under other circumstances. For example, digits count as word characters, and are therefore matched by \w, but not by a-z. See Kay, pp. 921 and 924–25 for details.

  2. Modify your answer to remove duplicate words. Our solution uses the distinct-values() function.
    //body/div[5]/div[1]/sp[speaker eq 'Hamlet']/(descendant::l | descendant::ab) 
    ! tokenize(.,'\s+')
    ! lower-case(.) 
    ! replace(., "[^-a-z']", '') 
    => distinct-values()
  3. Modify your answer to sort the distinct words alphabetically. Our solution uses the sort() function.

    //body/div[5]/div[1]/sp[speaker eq 'Hamlet']/(descendant::l | descendant::ab) 
    ! tokenize(.,'\s+') 
    ! lower-case(.) 
    ! replace(., "[^-a-z']", '') 
    => distinct-values()
    => sort()
  4. The first word in our alphabetized list is just a blank, so remove it. Our solution uses a predicate (checking whether the word is equal to the empty string), but no additional functions.

    //body/div[5]/div[1]/sp[speaker eq 'Hamlet']/(descendant::l | descendant::ab) 
    ! tokenize(.,'\s+') 
    ! lower-case(.) 
    ! replace(., "[^-a-z']", '')[. ne '']
    => distinct-values()
    => sort()

    If we want to filter out the blank line at the end, instead of after the replace() operation, we need to wrap everything that precedes it in parentheses.

    (//body/div[5]/div[1]//sp[speaker eq 'Hamlet']/(descendant::l | descendant::ab) 
    ! tokenize(.,'\s+') 
    ! lower-case(.) 
    ! replace(., "[^-a-z']", '') 
    => distinct-values()
    => sort())[. ne '']

    Why parentheses behave differently with the simple mapping and arrow operators

    You can remember that if you want to apply a predicate to the output of an arrow operation you always need to parenthesize everything before the predicate (although that is not the case with a simple mapping expression) and not worry about the reason, but in case you’re curious, here are the details.

    Parentheses and the arrow operator

    The reason we can write something like sort($things-to-sort)[1] but not $things-to-sort => sort()[1] (the second of these expressions raises an error) is that passing input into a function with the arrow operator is not exactly synonmous with writing the input inside the function parentheses. The difference is that the right side of the arrow operator is not the sequence created by applying the function to the input; it is the function itself. Predicates can be applied to sequences, including sequences output by functions, but not to functions themselves. We can use parentheses in the variant ($things-to-sort => sort())[1] to form the output sequence first and then apply the predicate to that sequence.

    Parentheses and the simple mapping operator

    Perhaps confusingly, that distinction does not apply with the simple mapping operator, so we can write a predicate after the right side of a simple mapping operation, without using any extra parentheses to group the results, and it will apply to filter the items in the output sequence. For that reason, an expression like:

    $strings-to-process ! replace(., "[^-a-z']", '')[. ne '']

    does not require parentheses around the expression before the predicate.

    The reason simple mapping operations behave differently than arrow operations in this respect is that the right side of the simple mapping operator is an expression, rather than a function. It may look like a function because expressions can contain functions, but we can see that it’s an expression because we can write, for example, (//sp) ! ./l to return all of the <l> children of each <sp> element in our document; note that the right side here is not (and does not include) a function. For that matter, we can write (//sp) ! "Hi, Mom!" to return the string Hi, Mom! once for each speech in the document. This type of operation is not possible with the arrow operator because the right side of the arrow operator must be a function.

    Although simple mapping operations that will use predicates on the right side do not require parentheses before the predicate (unlike arrow operations), they do allow parentheses, and there is a difference in meaning (although, confusingly, there is not always a difference in the final output). Consider:

    ("obdurodon", "steropodon") ! string-to-codepoints(.)

    This expression breaks up two strings into sequences of individual numerical values, one per character, producing:

    111 98 100 117 114 111 100 111 110 115 116 101 114 111 112 111 100 111 110

    This looks like one sequence of integers, but it’s really two, side by side, one for each input string. We can see that it’s two strings if we apply a numerical predicate to the right side with:

    ("obdurodon", "steropodon") ! string-to-codepoints(.)[1]

    Because we apply the predicate to the output of exploding each of the two strings into numbers, one string at a time, we apply the predicate to each separately, and it keeps the first integer in each of the two sequences of integers. This expression, then, outputs two items, 111 115, which are the first numerical values output for each of the two input strings.

    If, though, we parenthesize the expression before applying the predicate, we form it into a single sequence before filtering, which means that we get only the first item in the single, long, fused sequence: 111. The expression that produces that is:

    (("obdurodon", "steropodon") ! string-to-codepoints(.))[1]

    The effect of the parentheses is easy to see with the numerical predicates in the example above. In the example in our actual task, which involves removing zero-length strings from a sequence of strings, there is also a difference in what happens internally depending on where we perform the filtering (each string individually or all of the strings together), but we get the same final output, which can obscure that something different is going on in the two cases. Here is a simplified example that illustrates how two different internal operations produce the same output in this type of situation. Consider:

    ("ab", "ae", "bd") => count()

    returns the integer value 3, which confirms that we have a sequence of three strings. If we want to strip the vowels out of each string and then keep only the ones that are not reduced to zero length, that operation will return ("b", "bd"). We can do that as follows:

    ("ab", "ae", "bd") ! replace(., '[aeiou]', '')[. ne '']

    This performs the replacements on each string, one by one, and then filters each one individually. But we could, alternatively, do:

    (("ab", "ae", "bd") ! replace(., '[aeiou]', ''))[. ne '']

    This also perform the replacements on each string, one by one, but the parentheses form the three outputs of the replacement operations into a single sequence that looks like ("b", "", "bd"), which it then filters. The result is the same in this case because filtering out empty strings has the same effect whether you do it on each string individually as a sort of entrance examination to get into the output sequence, or form the output sequence initially with all strings, including the empty one, and then filter out the empty one.