Digital humanities


Maintained by: David J. Birnbaum (djbpitt@gmail.com) [Creative Commons BY-NC-SA 3.0 Unported License] Last modified: 2022-04-20T00:07:23+0000


Test 4: XPath answer guide

The task (a reminder)

For the XPath test, you will be using XPath to take a closer look at individual characters in Shakespeare’s Hamlet. We recommend working in the XPath/XQuery Builder view instead of the small XPath toolbar at the top of your <oXygen/> window because some of these expressions can get long, and it is easier to stay organized if you can see the entirety of your XPath expression as you are typing. Like all of our tests, this one is open-book, which means that you can consult notes, books, the Internet and other resources, except that you cannot receive any assistance from another person. Should you get stuck in a way that does not respond to your best rubber duck debugging efforts, feel free to post an inquiry in our Slack workspace and we’ll try to point you in the right direction. You should submit your answers (the full XPath expressions, not just the result of evaluating them) in a properly formatted markdown document. That includes surrounding XPath expressions in backticks.

Using the version of Hamlet that we have been using for all the previous XPath assignments (http://dh.obdurodon.org/bad-hamlet.xml), complete the following tasks. There are alternative good solutions for some of them, so you will not need to use all of the following functions, but some that we used include avg(), distinct-values(), matches(), normalize-space(), round(), sort(), string-join(), tokenize(), and translate(). Don’t guess at how these work; if you aren’t already familiar with them, look up the number of arguments they require and what each argument means in Michael Kay or an alternative reference.

Here are two details to keep in mind:

Required

Part 1

  1. Find all of Hamlet’s spoken lines (which may be represented by <l> or <ab> elements) in the play and select only those that begin with the word I (not just the letter I). The result should be a sequence of elements of type <l> or <ab>.

    Solution
    //(l | ab)[ancestor::sp[@who eq 'Hamlet']][starts-with(normalize-space(.), 'I ')]

    This expression returns a list of 66 <l> and <ab> elements. We start at the document node and search everywhere for both <l> and <ab> elements by grouping them together with parentheses in an or-group. We use the union operator (|) in //(l | ab) to create a sequence of all of elements of those types. We then filter those nodes to select a subsequence that contains only the lines that meet our requirements, that is, those that have a parent <sp> element with a @who value equal to the string Hamlet. We then apply another predicate that filters the lines to keep only the ones that start with I followed by a single space to ensure we return only lines that begin with the word I, not the letter I. Because some lines in the XML begin with spurious space characters, we can't merely ask for lines that start with I as we will miss the lines that start with a space followed by I. To handle this white-space irregularity, we apply normalize-space() to the current context (all the lines with Hamlet as the speaker) inside the starts-with() function as its first argument and specify the string I as its second argument.

    Alternatively, you can check for lines that begin with the word I with:

    • [tokenize(.)[1] eq "I"]
    • [matches(., '^\s*I ')]

    tokenize(), by default, splits strings into word tokens on sequences of white-space characters, and the numerical predicate filters the tokens to keep only the first one for the current context (every spoken Hamlet line). We test if that token is equal to the string I, and if it is, the node to which it belongs is returned in the results list. matches() takes a string as its first argument and a regex pattern as its second argument. In this case, we use the matches() function to operate on the current context (each line spoken by Hamlet) and match the lines that start with zero or more spaces followed by I followed by a single space.

    In all of these methods we take advantage of the fact that although some lines of speech contain stage directions, the stage direction happens never to be first. This means that if I is at the beginning of one of these lines, it is spoken text, and not the beginning of a stage direction. A more robust approach (one that would work properly if the first word inside one of Hamlet’s lines was I, but it was inside a stage direction, and not part of the spoken content) would perform the operation below (in 2) to remove stage directions and only then check for lines that begin with I.

    In general, a lot of you used an expression like //sp/* to locate all the spoken lines in the play. This ends up working for Hamlet’s spoken lines because all of his lines are children of <sp> elements, but some <l> elements, as we know from our XPath assignments, are grandchildren of <sp> elements instead of direct children. Because we do not know where in the hierarchy Hamlet’s spoken lines are, we need to account for any location in our expression. One good way to do this is to not think about <sp> elements and locate all the <l> and <ab> elements in the entire document with something like //(l | ab). From here, you can apply predicates to further filter down the resulting lines.

  2. The elements you select above all contain text nodes, which represent spoken text, but some also contain stage directions (see the example above). Extend your XPath expression above to return just the spoken text from each of the lines, without any accompanying stage directions. The result will be plain text, without any markup. (Hint: See above about selecting text nodes.)

    Solution
    //(l | ab)
    [ancestor::sp[@who eq 'Hamlet']]
    [starts-with(normalize-space(.), 'I ')]
    /string-join(text(), ' ')

    This expression returns the same 66 lines as before, only now, the results list is comprised of text nodes instead of element nodes. We need the string-join() because there is one <l> spoken by Hamlet that contains two text nodes, and without the string-join() each would be considered a separate result:

     I say, away! Go on; I'll follow thee.
      Exeunt Ghost and Hamlet.
    ]]>

    There is a text node before the stage direction and one after the stage direction, but only the one before the stage directions contains text other than white space.

  3. There is a lot of extra white-space that gets included with these lines because of pretty-printing. For example, where the XML contains:

    A little more than kin, and less than
      kind.
    ]]>

    the pretty-printing introduces a newline and some extra spaces for indentation. Extend your XPath expression above (the one that selects only the text nodes inside lines, but not stage directions) to remove this extra space, so that each spoken line of text will be continuous, with single space characters between words.

    Solution
    //(l | ab)
    [ancestor::sp[@who eq 'Hamlet']]
    [starts-with(normalize-space(.), 'I ')]
    /string-join(text(), ' ') 
    ! normalize-space(.)

    Again, we return a list of the same number, but this time, the resulting text does not contain any extra space beyond the standard single space between word tokens. We use the simple mapping operator to apply normalize-space() to each text node so that each of the resulting text nodes contain only the white-space that is to be expected within a sentence: a single space between words.

  4. Modify the preceding XPath expression to return the length in characters of each line spoken by Hamlet that begins with I (after ignoring stage directions and removing extra whitespace). The result will be a sequence of integers, each representing the character count of a single line of speech.

    Solution
    //(l | ab)
    [ancestor::sp[@who eq 'Hamlet']]
    [starts-with(normalize-space(.), 'I ')]/
    string-join(text(), ' ')
    ! normalize-space(.)
    ! string-length(.)

    As before, we are still operating on the same number of lines. This time around, our modification uses the simple mapping operator to apply the string-length() function to the current context, that is, each of the 66 lines. Passing each line into string-length() returns the length of the string in characters of each line, so our results list is a sequence of integers.

  5. Write an XPath expression to compute the average length (in character count) of the lines spoken by Hamlet that begin with I. If the value is not already an integer, round it to the nearest integer value.

    Solution
    //(l | ab)
    [ancestor::sp[@who eq 'Hamlet']]
    [starts-with(normalize-space(.), 'I ')]/
    string-join(text(), ' ')
    ! normalize-space(.) 
    ! string-length(.)
    => avg()
    => round()

    We have the string length of each line from the previous expression, and we can use these numbers to find the average length of a Hamlet line beginning with I. The XPath avg() function takes care of summing the sequence of integers and dividing by the number of lines, so all we have to do is feed in the current context (the sequence of string-lengths) to the function. We use the arrow operator here instead of the simple mapping operator because we are no longer operating on each one of the lines individually. Rather, we want to apply the function once using the entire sequence of string-lengths as the input. We get a float value of 38.28787878 repeating, but we want an integer value for the average. The round() function takes a single argument, in this case, the float value we returned after applying the avg() function, and rounds it to the closest integer value. We find that there are an average of 38 characters per Hamlet line beginning with the letter I. If you didn't join the 2 text nodes of the line containing a <stage> element, then the average will come out to slightly less at 37.71641791 as a result of there being an additional list item which increases the number of lines the sum of string lengths is divided by. Rounding to the nearest integer with the round() function, though, will yield the same value of 38.

Part 2

Who speaks in the fifth act of the play?

  1. Return a sequence of all unique speakers in Act V.
    Solution
    //body/div[5]//speaker => distinct-values()

    We begin by locating all the acts in the play with //body/div. This returns the five different acts, but we only care about the fifth one. To return only the fifth act, we apply a numerical predicate to our previous expression that filters the acts to keep only the one we want. Our expression looks like //body/div[5]. From here, we want to locate all of the speakers that appear in Act V, but we do not know for sure where <speaker> elements are located in the hierarchy. Thus, we have to operate on the descendant axis to locate speakers with //body/div[5]//speaker. This expression returns a list of 257 speakers with duplicates. To get rid of the duplicate speakers, we apply the distinct-values() function and return a list of 13 distinct characters who speak in Act V of Hamlet.

  2. Modify the preceding expression to return an alphabetized sequence of all unique speakers.
    Solution
    //body/div[5]//speaker 
    => distinct-values() 
    => sort()

    The XPath function sort(), as its name suggests, sorts the input supplied to it. Because the input in our case is a list of strings, sort() will sort the supplied input alphabetically. Our expression returns a list of the same 13 distinct speakers from number one above, but now the speakers appear in alphabetical order.

  3. Modify the preceding expression to return a comma-separated list of the alphabetized unique speakers.
    Solution
    //body/div[5]//speaker 
    => distinct-values() 
    => sort() 
    => string-join(', ')

    Our last expression gave us a sequence of distinct speakers in alphabetical order. We want to join these speakers together in a single, comma-separated list. To do so, we apply the XPath function string-join to the list of speakers. If intending to operate on the current context, as we are here, string-join takes a single argument: the separator string to insert between each of the strings supplied as the input. We string-join over a comma and a space to get a comma-separated, alphabetized list of 13 unique speakers.

Bonus

  1. Return an alphabetized sequence of unique words that immediately follow the word I in all of Hamlet’s spoken lines that start with the word I.
    Solution
    //(l | ab)
    [ancestor::sp[@who eq 'Hamlet']]
    [tokenize(.)[1] eq 'I']
    /tokenize(.)[2]
    => distinct-values()
    => sort()

    We start with our expression from Part 1 that uses the tokenize() function strategy to find all of Hamlet’s spoken lines that begin with the word I. We then tokenize each of these lines and filter with the numerical predicate to keep only the second token in each of the lines. Each of these second tokens will be the word that immediately follows the word I. The results list contains 66 items, and we can check this number against the number of lines that begin with the word I since they should be the same. In order to return a list of the distinct words, we use the arrow operator to apply the distinct-values() function to the sequence of 66 words. We then return a list of 32 unique words. To alphabetize this list, we use the sort() function like we did in the second task of Part 2.

  2. You’ll notice that some of those words have commas attached to them. Modify your expression above to remove the commas out and return an alphabetized sequence of unique words without any attached commas.
    Solution
    //(l | ab)
    [ancestor::sp[@who eq 'Hamlet']]
    [tokenize(.)[1] eq 'I']
    /tokenize(.)[2]
    ! translate(., '.,', '')
    => distinct-values()
    => sort()

    The word will. appears with a trailing period, which we overlooked when we wrote the original description of the task. The XPath expression above strips out both commas and periods.

    The translate() function in our expression takes three arguments: the input string, the portion of the string to be replaced, and the replacement string. We specify the input string as the current context which is a sequence of the words that immediately follow the word I. For each word, we replace periods and commas with nothing, effectively removing them from the input words. We then apply distinct-values() and sort(), as we did in the previous bonus task, to return an alphabetized list of 32 unique words. We want to strip out the unwanted punctuation before we apply distinct-values() so that initial input words with punctuation attached, such as will., are not counted as their own word type. Our expression that strips out punctuation returns a list of 31 words instead of 32 words because of this.