Digital humanities


Maintained by: David J. Birnbaum (djbpitt@gmail.com) [Creative Commons BY-NC-SA 3.0 Unported License] Last modified: 2022-03-18T16:20:27+0000


Test 4: XPath

For the XPath test, you will be using XPath to take a closer look at individual characters in Shakespeare’s Hamlet. We recommend working in the XPath/XQuery Builder view instead of the small XPath toolbar at the top of your <oXygen/> window because some of these expressions can get long, and it is easier to stay organized if you can see the entirety of your XPath expression as you are typing. Like all of our tests, this one is open-book, which means that you can consult notes, books, the Internet and other resources, except that you cannot receive any assistance from another person. Should you get stuck in a way that does not respond to your best rubber duck debugging efforts, feel free to post an inquiry in our Slack workspace and we’ll try to point you in the right direction. You should submit your answers (the full XPath expressions, not just the result of evaluating them) in a properly formatted markdown document. That includes surrounding XPath expressions in backticks.

Using the version of Hamlet that we have been using for all the previous XPath assignments (http://dh.obdurodon.org/bad-hamlet.xml), complete the following tasks. There are alternative good solutions for some of them, so you will not need to use all of the following functions, but some that we used include avg(), distinct-values(), matches(), normalize-space(), round(), sort(), string-join(), tokenize(), and translate(). Don’t guess at how these work; if you aren’t already familiar with them, look up the number of arguments they require and what each argument means in Michael Kay or an alternative reference.

Here are two details to keep in mind:

Required

Part 1

  1. Find all of Hamlet’s spoken lines (which may be represented by <l> or <ab> elements) in the play and select only those that begin with the word I (not just the letter I). The result should be a sequence of elements of type <l> or <ab>.

  2. The elements you select above all contain text nodes, which represent spoken text, but some also contain stage directions (see the example above). Extend your XPath expression above to return just the spoken text from each of the lines, without any accompanying stage directions. The result will be plain text, without any markup. (Hint: See above about selecting text nodes.)

  3. There is a lot of extra white-space that gets included with these lines because of pretty-printing. For example, where the XML contains:

    A little more than kin, and less than
      kind.
    ]]>

    the pretty-printing introduces a newline and some extra spaces for indentation. Extend your XPath expression above (the one that selects only the text nodes inside lines, but not stage directions) to remove this extra space, so that each spoken line of text will be continuous, with single space characters between words.

  4. Modify the preceding XPath expression to return the length in characters of each line spoken by Hamlet that begins with I (after ignoring stage directions and removing extra whitespace). The result will be a sequence of integers, each representing the character count of a single line of speech.

  5. Write an XPath expression to compute the average length (in character count) of the lines spoken by Hamlet that begin with I. If the value is not already an integer, round it to the nearest integer value.

Part 2

Who speaks in the fifth act of the play?

  1. Return a sequence of all unique speakers in Act V.
  2. Modify the preceding expression to return an alphabetized sequence of all unique speakers.
  3. Modify the preceding expression to return a comma-separated list of the alphabetized unique speakers.

Bonus

  1. Return an alphabetized sequence of unique words that immediately follow the word I in all of Hamlet’s spoken lines that start with the word I.
  2. You’ll notice that some of those words have commas attached to them. Modify your expression above to remove the commas out and return an alphabetized sequence of unique words without any attached commas.