Digital humanities


Maintained by: David J. Birnbaum (djbpitt@gmail.com) [Creative Commons BY-NC-SA 3.0 Unported License] Last modified: 2021-03-20T15:05:54+0000


Test #4: XPath

The task below uses Bad Hamlet to create an alphabetized list of all of the distinct words spoken by Hamlet in Act 5, Scene 1. The steps are ordered so that each builds on the preceding one, and if you get stuck on a step (required or extra-credit; for example, if you aren’t able to convert words to lower case or remove punctuation), it’s okay to skip it and proceed to the next step (but tell us that you’ve done that). Submit your answers in a markdown document, with backticks around the XPath expressions, specifying the XPath expression for each step (they are cumulative, so each will look like a modified version of the immediately preceding one). That is, you answer will look like a numbered list of XPath expressions, one expression for each of the steps below.

We suggest working in the XPath/XQuery Builder view (accessible from the <oXygen/> menus at Window → Show View) because the expression gets very long. (To run an expression in the Builder, click on the red sideways triangle at the top of the Builder panel. The Enter key just creates a new line; it doesn’t execute the expression.) Using the simple mapping operator (!) and the arrow operator (=>), where possible, will improve the legibility, but it is possible to complete these steps without those operators. You can split your expression into multiple lines in the Builder, which will also improve legibility.

Required steps

  1. Write an XPath expression that returns all of Hamlet’s lines (<l> elements) in Act 5, Scene 1. There are 43 of them. Our solution does not use any functions.
  2. Modify your answer above to provide both the <l> elements and the anonymous blocks (<ab> elements). There are 91 anonymous blocks, so the total number of lines of both types is 134. Our solution does not use any functions.
  3. Modify your answer to the preceding step to break the text into a single list of all of the words in all of those lines. Our solution used the tokenize() function.

    This word list will include words inside the four stage directions that are children of the line elements in this scene. For extra credit, exclude those from your result, so that you are returning only the words spoken by Hamlet.

  4. Modify your answer to return all of those words in lower case, that is, convert words with initial capital letters to all lower case. Our solution uses the lower-case() function.

Extra-credit steps

  1. Modify your answer to strip all punctuation except hyphens and apostophes. Our solution uses the replace() function.
  2. Modify your answer to remove duplicate words. Our solution uses the distinct-values() function.
  3. Modify your answer to sort the distinct words alphabetically. Our solution uses the sort() function.
  4. The first word in our alphabetized list is just a blank, so remove it. Our solution uses a predicate (checking whether the word is equal to the empty string), but no additional functions.