Digital humanities


Maintained by: David J. Birnbaum (djbpitt@gmail.com) [Creative Commons BY-NC-SA 3.0 Unported License] Last modified: 2017-04-18T13:55:54+0000


Test #7: XQuery solution

The task

For this test we’ll be using the version of Hamlet that is available in eXist. You can access it as doc('/db/apps/shakespeare/data/ham.xml')/tei:TEI

Remember that you’ll also need a namespace declaration:

declare namespace tei="http://www.tei-c.org/ns/1.0";

Don’t forget the trailing semicolon, which is required for XQuery declare statements.

Your goal is to produce a valid HTML document that contains a table of characters in Hamlet, the number of times they speak, and their first line, sorted in descending order of number of speeches. In case of ties in number of speeches, we’ve subsorted our output alphabetically. You can see our output (without the first lines) at http://dh.obdurodon.org/hamlet-speech.html.

A simple solution

xquery version "3.0";
declare namespace tei="http://www.tei-c.org/ns/1.0";
<html>
  <head><title>XQuery test</title></head>
  <body>
    <h1>Characters in <cite>Hamlet</cite> and their speeches</h1>
    <table border="1">
      <tr>
          <th>Character</th>
          <th>Speeches</th>
          <th>First Line</th>
      </tr>
      {
        let $hamlet := doc('/db/apps/shakespeare/data/ham.xml')
        let $characters := distinct-values($hamlet//tei:speaker)
        for $character in $characters
        let $character_speeches := $hamlet//tei:sp[tei:speaker = $character]
        let $first_line := normalize-space(($character_speeches//tei:l)[1])
        let $count := count($character_speeches)
        order by $count descending, $character
        return
          <tr>
            <td>{$character}</td>
            <td>{$count}</td>
            <td>{$first_line}</td>
          </tr>
      }
    </table>
  </body>
</html>

There are some inconsistencies in the data, including the fact that some lines (<l>) elements have extraneous white space before or after the text actually begins. For example, the first line spoken by Hamlet reads:

<l xml:id="sha-ham102065" n="65"> A little more than kin, and less than kind. </l>

There is a space character before the actual content of the line and another at the end, although the space characters have no meaning and no logical function. To avoid carrying them forward into our report, we us the normalize-space() function, which, among other things, strips leading and trailing whitespace. It also strips markup (in this case, the <l> start and end tags), so we don’t need to use the string() or data() functions to do that.

An enhanced solution

xquery version "3.0";
declare namespace tei="http://www.tei-c.org/ns/1.0";
<html>
  <head><title>XQuery test</title></head>
  <body>
    <h1>Characters in <cite>Hamlet</cite> and their speeches</h1>
    <table border="1">
      <tr>
          <th>Character</th>
          <th>Speeches</th>
          <th>First Line</th>
          <th>Acts/Scenes</th>
          <th>First appearance</th>
          <th>Last appearance</th>
          <th>Word count</th>
      </tr>
      {
        let $hamlet := doc('/db/apps/shakespeare/data/ham.xml')
        let $characters := distinct-values($hamlet//tei:speaker)
        for $character in $characters
        let $scenes := $hamlet//tei:div/tei:div[descendant::tei:speaker = $character]/tei:head
        let $speeches := $hamlet//tei:sp[tei:speaker = $character]
        let $first_line := ($speeches//tei:l)[1]
        let $count := count($speeches)
        let $first_scene := $scenes[1]
        let $last_scene:= $scenes[last()]
        let $all_text := normalize-space(string-join($speeches,' '))
        let $words := tokenize($all_text,'\s+')
        let $word_count := count($words)
        order by $count descending, $character
        return
          <tr>
            <td>{$character}</td>
            <td>{$count}</td>
            <td>{normalize-space($first_line)}</td>
            <td>{string-join($scenes,'; ')}</td>
            <td>{string($first_scene)}</td>
            <td>{string($last_scene)}</td>
            <td>{$word_count}</td>
          </tr>
      }
    </table>
  </body>
</html>

Because we want to list the act-scene combinations in which each character speaks, we collect those in our $scenes variable. That variable contains a sequence of <head> elements taken from scene <div> elements. The contents of those <head> elements contain text like Act 1, Scene 2, so instead of having to count the acts and scenes ourselves and assemble these values, we can just retrieve them directly from the <head> elements.

We use three variables to create a word count. The $all_text variable string-joins all of a character’s speeches across a space character, turning them into one long string. It then uses normalize-space() to get rid of any extra whitespace characters, such as those at the beginnings and ends of lines of speech (see above) and those created by pretty-printing, etc. For our $words variable we tokenize $all_text, which is a single string that has single spaces between the words, on whitespace, which divides $all_text into a sequence of strings, each of which is a word. Finally, count those words and save the value in the variable $wordcount, which we can use when we output the table. Our decision to perform this counting in three separate steps is an example of using convenience variables; we could have done it all in one step by nesting the functions inside one another, but we found the code more legible (and therefore less error-prone) when we wrote the parts separately.

So how could we have made it better? We’ve assumed that <sp> elements contain the words a character speaks, and that’s true, but they also contain descendant elements that do not represent speech, specifically <stage> (stage directions) and <speaker> (the name of the speaker). An accurate solution would have excluded those from the word count.

<sp> elements have five types of descendant elements: <l>, <ab>, <lg>, <speaker>, and <stage>. Only <l>, <ab> contain actual speech, so our first thought was that we could just take another path step and retrieve text only from those two element types. The problem is that some of those elements may contain stage directions, e.g.:

<l xml:id="sha-ham101018I" n="18"> Give you good night. <stage>Exit.</stage></l>

If we just take the string value of <l> and <ab> elements, we’ll wind up including the words inside those embedded stage directions, which we don’t want, since they aren’t spoken by the character. We can avoid this problem by taking advantage of text(), which in XPath matches text nodes. This means that an XPath expression like l/text() or text()[parent::l] finds only the text-node children of <l> elements. Since the text inside an embedded stage direction is part of a text-node child of the stage direction, rather than of the surrounding line, this type of path will exclude it, which is what we want. Here’s a revision, with the change highlighted:

xquery version "3.0";
declare namespace tei="http://www.tei-c.org/ns/1.0";
<html>
  <head><title>XQuery test</title></head>
  <body>
    <h1>Characters in <cite>Hamlet</cite> and their speeches</h1>
    <table border="1">
      <tr>
          <th>Character</th>
          <th>Speeches</th>
          <th>First Line</th>
          <th>Acts/Scenes</th>
          <th>First appearance</th>
          <th>Last appearance</th>
          <th>Word count</th>
      </tr>
      {
        let $hamlet := doc('/db/apps/shakespeare/data/ham.xml')
        let $characters := distinct-values($hamlet//tei:speaker)
        for $character in $characters
        let $scenes := $hamlet//tei:div/tei:div[descendant::tei:speaker = $character]/tei:head
        let $speeches := $hamlet//tei:sp[tei:speaker = $character]
        let $first_line := ($speeches//tei:l)[1]
        let $count := count($speeches)
        let $first_scene := $scenes[1]
        let $last_scene:= $scenes[last()]
        let $spoken_text := $speeches/descendant::text()[parent::tei:l or parent::tei:ab]
        let $all_text := normalize-space(string-join($spoken_text,' '))
        let $words := tokenize($all_text,'\s+')
        let $word_count := count($words)
        order by $count descending, $character
        return
          <tr>
            <td>{$character}</td>
            <td>{$count}</td>
            <td>{normalize-space($first_line)}</td>
            <td>{string-join($scenes,'; ')}</td>
            <td>{string($first_scene)}</td>
            <td>{string($last_scene)}</td>
            <td>{$word_count}</td>
          </tr>
      }
    </table>
  </body>
</html>

Here we define $spoken_text by starting with $speeches (a sequence of <sp> elements), finding all of the descendant text() nodes of those <sp> elements, and keeping only the ones that are children of <l> and <ab>. We then use that value, instead of $speeches (which contains unwanted words from speaker names and stage directions) to construct $all_text.