Digital humanities


Maintained by: David J. Birnbaum (djbpitt@gmail.com) [Creative Commons BY-NC-SA 3.0 Unported License] Last modified: 2018-04-22T01:31:32+0000


Test #7: XQuery solution

The task

For this test we’ll be using the version of Hamlet that is available in eXist. You can access it as doc('/db/apps/shakespeare/data/ham.xml')/tei:TEI

Remember that you’ll also need a namespace declaration:

declare namespace tei="http://www.tei-c.org/ns/1.0";

Don’t forget the trailing semicolon, which is required for XQuery declare statements.

Your goal is to produce a valid HTML document that contains a table of acts, scenes, and an alphabetical list of speakers (<speaker> elements) that appear in each scene.

A simple solution

xquery version "3.0";
declare namespace tei="http://www.tei-c.org/ns/1.0";
<html>
  <head><title>XQuery Test Spring 2018</title></head>
  <body>
    <h1>Hamlet, King of Denmark</h1>
    <table border="1">
      <tr>
          <th>Act</th>
          <th>Scene</th>
          <th>Speakers</th>
      </tr>
      {
        let $hamlet := doc('/db/apps/shakespeare/data/ham.xml')
        let $hamlet_acts := $hamlet//tei:body/tei:div
        for $act in $hamlet_acts 
        let $act_head := $act/tei:head
        let $hamlet_scenes := $act/tei:div
            for $scene in $hamlet_scenes
            let $scene_head := $scene/tei:head
            let $scene_speaker := $scene/tei:sp/tei:speaker
            let $distinct_speakers := distinct-values($scene_speaker)
            let $sorted_speakers := 
                    for $distinct_speaker in $distinct_speakers 
                    order by $distinct_speaker
                    return $distinct_speaker
        return
          <tr>
            <td>{string($act_head)}</td>
            <td>{substring-after($scene_head, ', ')}</td>
            <td>{string-join($sorted_speakers, ', ')}</td>
          </tr>
      }
    </table>
  </body>
</html>

Note that the act number does not repeat in the scene column. You can remove the superfluous act number using the substring-after() function which returns the substring of the first argument string that follows the first occurrence of the second argument within the first argument string. In other words it returns the substring that follows the ',' which is in this case only the scene name.

An enhanced solution

xquery version "3.0";
declare namespace tei = "http://www.tei-c.org/ns/1.0";
declare variable $ham as document-node() := doc('/db/apps/shakespeare/data/ham.xml');
declare variable $acts as element(tei:div)+ := $ham//tei:body/tei:div;
<html>
    <head><title>XQuery Test Spring 2018</title></head>
    <body>
        <h1>Hamlet, King of Denmark</h1>
        <table
            border="1">
            <tr>
                <th>Act</th>
                <th>Scene</th>
                <th>Speakers</th>
                <th>Speech Count</th>
                <th>Line Count</th>
                <th>First Line</th>
            </tr>
            {
                for $act in $acts
                let $scenes := $act/tei:div
                    for $scene at $pos in $scenes
                    let $sp_count := count($scene/tei:sp)
                    let $line_count := count($scene//(tei:l | tei:ab))
                    let $first_line := ($scene/tei:sp//(tei:l | tei:ab))[1]
                    let $first_line_spoke := $first_line/text()
                    let $speakers :=
                        for $speaker in distinct-values($scene//tei:speaker/tokenize(.,' and '))
                        order by $speaker
                        return $speaker
                return
                    <tr>
                    {
                        if ($pos eq 1) then
                            <td rowspan="{count($scenes)}">{string($act/tei:head)}</td>
                            else()
                    }
                        <td>{substring-after($scene/tei:head, ', ')}</td>
                        <td>{string-join($speakers, ', ')}</td>
                        <td>{$sp_count}</td>
                        <td>{$line_count}</td>
                        <td>{$first_line_spoke}</td>
                    </tr>
            }
        </table>
    </body>
</html>

Our original table is lame because it repeats the same entry for Act in multiple rows. The more graceful design is to use the HTML @rowspan attribute to create a single entry in the Act column for each act, which would span all of the scenes of that act. We did this by using an if expression in the return statement to output a <tr> with six columns (Act, Scene, Speakers, Speech count, Line count, First line) for the first scene in each row, with a @rowspan attribute value on the first <td> equal to the number of scenes in that act, that is, the number of rows in the Act column had to span. For scenes other than the first, our <tr> contains only five <td> elements (Scene, Speakers, Speech count, Line count, First line), since the first column of those rows is occupied by the Act value that extends down from a higher row.

We retrieved the line count for each scene by using the count() function. Lines occur as either <l> or <ab> and not necessarily as direct children of <sp> elements, so we have to take into account both for the count. Similarly, the first line could be either an <l> or an <ab>, so we allow for both. One challenge is that some of these elements may contain stage directions, e.g.:

<l xml:id="sha-ham101018I" n="18"> Give you good night. <stage>Exit.</stage></l>

If we just take the string value of <l> and <ab> elements, we’ll wind up including the words inside those embedded stage directions, which we don’t want, since they aren’t spoken by the character. We can avoid this error by taking advantage of text(), which in XPath matches text nodes. Since the text inside an embedded stage direction is part of a text-node child of the stage direction, rather than of the surrounding line, this path will exclude it, which is what we want.

We took a shortcut to break up conjoined <speaker> values, such as when Rosencrantz and Guildenstern speak together. After using XPath to ascertain that the only joint speech is by two persons (never three or more), we tokenized the <speaker> elements on the string " and " before taking the distinct values of the speakers in the scene. This approach will break if there are three or more speakers, with comma separators, such as "Curly, Larry, and Moe", so a more robust solution would have to allow for that possibility, as well. Unfortunately, we can’t just tokenize on white space because there are single characters with white space in their names, such as "First Player".