Digital humanities


Maintained by: David J. Birnbaum (djbpitt@gmail.com) [Creative Commons BY-NC-SA 3.0 Unported License] Last modified: 2016-03-30T14:33:16+0000


XQuery assignment #2: solution

The original assignment

Obdurodon contains the text of forty-two Shakespearean plays. In the TEI markup for these plays, speeches are <sp> elements and the speakers are <speaker> elements that are their first children. For example:

<sp who="Roderigo">
    <speaker>Roderigo</speaker>
    <l xml:id="sha-oth101040F" n="40">I would not follow him then.</l>
</sp>

There are 966 distinct speaker names (values of the <speaker> element) in the entire corpus, some of which show up in more than one play. Three of the names show up in more than ten plays: there is a character called Messenger in 22 plays, one called All in 20 plays, and one called Servant in 18. (These aren’t the same messenger or all or servant, of course!) Here are the details:

Assignment

Your assignment is to write XQuery that will query the collection of plays, find the three <speaker> element values that occur in more than ten plays, and return a list of those elements along with the names of the plays in which they occur. That is, you should write XQuery that generates the basic information in the preceding list.

Our list includes some enhancements that you can consider optional for this assignment, but if you get the basic solution, we hope you’ll try them:

Our solution used several let statements to configure convenience variables. We used distinct-values() and a for statement to find and iterate over all of the distinct <speaker> names in the collection. We used the count() function and a where statement to find all of the names of speakers that occur in more than ten plays. Finally, we used both concat() and string-join() in our return statement to glue the pieces together. For the optional parts, we used an order by statement to sort the results in descending order of frequency, and to alphabetize the play titles we used an embedded FLWOR statement. That means that we set a variable equal to the return of a FLWOR statement, which is a powerful way of using FLWOR not just for the main program flow, but also to control subcomponents. We used the XPath translate() function to fix the apostrophe.

This is not a difficult problem intellectually: you can paraphrase it easily as find all of the speaker names that show up in more than ten plays. It is, however, a complicated pattern to formalize, and because the XQuery version has to be explicit, it is longer than the prose formulation. We would suggest breaking the problem down into subcomponents like

Some of the preceding questions are easier than others, but gettings all the way to a solution means getting them all correct and stitching them together effectively, which is why we recommend developing your XQuery step by step and testing it after every step, so that when it breaks, you’ll know which statement caused the problem. If you can’t get all the way to a solution, that’s okay, but you can’t just give up; what you need to do instead is submit your solutions to as many of the pieces as you can (both the ones we list above and any others that you recognize as important). While this particular assignment is a bit (okay, more than a bit) artificial, the techniques required are very common in Real Life: find information across documents and create a report based on an aggregated result.

Our solution to the main problem

xquery version "3.0";
declare namespace tei="http://www.tei-c.org/ns/1.0";
declare variable $plays := collection('/db/apps/shakespeare/data');
declare variable $distinctSpeakers := distinct-values($plays//tei:speaker);
for $speaker in $distinctSpeakers
let $myPlays := $plays[.//tei:speaker = $speaker]
let $myTitles := (: a nested FLWOR expression for getting the titles in alphabetical order :)
    for $myTitle in $myPlays//tei:titleStmt/tei:title/translate(., "'", "’")
    order by $myTitle
    return $myTitle
let $countMyPlays := count($myPlays)
where $countMyPlays ge 10
order by $countMyPlays descending
return concat($speaker, ' appears in ', $countMyPlays, ' plays: ',
    string-join($myTitles, '; '))

(Comments in XQuery are surrounded by smiley faces, e.g., (: blah blah blah :). Comments are for the convenience of the programmer; the XQuery processor ignores them.)

We begin by setting some convenience variables, one for the collection of plays, and one for all of the distinct <speaker> elements in the collection. You don’t have to use variables (you can write the whole XPath out each time you need it), but we find that extra variables result in more legible code and more self-explanatory program logic.

We then iterate over the distinct <speaker> values in a for expression. For each <speaker> in turn, we find the <play> elements that contain that <speaker>. In a nested FLWOR expression, we iterate over the <title> elements of all those plays, using translate() to replace straight apostrophes with curly ones. We use order by to put the titles in alphabetical order, and return them. Coming back to the main expression, we count the plays the current <speaker> appears in and use a where clause to save only the <speaker> values where that count is greater than or equal to 10, that is, where the <speaker> appears in 10 or more plays.

We then use order by to sort the results in order of frequency by the count of plays. Since we want the most frequent character name first, we specify that the sort should be in descending order.

Finally, we return, beginning with the name of the <speaker>, which we’re still holding on to from our for expression, followed by appears in , the count of plays, and plays . After that, we string-join the titles of the plays across a semicolon followed by a space.

Returning HTML

If we want to return the results as HTML, we can modify our XQuery as follows:

xquery version "3.0";
declare namespace tei="http://www.tei-c.org/ns/1.0";
declare default element namespace "http://www.w3.org/1999/xhtml";
declare variable $plays := collection('/db/apps/shakespeare/data');
declare variable $distinctSpeakers := distinct-values($plays//tei:speaker);
<html>
    <head><title>Speakers in plays</title></head>
    <body>{
        for $speaker in $distinctSpeakers
        let $myPlays := $plays[.//tei:speaker = $speaker]
        let $myTitles := (: a nested FLWOR expression for getting the titles in alphabetical order :)
            for $myTitle in $myPlays//tei:titleStmt/tei:title/translate(., "'", "’")
            order by $myTitle
            return $myTitle
        let $countMyPlays := count($myPlays)
        where $countMyPlays ge 10
        order by $countMyPlays descending
        return 
            (<h1>{concat($speaker,' (',$countMyPlays,')')}</h1>,
            <ol>{
                for $title in $myTitles
                return <li>{$title}</li>
            }</ol>)
    }</body>
</html>

Because we are outputting HTML and we want it to be in the HTML namespace, we declare a default element namespace on our third line, right after we declare the TEI namespace (which we bind to the tei: namespace prefix). The default namespace declaration means that everything not explicitly in a different namespace (in our document, everything without a tei: namespace prefix) will be in the HTML namespace. Since all elements in our XQuery that don’t have a tei: namespace prefix are part of the HTML we’re creating, we want them to be in the HTML namespace, this will give us the desired results. Unfortunately, XQuery doesn't have the ability to declare separate default input and output namespaces, which XSLT does (using the @xpath-default-namespace attribute for the input namespace and a regular namespace declaration on the root <xsl:stylesheet> element for the output namespace). We work around that in this XQuery by declaring a default namespace (which will apply to both input and output) and then overriding it in the case of input by using the tei: namespace prefix explicitly.

We configure our convenience variables, as before, but because we want to return only one HTML document, we need to create that before the for statement that iterates over the speakers. Anything we return after the for statement will be returned multiple times, once for every item in the sequence over which we’re looping. For example, if we return the <html> element after the for statement, we’ll wind up with one <html> element for each of the three speakers who appear more than ten times in the corpus. What we want, though, is one HTML document that contains the information about all three speakers.

Within the <body> of the one <html> element that we create, we start our for loop, iterating over the 966 distinct speaker names, as before. What we’ve decided to return for each speaker this time, though, is two things, an <h1> that contains the speaker’s name and the number of plays in which that name appears, and then an ordered list (<ol>) that contains the play titles. Since a return statement can return only one thing, we have to form the two HTML elements into a single sequence, which we do by wrapping parentheses around them and separating them with a comma. XQuery will recognize this as one thing (a sequence) that happens to contain two things (the two HTML elements).

Note that we begin in XQuery mode, where everything we’ve typed is understood and intepreted as XQuery. Once we begin to return XML (in this case, HTML), though, we switch into XML mode, where everything we type is interpreted as XML to be returned. Where we need to switch back into XQuery mode, so that what we’ve typed will be interpreted, and not returned literally, we need to use curly braces. For example, inside the HTML <body> we’ve embedded XQuery, and if we didn’t surround the XQuery with curly braces, the XQuery code would be returned literally inside the <body> tags, instead of being interpreted, with the results of the interpretation returned. In this XQuery we change modes several times: we begin writing XQuery, we return HTML, inside the HTML <body> element we start interpreting XQuery again, inside the inner return statement we return a sequence of two HTML elements (<h1> and <ol>), and inside each of those HTML elements we switch back to XQuery to output the results inside the HTML tags. In each case, returning an XML element while you’re in XQuery mode puts you into XML mode, and using curly braces inside the XML while you’re in XML mode switches you into XQuery mode.