Digital humanities

Maintained by: David J. Birnbaum ( [Creative Commons BY-NC-SA 3.0 Unported License] Last modified: 2021-04-22T21:36:35+0000

XQuery assignment #1: answers

Use the collection of Shakespeare plays that have been uploaded to Obdurodon to do the following:

  1. The task

    Find all of the titles of all of the Shakespeare plays in the corpus. You’ll need to read our posting on the main course page on Obdurodon for information about how to address the collection of plays, and also about how to retrieve the full text of one of the plays so that you can look at it and see where the title is, which you’ll need to know in order to construct the XPath to retrieve it. The simplest answer is a single XPath expression. The output should look something like:

    <title xmlns="">The Tragedy of Hamlet, Prince of Denmark</title>
    <title xmlns="">The Tragedy of Macbeth</title>
    <title xmlns="">The Tragedy of Romeo and Juliet</title>

    Our solution

    declare namespace tei="";

    The collection('/db/apps/shakespeare/data') portion of the preceding XPath expression returns the document nodes of the plays. We then find their <titleStmt> descendants and the <title> child elements of those <titleStmt> elements. We specify that we want <title> elements only when they are children of <titleStmt> elements because the <title> element is used for more than just the titles of the plays. Restricting the context lets us exclude the <title> elements that are not titles of plays.

  2. The task

    Modify your XPath above to return just the text of the titles, without the tags. You can do that by using text() or data() or string() (which you might want to look up in Michael Kay). Your answer should look something like:

    The Tragedy of Hamlet, Prince of Denmark
    The Tragedy of Macbeth
    The Tragedy of Romeo and Juliet

    Our solution

    declare namespace tei="";

    Wrapping data() around the result extracts its typed data value, which in this case is its string value. You could also use:










    text() isn’t a function, even though it looks like one; it’s the way to specify a text node. For that reason, you can’t put anything inside the parentheses if you use text(); you just use it as a path step to say once you have the right <title> element, retrieve any text nodes that are its children, i.e., that are inside it. That retrieves the textual content of the element. Since we know that the <title> elements all contain exactly one text() node, this will get us what we want. In general, though, if you want the textual content of an element, it’s safer to use (= you should use) the string() function as a path step.

    You can’t wrap string() around the entire XPath expression because string() can take only a single item as its argument, and the XPath in this case returns forty-two items. (If you know that you’re going to return only one item, you can wrap string() around it, so string(collection('/db/apps/shakespeare/data')[1]//tei:titleStmt/tei:title) is fine.) The data() function doesn’t have that limitation, and can be used with any number of items. You also can include or omit the dot inside string() or data() and they will default to giving you the string value or the typed data value (which are the same thing in the case of strings) of the current context.

  3. The task

    Find the plays that contains more than 40 unique speakers and then return their <title> elements. You will need to use count() and distinct-values(). Find the collection, drill down to the <TEI> elements in the collection (you know there are 42 of them), then filter them based on whether or not they contain more than 40 distinct <speaker> elements. Once you’re getting the 14 plays that meet that description, you can add a path step to retrieve their <titleStmt>, and then the <title>.

    Our solution:

    declare namespace tei="";
    collection('/db/apps/shakespeare/data')//tei:TEI[count(distinct-values(.//tei:speaker)) gt 40]//tei:titleStmt/tei:title

    The collection('/db/shakespeare/data')/tei:TEI portion of the XPath returns the 42 root <TEI> elements. The predicate then filters them according to whether the count of the distinct <speaker> element values in each of them is greater than 40, and it keeps only the ones where that test returns the value True. The dot before the double slash is crucial; without it, you’re counting all of the <speaker> elements in the entire database, instead of just in the play you’re looking at at that moment. Because this is an easy mistake to make, it’s safer to specify the descendant axis as distinct-values(descendant::tei:speaker)).

    You could also solve this with a FLWOR expression:

    declare namespace tei="";
    for $play in collection('/db/apps/shakespeare/data')/tei:TEI
    where count(distinct-values($play//tei:speaker)) gt 40
    return $play//tei:titleStmt/tei:title

    The first line after the XQuery declaration rounds up the plays and sets the variable $play to point to each <TEI> node in turn. This creates a sequence of 42 items and iterates over them. The next line uses a where clause to filter that sequence by checking how many distinct speakers they have. Only those with more than 40 speakers will make it to the last line, which prints the title (<title> child of a <titleStmt> descendant) of any surviving plays. In the FLWOR expression you need to use $play before the double slash because that’s how you communicate the context across statements. The dot won’t work here because each statement is a stand-alone part of the FLWOR expression, and it has no current context until you give it one.

    You can combine the pure XPath and the FLWOR strategies, using an XPath predicate to filter, but keeping the for and return parts of the FLWOR expression. That might look like:

    declare namespace tei="";
    for $play in collection('/db/shakespeare/data')/tei:TEI[count(distinct-values(descendant::tei:speaker)) gt 40]
    return $play//tei:titleStmt/tei:title

    By the way, you can split XPath expressions across lines, so if the long line becomes hard to manage, you could break it up, along the lines of:

    for $i in collection('/db/shakespeare/plays')
    /PLAY[count(distinct-values(.//SPEAKER)) gt 40]
    return $i/TITLE

    You can indent the second line of that, if it makes it easier to read, but that isn’t required and it has no effect on the processing. (You can’t, of course, break a line in the middle of a word!)

    So which is better, an XPath predicate or a where expression?

    As far as XQuery is concerned, they do the same thing, so one answer is that you should whichever you find more comfortable and convenient. In some complex cases you may not be able to use an XPath predicate; the syntax of where expressions is more flexible. As far as eXist-db (the particular XML database we’re using; there are others) is concerned, though, eXist-db is better at optimizing XPath predicates than where expressions. If you use a where expression and the execution is unacceptably slow, try rewriting it as an XPath predicate. For what it’s worth, we almost always use XPath predicates in our own XQuery.

  4. The task

    Modify your solution to the preceding question #3 to return just the text of the play title, without the <title> tags. You can take the same approach as you did for the transition from question #1 to question #2.

    Our solution

    Any of the strategies you used for #2 will work here, as well.