Digital humanities


Maintained by: David J. Birnbaum (djbpitt@gmail.com) [Creative Commons BY-NC-SA 3.0 Unported License] Last modified: 2017-02-15T14:53:33+0000


XPath assignment #2 answers

You can find an XML (TEI) version of Shakespeare’s Hamlet at http://dh.obdurodon.org/bad-hamlet.xml. We’ve deliberately damaged some of the markup in this edition to introduce some inconsistencies, but the file is well-formed XML, which means that you can use XPath to explore it. You should download this file to your computer (typically that means right-clicking on the link and selecting save as) and open it in <oXygen/>.

After you’ve completed your homework, save your answers to a file and upload it to CourseWeb as an attachment. (Please use an attachment! If you paste your answer into the text box, CourseWeb may munch the angle brackets.) Some of these tasks are thought-provoking, and even difficult. If you get stuck, do the best you can, and if you can’t get a working answer, give the answers you tried and explain where they failed to get the results you wanted. Sometimes doing that will help you figure out what’s wrong, and even when it doesn’t, it will help us identify the difficult moments. These tasks require the use of path expressions, predicates, and the functions count() and not(), but they should not require any other XPath functions. There may be more than one possible answer.

Using the Bad Hamlet document and the XPath browser window in <oXygen/>, construct XPath expressions that will do the following (give the full XPath expressions in your answers, and not just the results):

  1. Most (not all) speeches in Hamlet contain mostly metrical line (<l>) and anonymous block (<ab>) elements (an anonymous block is the TEI element that the tagger used to represent a non-metrical speech line). Speeches also typically contain <speaker> elements, and may also contain stage directions (<stage>). We have deliberately left out at least one other type of subelement found in speeches. Based on this understanding:
    1. What XPath would find all of the speeches that do not contain any metrical lines as immediate children. How many are there?

      //sp[not(l)]

      For this answer, we use the predicate [not(l)] to check whether there are <l> elements on the child axis of all of the speeches we find. We keep only speeches that do not contain any line elements, throwing away any speeches that do contain those elements. There are 451 speech elements without line children. Some of you specified //sp[not(child::l)]. This is syntactically valid XPath, but it isn’t consistent with Best Practice; most coders wouldn’t mention the child axis explicitly because it’s the default.

    2. What XPath would find all of the speeches that do not contain either any metrical lines (<l>) or any anonymous blocks (<ab>)? How many are there? What do they contain instead? (In the assignment we wrote: As in #2d in XPath assignment #1, you haven’t yet learned the XPath to return a list of the types of elements they do contain, but if you find them all, you can scan the brief list that <oXygen/> returns, click on each one to see it in context, and see what’s going on. But we actually covered this in class Wednesday; you can use the name() function in a path step to get the names of the elements, rather than the elements themselves.)

      //sp[not(l|ab)] or //sp[not(l or ab)] or //sp[not(l)][not(ab)] or //sp[not(l) and not(ab)] or //sp except (//sp[l]|//sp[ab])

      These path expressions all return speeches that do not have either <l> or <ab> elements as immediate children. There are seven of these elements. When you navigate to these speeches in the list that <oXygen/> returns, you find that the other child elements of speeches that we concealed from you are line groups, or <lg>. You can get those child elements by adding an asterisk as the next path step, meaning find all of the child elements of these <sp> elements, whatever they might be: //sp[not(l|ab)]/* or //sp[not(l or ab)]/* or //sp[not(l)][not(ab)]/* or //sp[not(l) and not(ab)]/*. You can get the element names, instead of the elements themselves, by adding the name() path step after the asterisk: //sp[not(l|ab)]/*/name() or //sp[not(l or ab)]/*/name() or //sp[not(l)][not(ab)]/*/name() or //sp[not(l) and not(ab)]/*/name(). You can get rid of the duplicates by wrapping the entire expression in the distinct-values() function, e.g., distinct-values(//sp[not(l|ab)]/*/name()), etc.

      The first version uses the union operator. l|ab matches the union of all lines and anonymous blocks, that is, everything that is either a line or an anonymous block (and some people refer to the union connector as the or connector). The path //sp collects all of the speeches, and the predicate then checks on the child axis (the default, since no axis is specified) and retains only those speeches that do not have either lines or anonymous blocks as immediate children.

      The second version uses the or operator. Without the not() function, it would keep all <sp> elements that have either <l> or <ab> children. The not() function inverts the test, so it keeps only the <sp> elements that have neither type of child. Some people find the two senses of or confusing: the union operator (|) finds elements that are either one thing or another, while the keyword or is used in complex predicates to filter a sequence according to one condition or another.

      The third version uses two predicates. It first collects all of the speeches, after which the first predicate keeps only those that don’t have line elements as children. The second predicate then filters those still further, keeping only the ones that also don’t have anonymous blocks as immediate children.

      The fourth version uses a compound predicate. It collects all of the speeches and then filters them all at once, keeping only those that both do not have any lines as children and do not have any anonymous blocks as children.

      The fifth version uses except to specify the difference of sets of nodes. See Kay pp. 628–31 for discussion. Our pattern says return all <sp> elements, but exclude from that return the union of all <sp> elements that have <l> children and all <sp> elements that have <ab> children.

      All five of these XPaths yield the same sequence of seven elements, and in all cases that use the and, or, or union (|) operators, you may change the order of the parts without changing the results. For example not(l or ab) returns the same nodes as not(ab or l).

  2. Explain why the following four XPath expressions return different results, and describe in prose what each of them does return, and why:
    1. //sp[@who="Hamlet"]/l[1]
    2. /descendant::sp[@who="Hamlet"]/l[1]
    3. (//sp[@who="Hamlet"]/l)[1]
    4. (/descendant::sp[@who="Hamlet"]/l)[1]

    The reason that these four XPath expressions return different results is because of the scoping effect of parentheses, that is, how the parentheses affect the sequence of items within which the system then searches for the first item. The other distinguishing factor of these expressions involves the specification of the descendant axis. The shorthand // functions the same way as the longform descendant::. The predicate in the expressions 2a and 2b filter elements in a different context than in expressions 2c and 2d, because the parentheses alter the current context.

    2a finds all speeches by Hamlet, and then all of the child line elements of those speeches. The predicate [1] modifies the immediately preceding step of the expression, which means that it filters separately the sequence of lines from each speech by Hamlet. As a result, the predicate retains the first line element in the context of each speech, that is, the expression overall returns the first line of every speech by Hamlet.

    2b starts by looking for <sp> on the descendant axis, and then finds all of the child line elements of those speeches. As in 2a, the predicate [1] modifies the immediately preceding step of the expression. This means that the expression returns the first line of each of Hamlet's speeches.

    2c also begins by finding all of the speeches by Hamlet, and then all of the child line elements of those speeches. At that point the full expression is wrapped in parentheses, making the current context a single sequence of all line elements in Hamlet speeches. The predicate [1] then applies to that entire sequence, so it filters the result by asking for the very first line in the entire sequence of all lines spoken by Hamlet. As a result 2c returns only one line.

    2d, like 2b, starts by looking for <sp> on the descendant axis, and then finding all of the child line elements of those speeches. Like in 2c, the predicate is then applied to the entire expression since it is wrapped in parenthesis, and thus filters the results for only the first line that is spoken by Hamlet and returns only one line.

    The point is that a predicate applies to the current context, which is defined as the immediately preceding step in the path. In 2a and 2c, it’s each sequence of lines in each speech, so the predicate applies once for every speech, and returns the first line of every speech. In the 2b and 2d, the parentheses cause the entire preceding expression to function as a single sequence, so the predicate applies only once, to the continuous sequence of all lines spoken by Hamlet, and therefore returns only one line, the first line spoken by Hamlet in the entire play.