Digital humanities


Maintained by: David J. Birnbaum (djbpitt@gmail.com) [Creative Commons BY-NC-SA 3.0 Unported License] Last modified: 2016-02-12T14:53:50+0000


XPath assignment #3 answers

You can find an XML (TEI) version of Shakespeare’s Hamlet at http://dh.obdurodon.org/bad-hamlet.xml. We’ve deliberately damaged some of the markup in this edition to introduce some inconsistencies, but the file is well-formed XML, which means that you can use XPath to explore it. You should download this file to your computer (typically that means right-clicking on the link and selecting save as) and open it in <oXygen/>.

After you’ve completed your homework, save your answers to a file and upload it to CourseWeb as an attachment. (Please use an attachment! If you paste your answer into the text box, CourseWeb may munch the angle brackets.) Some of these tasks are thought-provoking, and even difficult. If you get stuck, do the best you can, and if you can’t get a working answer, give the answers you tried and explain where they failed to get the results you wanted. Sometimes doing that will help you figure out what’s wrong, and even when it doesn’t, it will help us identify the difficult moments. These tasks require the use of path expressions, predicates, and functions. There may be more than one possible answer.

Notation: For ease in recognition, from now on when we refer in discussion to an attribute name, we’ll precede it with an at sign (@). In other words, when we write about the @id attribute in question #2, below, the name of the attribute is actually id (without an at sign).

Using the Bad Hamlet document and the XPath browser window in <oXygen/>, construct XPath expressions that will do the following (give the full XPath expressions in your answers, and not just the results):

  1. What XPath expressions will find the last stage direction <stage> in the entire document? (Note: there should be only one!)

    One possible answer is (//stage)[last()]. This collects all of the stage directions in the entire document, forms them into one big sequence with parentheses, and then uses the predicate [last()] to keep only the last item in that sequence.

    Alternatively, you could have used //stage[not(following::stage)] The last <stage> will not have any others following it. The predicate here makes use of the following:: axis, which searches the rest of the tree following the current context. Note the difference between this axis and the following-sibling:: axis, which would only check for following <stage> elements within the same parent element. The long axes (preceding and following) are less efficient computationally than the others because they don’t take advantage of the tree structure, and tree traversal is more efficient than just walking through the document looking for elements. In a document of this small size you won’t notice a difference, but in a large production system you might want to avoid the long axes if there is an alternative. (For a real-world example where the efficiency of the long axes was a serious issue, see An XML user steps into, and escapes from, XPath quicksand.)

    You may have tried //stage[last()] and been surprised to get 218 answers. That can’t be correct, since there can be only one last stage direction in the entire document. The possibly unexpected result comes about because positional predicates applied to nodes don’t actually refer to the position in the entire sequence that matches the path expression. Rather, the positional predicate refers to the order of each node among its siblings, which isn’t the same as the order among all nodes matched by the XPath expression. //stage[last()] actually returns every <stage> that does not have following-sibling <stage>, that is, that isn’t followed by another <stage> that has the same parent. Using the parentheses, in the first solution above, coerces all the <stage> nodes into behaving like a single sequence, irrespective of who counts as whose siblings. This was point of the second question in XPath Assignment 2, and it’s a famous XPath gotcha. See the discussion in Michael Kay, pp. 542, 618, 702–03.

    The fine print: If you’ve had a good night’s sleep, you might want to read up on the exact meaning of // in Michael Kay at pp. 626–28, as well as the discussion of the difference between predicates in filter expressions and predicates in axis steps at 639. We use // as if it means look anywhere in the document, and it sort of does, but there are complications with numerical predicates. You don’t need to understand these details; you can just remember to use parentheses in situations like find the first or last instance of a particular element in the document.

  2. What XPath expression will find the last member in the cast list at the beginning of the document and return the value of the @xml:id attribute that is associated with it?

    //castList/castItem[last()]/role/@xml:id

    After looking at the document, you can see that the basic path you want to follow is to find the last <castItem>, get its <role> child (there can be only one <role> per <castItem>), and then get the @xml:id of the <role>.

    You may have tried //role[last()]/@xml:id, but that returns the @xml:id values of all 37 <role> elements. The reason is that it’s returning the last <role> in each group of <role> siblings. It turns out that <role> elements in this document are only children, and don’t have siblings, so you get just each one, individually. What would work, though, is (//role)[last()]/@xml:id or /descendant::role[last()]/@xml:id, since both of those form all of the <role> elements into a single sequence before selecting the last of them. You need to put the parentheses in the right place, though; if you try (//role[last()])/@xml:id, you’ll get all 37, since you use your last() predicate first and then form the big sequence with the parentheses, which means that you get a sequence of all 37 last <role> elements.

    You may have thought that you could eliminate the <castList> from the path and just use //castItem[last()]/role/@xml:id, but that returns three results. If you click on them in the results box at the bottom of the <oXygen/> window, you’ll see where the two spurious ones come from.

    The XPaths discussed here return the @xml:id attribute node, and not its value, but the question asked for the value of the node, and that isn’t the same as the node itself. A fully correct answer would add an additional path step: //castList/castItem[last()]/role/@xml:id/string() or //castList/castItem[last()]/role/@xml:id/data(.). The difference between the attribute and its value is discussed as part of bonus question #2, below.

  3. What XPath expression will find all <sp> elements with more than 8 line (<l>) subelements? You’ll need to use the count() function (Kay 733–34).

    //sp[count(descendant::l) gt 8] or //sp[count(.//l) gt 8]

    This expression finds all <sp> elements in the document and filters them by counting the number of <l> descendants they have and checking whether that count is greater than 8. We used the gt value comparison test for greater than; you could also use the > general comparison test, and you can spell that either with the raw > character or the &gt; character entity replacement. In this context, where there is only one item on either side of the test (the count of lines to the left and the integer 8 to the right), there’s no difference between value comparison and general comparison. If either side was a sequence of more than one item, though, you would have to use general comparison (value comparison works only with exactly one item on each side), and it may not mean what you think. What would it mean to ask whether the count of lines was greater than the sequence (8, 10)? That turns out not to be an error; it has a meaning, which you can look up under general comparison in Michael Kay.

    If you tried //sp[count(//l) gt 8], without the leading dot in the predicate, you got every <sp> in the document, all 1137 of them. The reason is that without the dot, you look on the descendant axis from the document node, that is, from the very top of the document (remember that a path that begins with a double slash starts from the top of the document), so for each of the 1137 speeches in the document you test whether the total number of lines in the document is greater than 8. It always is, so the test always succeeds, and you wind up keeping all of the speeches. Using the dot inside the predicate tells the processor that it should look on the descendant axis from the current context node; remember that the dot in XPath represents the current context node.

    Note that the question asked for subelement, so the answer should look for descendants, and not just children. See bonus question #3, below, for discussion of the difference.

  4. Building on your answer to the preceding question, what XPath expression will tell you how many line subelements each of those speeches actually has?

    //sp[count(.//l) gt 8]/count(.//l)

    The preceding answer returned a sequence of 94 <sp> elements. The additional path step in this answer applies to each of those in turn; it uses the dot to start from the current context node, the <sp> you’re looking at at the moment, and finds and counts all of the <l> elements on its descendant axis.

  5. Building on your answers to the preceding two questions, what XPath expression will find the speakers of all speeches that have more than 8 line subelements? Once you’ve found the speeches that have more than 8 lines, you can find the speakers of those speeches by just adding another path step, but you’ll get some duplication, since a single person may have more than one long speech. Your answer to this question should get rid of the duplicates, and return just a list of names of speakers without duplication. You’ll need to use the distinct-values() function (Kay 749–50).

    distinct-values(//sp[count(.//l) gt 8]/speaker)

    Starting with the answer to #3, instead of adding count(.//l), as we did in #4, and getting a count of the lines, we add speaker and get the <speaker> child of each speech. Since there are 94 speeches, we get a sequence of 94 speakers. We get rid of the duplicates by wrapping that sequence in the distinct-values() function.

Optional bonus questions

  1. Question #1, above, asked how you to provide an XPath that would find the last stage direction (<stage>) in the play. What XPath would find the last line (<l>) in the play? What XPath would find the last stage direction or line (that is, whichever of the last stage direction and last line comes last)? You’ll need to use the union operator (Kay 628–31).

    You can find the last line with /descendant::l[last()] or (//l)[last()]. You can find the last stage direction or line with (/descendant::l | /descendant::stage)[last()]. Reading from the inside out, we use /descendant::l to find the last line in the play and /descendant::stage to find the last stage direction. We join those with the union operator (|) to create a sequence of all of the nodes returned by both of those paths, that is, all lines and all stage directions. We wrap that union in parentheses to form it into one long sequence and then use the last() function in a predicate to select the last item in that sequence in document order.

    Note that it’s best to find all of the lines and stage directions, combine them, and then take the last item in the combined sequence. There’s no benefit in getting the last line and the last stage direction separately, since you really only care about what’s last in the merged sequence.

  2. Question #2, above, asked you to provide an XPath that would find the @xml:id associated with the last cast member in the cast list. What’s the difference between an XPath that returns the @xml:id attribute itself and an XPath that returns just the value of the @xml:id attribute? That is, what are the two XPath expressions and what object does each of them return? You’ll need to use the data() or string() function (Kay 741–43, 877–79).

    When your path ends with something like @xml:id, what you return is an attribute node. If you were copying that into a new XML document as part of an XSLT transformation, you would create an attribute on whatever element you had just created in the output XML document. If you extend the path as //castList/castItem[last()]/role/@xml:id/string() or //castList/castItem[last()]/role/@xml:id/data(.), though, you’ll get the value of the attribute, instead of the attribute node itself. If you copy the value into your output XML, you don’t get an attribute node; you just get the string value.

    In the <oXygen/> XPath debugger interface there isn’t much difference between retrieving the attribute node or its string value. But in an XSLT transformation you don’t want to create an attribute in your output document instead of a string value, or vice versa.

  3. Question #3, above, asked you to provide an XPath that would find all of the speeches (<sp> elements) with more than 8 line (<l>) subelements. What are the XPaths to find speeches with more than 8 line child elements and speeches with more than 8 descendant line elements? How do those results differ? If there are descendant line elements that are not children of a speech, what is their parent? If you don’t know the types of their parent elements in advance, what XPath expression will tell you?

    The XPath //sp[count(descendant::l) gt 8] or //sp[count(.//l) gt 8] returns the sequence of all <sp> elements that have more than 8 <l> descendants. As we mention in the answer to regular question #3, above, there are 94 of them. To find just the children, but not other descendants, use //sp[count(l) gt 8], which returns 87 <sp> elements. (You could write //sp[count(./l) gt 8], but the dot and slash aren’t needed [= shouldn’t be used] here, since the child axis is the default axis.) Since the task was to find lines that are descendants of speeches but not their children, the most direct route might be//sp//l[not(parent::sp)]. This finds all speeches, and then all of their line descendants, and then uses a predicate to keep only the lines that don’t have a parent of type <sp>. You can retrieve the parents themselves (instead of retrieving the lines and just filtering them by their parents) with //sp//l/..[not(self::sp)]. This is the main reason for the existence of the self:: axis; this XPath can be read as: Find all speeches in the play, and then all of their line descendants, and then the parents of the lines as long as the parents are not of type <sp>. To get the type, you can add a path step that uses the name() function: //sp//l/..[not(self::sp)]/name(), and to remove the duplicates you can write the entire expression in distinct-values(): distinct-values(//sp//l/..[not(self::sp)]/name()). The answer is that lines that are not immediate children of speeches are children of line groups (<lg>).