Digital humanities


Maintained by: David J. Birnbaum (djbpitt@gmail.com) [Creative Commons BY-NC-SA 3.0 Unported License] Last modified: 2021-10-22T13:35:36+0000


XPath assignment #3 answers

You can find an XML (TEI) version of Shakespeare’s Hamlet at http://dh.obdurodon.org/bad-hamlet.xml. We’ve deliberately damaged some of the markup in this edition to introduce some inconsistencies, but the file is well-formed XML, which means that you can use XPath to explore it. You should download this file to your computer (typically that means right-clicking on the link and selecting save as) and open it in <oXygen/>.

After you’ve completed your homework, save your answers to a file and upload it to CourseWeb as an attachment. (Please use an attachment! If you paste your answer into the text box, CourseWeb may munch the angle brackets.) Some of these tasks are thought-provoking, and even difficult. If you get stuck, do the best you can, and if you can’t get a working answer, give the answers you tried and explain where they failed to get the results you wanted. Sometimes doing that will help you figure out what’s wrong, and even when it doesn’t, it will help us identify the difficult moments. These tasks require the use of path expressions, predicates, and functions. There may be more than one possible answer.

Notation: For ease in recognition, from now on when we refer in discussion to an attribute name, we’ll precede it with an at sign (@). In other words, when we write about the @id attribute in question #2, below, the name of the attribute is actually id (without an at sign).

Using the Bad Hamlet document and the XPath browser window in <oXygen/>, construct XPath expressions that will do the following (give the full XPath expressions in your answers, and not just the results):

  1. What XPath expressions will find the last stage direction <stage> in the entire document? (Note: there should be only one!)

    One possible answer is (//stage)[last()]. This collects all of the stage directions in the entire document, forms them into one big sequence with parentheses, and then uses the predicate [last()] to keep only the last item in that sequence.

    Alternatively, you could have used //stage[not(following::stage)] The last <stage> will not have any others following it. The predicate here makes use of the following:: axis, which searches the rest of the tree following the current context. Note the difference between this axis and the following-sibling:: axis, which would only check for following <stage> elements within the same parent element. The long axes (preceding and following) are less efficient computationally than the others because they don’t take advantage of the tree structure, and tree traversal is more efficient than just walking through the document looking for elements. In a document of this small size you won’t notice a difference, but in a large production system you might want to avoid the long axes if there is an alternative. (For a real-world example where the efficiency of the long axes was a serious issue, see An XML user steps into, and escapes from, XPath quicksand.)

    You may have tried //stage[last()] and been surprised to get 218 answers, which cannot be correct since there cannot be 218 last stage directions in the entire play. Meanwhile, /descendant::stage[last()] correctly returns the single last speeech in the entire play. For an explanation of why these two expressions behave differently, see the What does // mean? section of the posted solution to our XPath #2 exercise and the discussion in Michael Kay, pp. 542, 618, 702–03.

    The fine print: If you’ve had a good night’s sleep, you might want to read up on the exact meaning of // in Michael Kay at pp. 626–28, as well as the discussion of the difference between predicates in filter expressions and predicates in axis steps at 639. We use // as if it means look anywhere in the document, and it sort of does, but there are complications with numerical predicates. You don’t need to understand these details; you can just remember to use parentheses in situations like find the first or last instance of a particular element in the document.

  2. What XPath expression will find the last member in the cast list at the beginning of the document and select the @xml:id attribute that is associated with it?

    (//castItem)[last()]/role/@xml:id

    After looking at the document, you can see that the basic path you want to follow is to find the last <castItem>, get its <role> child (there can be only one <role> per <castItem>), and then get the @xml:id of the <role>.

    The path expression //castItem finds all <castItem> elements, but as was the case with the stage directions in the previous question, it effectively returns them in cohorts of siblings, so //castItem[last()] returns the last <castItem> in each cohort, and not the last one in the document. There are three such cohorts: the <castItem> children of <castList> and the <castItem> children of each of the two <castGroup> elements (Courtiers, Grave-diggers), which themselves are children of <castList>. Wrapping parentheses around //castItem at the beginning flattens all of the <castItem> elements into a single sequence, so that the predicate returns only one node, which is the one you want.

  3. What XPath expression will find all <sp> elements with more than 8 line (<l>) subelements? You’ll need to use the count() function (Kay 733–34).

    //sp[count(descendant::l) gt 8] or //sp[count(.//l) gt 8]

    This expression finds all <sp> elements in the document and filters them by counting the number of <l> descendants they have and checking whether that count is greater than 8. We used the gt value comparison test for greater than; you could also use the > general comparison test, and you can spell that either with the raw > character or the &gt; character entity replacement. In this context, where there is only one item on either side of the test (the integer count of lines to the left and the integer 8 to the right) and the two are comparable (we’re comparing a number to a number), there’s no difference between value comparison and general comparison. If either side was a sequence of more than one item, though, you would have to use general comparison (value comparison works only with exactly one item on each side), and it may not mean what you think. What would it mean to ask whether the count of lines was greater than the sequence (8, 10)? That question turns out not to be an error; it has a meaning, which you can look up under general comparison in Michael Kay, and we also discuss it briefly at the end of our XPath functions we use most tutorial.

    If you tried //sp[count(//l) gt 8], without the leading dot in the predicate, you got every <sp> in the document, all 1137 of them. The reason is that without the dot, you look on the descendant axis from the document node, that is, from the very top of the document (remember that a path that begins with a slash—single or double—starts from the top of the document), so for each of the 1137 speeches in the document you test whether the total number of lines in the document is greater than 8. It always is, so the test always succeeds, and you wind up keeping all of the speeches. Using the dot inside the predicate tells the processor that it should look on the descendant axis from the current context node; remember that the dot in XPath represents the current context node. It’s easy to forget the dot, and since you’ll get a result, you may not even know that it isn’t the result you want. For that reason, we recommend spelling out descendant:: in any context where using // without a dot would have a different meaning.

    Note that the question asked for subelement, so the answer should look for descendants, and not just children. See bonus question #3, below, for discussion of the difference.

  4. Building on your answer to the preceding question, what XPath expression will tell you how many line subelements each of those speeches actually has?

    //sp[count(descendant::l) gt 8]/count(descendant::l) (or you could use the simple map operator: //sp[count(descendant::l) gt 8] ! count(descendant::l))

    The preceding answer returned a sequence of 94 <sp> elements. The additional path step in this answer applies to each of those in turn; it starts from the current context node, the <sp> you’re looking at at the moment, and finds and counts all of the <l> elements on its descendant axis.

  5. Building on your answers to the preceding two questions, what XPath expression will find the speakers of all speeches that have more than 8 line subelements? Once you’ve found the speeches that have more than 8 lines, you can find the speakers of those speeches by just adding another path step, but you’ll get some duplication, since a single person may have more than one long speech. Your answer to this question should get rid of the duplicates, and return just a list of names of speakers without duplication. You’ll need to use the distinct-values() function (Kay 749–50).

    distinct-values(//sp[count(descendant::l) gt 8]/speaker) (or you could use the arrow operator: //sp[count(descendant::l) gt 8]/speaker => distinct-values())

    Starting with the answer to #3, instead of adding count(descendant::l) at the end, as we did in #4, and getting a count of the lines, we add speaker and get the <speaker> child of each speech. Since there are 94 speeches, we get a sequence of 94 speakers. We get rid of the duplicates by applying the distinct-values() function to that sequence.

Optional bonus questions

  1. Question #1, above, asked how you to provide an XPath that would find the last stage direction (<stage>) in the play. What XPath would find the last line (<l>) in the play? What XPath would find the last stage direction or line (that is, whichever of the last stage direction and last line comes last)? You’ll need to use the union operator (Kay 628–31).

    You can find the last line with /descendant::l[last()] or (//l)[last()]. Building on that, you can find the last stage direction or line with (/descendant::l | /descendant::stage)[last()]. Reading from the inside out, we use /descendant::l to find the last line in the play and /descendant::stage to find the last stage direction. We join those with the union operator (|) to create a sequence of all of the nodes returned by both of those paths, that is, all lines and all stage directions. We wrap that union in parentheses to form it into one long sequence and then use the last() function in a predicate to select the last item in that sequence in document order. Note that the union operator doesn’t concatenate the sequences, which would put all of the lines before all of the stage directions; it maintains document order. You can verify this by changing the order of the line and stage-direction parts of the expression; you’ll get the same result.

    It’s best to find all of the lines and stage directions, combine them, and then take the last item in the combined sequence. There’s no benefit in getting the last line and the last stage direction separately, since you really only care about what’s last in the merged sequence.

  2. Question #2, above, asked you to provide an XPath that would find the @xml:id associated with the last cast member in the cast list. What’s the difference between an XPath that returns the @xml:id attribute itself and an XPath that returns just the value of the @xml:id attribute? That is, what are the two XPath expressions and what object does each of them return? You’ll need to use the data() or string() function (Kay 741–43, 877–79).

    When your path ends with something like @xml:id, what you return is an attribute node. If you were copying that into a new XML document as part of an XSLT transformation, you would create an attribute on whatever element you had just created in the output XML document. If, though, you extend the path as (//castItem)[last()]/role/@xml:id/string() or (//castItem)[last()]/role/@xml:id/data() or their equivalents that use the simple mapping operator, you’ll get the value of the attribute, instead of the attribute node itself. If you write the value into your output XML, you don’t get an attribute node; you just get the string value.

    In the <oXygen/> XPath debugger interface there isn’t much visual difference between retrieving the attribute node or its string value. But in an XSLT transformation you don’t want to create an attribute in your output document instead of a string value, or vice versa.

  3. Question #3, above, asked you to provide an XPath expression that would select all of the speeches (<sp> elements) with more than 8 line (<l>) subelements. What XPath expressions would select speeches with more than 8 line child elements and speeches with more than 8 descendant line elements? How do those results differ? If there are descendant line elements that are not children of a speech, what is their parent? If you don’t know the types of their parent elements in advance, what XPath expression will tell you?

    The XPath //sp[count(descendant::l) gt 8] or //sp[count(.//l) gt 8] returns the sequence of all <sp> elements that have more than 8 <l> descendants. As we mention in the answer to regular question #3, above, there are 94 of them. To find just the children, but not other descendants, use //sp[count(l) gt 8], which returns 87 <sp> elements. (You could write //sp[count(./l) gt 8], but the dot and slash aren’t needed [= shouldn’t be used] here, since the child axis is the default axis.) Since the task was to find lines that are descendants of speeches but that are not children of those speeches, the most direct route might be//sp//l[not(parent::sp)]. (As it turns out, you don’t need the //sp part of this path because all <l> elements happen to be descendants of <sp> elements, but in Real Life you might not always know that sort of detail). This finds all speeches, and then all of their line descendants, and then uses a predicate to keep only the lines that don’t have a parent of type <sp>.

    You can retrieve the parents themselves (instead of retrieving the lines and just filtering them by their parents) with //sp//l/..[not(self::sp)]. This is the main reason for the existence of the self:: axis; this XPath can be read as: Find all speeches in the play, and then all of their line descendants, and then the parents of each of those lines, and filter them to keep only the ones where the parent is not of type <sp>. To get the element type (if they aren’t speeches, what are they?), you can add a path step or simple map operation that uses the name() function: //sp//l/..[not(self::sp)] ! name(), and to remove the duplicates you can write the entire expression in distinct-values() or use the arrow operator: //sp//l/..[not(self::sp)] ! name() => distinct-values() ( the old-style version, without simple map or arrow, would be distinct-values(//sp//l/..[not(self::sp)]/name())). The answer is that all lines that are not immediate children of speeches are children of line groups (<lg>).

    You could, alternatively, use //sp//l[not(parent::sp)]/.., which, instead of finding the parents first and then using the self axis to filter out the ones that of type <sp>, instead filters on the preceding path step to find only the lines that don’t have <sp> parents and then gets their parents. Whether you filter on the line step or the parent step is a matter of personal preference, and we recommend using the expression that corresponds most closely. step by step, to how your would explain what you were trying to your rubber duck.