Digital humanities


Maintained by: David J. Birnbaum (djbpitt@gmail.com) [Creative Commons BY-NC-SA 3.0 Unported License] Last modified: 2015-02-10T20:35:00+0000


XPath assignment #1 answers

You can find an XML (TEI) version of Shakespeare’s Hamlet at http://dh.obdurodon.org/bad-hamlet.xml. We’ve deliberately damaged some of the markup in this edition to introduce some inconsistencies, but the file is well-formed XML, which means that you can use XPath to explore it. You should download this file to your computer (typically that means right-clicking on the link and selecting save as) and open it in <oXygen/>.

After you’ve completed your homework, save your answers to a file and upload it to CourseWeb as an attachment. (Please use an attachment! If you paste your answer into the text box, CourseWeb may munch the angle brackets.) Some of these tasks are thought-provoking, and even difficult. If you get stuck, do the best you can, and if you can’t get a working answer, give the answers you tried and explain where they failed to get the results you wanted. Sometimes doing that will help you figure out what’s wrong, and even when it doesn’t, it will help us identify the difficult moments. These tasks require the use of path expressions, predicates, and the functions count() and not(), but they should not require any other XPath functions. There may be more than one possible answer.

Using the Bad Hamlet document and the XPath browser window in <oXygen/>, construct XPath expressions that will do the following (give the full XPath expressions in your answers, and not just the results):

  1. Hamlet, like a typical Shakespearean tragedy, contains five acts, each of which contains scenes. Both acts and scenes are encoded as division (<div>) elements.
    1. How can XPath tell them apart?

      XPath is able to distinguish between the <div> elements that are acts and the <div> elements that are scenes because they occur at different levels of the document hierarchy. The <div> elements that are scenes are children of the <div> elements that are acts, while the <div> elements that are acts are children of the <body> element. This means that an XPath expression to find all of the immediate <div> children of the <body> element will retrieve only and exactly the five <div> elements that are acts. Finding the <div> children of those <div> elements would return the scenes.

    2. What XPath would find just the acts?

      //body/div or /TEI/text/body/div

      If we look at the document, we see that the acts and scenes occur within the <body> element, even though <div> elements are also found elsewhere in the document. To find the acts, we first want to navigate to the <body> element, which we can do directly with //body (this finds all the <body> elements in the entire document, but we know that there is only one) or by walking down the tree through all of the steps with /TEI/text/body. From there we just take another step to get down to the <div> children of the <body>: //body/div or /TEI/text/body/div. These two XPaths return exactly the same nodes, so you can use whichever you find easiest to read and understand. Most XPath developers would favor the shorter path.

    3. What XPath would find just the scenes?

      //body/div/div or /TEI/text/body/div/div

      All we do here is append another step to the child axis of the acts that we found in the previous question by using /div.

      Some XPath developers would prefer //div/div as the entire path, which finds all of the <div> elements (all acts and scenes) and then navigates to their children that are also <div> elements. An XPath expression doesn’t keep the intermediate stages in the path, which is to say that although it finds both acts and scenes at first, ultimately it keeps only <div> elements that are children of <div> elements, and therefore winds up keeping only the scenes.

    4. What XPath would find just the scenes in Act III?

      //body/div[3]/div

      We start by finding the acts just as we did in part 1 (//body/div). This part of the XPath returns a sequence of <div> elements. Sequences have an order that in this case will be based on document order, or the order in which the <div> elements appear in the source document. This means that we can use a numerical predicate to indicate that we want the third item in the sequence of all of the <div> elements we found (//body/div[3]). The new context for the next path step is this one <div>, and we use it to find all of its child <div> elements, that is, all of the scenes of that act.

      Note that numerical (and other) predicates are used to filter sequences, and you can use a predicate at any step in a path expression, including intermediate ones. In this case we collect all of the acts and filter them to keep just one, and we then collect all of the scenes of the one act that we care about. Predicates give XPath tremendous power to navigate within a document or set of documents.

  2. Stage directions (<stage>) occur in a variety of contexts.
    1. What XPath would find all of the stage directions that are inside a metrical line (<l>), that is, between the starting <l> and the ending </l>? How many are there?

      There are two directions from which to approach this question. You could first find all of the the lines, and then find all of the stage directions within them, or you could find all of the stage directions, and then use a predicate to weed out the ones that aren’t inside a line. Here are two possible solutions:

      //l//stage

      //stage[ancestor::l]

      In both cases there are 128 items. Notice that the first answer used the descendant:: axis (using the shorthand //), and the second answer used the ancestor:: axis. The question did not specify that the stage elements were immediately within lines (although they happen to be in the document), and so it is more correct to search as deeply or as high as possible to guarantee that you have all of the stage elements within lines. In contrast, //l/stage would have found stage directions only if they were immediate children of lines, and not if they were grandchildren or deeper, that is, not if they were inside something else that was the actual immediate child of the line. Similarly, for the second solution, we look at all of the ancestor elements, and not just the immediate parent.

    2. What XPath would find all of the stage directions that are directly inside a speech (<sp>), that is, inside a speech but not inside a line within a speech?

      //stage[parent::sp] or //sp/stage will work equally well. In the first case, you find all stage directions and then filter them to keep only those that have a speech as their immediate parent. In the second case you find all of the speeches and then get all of their children that are stage directions.

    3. What XPath would find all of the stage directions that are not directly inside a speech or a line. How many are there?

      //stage[not(parent::sp) and not(parent::l)]

      This returns 40 items. Note that it uses a compound predicate, which filters the sequence of all stage directions to retain only those that satisfy both of the conditions: they don’t have a speech parent and they also don’t have a line parent. As an alternative, you could also use two predicates, writing //stage[not(parent::sp)][not(parent::l)] This version operates in three steps: it finds all of the stage directions, it keeps only the ones that don’t have a speech as their parent, and then it filters the ones that survived that first predicate to keep only the ones that also don’t have a line as their parent.

      Most people responded with the path //div/stage, which happens to return the correct result, but it’s nonetheless the wrong answer because it depends on your external knowledge of the document hierarchy and contents. Had there been <stage> elements not occurring directly within an <sp> or <l> that weren’t immediately inside a <div>, //div/stage wouldn’t have found them. In general with XPath, you want to write an expression that will not only find all of the items it seeks for a particular document with particular content, but also not risk missing something that could occur but happens not to just by accident. In this XML version of this play, the taggers have used <l> only for metrical (iambic pentameter) lines of dialog, and where there is non-metrical speech, they’ve used the more generic TEI <ab> element (which stands, somewhat opaquely, for anonymous block). There could have been stage directions inside anonymous blocks, and it’s just an accident that there weren’t.

      An answer that happens to give the right results because of accidents about the data is called fragile or brittle because it can break as soon as a possible complication appears. An answer like the one we recommend here, which can survive more kinds of data, is called robust. Because a lot of coding in digital humanities is designed to be reused (for example, one might wish to use these XPath expressions to explore other plays that employ the same markup), you should favor a robust expression over a fragile one.

    4. For the stage directions you identified in #2c, above, write an XPath expression that will return not the <stage> elements themselves, but their parent elements, whatever they might be. What are those parent elements? (You haven’t yet learned the XPath to return just the names of the parent elements [rather than the elements themselves], but you can locate them, click on each one in the list <oXygen/> returns, and look at it directly.)

      //stage[not(parent::sp) and not(parent::l)]/parent::*

      The asterisk is used to denote any element. Since elements will only have one direct parent, using the * on the parent:: axis returns just the one element that is the immediate parent of whatever context you were in previously. In this case we just appended an additional path step, /parent::*, to the end of the XPath solution to the preceding question. You could also use the shorthand expression .. in place of the parent::* path step, since .. means the parent element, whatever it may be, of the current context.