Digital humanities


Maintained by: David J. Birnbaum (djbpitt@gmail.com) [Creative Commons BY-NC-SA 3.0 Unported License] Last modified: 2023-02-24T21:39:17+0000


XPath assignment #1 answers

You can find an XML (TEI) version of Shakespeare’s Hamlet at http://dh.obdurodon.org/bad-hamlet.xml. We’ve deliberately damaged some of the markup in this edition to introduce some inconsistencies, but the file is well-formed XML, which means that you can use XPath to explore it. You should download this file to your computer (typically that means right-clicking on the link and selecting save as) and open it in <oXygen/>.

Prepare your answers to the following questions in a markdown file upload it to Canvas as an attachment. As always, code snippets (including XPath snippets) in markdown must be surrounded with backticks.

Some of these tasks are thought-provoking, and even difficult. If you get stuck, do the best you can, and if you can’t get a working answer, give the answers you tried and explain where they failed to get the results you wanted. As always, you are encouraged to ask questions in the #xpath channel in Slack, but because you want to make progress in learning to debug your own code, your questions should tell us what you tried, what you expected, exactly what you got instead (not just didn’t work or got an error), and what you think the source of the problem is. Sometimes writing that sort of request for advice that will help you figure out what’s wrong on your own (see Rubber duck debugging), and even when it doesn’t, it will help us identify the difficult moments.

These tasks require the use of path expressions, predicates, and the functions count() and not(), but they should not require any other XPath functions. There may be more than one possible answer.

Using the Bad Hamlet document and the XPath browser window in <oXygen/>, construct XPath expressions that will do the following (give the full XPath expressions in your answers, and not just the results):

  1. Hamlet, like a typical Shakespearean tragedy, contains five acts, each of which contains scenes. Both acts and scenes are encoded as division (<div>) elements.
    1. How can XPath tell them apart?

      XPath is able to distinguish between the <div> elements that are acts and the <div> elements that are scenes because they occur at different levels of the document hierarchy. The <div> elements that are scenes are children of the <div> elements that are acts, while the <div> elements that are acts are children of the <body> element. This means that an XPath expression to find all of the immediate <div> children of the <body> element will retrieve only and exactly the five <div> elements that are acts. Finding the <div> children of those <div> elements would return just the scenes.

    2. What XPath would find just the acts?

      You can click anywhere and then use //body/div or /TEI/text/body/div or /descendant::body/div. Because XPath expressions that begin with a slash start at the document node, that is the one location in the document that you can always reach from anywhere in a single step. If you click inside the document in a particular location, you can use a more specific path expression. For example, if you click just inside the <body> the acts are children of the current context, so a path expression of just div would find the five acts.

      If we look at the document, we see that the acts and scenes occur within the <body> element (that is, they are descendants of the <body> element), even though <div> elements are also found elsewhere in the document. To find the acts, we first want to navigate to the <body> element, which we can do directly with //body or /descendant::body (this finds all the <body> elements in the entire document, but we know that there is only one) or by walking down the tree through all of the steps with /TEI/text/body. From there we just take another step to get down to the <div> children of the <body>: //body/div or /TEI/text/body/div. These two XPath expressions return exactly the same nodes, so you can use whichever you find easiest to read and understand. Most XPath developers would favor the shorter path.

    3. What XPath would find just the scenes?

      //body/div/div or /TEI/text/body/div/div

      All we do here is append another step to the child axis of the acts that we found in the previous question by using /div.

      Some XPath developers would prefer //div/div as the entire path, which finds all of the <div> elements (all acts and scenes, plus any <div> elements outside the <body>) and then navigates to their children that are also <div> elements. An XPath expression doesn’t keep the intermediate stages in the path, which is to say that although it finds acts and scenes and other things at first, ultimately it keeps only <div> elements that are children of <div> elements, and therefore winds up keeping only the scenes.

      By the way, a bottom-up approach to finding the scene might look like //div[parent::div]. This starts by selecting all <div> elements in the entire document and then applies a predicate that filters them to keep only the ones have have parent elements of type <div>.

    4. What XPath would find just the scenes in Act III?

      //body/div[3]/div

      We start by finding the acts just as we did in part 1 (//body/div). This part of the XPath returns a sequence of <div> elements. Sequences have an order that in this case will be based on document order, or the order in which the <div> elements appear in the source document. This means that we can use a numerical predicate to indicate that we want the third item in the sequence of all of the <div> elements we found (//body/div[3]). The new context for the next path step is this one <div>, and we use it to find all of its child <div> elements, that is, all of the scenes of that act.

      Note that numerical (and other) predicates are used to filter sequences, and you can use a predicate at any step in a path expression, including intermediate ones. In this case we collect all of the acts and filter them to keep just one, and we then collect all of the scenes of the one act that we care about. Predicates give XPath tremendous power to navigate within a document or set of documents.

  2. Stage directions (<stage>) occur in a variety of contexts.
    1. What XPath would find all of the stage directions that are inside a metrical line (<l>), that is, between the starting <l> and the ending </l>? How many are there?

      There are two directions from which to approach this question. You could first find all of the the lines, and then find all of the stage directions within them, or you could find all of the stage directions, and then use a predicate to weed out the ones that aren’t inside a line. Here are two possible solutions:

      //l//stage

      //stage[ancestor::l]

      In both cases there are 128 items. Notice that the first answer used the descendant:: axis (using the shorthand //), and the second answer used the ancestor:: axis. The question did not specify that the stage elements were immediately within lines (although they happen to be in the document), and so it is more correct to search as deeply or as high as possible to guarantee that you have all of the stage elements within lines. In contrast, //l/stage would have found stage directions only if they were immediate children of lines, and not if they were grandchildren or deeper, that is, not if they were inside something else that was the actual immediate child of the line. Similarly, for the second solution, we look at all of the ancestor elements, and not just the immediate parent.

      If you tried to write out the full path as /TEI/text/body/div/div/sp/l/stage you found only 127 results, even though there are 128. The one you missed is a child of a line, but the line itself is a child of a line group, and not of a speech. The mistake with this approach involves making the unnecessary assumption that all lines will be children of speeches, and the more general take-away is that when writing XPath you want to avoid making any unnecessary assumptions. Since the question was only about lines and stage directions, it’s best to write a path expression that refers only to those element types. See the note below, under question 2c, about robust vs brittle solutions.

    2. What XPath would find all of the stage directions that are directly inside a speech (<sp>), that is, inside a speech but not inside a line within a speech?

      //stage[parent::sp] or //sp/stage will work equally well. In the first case, you find all stage directions and then filter them to keep only those that have a speech as their immediate parent. In the second case you find all of the speeches and then get all of their children that are stage directions.

    3. What XPath would find all of the stage directions that are not directly inside a speech or a line. How many are there?

      //stage[not(parent::sp) and not(parent::l)]

      This returns 40 items. Note that it uses a compound predicate, which filters the sequence of all stage directions to retain only those that satisfy both of the conditions: they don’t have a speech parent and they also don’t have a line parent. As an alternative, you could also use two predicates, writing //stage[not(parent::sp)][not(parent::l)] This version operates in three steps: it finds all of the stage directions, it keeps only the ones that don’t have a speech as their parent, and then it filters the ones that survived that first predicate to keep only the ones that also don’t have a line as their parent.

      The path //div/stage happens to return the correct result, but it’s nonetheless the wrong answer because it depends on your external knowledge of the document hierarchy and contents. Had there been <stage> elements not occurring directly within an <sp> or <l> that weren’t immediately inside a <div>, //div/stage wouldn’t have found them. In general with XPath, you want to write an expression that will not only find all of the items it seeks for a particular document with particular content, but also not risk missing something that could occur but happens not to just by accident. In this XML version of this play, the taggers have used <l> only for metrical (iambic pentameter) lines of dialog, and where there is non-metrical speech, they’ve used the more generic TEI <ab> element (which stands, somewhat opaquely, for anonymous block). There could have been stage directions inside anonymous blocks, and it’s just an accident that there weren’t.

      An answer that happens to give the right results because of accidents about the data is called fragile or brittle because it can break as soon as a possible complication appears. An answer like the one we recommend here, which can survive more kinds of data, is called robust. Because a lot of coding in digital humanities is designed to be reused (for example, one might wish to use these XPath expressions to explore other plays that employ the same markup), you should favor a robust expression over a fragile one.

    4. For the stage directions you identified in #2c, above, write an XPath expression that will return not the <stage> elements themselves, but their parent elements, whatever they might be. What are those parent elements? (You haven’t yet learned the XPath to return just the names of the parent elements [rather than the elements themselves], but you can locate them, click on each one in the list <oXygen/> returns, and look at it directly.)

      //stage[not(parent::sp) and not(parent::l)]/parent::*

      The asterisk is used to denote any element. Since elements will only have one direct parent, using the * on the parent:: axis returns just the one element that is the immediate parent of whatever context you were in previously. In this case we just appended an additional path step, /parent::*, to the end of the XPath solution to the preceding question. You could also use the shorthand expression .. in place of the parent::* path step, since .. means the parent, whatever it may be, of the current context.

      You can ask XPath to tell you the names of those elements, instead of just selecting them and making you look at the tags to learn the names, with //stage[not(parent::sp) and not(parent::l)]/parent::* ! name(.) (using the XPath 3.1 simple map operator) or //stage[not(parent::sp) and not(parent::l)]/parent::*/name() (using the older notation). The new last step applies the XPath name() function to each of the context nodes, that is, the parent elements selected by the preceding path step. The XPath name() function returns the name of a node (e.g., element or attribute type), rather than the node itself, which makes it especially useful during exploratory document analysis.


Advanced details

You can skip the following part for now because it isn’t required to provide correct answers to the questions above. With that said, we encourage at least to read through it and think about it, since it will help you write clearer, more legible XPath.

The simple map operator

In our answer to the last question, above, we showed how to use the name() function as either part of a simple map operation (with !) or a final path step (with a /). Each step in a path expression is separated from the preceding step by a slash (or double slash), and the step to the left defines the context nodes that serve as the starting point for the step to the right. For example, in //body/div, the first step selects all of the <body> elements in the document (there is only one), and each <body> element then serves, in turn, as the context for the next step, which finds all of the <div> element children of the current context node. When the last path step is a function, like name(), it also uses each item selected by the path step immediately before it as the context item, which in this case means the context in which the function is applied. For example, //body/*/name(.) has three steps: first find all of the <body> elements in the document, then find all of the element children of each of those <body> elements, and then compute the name of each of those child elements. If you run this expression against our text, it will return a sequence of five instances of the string div, since the only children of the one <body> element in our document are the five <div> elements that contain the five acts of the play.

XPath provides an alternative notation, called the simple map operator and spelled as an exclamation point (!), for applying functions to a sequence of context items. This means that the following two expressions are equivalent:

//stage[not(parent::sp) and not(parent::l)]/parent::*/name(.)

and

//stage[not(parent::sp) and not(parent::l)]/parent::* ! name(.)

We prefer the exclamation point when we are applying a function to the context items because that strategy helps us see more quickly which path steps are navigation and which apply functions. But that’s just a personal preference and you can use whichever notation you find easier to understand. (There is a difference in functionality between the two notations, but it is not relevant in this particular example.)

The arrow operator

The simple map operator or the slash that introduces a function mean do the thing to the right once for each item in the sequence to the left. For example, if we use the simple map operator to get the names of five elements, the expression will return five strings, that is, five element names. The arrow operator, spelled =>, means apply the function to the right once, using the entire sequence to the left as input into the function. This means that a function to the right of the simple map operator is applied once to each context item to the left, while a function to the right of the arrow operator is applied only once, taking the entire sequence to the left as its input. Here is why that’s useful.

Suppose we are returning the names of all of the elements that can be parents of stage directions. If we run:

//stage/.. ! name(.)

against our document we’ll return 218 strings because the expression will find all of the stage directions, use them to find all of their parents, and then return not the parent elements themselves, but just their names, one name per element. Suppose we want to find out what types of elements can be parents of stage directions. We could scroll through the 218 results, but that’s tedious and error-prone, and with a different source document there might be even more stage directions. This is the sort of task that computers perform more reliably than humans, though, so we can instead ask XPath to remove the duplicate values for us by applying the distinct-values() function, which takes a sequence of items (in this case strings, since they’re the names of elements) as input, removes any duplicates, and returns a deduplicated sequence. We can do that by wrapping the function around the entire expression, since the entire result of the expression (the 218 values) is the input to the deduplication process:

distinct-values(//stage/.. ! name(.))

This returns just three values: div, l, and sp, because those are the only element types that can be parents of stage directions in this play.

Wrapping a function around a long path expression can be difficult to read (and the difficulty increases if we want to nest several functions), and the arrow operator exists as a way to make long expressions with functions easier to read. In this case, we can rewrite our expressions as:

//stage/.. ! name(.) => distinct-values()

We can read this from left to right: first find the parent elements of the stage directions, then get the names of those elements, and then remove the duplicate names. We find this easier to read than the version that wraps the distinct-values() function around the rest because we have to read the version with wrapping from the inside out, which doesn’t feel as natural as reading from left to right. With the notation that wraps the function around the rest, first we do the things inside the function parentheses and then we step outside the parentheses to apply the function to the results.

The simple map operator normally requires a dot inside the parentheses to specify that the function is being applied to the current context item. Some functions know that the current context item is the input into the function by default, so omitting the dot for those functions won’t do any harm, but it can be difficult to predict which functions require the dot and which regard it as optional. The function to the right of the arrow operator, though, never includes the dot. Keeping the two syntactic expectations separate will become more natural as you gain experience.

Putting it all together: multi-line XPath expressions

The simple map and arrow operators make complex path expressions easier to read, and that’s especially the case if we write the expression across multiple lines. e.g.:

//stage/.. 
! name(.)
=> distinct-values()

Writing each step on its own line makes the stepwise process even easier to see because it now takes advantage of our intuitive understanding of both left to right and top to bottom.

We encourage you to become comfortable with the simple map and arrow operators because they’ll help you write code that is easier to understand and therefore less prone to error and easier to debug when you do make a mistake. This method also encourages you to construct your XPath expressions one step at a time, which is always a good idea because it lets you test each step, so that as soon as something breaks, you’ll know that the last thing you did is the locus of the mistake. With that said, using alternative notations (like the distinct-values(//stage/.. ! name(.)) example above) isn’t wrong, and you’ll see it in a lot of examples you’ll find on the Internet (including on our course pages) because the simple map and arrow operators are relatively new features in XPath, so any expressions written before their introduction wouldn’t have been able to use them.