Maintained by: David J. Birnbaum (djbpitt@gmail.com)
Last modified:
2023-02-24T21:39:17+0000
You can find an XML (TEI) version of Shakespeare’s Hamlet at http://dh.obdurodon.org/bad-hamlet.xml. We’ve deliberately
damaged some of the markup in this edition to introduce some inconsistencies, but the
file is well-formed XML, which means that you can use XPath to explore it. You should
download this file to your computer (typically that means right-clicking on the link and
selecting save as
) and open it in <oXygen/>.
Prepare your answers to the following questions in a markdown file upload it to Canvas as an attachment. As always, code snippets (including XPath snippets) in markdown must be surrounded with backticks.
Some of these tasks are thought-provoking, and even difficult. If you get stuck, do the
best you can, and if you can’t get a working answer, give the answers you tried and
explain where they failed to get the results you wanted. As always, you are encouraged
to ask questions in the #xpath channel in Slack, but because you want to make
progress in learning to debug your own code, your questions should tell us what you
tried, what you expected, exactly what you got instead (not just didn’t
work
or got an error
), and what you think the source of the problem is.
Sometimes writing that sort of request for advice that will help you figure out what’s
wrong on your own (see Rubber duck debugging), and even when it doesn’t, it will help us identify the
difficult moments.
These tasks require the use of path expressions, predicates, and the functions
count()
and
not()
, but they should not require any other XPath
functions. There may be more than one possible answer.
Using the Bad Hamlet document and the XPath browser window in <oXygen/>, construct XPath expressions that will do the following (give the full XPath expressions in your answers, and not just the results):
<div>
) elements. How can XPath tell them apart?
XPath is able to distinguish between the
<div>
elements that are acts and
the <div>
elements that are
scenes because they occur at different levels of the document hierarchy.
The <div>
elements that are
scenes are children of the <div>
elements that are acts, while the
<div>
elements that are acts are
children of the <body>
element.
This means that an XPath expression to find all of the immediate
<div>
children of the
<body>
element will retrieve
only and exactly the five <div>
elements that are acts. Finding the
<div>
children of those
<div>
elements would return just
the scenes.
What XPath would find just the acts?
You can click anywhere and then use
//body/div
or
/TEI/text/body/div
or
/descendant::body/div
. Because
XPath expressions that begin with a slash start at the document node,
that is the one location in the document that you can always reach from
anywhere in a single step. If you click inside the document in a
particular location, you can use a more specific path expression. For
example, if you click just inside the
<body>
the acts are children of
the current context, so a path expression of just
div
would find the five acts.
If we look at the document, we see that the acts and scenes occur within
the <body>
element (that is,
they are descendants of the
<body>
element), even though
<div>
elements are also found
elsewhere in the document. To find the acts, we first want to navigate
to the <body>
element, which we
can do directly with //body
or
/descendant::body
(this finds all
the <body>
elements in the
entire document, but we know that there is only one) or by walking down
the tree through all of the steps with
/TEI/text/body
. From there we just
take another step to get down to the
<div>
children of the
<body>
:
//body/div
or
/TEI/text/body/div
. These two XPath
expressions return exactly the same nodes, so you can use whichever you
find easiest to read and understand. Most XPath developers would favor
the shorter path.
What XPath would find just the scenes?
//body/div/div
or
/TEI/text/body/div/div
All we do here is append another step to the child axis of the acts that
we found in the previous question by using
/div
.
Some XPath developers would prefer
//div/div
as the entire path, which
finds all of the <div>
elements
(all acts and scenes, plus any
<div>
elements outside the
<body>
) and then navigates to
their children that are also
<div>
elements. An XPath
expression doesn’t keep the intermediate stages in the path, which is to
say that although it finds acts and scenes and other things at first,
ultimately it keeps only <div>
elements that are children of
<div>
elements, and therefore
winds up keeping only the scenes.
By the way, a bottom-up approach to finding the scene might look like
//div[parent::div]
. This starts by
selecting all <div>
elements in
the entire document and then applies a predicate that filters them to
keep only the ones have have parent elements of type
<div>
.
What XPath would find just the scenes in Act III?
//body/div[3]/div
We start by finding the acts just as we did in part 1
(//body/div
). This part of the
XPath returns a sequence of
<div>
elements. Sequences have
an order that in this case will be based on document order,
or the order in which the <div>
elements appear in the source document. This means that we can use a
numerical predicate to indicate that we want the third item in the
sequence of all of the <div>
elements we found (//body/div[3]
).
The new context for the next path step is this one
<div>
, and we use it to find all
of its child <div>
elements,
that is, all of the scenes of that act.
Note that numerical (and other) predicates are used to filter sequences, and you can use a predicate at any step in a path expression, including intermediate ones. In this case we collect all of the acts and filter them to keep just one, and we then collect all of the scenes of the one act that we care about. Predicates give XPath tremendous power to navigate within a document or set of documents.
<stage>
) occur in a
variety of contexts.What XPath would find all of the stage directions that are inside a
metrical line (<l>
), that is,
between the starting <l>
and the
ending </l>
? How many are
there?
There are two directions from which to approach this question. You could first find all of the the lines, and then find all of the stage directions within them, or you could find all of the stage directions, and then use a predicate to weed out the ones that aren’t inside a line. Here are two possible solutions:
//l//stage
//stage[ancestor::l]
In both cases there are 128 items. Notice that the first answer used the
descendant::
axis (using the
shorthand //
), and the second
answer used the ancestor::
axis.
The question did not specify that the stage elements were
immediately within lines (although they happen to be in the
document), and so it is more correct to search as deeply or as high as
possible to guarantee that you have all of the stage elements within
lines. In contrast, //l/stage
would
have found stage directions only if they were immediate children of
lines, and not if they were grandchildren or deeper, that is, not if
they were inside something else that was the actual immediate child of
the line. Similarly, for the second solution, we look at all of the
ancestor elements, and not just the immediate parent.
If you tried to write out the full path as
/TEI/text/body/div/div/sp/l/stage
you found only 127 results, even though there are 128. The one you
missed is a child of a line, but the line itself is a child of a line
group, and not of a speech. The mistake with this approach involves
making the unnecessary assumption that all lines will be children of
speeches, and the more general take-away is that when writing XPath you
want to avoid making any unnecessary assumptions. Since the question was
only about lines and stage directions, it’s best to write a path
expression that refers only to those element types. See the note below,
under question 2c, about robust vs brittle solutions.
What XPath would find all of the stage directions that are
directly inside a speech
(<sp>
), that is, inside a speech
but not inside a line within a speech?
//stage[parent::sp]
or
//sp/stage
will work equally well.
In the first case, you find all stage directions and then filter them to
keep only those that have a speech as their immediate parent. In the
second case you find all of the speeches and then get all of their
children that are stage directions.
What XPath would find all of the stage directions that are not directly inside a speech or a line. How many are there?
//stage[not(parent::sp) and not(parent::l)]
This returns 40 items. Note that it uses a compound predicate,
which filters the sequence of all stage directions to retain only those
that satisfy both of the conditions: they don’t have a speech parent and
they also don’t have a line parent. As an alternative, you could also
use two predicates, writing
//stage[not(parent::sp)][not(parent::l)]
This version operates in three steps: it finds all of the stage
directions, it keeps only the ones that don’t have a speech as their
parent, and then it filters the ones that survived that first predicate
to keep only the ones that also don’t have a line as their parent.
The path //div/stage
happens to
return the correct result, but it’s nonetheless the wrong answer because
it depends on your external knowledge of the document hierarchy and
contents. Had there been <stage>
elements not occurring directly within an
<sp>
or
<l>
that weren’t immediately
inside a <div>
,
//div/stage
wouldn’t have found
them. In general with XPath, you want to write an expression that will
not only find all of the items it seeks for a particular document with
particular content, but also not risk missing something that could occur
but happens not to just by accident. In this XML version of this play,
the taggers have used <l>
only
for metrical (iambic pentameter) lines of dialog, and where there is
non-metrical speech, they’ve used the more generic TEI
<ab>
element (which stands,
somewhat opaquely, for anonymous block
). There could have been
stage directions inside anonymous blocks, and it’s just an accident that
there weren’t.
An answer that happens to give the right results because of accidents about the data is called fragile or brittle because it can break as soon as a possible complication appears. An answer like the one we recommend here, which can survive more kinds of data, is called robust. Because a lot of coding in digital humanities is designed to be reused (for example, one might wish to use these XPath expressions to explore other plays that employ the same markup), you should favor a robust expression over a fragile one.
For the stage directions you identified in #2c, above, write an XPath
expression that will return not the
<stage>
elements themselves, but
their parent elements, whatever they might be. What are those parent
elements? (You haven’t yet learned the XPath to return just the names of
the parent elements [rather than the elements themselves], but you can
locate them, click on each one in the list <oXygen/> returns, and
look at it directly.)
//stage[not(parent::sp) and not(parent::l)]/parent::*
The asterisk is used to denote any element. Since elements will only have
one direct parent, using the *
on
the parent::
axis returns just the
one element that is the immediate parent of whatever context you were in
previously. In this case we just appended an additional path step,
/parent::*
, to the end of the XPath
solution to the preceding question. You could also use the shorthand
expression ..
in place of the
parent::*
path step, since
..
means the parent, whatever it
may be, of the current context.
You can ask XPath to tell you the names of those elements, instead of
just selecting them and making you look at the tags to learn the names,
with
//stage[not(parent::sp) and not(parent::l)]/parent::* ! name(.)
(using the XPath 3.1 simple map operator) or
//stage[not(parent::sp) and not(parent::l)]/parent::*/name()
(using the older notation). The new last step applies the XPath
name()
function to each of the
context nodes, that is, the parent elements selected by the preceding
path step. The XPath name()
function returns the name of a node (e.g., element or attribute type),
rather than the node itself, which makes it especially useful during
exploratory document analysis.
You can skip the following part for now because it isn’t required to provide correct answers to the questions above. With that said, we encourage at least to read through it and think about it, since it will help you write clearer, more legible XPath.
In our answer to the last question, above, we showed how to use the
name()
function as either part of a simple
map operation (with !
) or a final path step
(with a /
). Each step in a path expression
is separated from the preceding step by a slash (or double slash), and the step
to the left defines the context nodes that serve as the starting point for the
step to the right. For example, in
//body/div
, the first step selects all of
the <body>
elements in the document
(there is only one), and each <body>
element then serves, in turn, as the context for the next step, which finds all
of the <div>
element children of the
current context node. When the last path step is a function, like
name()
, it also uses each item selected by
the path step immediately before it as the context item, which in this case
means the context in which the function is applied. For example,
//body/*/name(.)
has three steps: first
find all of the <body>
elements in the
document, then find all of the element children of each of those
<body>
elements, and then compute the
name of each of those child elements. If you run this expression against our
text, it will return a sequence of five instances of the string
div
, since the only children of the one
<body>
element in our document are the
five <div>
elements that contain the
five acts of the play.
XPath provides an alternative notation, called the simple map operator
and spelled as an exclamation point (!
),
for applying functions to a sequence of context items. This means that the
following two expressions are equivalent:
//stage[not(parent::sp) and not(parent::l)]/parent::*/name(.)
and
//stage[not(parent::sp) and not(parent::l)]/parent::* ! name(.)
We prefer the exclamation point when we are applying a function to the context items because that strategy helps us see more quickly which path steps are navigation and which apply functions. But that’s just a personal preference and you can use whichever notation you find easier to understand. (There is a difference in functionality between the two notations, but it is not relevant in this particular example.)
The simple map operator or the slash that introduces a function mean do the
thing to the right once for each item in the sequence to the left
. For
example, if we use the simple map operator to get the names of five elements,
the expression will return five strings, that is, five element names. The arrow
operator, spelled =>
, means apply the
function to the right once, using the entire sequence to the left as input
into the function
. This means that a function to the right of the simple
map operator is applied once to each context item to the left, while a function
to the right of the arrow operator is applied only once, taking the entire
sequence to the left as its input. Here is why that’s useful.
Suppose we are returning the names of all of the elements that can be parents of stage directions. If we run:
//stage/.. ! name(.)
against our document we’ll return 218 strings because the expression will find
all of the stage directions, use them to find all of their parents, and then
return not the parent elements themselves, but just their names, one name per
element. Suppose we want to find out what types of elements can be
parents of stage directions. We could scroll through the 218 results, but that’s
tedious and error-prone, and with a different source document there might be
even more stage directions. This is the sort of task that computers perform more
reliably than humans, though, so we can instead ask XPath to remove the
duplicate values for us by applying the
distinct-values()
function, which takes a
sequence of items (in this case strings, since they’re the names of elements) as
input, removes any duplicates, and returns a deduplicated sequence. We can do
that by wrapping the function around the entire expression, since the entire
result of the expression (the 218 values) is the input to the deduplication
process:
distinct-values(//stage/.. ! name(.))
This returns just three values: div
,
l
, and
sp
, because those are the only element
types that can be parents of stage directions in this play.
Wrapping a function around a long path expression can be difficult to read (and the difficulty increases if we want to nest several functions), and the arrow operator exists as a way to make long expressions with functions easier to read. In this case, we can rewrite our expressions as:
//stage/.. ! name(.) => distinct-values()
We can read this from left to right: first find the parent elements of the stage
directions, then get the names of those elements, and then remove the duplicate
names. We find this easier to read than the version that wraps the
distinct-values()
function around the rest
because we have to read the version with wrapping from the inside out, which
doesn’t feel as natural as reading from left to right. With the notation that
wraps the function around the rest, first we do the things inside the function
parentheses and then we step outside the parentheses to apply the function to
the results.
The simple map operator normally requires a dot inside the parentheses to specify that the function is being applied to the current context item. Some functions know that the current context item is the input into the function by default, so omitting the dot for those functions won’t do any harm, but it can be difficult to predict which functions require the dot and which regard it as optional. The function to the right of the arrow operator, though, never includes the dot. Keeping the two syntactic expectations separate will become more natural as you gain experience.
The simple map and arrow operators make complex path expressions easier to read, and that’s especially the case if we write the expression across multiple lines. e.g.:
//stage/..
! name(.)
=> distinct-values()
Writing each step on its own line makes the stepwise process even easier to see
because it now takes advantage of our intuitive understanding of both left to
right
and top to bottom
.
We encourage you to become comfortable with the simple map and arrow operators
because they’ll help you write code that is easier to understand and therefore
less prone to error and easier to debug when you do make a mistake. This method
also encourages you to construct your XPath expressions one step at a time,
which is always a good idea because it lets you test each step, so that as soon
as something breaks, you’ll know that the last thing you did is the locus of the
mistake. With that said, using alternative notations (like the
distinct-values(//stage/.. ! name(.))
example above) isn’t wrong, and you’ll see it in a lot of examples you’ll find
on the Internet (including on our course pages) because the simple map and arrow
operators are relatively new features in XPath, so any expressions written
before their introduction wouldn’t have been able to use them.