Maintained by: David J. Birnbaum (djbpitt@gmail.com)
Last modified:
2023-02-27T20:50:19+0000
You can find an XML (TEI) version of Shakespeare’s Hamlet at http://dh.obdurodon.org/bad-hamlet.xml. We’ve
deliberately damaged some of the markup in this edition to introduce some
inconsistencies, but the file is well-formed XML, which means that you can use XPath
to explore it. You should download this file to your computer (typically that means
right-clicking on the link and selecting save as
) and open it in
<oXygen/>.
Prepare your answers to the following questions in a markdown file upload it to Canvas as an attachment. As always, code snippets (including XPath snippets) in markdown must be surrounded with backticks.
Some of these tasks are thought-provoking, and even difficult. If you get stuck, do
the best you can, and if you can’t get a working answer, give the answers you tried
and explain where they failed to get the results you wanted. As always, you are
encouraged to ask questions in the #xpath channel in Slack, but because you
want to make progress in learning to debug your own code, your questions should tell
us what you tried, what you expected, exactly what you got instead (not
just didn’t work
or got an error
), and what you think the source of
the problem is. Sometimes writing that sort of request for advice that will help you
figure out what’s wrong on your own (see Rubber duck
debugging), and even when it doesn’t, it will help us identify the difficult
moments.
These tasks require the use of path expressions, predicates, and the functions
count()
and
not()
, but they should not require any other
XPath functions. There may be more than one possible answer.
Using the Bad Hamlet document and the XPath browser window in <oXygen/>, construct XPath expressions that will do the following (give the full XPath expressions in your answers, and not just the results):
<l>
) and anonymous block(
<ab>
) elements (an anonymous block is
the TEI element that the tagger used to represent a non-metrical speech line).
Speeches also typically contain
<speaker>
elements, and may also contain
stage directions (<stage>
). We have
deliberately left out at least one other type of subelement found in speeches.
Based on this understanding:What XPath would find all of the speeches that do not contain any metrical lines as immediate children. How many are there?
//sp[not(l)]
For this answer, we use the predicate
[not(l)]
to check whether there
are <l>
elements on the
child axis of all of the speeches we find. We keep only speeches
that do not contain any line elements, throwing away any
speeches that do contain those elements. There are 451
speech elements without line children. Some of you specified
//sp[not(child::l)]
. This is
syntactically valid XPath, but it isn’t consistent with Best
Practice; most coders wouldn’t mention the child axis explicitly
because it’s the default.
What XPath would find all of the speeches that do not contain either
any metrical lines (<l>
) or
any anonymous blocks (<ab>
)?
How many are there? What do they contain instead?
From anywhere in the document (because a path expressions beginning with a slash will always start at the document node regardless of your current context location) you can use any of the following:
//sp[not(l|ab)]
//sp[not(l or ab)]
//sp[not(l)][not(ab)]
//sp[not(l) and not(ab)]
//sp except (//sp[l]|//sp[ab])
These path expressions all return speeches that do not have either
<l>
or
<ab>
elements as immediate
children. There are seven of these elements. When you navigate to
these speeches in the list that <oXygen/> returns, you find
that the other child elements of speeches that we concealed from you
are line groups, or <lg>
.
You can get those child elements by adding an asterisk as the next
path step, meaning find all of the child elements of these
:
<sp>
elements, whatever
they might be//sp[not(l or ab)]/*
(similarly
with the other options above). You can get the element names,
instead of the elements themselves, by using the simple map operator
to apply the name()
function to
the elements returned by the path expression:
//sp[not(l or ab)]/* ! name(.)
(likewise for the other expressions above). You can get rid of the
duplicates by using the arrow operator to apply the
distinct-values()
function:
//sp[not(l|ab)]/* ! name(.) => distinct-values()
(likewise for the other expressions above).
There are alternatives that don’t use the simple map operator or
the arrow operator. For example,
distinct-values(//sp[not(l|ab)]/*/name())
returns the same results as the last example above. We recommend
using the simple map and arrow operators where appropriate
because the distinctive operators and the left-to-right
application make the code easier to understand.
//sp[not(l|ab)]
uses the
union operator.
l|ab
matches the union of all
lines and anonymous blocks, that is, everything that is either a
line or an anonymous block (and some people refer to the union
connector as the or connector
). The path
//sp
collects all of the
speeches, and the predicate then checks on the child axis (the
default, since no axis is specified) and retains only those speeches
that do not have either lines or anonymous blocks as immediate
children.
//sp[not(l or ab)]
uses the
or
operator. Without the
not()
function, it would keep
all <sp>
elements that have
either <l>
or
<ab>
children. The
not()
function inverts the
test, so it keeps only the
<sp>
elements that have
neither type of child. Some people find the two senses of
or confusing: the union operator
(|
) finds elements
that are either one thing or another, while the keyword
or
is used in complex
predicates to filter a sequence according to one condition
or another.
//sp[not(l)][not(ab)]
uses two
predicates. It first collects all of the speeches, after which the
first predicate keeps only those that don’t have line elements as
children. The second predicate then filters those still further,
keeping only the ones that also don’t have anonymous blocks as
immediate children. You can reverse the order of the predicates in
this case (although there are other situations where changing the
order of predicates leads to different results).
//sp[not(l) and not(ab)]
uses a
compound predicate. It collects all of the speeches and then filters
them all at once, keeping only those that both do not have any lines
as children and do not have any anonymous blocks as children. The
order of the two parts on either side of the
and
operator doesn’t
matter.
//sp except (//sp[l]|//sp[ab])
uses except
to specify the
difference of sets of nodes. See Kay pp. 628–31 for
discussion. Our pattern says return all
.<sp>
elements, but
exclude from that return the union of all
<sp>
elements that have
<l>
children and all
<sp>
elements that have
<ab>
children
All five of these XPaths yield the same sequence of seven elements,
and in all cases that use the
and
,
or
, or union
(|
) operators, you may change
the order of the parts without changing the results. For example
not(l or ab)
returns the same
nodes as not(ab or l)
.
//sp[@who="Hamlet"]/l[1]
/descendant::sp[@who="Hamlet"]/l[1]
(//sp[@who="Hamlet"]/l)[1]
(/descendant::sp[@who="Hamlet"]/l)[1]
The reason that these four XPath expressions return different results is
because of the scoping effect of parentheses, that is, how the
parentheses affect the sequence of items within which the system then
applies the predicate [1]
to select
only the first item in a sequence. The other distinguishing factor of these
expressions involves the specification of the descendant axis. The shorthand
//
functions the same way as the long
form descendant::
. The predicate in the
expressions 2a and 2b filter elements in a different context than in
expressions 2c and 2d because the parentheses alter the current context.
Here are the details:
2a (//sp[@who="Hamlet"]/l[1]
) returns
the first line of every speech by Hamlet. XPath expressions operate one step
at a time, and the sequence returned by each step becomes the sequence of
context nodes from which the next step proceeds. This means that when the
first step returns a sequence of all of Hamlet’s speeches, each speech, one
by one, becomes the context for returning the lines in that single speech,
which are then filtered by the predicate to keep only the first line. The
entire expression returns, then, a sequence of all of the first lines
(<l>
elements) of all of Hamlet’s
speeches.
Not all of Hamlet’s speeches contain
<l>
child elements, yet when we
ask for the first <l>
child
element of each speech we don’t raise any errors. This fact illustrates
another perhaps surprising feature of XPath: asking for something that
does not exist is not an error, so if a speech has no
<l>
children we get no result
(an empty sequence) for that speech. When we ask for
//sp[@who="Hamlet"]/l[1]
we get 159
first child <l>
elements, but
when we ask for
//sp[@who="Hamlet"]
, we get 357
speeches. This means that 198 speeches don’t have a first
<l>
child element, and that
means that they don’t have any
<l>
child elements. We can
verify that with
//sp[@who="Hamlet"][not(l)]
, which
asks explicitly for all speeches by Hamlet that contain no
<l>
child elements, and which
returns 198 results. We can find out what types of elements those
speeches do have with:
//sp[@who="Hamlet"][not(l)]/*[not(self::speaker|self::stage)]
! name(.)
=> distinct-values()
This finds all speeches by Hamlet
(//sp[@who="Hamlet"]
) and filters them
to keep only the ones that do not have any
<l>
children
(//sp[@who="Hamlet"][not(l)]
). We then
find all of the child elements that those speeches do have
(//sp[@who="Hamlet"][not(l)]/*
) and
filter those children to exclude the speaker name and any stage
directions
(//sp[@who="Hamlet"][not(l)]/*[not(self::speaker|self::stage)]
),
since those don’t hold spoken text, and what we’re looking for is the
elements other than <l>
that
hold spoken text. We then use the
name()
function to get the names of
each of those elements, using the simple map operator. Since we don’t
need the names of the element types to be repeated, we employ the
distinct-values()
function to
remove the duplicates, using the arrow operator. This tell us that
speeches by Hamlet that don’t have any
<l>
children have, instead,
<ab>
(anonymous block, which in
this play is used for non-metrical lines) and
<lg>
(line group) child
elements.
We used the self::
axis above inside
a predicate to exclude <speaker>
and
<stage>
elements from the results of a path step. We
could, alternatively, use the
except
operator within the path
step instead of the predicate. That version would look like:
//sp[@who="Hamlet"][not(l)]/(* except (speaker|stage))
! name(.)
=> distinct-values()
There is no reason to prefer one of these to the other, and you should use whichever one you find easiest to understand.
2b (/descendant::sp[@who="Hamlet"]/l[1]
)
starts by looking for <sp>
on the
descendant axis from the document node (the top of the tree), and then finds
all of the child line elements of those speeches. As in 2a, the predicate
[1]
modifies the immediately
preceding step of the expression. This means that the expression returns the
first line of each of Hamlet's speeches, the same result as 2a.
2c ((//sp[@who="Hamlet"]/l)[1]
) also
begins by finding all of the speeches by Hamlet, and then all of the child
line elements of those speeches. At that point the full expression is
wrapped in parentheses, making the current context a single, flattened
sequence of all line elements in Hamlet speeches. The predicate
[1]
then applies to that entire
flattened sequence, so it filters the result by asking for the very first
line in the entire sequence of all lines spoken by Hamlet. As a result 2c
returns only one line.
The way the parentheses operate here illustrates a perhaps surprising
feature of sequences in XPath: sequences flatten their contents into a
single sequences. For example, the sequence
( (1, 2, 3), (4, 5, 6) )
might look
like a sequence of two items, each of which is itself a sequence of
three integers. The way XPath works, though, is that this is equivalent
to (1, 2, 3, 4, 5, 6)
because
nested sequences in XPath are automatically flattened. This is why
wrapping all of the lines of all of Hamlet’s speeches in parentheses
causes them to behave like a single sequence of lines, instead of a
sequence of sequences of lines.
XPath has a structure called an array that is similar to sequences except that it permits this type of nesting without flattening. We don’t introduce arrays in this course because they are not needed for most document processing, but if you do require array functionality for your project, the instructors will help you learn to use them.
2d
((/descendant::sp[@who="Hamlet"]/l)[1]
),
like the others, starts by looking for
<sp>
elements on the descendant axis
from the document node and filtering them to keep only the ones by Hamlet.
As in 2c, the numerical predicate is applied to the entire expression, and
not to each individual speech, since the predicate applies to everything
wrapped in parentheses, that is, to all of Hamlet’s line as a single
sequence. For that reason, it keeps only the first line that is spoken by
Hamlet and returns only one line, the same result as 2c.
The point is that a predicate applies to the current context, which is defined as the immediately preceding sequence. In 2a and 2b, it’s each sequence of lines in each speech, separately, so the predicate applies once for every speech, and returns the first line of every speech. In the 2c and 2d, the parentheses cause the entire preceding expression to function as a single, flattened sequence, so the predicate applies only once, to the continuous sequence of all lines spoken by Hamlet, and therefore returns only one line, the first line spoken by Hamlet in the entire play.
//
mean?descendant::
Most of the time you don’t need to think about the details of how
//
doesn’t really mean descendant
axis
, even though it seems to behave as if it did. But if you run into
one of the places where //
doesn’t behave
the same way as referring to the descendant axis explicitly, here’s why.
In most places //
functions the same way as
descendant::
, so we often think of it as
shorthand for it, just as @
is shorthand
for attribute::
, that is, for referring to
nodes on the attribute axis. But //
isn’t
exactly synonymous with descendant::
; what
it actually means is
descendant-or-self::node()/
. Here’s why
that matters, and why that perhaps confusing path step is useful.
The descendant-or-self::
axis means what you
think it means: it looks for specified nodes on the descendant axis, but it also
looks at the current context node, and not only at its descendants. The
expressions //sp
and
/descendant::sp
return the same results
because:
/descendant::sp
is straight-forward.
It starts at the document node (because it begins with a slash) and then
looks at all descendants of the current context node (which we’ve just
established as the document node) and selects all descendants that are
elements of type <sp>
.
//sp
is short for
/descendant-or-self::node()/sp
. It
also starts at the document node because it also begins with a slash. It
then, as a first path step from the document node, looks at itself (the
document node) and at all of its descendant nodes of itself. It doesn’t
select those nodes, though; that’s an intermediate path step, and from
each of those nodes it then looks at their children and selects them if
they are of type <sp>
.
//
doesn’t mean the same
thing as descendant::
?If you want to find all @who
attributes in the
document, you can do that with //@who
. But if
you try to write /descendant::@who
you’ll raise
an error because you’re trying to look on two axes at once. Attributes aren’t
descendants in the sense that they are never on the descendant axis because they are
never on any of the directional axes (parent, child, etc.); they are only on the
attribute axis. But because //@who
really means
/descendant-or-self::node()/@who
, it looks for
all nodes on the descendant axis (none of which are attributes themselves) and then
looks for @who
attributes on the attribute axis
from those nodes. Since it is looking at all descendant nodes of the document node,
it winds up looking at all @who
attributes, no
matter what their parent.
Attributes are not children, which means that they are not located on the child
axis of anything (they are always and only on the attribute axis). But, perhaps
surprisingly, they do have parents: the element that hosts an attribute is
called its parent and you can navigate to it on parent axis from an attribute.
For example, you can find all elements that have a
@who
attribute with
//@who/parent::*
(or the shorthand version
//@who/..
).
//
as a
synonym for descendant::
?There are two common types of errors that come from treating
//
as a synonym for
descendant::
:
Suppose you want to find all of the speeches
(<sp>
) that have child
<l>
elements. If you write
//sp[l]
you’ll get the right
results: your path starts at the document node, eventually finds its way
to all of the <sp>
elements in
the document, and uses the predicate to test each one to see whether it
has any <l>
children. This works
because every path step is on an axis and the child axis is the default
when no other axis is specified explicitly, so
//sp[l]
is synonymous with
//sp[child::l]
.
We can’t, though, find all <sp>
elements that have <l>
descendants with //sp[//l]
. If you
try this, you’ll select all 1137 speeches in the play, both those have
have <l>
descendants and those
that don’t. The reason is that the predicate begins with a slash and a
slash always means start at the document node
, so instead of
looking for <l>
descendants of
the current context (each <sp>
in turn), your predicate is reporting
true
if there are any
<l>
element descendants of the
root node. Since there always are, the predicate always returns
true
, so the test always succeeds
and we wind up not filtering anything.
We can fix this in two ways:
Specify the descendant axis for real with
//sp[descendant::l]
. This
predicate expression does not begin with a slash, so it looks
for <l>
descendants only
of the current context node, which is each
<sp>
in turn.
Start the predicate with a dot, which in an XPath context means
current context node
:
//sp[.//l]
. The predicate
expression no longer begins with a slash, so it doesn’t start
from the document node; it starts from the current context node
(represented by the dot) and looks down from there.
In our own work we favor the first solution because we find it easier to understand, but the two will return the same result. They aren’t exactly equivalent for the reason described above (we can specify an attribute on the attribute axis right after a double slash), but in practice we almost never want to do that anyway.
Because XPath expressions proceed step by step, where the sequence
selected at each step becomes a sequence of context nodes for the next
step, //sp//l[1]
doesn’t select all
<l>
children of all
<sp>
elements and then return
the first one. What it does instead is select all
<l>
children of each
<sp>
, one
<sp>
at a time, and it applies
the predicate to those individual sequences of lines. What you are
asking for, then, is the first
<l>
child of each speech (one
per speech), instead of the first
<sp>
element that is a child of
a speech (one result for the entire play). The same is true of
//sp/descendant::l[1]
; here, too,
you are selecting the descendant lines of each speech, one speech at a
time, and filtering to keep just the first of each of those
speech-specific sequences.
You can work around this limitation by using parentheses to fuse all of
the subsequences into one long sequence before applying the predicate:
(//sp//l)[1]
or
(//sp/descendant::l)[1]
will both
return the first <l>
element in
the play that is a descendant of an
<sp>
element.