Maintained by: David J. Birnbaum (djbpitt@gmail.com)
Last modified:
2023-03-01T17:54:35+0000
You can find an XML (TEI) version of Shakespeare’s Hamlet at http://dh.obdurodon.org/bad-hamlet.xml. We’ve deliberately
damaged some of the markup in this edition to introduce some inconsistencies, but the
file is well-formed XML, which means that you can use XPath to explore it. You should
download this file to your computer (typically that means right-clicking on the link and
selecting save as
) and open it in <oXygen/>.
After you’ve completed your homework, save your answers to a file and upload it to CourseWeb as an attachment. (Please use an attachment! If you paste your answer into the text box, CourseWeb may munch the angle brackets.) Some of these tasks are thought-provoking, and even difficult. If you get stuck, do the best you can, and if you can’t get a working answer, give the answers you tried and explain where they failed to get the results you wanted. Sometimes doing that will help you figure out what’s wrong, and even when it doesn’t, it will help us identify the difficult moments. These tasks require the use of path expressions, predicates, and functions. There may be more than one possible answer.
Notation: For ease in recognition, from now on when we refer in discussion to an
attribute name, we’ll precede it with an at sign
(@
). In other words, when we write about the
in question #2, below, the
name of the attribute is actually @id
attributeid
(without an at
sign).
Using the Bad Hamlet document and the XPath browser window in <oXygen/>, construct XPath expressions that will do the following (give the full XPath expressions in your answers, and not just the results):
What XPath expressions will find the last stage direction
<stage>
in the entire document? (Note:
there should be only one!)
One possible answer is (//stage)[last()]
.
This collects all of the stage directions in the entire document, forms them
into one big sequence with parentheses, and then uses the predicate
[last()]
to keep only the last item in that
sequence.
Alternatively, you could have used
//stage[not(following::stage)]
The last
<stage>
will not have any others
following it. The predicate here makes use of the
following::
axis, which searches the rest
of the tree following the current context. Note the difference between this axis
and the following-sibling::
axis, which
would check for following <stage>
elements only within the same parent element.
The long axes (preceding and following) are less efficient computationally than the others because they don’t take advantage of the tree structure, and tree traversal is more efficient than just walking through the document looking for elements. In a document of this small size you won’t notice a difference, but in a large production system you might want to avoid the long axes if there is an alternative.
You may have tried //stage[last()]
and been
surprised to get 218 answers, which cannot be correct since there cannot be 218
last stage directions in the entire play. Meanwhile,
/descendant::stage[last()]
correctly
returns the single last speeech in the entire play. For an explanation of why
these two expressions behave differently, see the What does
section of the posted solution to our XPath #2
exercise and the discussion in Michael Kay, pp. 542, 618, 702–03.//
mean?
What XPath expression will find the last member in the cast list at the beginning
of the document and select the @xml:id
attribute that is associated with it?
(//castItem)[last()]/role/@xml:id
After looking at the document, you can see that the basic path you want to follow
is to find the last <castItem>
, get its
<role>
child (there can be only one
<role>
per
<castItem>
), and then get the
@xml:id
of the
<role>
.
The path expression //castItem
finds all
<castItem>
elements, but as was the case
with the stage directions in the previous question, it effectively returns them
in cohorts of siblings, so
//castItem[last()]
returns the last
<castItem>
in each cohort, and not the last
one in the document. There are three such cohorts: the
<castItem>
children of
<castList>
and the
<castItem>
children of each of the two
<castGroup>
elements (Courtiers,
Grave-diggers), which themselves are children
of <castList>
. Wrapping parentheses around
//castItem
at the beginning flattens all of
the <castItem>
elements into a single
sequence, so that the predicate returns only one node, which is the one you
want.
What XPath expression will find all <sp>
elements with more than 8 line (<l>
)
subelements? You’ll need to use the count()
function (Kay 733–34).
//sp[count(descendant::l) gt 8]
or
//sp[count(.//l) gt 8]
This expression finds all <sp>
elements
in the document and filters them by counting the number of
<l>
descendants they have and checking
whether that count is greater than 8. We used the gt
value comparison test for greater than
; you could also use the >
general comparison test, and you can spell that either with the raw
>
character or the
>
character entity replacement. In
this context, where there is only one item on either side of the test (the
integer count of lines to the left and the integer 8
to the right) and
the two are comparable (we’re comparing a number to a number), there’s no
difference between value comparison and general comparison.
If either side was a sequence of more than one item you would have to use
general comparison (value comparison works only with exactly one item on
each side), and it may not mean what you think. What would it mean to ask
whether the count of lines was greater than the sequence
(8, 10)
? That question turns out not to
be an error; it has a meaning, which you can look up under general
comparison
in Michael Kay, and we also discuss it briefly at the end
of our XPath functions we use most
tutorial.
If you tried //sp[count(//l) gt 8]
, without
the leading dot in the predicate, you got every
<sp>
in the document, all 1137 of them.
See the What does
section
our posted solution to XPath
assignment #2 for an explanation//
mean?
Note that the question asked for subelement
, so the answer should look for
descendants, and not just children. See bonus question #3, below, for discussion
of the difference.
Building on your answer to the preceding question, what XPath expression will tell you how many line subelements each of those speeches actually has?
//sp[count(descendant::l) gt 8] ! count(descendant::l)
The preceding answer returned a sequence of 94
<sp>
elements. We then use the simple
map operator to apply the count function to all of the
<l>
descendants of each of those
speeches.
Building on your answers to the preceding two questions, what XPath expression
will find the speakers of all speeches that have more than 8 line subelements?
Once you’ve found the speeches that have more than 8 lines, you can find the
speakers of those speeches by just adding another path step, but you’ll get some
duplication, since a single person may have more than one long speech. Your
answer to this question should get rid of the duplicates, and return just a list
of names of speakers without duplication. You’ll need to use the
distinct-values()
function (Kay
749–50).
//sp[count(descendant::l) gt 8]/speaker => distinct-values()
Starting with the answer to #3, instead of counting the desendant lines, as we
did in #4, and getting a count of the lines, we add a
speaker
path step and get the
<speaker>
child of each speech. Since
there are 94 speeches, we get a sequence of 94 speakers. We get rid of the
duplicates by applying the
distinct-values()
function to that
sequence.
Question #1, above, asked how you to provide an XPath that would find the last
stage direction (<stage>
) in the play.
What XPath would find the last line
(<l>
) in the play? What XPath would find
the last stage direction or line (that is, whichever of the last stage
direction and last line comes last)? You’ll need to use the union
operator (Kay 628–31).
You can find the last stage direction or line with
(/descendant::l | /descendant::stage)[last()]
.
Reading from the inside out, we use
/descendant::l
to find all lines in the
play and /descendant::stage
to find all
stage directions. We join those with the union operator
(|
) to create a sequence of all of the
nodes returned by both of those paths, that is, all lines and all stage
directions. We wrap that union in parentheses to form it into one long sequence
and then use the last()
function in a
predicate to select the last item in that sequence in document order. Note that
the union operator doesn’t concatenate the sequences, which would put all of the
lines before all of the stage directions; it maintains document order. You can
verify this by changing the order of the line and stage-direction parts of the
expression; you’ll get the same result.
Note that we don’t get the last line and last stage direction separately and then figure out which of those comes after the other. That would work, but it entails an unnecessary extra step and only makes the code harder to understand.
Question #2, above, asked you to provide an XPath that would find the
@xml:id
associated with the last cast
member in the cast list. What’s the difference between an XPath that returns the
@xml:id
attribute itself and an XPath that
returns just the value of the
@xml:id
attribute? That is, what are the
two XPath expressions and what object does each of them return? You’ll need to
use the data()
or
string()
function (Kay 741–43, 877–79).
When your path ends with something like
@xml:id
, what you return is an attribute
node. If you were copying that into a new XML document as part of an XSLT
transformation, you would create an attribute on whatever element you had just
created in the output XML document. If, though, you extend the path as
(//castItem)[last()]/role/@xml:id ! string(.)
you’ll get the value of the attribute, instead of the attribute node
itself. If you write the value into your output XML, you don’t get an attribute
node; you just get the string value.
In the <oXygen/> XPath debugger interface there isn’t much visual difference between retrieving the attribute node or its string value. But in an XSLT transformation you don’t want to create an attribute in your output document instead of a string value, or vice versa.
Question #3, above, asked you to provide an XPath expression that would select
all of the speeches (<sp>
elements) with
more than 8 line (<l>
) subelements. What
XPath expressions would select speeches with more than 8 line child
elements (one XPath expression) and speeches with more than 8
descendant line elements (the expression you created for #3,
above)? How do those results differ? If there are descendant line elements that
are not children of a speech, what is their parent? If you don’t know the types
of their parent elements in advance, what XPath expression will tell you?
The XPath //sp[count(descendant::l) gt 8]
returns the sequence of all <sp>
elements that have more than 8 <l>
descendants. As we mention in the answer to regular question #3, above, there
are 94 of them. To find just the children, but not other descendants, use
//sp[count(l) gt 8]
, which returns 87
<sp>
elements. (You could write
//sp[count(./l) gt 8]
, but the dot and
slash aren’t needed [= shouldn’t be used] here, since the child axis is the
default axis.) Since the task was to find lines that are descendants of speeches
but that are not children of those speeches, the most direct route might
be//sp//l[not(parent::sp)]
. (As it turns
out, you don’t need the //sp
part of this
path because all <l>
elements happen to be
descendants of <sp>
elements, but in Real
Life you might not always know that sort of detail). This finds all speeches,
and then all of their line descendants, and then uses a predicate to keep only
the lines that don’t have a parent of type
<sp>
.
You can retrieve the parents themselves (instead of retrieving the lines and just
filtering them by their parents) with
//sp//l/..[not(self::sp)]
. This is the main
reason for the existence of the self::
axis; this XPath can be read as: Find all speeches in the play, and then all
of their line descendants, and then the parents of each of those lines, and
filter them to keep only the ones where the parent is not of type
To get the element type
(if they aren’t speeches, what are they?), you can use the
<sp>
.name()
function:
//sp//l/..[not(self::sp)] ! name(.)
, and to
remove the duplicates you can use the arrow operator:
//sp//l/..[not(self::sp)] ! name(.) => distinct-values()
.
The answer is that all lines that are not immediate children of speeches are
children of line groups (<lg>
).
You could, alternatively, use
//sp//l[not(parent::sp)]/..
, which, instead
of finding the parents first and then using the self axis to filter out the ones
that of type <sp>
, instead filters on the
preceding path step to find only the lines that don’t have
<sp>
parents and then gets their parents.
Whether you filter on the line step or the parent step is a matter of personal
preference, and we recommend using the expression that corresponds most closely.
step by step, to how you would explain what you were trying to your rubber
duck.