Maintained by: David J. Birnbaum (djbpitt@gmail.com)
Last modified:
2023-03-03T15:57:01+0000
You can find an XML (TEI) version of Shakespeare’s Hamlet at http://dh.obdurodon.org/bad-hamlet.xml. We’ve deliberately
damaged some of the markup in this edition to introduce some inconsistencies, but the
file is well-formed XML, which means that you can use XPath to explore it. You should
download this file to your computer (typically that means right-clicking on the link and
selecting save as
) and open it in <oXygen/>.
Prepare your answers to the following questions in a markdown file upload it to Canvas as an attachment. As always, code snippets (including XPath snippets) in markdown must be surrounded with backticks.
Some of these tasks are thought-provoking, and even difficult. If you get stuck, do the
best you can, and if you can’t get a working answer, give the answers you tried and
explain where they failed to get the results you wanted. As always, you are encouraged
to ask questions in the #xpath channel in Slack, but because you want to make
progress in learning to debug your own code, your questions should tell us what you
tried, what you expected, exactly what you got instead (not just didn’t
work
or got an error
), and what you think the source of the problem is.
Sometimes writing that sort of request for advice that will help you figure out what’s
wrong on your own (see Rubber duck debugging), and even when it doesn’t, it will help us identify the
difficult moments.
These tasks require the use of path expressions, predicates, and functions. References to Kay are to the Michael Kay book; there’s a link in our online course description to a PDF version accessible through the Pitt library system. There may be more than one possible answer.
Using the Bad Hamlet document and the XPath browser window in <oXygen/>, construct XPath expressions that will do the following (give the full XPath expressions in your answers, and not just the results):
What XPath will return a hyphen-separated list of all characters without duplicates? The resulting list will look something like:
Claudius-Hamlet-Polonius ...
Our solution uses string-join()
(alternative
solutions may also require
distinct-values()
). Note that there are
several ways to identify the characters in this markup, including the
<castList>
element, the
<speaker>
elements, and the
@who
atribute on the
<sp>
element. Which should you use and
why?
The simplest solution is
string-join(//role, '-')
(or, with the
arrow operator,
//role => string-join('-')
.
This doesn’t use distinct-values()
, since
the values of <role>
elements are already
unique. In addition to uniqueness, it also separates characters who speak in
unison. For example, if you use <speaker>
elements you’ll find values like
<speaker>Rosencrantz and Guildenstern</speaker>
,
and that isn’t a string you want in your list of characters. Furthermore, the
<speaker>
and
@who
values contain only speaking
characters, so using those would miss non-speaking characters.
Most metrical lines (<l>
) have an
@xml:id
attribute with a value like
sha-ham101010
, ending in a six-digit number. The first digit is the
act, the next two the scene, and the last three the line in the scene. Some
metrical lines are split across multiple speakers, and in that case the
six-digit number in the @xml:id
value is
followed by I
(initial part), M
(middle part), or F
(final
part). In a few places there may be more than one middle part, and in those
cases the M
is followed by a one-digit number. For example, one of
Hamlet’s lines is:
<l xml:id="sha-ham502277M2" n="277">One.</l>
which is the second middle part. What XPath will return the number of
<l>
elements that are middle parts? Our
solution uses count()
and
contains()
.
//l[contains(@xml:id, 'M')] => count()
As we explained, any line that is a middle part will have the character M
in its @xml:id
. To test whether a line's
attribute has this character, we use the
contains()
function inside a predicate. The
contains()
function checks whether its
second argument exists within the first, that is, whether the capital M
is anywhere in the value of @xml:id
, which
is a string.
To find all of these lines, we search on the descendant axis, starting from the
document node, with the //
shorthand, and
append the predicate
[contains(@xml:id, 'M')]
. We then use the
arrow operator with the count()
function
function to count them.
Sometimes Rosencrantz speaks by himself and sometimes he speaks in unison with
Guildenstern. What XPath finds all of the speeches by Rosencrantz, whether alone
or together with Guildenstern? Our solution uses a single instance of
contains()
.
//sp[contains(@who, "Rosencrantz")]
This solution is very similar to the solution to the previous question, except
that it uses the function contains()
to
search for the presence of an entire string instead of just a single character.
It is important to know that contains()
is
capable of doing both. This XPath finds all of the
<sp>
elements and filters them by
determining whether the string Rosencrantz
occurs anywhere in the
@who
attribute, indicating that Rosencrantz
is speaking. Where Rosencrantz and Guildenstern speak together, the
@who
value is instead Rosencrantz
Guildenstern"
, but that also contains the string Rosencrantz
.
The approach above will produce false positives if the play also has a character
with a name like Rosencrantzenfeld
because that also contains
Rosencrantz
as a substring. You can (= should) use the
contains-token()
function instead of
contains()
to avoid that peril.
The contains-token()
function matches a
substring only if it is a separate word, that is, not a substring of another
word. In that way the contains-token()
function is a compact and legible way of performing the same logic as
//sp[tokenize(@who) = 'Rosencrantz']
.
This longer, less legible version uses the
tokenize()
function to split the value
of the @who
attribute into word tokens
on whitespace (whitespace is the default separator for the
tokenize()
function if none is
specified explicitly) and it then tests whether any of the tokens is equal
to the string Rosencrantz
.
The contains-token()
function is new in
XPath 3.0, so it is not included in Michael Kay’s book, which documents
XPath and XSLT only through version 2.0.
The string-length()
function can be used in
two ways. You can wrap it around an argument, so that, for example,
string-length('Hi, Mom!')
will return 8,
the length in character count of the string inside the quotation marks. It can
also be used as part of a path expression, so that, for example, if the XPath
//sp
returns a sequence of all
<sp>
elements,
//sp/string-length(.)
returns a sequence of
the lengths of all <sp>
elements as
measured by counting characters. This works by finding all of the
<sp>
elements and then (next path step)
getting the string length of each one. Remember that the dot inside the
parentheses refers to the current context node, which is the member of the
sequence of <sp>
nodes that is being
processed at the moment. We need to use this subterfuge because
string-length(//sp)
generates an error. The
problem is that string-length()
can take
only a single argument, and //sp
returns
more than one item. Putting the
string-length()
function on its own path
step with a dot inside means that it applies once for every
<sp>
element, and that each time it
applies, it has just a single argument.
Use this information to identify an XPath that finds the length of the longest
speech. What length does it return? Our solution uses
string-length()
and
max()
.
max(//sp/string-length(.))
We can read this from the inside out as first find all
. Because we
cannot find the string-length of more than one item at a time, we navigate to
all of the <sp>
elements in the document; then,
for each of them, count its length in characters; then, for the sequence of
lengths (all of which are integers) return only the longest<sp>
elements in the play on
the descendant axis with the //
shorthand,
and then take a step to use the
string-length()
function and use
.
to refer to the current
<sp>
for each one in turn. Wrapping the
max()
function around the XPath that
produces this sequence of values will return the maximum value in that sequence,
that is, the maximum length of any speech, which is 5248.
We find this much easier to read when we write it using the simple map and arrow operators:
//sp ! string-length(.) => max()
By the way, this is a naive and textually incorrect way to measure the length of
a speech. It includes the content of any embedded
<speaker>
or
<stage>
elements, which aren’t part of
the spoken text, and it also includes any whitespace characters that might have
been present because of indentation during pretty-printing. How might you
measure the length of a speech in a more textually meaningful way, and how would
you do that using XPath?
Optional, challenging question: Given the preceding solution, how can
you use that XPath to retrieve the longest
<sp>
itself? No fair checking the length
and then writing a separate XPath that looks for that number. Your answer must
find the longest speech without your knowing how long it is. Our solution
doesn’t require any additional functions beyond the ones used in #4, but it does
use a complicated predicate.
//sp[string-length(.) eq max(//sp ! string-length(.))]
Since the value returned by the previous solution is just a number, we can use it
in a predicate to compare against the length of any speech that we are looking
at. We first find the speeches and then check one by one whether the length of
any of them is equal to the maximum length of all speeches in the play. This
works because even within a predicate the
//
starts its search from the root of the
document, so the comparison is to the maximal length of all speeches.
It is also possible to find the longest speech without building on #4. An
expression that does that is:
//sp[not(string-length(.) < //sp/string-length(.))]
This takes advantage of the fact that when we use the general
comparison operator <
(which we
have to spell with the character entity
<
because it’s inside an attribute
value, where a literal <
would not be well
formed), the comparison returns true
if
any item on the left side of the operator is less than any
item on the right (see the discussion of General comparison at the bottom of our
XPath functions we use most). The right side of our comparison here
is a sequence of integers that represent the lengths of all speeches, and the
item on the left is the integer length of the speech we’re looking at at the
moment (we look at each one separately, because that’s how predicates work). The
only speech that is not shorter than at least one of the sequence of all
speeches in the play is the one that is itself the longest, so our test will
pick out just that one. As you gain more experience with using general
comparison operators to compare sequences to one another this type of logic will
grow more intuitive.
We don’t recommend the following alternative because it requires an extra
statement, but it’s worthwhile knowing about the XPath
let
statement:
let $longest := max(//sp/string-length(.))
return //sp[string-length(.) eq $longest]
The let
statement defines a variable, and in
this case, we create a variable called
$longest
(variable names in XPath begin
with a dollar sign) which we set equal to the value of the longest speech, which
is the integer 5248. The binding operator, which binds the value of
the the expression on its right to the variable name on its left, is a colon
followed by an equal sign (:=
), and not
just an equal sign (as in some other programming languages). The binding
operator is sometimes called the walrus operator because it looks
like the eyes and tusks of a walrus lying on its
side—at least if you have a lively imagination.
A let
statement must be paired with a
return
statement, which normally uses the
variable to compute the value of an XPath expression, so in this case the
return
statement returns a sequence of all
<sp>
elements with a string length equal
to 5248. You can write the entire expression on one line if you prefer; we’ve
broken it over two lines because we find it easier to read that way.
Optional, very challenging question: What XPath produces a numbered list of all characters, without any duplicates, which should look something like:
There are several possible solutions, each of which raises issues that you may not have seen before. If you get an error message, try to figure out what it means and how to resolve it.
One solution is
//role ! concat(position(), ". " ,.)
. We
retrieve all <role>
elements and then, for
each one, return a concatenation of its position in the sequence of
<role>
elements (using the
position()
function to get the position of
the current context node in the sequence selected by the preceding path step), a
literal dot followed by a space, and then the
<role>
element itself. The
concat()
function automatically atomizes
its arguments, which is to say that when we pass it a
<role>
element, it converts it to its
atomic (string) value (that is, it throws away the markup and just gives us back
the character content), so that we wind up with results like “1. Claudius”,
which is what we want.
There is, alternatively, a concatenation operator, spelled
||
, that you can use instead of the
concat()
function. The expression with that
operator would look like
//role ! (position() || ". " || .)
.
If we get the characters using <speaker>
or
@who
values instead of
<role>
, we need to deduplicate them with
distinct-values()
, and the the expression
would be
distinct-values(//speaker) ! concat(position(), ". " ,.)
.
Finally, instead of iterating over the roles or distinct
<speaker>
or
@who
values and returning their positions
and string values, as we do above, we can iterate over the positions and return
the same thing. The expression in that case would be
for $i in (1 to count(//role))
return concat($i, ". ", (//role)[$i])
The for
expression iterates over a sequence
and does something once for each member of the sequence. The sequence over which
it iterates is a sequence of integers from 1
through however many
characters there are in the play (there are 37, which XPath determines by
counting the number of <role>
elements).
We use the to
operator, which we haven’t
used before, as an instruction to generate the sequence of integers for us
dynamically. If there were, say, only 5 characters in the play, the expression
1 to count(//role)
would be equivalent to
the sequence (1,2,3,4,5)
.
Although for $i in (1 to count(//role))
is
just setting the variable $i
to a different
integer value each time it loops, on each pass through the
for
loop
(//role)[$i]
will point to a different
character in the play. On the first pass, when
$i
equals 1
,
(//role)[$i]
means
(//role)[1]
, so it points to the first
character in the sequence returned by
//role
. On the second pass, the number
value is 2
and (//role)[$i]
means
(//role)[2]
, and points to the second
character. This is what lets us generate our numbered list; both the numbers and
the pointers into the list of characters are incremented by one on each pass
through the loop. You can read more about how this works in http://xsltbyexample.blogspot.com/2010/05/obtain-position-from-for-expression-in.html,
which details specifically how and why you would use this approach. We find this
the least intuitive of the options discussed here, so it isn’t the one we’d
choose in Real Life. It’s nonetheless worth knowing how
for
expressions work, but because simple
map and path expressions have an implicit for
built into every step
(since the step to the right of the slash or bang is applied once for
each item in the sequence to the left), we use explicit
for
statements much less in XPath than we
do in many other programming languages.
In many instances we can apply an operation to a sequence of nodes with either a slash or simple mapping. For example, the following two expressions are equivalent, and each returns a sequence of integers that gives the string length for each act in the play, in document order (that is, from Act 1 through Act 5, consecutively):
//body/div ! string-length(.)
//body/div/string-length(.)
We recommend using the simple map operator where appropriate, such as in this situation, because it makes it easier to see when we are taking a path step and when we are applying a function to each member of a sequence.
There are, though, at least two important differences between the slash and the simple map operator in this context:
With both operations the sequence to the left supplies the input items, one by one, for the operation on the right. The slash requires that the sequence to the left be a sequence of nodes, while the simple map operator allows any sequence. For example:
The expression
distinct-values(//speaker) ! string-length(.)
will return the string length (number of letters) for each distinct
speaker name.
The expression
distinct-values(//speaker)/string-length(.)
will raise an error because distinct values are strings, and not
nodes.
Meanwhile, both
//speaker ! string-length(.)
and
//speaker/string-length(.)
will work
because //speaker
, unlike
distinct-values(//speaker)
, is a
sequence of nodes.
The slash sorts the items on the left into document order and removes duplicates before it passes them along to the operation on the right, which the bang does not do. The following example is a bit contrived (more natural examples occur inside XSLT transformations, but not in stand-alone XPath expressions):
The expression
(
//body/div =>
sort((),function($i) {string-length($i)})
)
! head
sorts the acts from shortest to longest by string-length and returns
the act labels. The results are, in order, the
<head>
children of acts 4,
2, 5, 1, 3. That is, Act 4 is the shortest act and Act 3 is the
longest with respect to string length.
You can read about the
sort()
function at https://www.w3.org/TR/xpath-functions-31/#func-sort.
If we replace the simple map operator in the XPath expression above with the slash we re-sort the results into document order, so that the results are, in order, acts 1, 2, 3, 4, 5. This is probably not what we want or we wouldn’t have sorted them in the first place.