Maintained by: David J. Birnbaum (djbpitt@gmail.com) Last modified: 2023-01-08T17:47:32+0000
You can find an XML (TEI) version of Shakespeare’s Hamlet at http://dh.obdurodon.org/bad-hamlet.xml. We’ve
deliberately damaged some of the markup in this edition to introduce some
inconsistencies, but the file is well-formed XML, which means that you can use XPath
to explore it. You should download this file to your computer (typically that means
right-clicking on the link and selecting save as
) and open it in
<oXygen/>.
Prepare your answers to the following questions in a markdown file upload it to Canvas as an attachment. As always, code snippets (including XPath snippets) in markdown must be surrounded with backticks.
Some of these tasks are thought-provoking, and even difficult. If you get stuck, do
the best you can, and if you can’t get a working answer, give the answers you tried
and explain where they failed to get the results you wanted. As always, you are
encouraged to ask questions in the #xpath channel in Slack, but because you
want to make progress in learning to debug your own code, your questions should tell
us what you tried, what you expected, exactly what you got instead (not
just didn’t work
or got an error
), and what you think the source of
the problem is. Sometimes writing that sort of request for advice that will help you
figure out what’s wrong on your own (see Rubber duck
debugging), and even when it doesn’t, it will help us identify the difficult
moments.
These tasks require the use of path expressions, predicates, and functions. References to Kay are to the Michael Kay book; there’s a link in our online course description to a PDF version accessible through the Pitt library system. There may be more than one possible answer.
Using the Bad Hamlet document and the XPath browser window in <oXygen/>, construct XPath expressions that will do the following. Give the full XPath expressions in your answers, and not just the results:
What XPath will return a hyphen-separated list of all characters without duplicates. The resulting list will look something like:
Claudius-Hamlet-Polonius ...
Our solution uses string-join()
(alternative solutions may also require
distinct-values()
). Note that there are
several ways to identify the characters in this markup, including the
<castList>
element, the
<speaker>
elements, and the
@who
atribute on the
<sp>
element. Which should you use
and why?
Most metrical lines (<l>
) have an
@xml:id
attribute with a value like
sha-ham101010
, ending in a six-digit number. The first digit is
the act, the next two the scene, and the last three the line in the scene.
Some metrical lines are split across multiple speakers, and in that case the
six-digit number in the @xml:id
value
is followed by I
(initial part), M
(middle part), or F
(final part). In a few places there may be more than one middle part, and in
those cases the M
is followed by a one-digit number. For example, one
of Hamlet’s lines is:
<l xml:id="sha-ham502277M2" n="277">One.</l>
which is the second middle part. What XPath will return the number of
<l>
elements that are middle parts?
Our solution uses count()
and
contains()
.
Sometimes Rosencrantz speaks by himself and sometimes he speaks in unison
with Guildenstern. What XPath finds all of the speeches by Rosencrantz,
whether alone or together with Guildenstern? Our solution uses a single
instance of contains()
.
The string-length()
function can be used
in two ways. You can wrap it around an argument, so that, for example,
string-length('Hi, Mom!')
will return
8, the length in character count of the string inside the quotation marks.
It can also be used as part of a path expression, so that, for example, if
the XPath //sp
returns a sequence of
all <sp>
elements,
//sp/string-length(.)
returns a
sequence of the lengths of all <sp>
elements as measured by counting characters. This works by finding all of
the <sp>
elements and then (next
path step) getting the string length of each one. Remember that the dot
inside the parentheses refers to the current context node, which is the
member of the sequence of <sp>
nodes
that is being processed at the moment. We need to use this subterfuge
because string-length(//sp)
generates
an error. The problem is that
string-length()
can take only a single
argument, and //sp
returns more than
one item. Putting the string-length()
function on its own path step with a dot inside means that it applies once
for every <sp>
element, and that
each time it applies, it has just a single argument.
In Real Life we are more likely to use the simple map operator
(!
) than a path step to apply a
function to a sequence of nodes. That is, we would write
//sp ! string-length(.)
instead of
//sp/string-length(.)
. The meaning is
the same and both versions are correct, idiomatic XPath, but we prefer the
simple map operator in this situation because it helps us recognize quickly
that we’re applying a function, and not just selecting nodes.
Use the information above to write an XPath expression that finds the length
of the longest speech. What length does it return? Our solution uses
string-length()
and
max()
.
Optional challenging question: Given the preceding solution, how can
you use that XPath to retrieve the longest
<sp>
itself? No fair checking the
length and then writing a separate XPath that looks for that number. Your
answer must find the longest speech without your knowing how long it is. Our
solution doesn’t require any additional functions beyond the ones used in
#4, but it does use a complicated predicate.
Optional more challenging question: What XPath produces a numbered list of all characters, without any duplicates, which should look something like:
Claudius
Hamlet
Polonius
...
There are several possible solutions, each of which raises issues that you may not have seen before. If you get an error message, try to figure out what it means and how to resolve it.
You should turn in your answers to the above questions in a markdown file, that is, a file with the extension .md.