Maintained by: David J. Birnbaum (djbpitt@gmail.com)
Last modified:
2022-04-20T00:07:23+0000
For the XPath test, you will be using XPath to take a closer look at individual characters in Shakespeare’s Hamlet. We recommend working in the XPath/XQuery Builder view instead of the small XPath toolbar at the top of your <oXygen/> window because some of these expressions can get long, and it is easier to stay organized if you can see the entirety of your XPath expression as you are typing. Like all of our tests, this one is open-book, which means that you can consult notes, books, the Internet and other resources, except that you cannot receive any assistance from another person. Should you get stuck in a way that does not respond to your best rubber duck debugging efforts, feel free to post an inquiry in our Slack workspace and we’ll try to point you in the right direction. You should submit your answers (the full XPath expressions, not just the result of evaluating them) in a properly formatted markdown document. That includes surrounding XPath expressions in backticks.
Using the version of Hamlet that we have been using for all the previous
XPath assignments (http://dh.obdurodon.org/bad-hamlet.xml), complete the following tasks.
There are alternative good solutions for some of them, so you will not need to use
all of the following functions, but some that we used include
avg()
,
distinct-values()
,
matches()
,
normalize-space()
,
round()
,
sort()
,
string-join()
,
tokenize()
, and
translate()
. Don’t guess at how these work; if you
aren’t already familiar with them, look up the number of arguments they require and
what each argument means in Michael Kay or an alternative reference.
Here are two details to keep in mind:
Some lines in the XML begin with spurious space characters. For example:
I say, away! Go on; I'll follow thee.
Exeunt Ghost and Hamlet.
]]>
begins with a space even though the first word spoken is the word I
.
Whether those spaces require special handling depends on how you approach
the tasks below, but whatever approach you take, you’ll want to verify that
these lines are being treated properly.
You may need to select the text node children of an element that contains
mixed content. For example, the line above contains mixed content, in this
case consisting of a single text node (that is, plain text) followed by a
single <stage>
element. Much as you
can select the <stage>
element
children of all lines with //l/stage
,
you can select the text node children of all lines with
//l/text()
. The path step spelled
text()
is not a function (even though
it looks like one with its trailing parentheses); it’s the way to say, in
this path expression, select the text nodes on the child axis of each of
the nodes selected by the previous path step
.
The technical term for text()
is
that it’s a node test for text nodes, that is, it tests for
and selects text nodes. This is similar to the way that
*
in
//l/*
tests for and selects all
element children of each <l>
element (but not any children that are not element nodes, like text
nodes or comment nodes) and stage
in //l/stage
tests for and selects
element nodes that are of type
<stage>
(but not any children
that are not element nodes or that are element nodes of other types).
See Kay, p. 614.
Find all of Hamlet’s spoken lines (which may be represented by
<l>
or
<ab>
elements) in the play and
select only those that begin with the word I
(not just the letter
I
). The result should be a sequence of elements of type
<l>
or
<ab>
.
//(l | ab)[ancestor::sp[@who eq 'Hamlet']][starts-with(normalize-space(.), 'I ')]
This expression returns a list of 66
<l>
and
<ab>
elements. We start at the
document node and search everywhere for both
<l>
and
<ab>
elements by grouping them
together with parentheses in an or-group. We use the union operator
(|
) in
//(l | ab)
to create a sequence of
all of elements of those types. We then filter those nodes to select a
subsequence that contains only the lines that meet our requirements,
that is, those that have a parent
<sp>
element with a
@who
value equal to the string
Hamlet
. We then apply another predicate that filters the
lines to keep only the ones that start with I
followed by a
single space to ensure we return only lines that begin with the word
I
, not the letter I
. Because some lines in the XML
begin with spurious space characters, we can't merely ask for lines that
start with I
as we will miss the lines that start with a space
followed by I
. To handle this white-space irregularity, we apply
normalize-space()
to the current
context (all the lines with Hamlet as the speaker) inside the
starts-with()
function as its first
argument and specify the string I
as its second argument.
Alternatively, you can check for lines that begin with the word I
with:
[tokenize(.)[1] eq "I"]
[matches(., '^\s*I ')]
tokenize()
, by default, splits
strings into word tokens on sequences of white-space characters, and the
numerical predicate filters the tokens to keep only the first one for
the current context (every spoken Hamlet line). We test if that token is
equal to the string I
, and if it is, the node to which it belongs
is returned in the results list.
matches()
takes a string as its
first argument and a regex pattern as its second argument. In this case,
we use the matches()
function to
operate on the current context (each line spoken by Hamlet) and match
the lines that start with zero or more spaces followed by I
followed by a single space.
In all of these methods we take advantage of the fact that although some
lines of speech contain stage directions, the stage direction happens
never to be first. This means that if I
is at the beginning of
one of these lines, it is spoken text, and not the beginning of a stage
direction. A more robust approach (one that would work properly if the
first word inside one of Hamlet’s lines was I
, but it was inside
a stage direction, and not part of the spoken content) would perform the
operation below (in 2) to remove stage directions and only then check
for lines that begin with I
.
In general, a lot of you used an expression like
//sp/*
to locate all the spoken
lines in the play. This ends up working for Hamlet’s spoken lines
because all of his lines are children of
<sp>
elements, but some
<l>
elements, as we know from
our XPath assignments, are grandchildren of
<sp>
elements instead of direct
children. Because we do not know where in the hierarchy Hamlet’s spoken
lines are, we need to account for any location in our expression. One
good way to do this is to not think about
<sp>
elements and locate all the
<l>
and
<ab>
elements in the entire
document with something like
//(l | ab)
. From here, you can
apply predicates to further filter down the resulting lines.
The elements you select above all contain text nodes, which represent spoken text, but some also contain stage directions (see the example above). Extend your XPath expression above to return just the spoken text from each of the lines, without any accompanying stage directions. The result will be plain text, without any markup. (Hint: See above about selecting text nodes.)
//(l | ab)
[ancestor::sp[@who eq 'Hamlet']]
[starts-with(normalize-space(.), 'I ')]
/string-join(text(), ' ')
This expression returns the same 66 lines as before, only now, the
results list is comprised of text nodes instead of element nodes. We
need the string-join()
because
there is one <l>
spoken by
Hamlet that contains two text nodes, and without the
string-join()
each would be
considered a separate result:
I say, away! Go on; I'll follow thee.
Exeunt Ghost and Hamlet.
]]>
There is a text node before the stage direction and one after the stage direction, but only the one before the stage directions contains text other than white space.
There is a lot of extra white-space that gets included with these lines because of pretty-printing. For example, where the XML contains:
A little more than kin, and less than
kind.
]]>
the pretty-printing introduces a newline and some extra spaces for indentation. Extend your XPath expression above (the one that selects only the text nodes inside lines, but not stage directions) to remove this extra space, so that each spoken line of text will be continuous, with single space characters between words.
//(l | ab)
[ancestor::sp[@who eq 'Hamlet']]
[starts-with(normalize-space(.), 'I ')]
/string-join(text(), ' ')
! normalize-space(.)
Again, we return a list of the same number, but this time, the resulting
text does not contain any extra space beyond the standard single space
between word tokens. We use the simple mapping operator to apply
normalize-space()
to each text node
so that each of the resulting text nodes contain only the white-space
that is to be expected within a sentence: a single space between
words.
Modify the preceding XPath expression to return the length in characters of
each line spoken by Hamlet that begins with I
(after ignoring stage
directions and removing extra whitespace). The result will be a sequence of
integers, each representing the character count of a single line of
speech.
//(l | ab)
[ancestor::sp[@who eq 'Hamlet']]
[starts-with(normalize-space(.), 'I ')]/
string-join(text(), ' ')
! normalize-space(.)
! string-length(.)
As before, we are still operating on the same number of lines. This time
around, our modification uses the simple mapping operator to apply the
string-length()
function to the
current context, that is, each of the 66 lines. Passing each line into
string-length()
returns the length
of the string in characters of each line, so our results list is a
sequence of integers.
Write an XPath expression to compute the average length (in character count)
of the lines spoken by Hamlet that begin with I
. If the value is not
already an integer, round it to the nearest integer value.
//(l | ab)
[ancestor::sp[@who eq 'Hamlet']]
[starts-with(normalize-space(.), 'I ')]/
string-join(text(), ' ')
! normalize-space(.)
! string-length(.)
=> avg()
=> round()
We have the string length of each line from the previous expression, and
we can use these numbers to find the average length of a Hamlet line
beginning with I
. The XPath
avg()
function takes care of
summing the sequence of integers and dividing by the number of lines, so
all we have to do is feed in the current context (the sequence of
string-lengths) to the function. We use the arrow operator here instead
of the simple mapping operator because we are no longer operating on
each one of the lines individually. Rather, we want to apply the
function once using the entire sequence of string-lengths as the input.
We get a float value of 38.28787878 repeating, but we want an integer
value for the average. The round()
function takes a single argument, in this case, the float value we
returned after applying the avg()
function, and rounds it to the closest integer value. We find that there
are an average of 38 characters per Hamlet line beginning with the
letter I
. If you didn't join the 2 text nodes of the line
containing a <stage>
element,
then the average will come out to slightly less at 37.71641791 as a
result of there being an additional list item which increases the number
of lines the sum of string lengths is divided by. Rounding to the
nearest integer with the round()
function, though, will yield the same value of 38.
Who speaks in the fifth act of the play?
//body/div[5]//speaker => distinct-values()
We begin by locating all the acts in the play with
//body/div
. This returns the five
different acts, but we only care about the fifth one. To return only the
fifth act, we apply a numerical predicate to our previous expression
that filters the acts to keep only the one we want. Our expression looks
like //body/div[5]
. From here, we
want to locate all of the speakers that appear in Act V, but we do not
know for sure where <speaker>
elements are located in the hierarchy. Thus, we have to operate on the
descendant axis to locate speakers with
//body/div[5]//speaker
. This
expression returns a list of 257 speakers with duplicates. To get rid of
the duplicate speakers, we apply the
distinct-values()
function and
return a list of 13 distinct characters who speak in Act V of
Hamlet.
//body/div[5]//speaker
=> distinct-values()
=> sort()
The XPath function sort()
, as its
name suggests, sorts the input supplied to it. Because the input in our
case is a list of strings, sort()
will sort the supplied input alphabetically. Our expression returns a
list of the same 13 distinct speakers from number one above, but now the
speakers appear in alphabetical order.
//body/div[5]//speaker
=> distinct-values()
=> sort()
=> string-join(', ')
Our last expression gave us a sequence of distinct speakers in
alphabetical order. We want to join these speakers together in a single,
comma-separated list. To do so, we apply the XPath function
string-join
to the list of
speakers. If intending to operate on the current context, as we are
here, string-join
takes a single
argument: the separator string to insert between each of the strings
supplied as the input. We string-join over a comma and a space to get a
comma-separated, alphabetized list of 13 unique speakers.
Iin all of Hamlet’s spoken lines that start with the word
I.
//(l | ab)
[ancestor::sp[@who eq 'Hamlet']]
[tokenize(.)[1] eq 'I']
/tokenize(.)[2]
=> distinct-values()
=> sort()
We start with our expression from Part 1 that uses the
tokenize()
function strategy to
find all of Hamlet’s spoken lines that begin with the word I
. We
then tokenize each of these lines and filter with the numerical
predicate to keep only the second token in each of the lines. Each of
these second tokens will be the word that immediately follows the word
I
. The results list contains 66 items, and we can check this
number against the number of lines that begin with the word I
since they should be the same. In order to return a list of the distinct
words, we use the arrow operator to apply the
distinct-values()
function to the
sequence of 66 words. We then return a list of 32 unique words. To
alphabetize this list, we use the
sort()
function like we did in the
second task of Part 2.
//(l | ab)
[ancestor::sp[@who eq 'Hamlet']]
[tokenize(.)[1] eq 'I']
/tokenize(.)[2]
! translate(., '.,', '')
=> distinct-values()
=> sort()
The word will.
appears with a trailing period, which we overlooked
when we wrote the original description of the task. The XPath expression
above strips out both commas and periods.
The translate()
function in our
expression takes three arguments: the input string, the portion of the
string to be replaced, and the replacement string. We specify the input
string as the current context which is a sequence of the words that
immediately follow the word I
. For each word, we replace periods
and commas with nothing, effectively removing them from the input words.
We then apply distinct-values()
and
sort()
, as we did in the previous
bonus task, to return an alphabetized list of 32 unique words. We want
to strip out the unwanted punctuation before we apply
distinct-values()
so that initial
input words with punctuation attached, such as will.
, are not
counted as their own word type. Our expression that strips out
punctuation returns a list of 31 words instead of 32 words because of
this.