Digital humanities


Maintained by: David J. Birnbaum (djbpitt@gmail.com) [Creative Commons BY-NC-SA 3.0 Unported License] Last modified: 2021-11-14T21:03:49+0000


Test #4: XPath

The task

For the XPath test, you will be using XPath to take a closer look at individual characters, and specifically at Hamlet, in Shakespeare’s Hamlet. We recommend working in the XPath/Xquery Builder view instead of the small XPath toolbar at the top of your <oXygen/> window because some of these expressions can get long, and it is easier to stay organized if you can see the entirety of your XPath expression as you are typing. Feel free to consult other resources, such as notes and information published on the Internet, as you are completing the test, but you are not allowed to ask another person for help. Please submit your answers (the full XPath expressions, not just what is returned) in a well-formatted markdown document.

Using the version of Hamlet that we have been using for all the previous XPath assignments (Bad Hamlet), complete the following tasks. Some of them may have multiple solutions.

Required

  1. Write an XPath expression that selects all the lines (<l> or <ab> elements) in which Hamlet mentions Ophelia by name. That is, your expression should select the <l> or <ab> elements that are spoken by Hamlet and contain the string Ophelia.

    Solution
    //(l | ab)[ancestor::sp[speaker eq 'Hamlet'] and contains(.,'Ophelia')]

    We begin by collecting all the <l> and all the <ab> elements in the entire document since we know that spoken lines in the play can be of either of these types, and we don't know in advance whether Hamlet will mention Ophelia in one, the other, or both of these types of line. We use the union operator (|) in //(l | ab) to create a sequence of all of elements of those types. We then filter those nodes to select a subsequence that contains only the lines that meet our requirements, that is, those that have an associated <speaker> value equal to the string Hamlet and contain the string Ophelia.

    To find the name of the speaker we look up the ancestor axis to find the <sp> ancestor of each line and then compare its child <speaker> child element to the string value Hamlet. We take this approach instead of looking for preceding-sibling <speaker> elements because some lines in the play are inside line groups (<lg> elements), and the speaker of any of those lines is not a sibling of the lines because the <lg> creates an intervening level in the hierarchy. As it happens, none of the lines we care about are inside line groups, so you’ll get the correct result if you just look at preceding sibling <speaker> elements, but that’s a brittle answer that makes an assumption that you don’t have to make. Once we’ve found the lines that have Hamlet as a speaker, our predicate further filters the lines to select only those that contain the string Ophelia. The task was to select <l> and <ab> elements, and not to select the <sp> elements that contain them, so although we look at the <sp> containers to do some of our filtering, our expression returns a sequence of <l> and <ab> elements. We chose to find the lines spoken by Hamlet by using the string value inside the associated <speaker> element, but you could, alternatively, use the @who attribute with an expression like ancestor::sp[@who eq 'Hamlet'] for the same purpose. The same option is available for any following expressions that require you to select lines after filtering by speaker.

    The contains() function takes as its first argument the location where you look for the string of interest (we refer to the place where you look as the haystack), and it takes as its second argument the string you are looking for inside the haystack (the needle). We want to look for the string Ophelia in the current context (represented in XPath by a dot), which is a line, and what is returned by the full XPath expression is a sequence of three <l> elements, which are the three lines in the play where Hamlet utters the word Ophelia.

    The solution above is brittle because it will yield a false result on a hypothetical line like:

    Shall I lie in your lap?
        Lying down at Ophelia's feet.]]>

    In the hypothetical line above, the <ab> contains the substring Ophelia inside a stage direction, but not inside spoken text. Since the spoken text is all inside text-node children of lines, we can avoid spurious matches on strings inside stage-direction children of lines by matching only inside text-node children on lines, and not inside the entire line. As it happens, there are no lines like the hypothetical one above, so if you applied contains() directly to <l> and <ab> elements, as in our solution above, you would get the correct results. But a more robust solution would apply contains() not to the line, but to its text-node children, as in:

    The new last part selects all of the text-node children of the current line and filters them to match only those that contains the substring Ophelia.

  2. What is Hamlet’s last line in the play? Your expression should select only the last line that Hamlet utters, and not his entire final speech. The last line could be either an <l> or an <ab> element, and your XPath expression should find it without knowing in advance whether it is of type <l> or of type <ab>.

    Solution
    (//(l | ab)[ancestor::sp[speaker eq 'Hamlet']])[last()]

    As in task 1, we collect all <l> and <ab> and we filter them the same way to keep only those spoken by Hamlet. We wrap the entire expression we have so far in parentheses to flatten the lines into a single sequence of all the <l> and <ab> elements that represent lines spoken by Hamlet. We then filter that flattened sequence with a predicate containing the last() function. The last(), as its name suggests, keeps only the last item in a sequence, and will therefore return the last line spoken by Hamlet in the play. You need to wrap the entire sequence in parentheses before applying the positional predicate for reasons explained in our answer to XPath exercise 2.

  3. How many times is the word madness uttered in Hamlet (by any character)? For this task you need to find and count all instances of madness together, regardless of whether it is spelled Madness or madness.

    Solutions
    //(l | ab)[contains(lower-case(.), 'madness')] => count()
    //(l | ab)[contains(., 'madness') or contains(., 'Madness')] => count()

    This expression collects all the <l> and <ab> elements, filters them so that only the ones containing madness or Madness are returned, and counts how many of them there are. We begin the path the same way we did for task 1 and 2, by locating all the <l> and <ab> element nodes. Using a predicate with the contains() function, we then filter them to keep only the ones that contain the word madness. In order to find all instances of madness regardless of whether if it starts with an uppercase M or a lowercase m, we can either lowercase the current context with the lower-case() function and then match against a lower-case needle, as in the first solution, or we can use two contains() functions, one for each spelling of madness, and connect them with or, as in second solution. We then use the arrow operator and count() to count the number of instances we've found. We returned 22 lines that contain the word madness.

    This solution gets the correct result, and it gets full credit for this test, but it’s brittle in two ways that we would want to address in Real Life. First, some lines have embedded stage directions, and if the word madness happened to occur inside a stage direction but not in the speech of that line, we would incorrectly count it as an utterance. That situation happens not to arise, and perhaps it’s unlikely, but we didn’t have to assume it, so it would be better not to. Second, the task was to count the number of times madness was uttered, but we’re counting instead the number of lines that contain the word. The result will be the same only if madness is never uttered twice in the same line, and that happens to be the case, but that’s an accident and an unnecessary assumption. While finding the word madness inside a stage direction might be unlikely, it would not be surprising if someone talking about madness might mention the word twice in the same line.

    Since the only children of lines are text nodes and stage directions, excluding embedded stage directions from the words that are considered part of the utterance is the easier of the two tasks: //(l|ab)/text() selects not the entire line (including a possible embedded stage direction), but only its text-node children, that is only the spoken text.

    So far, so good. But counting the number of times madness occurs in each of those lines to ensure that if it occurs twice we’ll increment our count by 2 is harder. We can tokenize on whitespace to split the speech into words, filter those to keep the words equal to madness, and count the surviving tokens. But we can’t just tokenize on whitespace and check for equality to the string madness because, for example, when we tokenize the following line on whitespace:

    And draw you into madness? think of it:]]>

    the token will come out as madness?, with the question mark, and that isn’t equal to the string madness without the question mark. The easiest way to deal with this is to use contains() instead of testing for equality, but contains() runs the risk of matching a substring, and when we look at an English-language word list we find madnesses and premadness. Should we count those? In Real Life we might be content to include them as examples of uttering madness, and in that case our expression could be:

    //(l|ab)/text() ! tokenize(.)[contains(lower-case(.), 'madness')] => count()

    Here we tokenize the text-node children of our lines on whitespace and then filter the individual word tokens, instead of the entire lines or entire text nodes, according to whether they contain the substring madnesss.

    If we’re sufficiently obsessive to want not to match madnesses, we can insert a replace() operation to strip non-letters from our word tokens and then test for equality, instead of using contains(). The regular expression pattern \P{L}, below, matches anything that is not a letter (see https://www.regular-expressions.info/unicode.html):

    //(l|ab)/text()
    ! tokenize(.)
    ! replace(., '\P{L}', '')
    [lower-case(.) eq 'madness'] 
    => count()
    
  4. Which characters utter the word madness (regardless of capitalization, as in the question above) in the play? Your XPath expression should return a sequence of the names of the characters (in the <speaker> element of the speech where the word appears) without duplicates.

    Solutions
    //(l | ab)[contains(lower-case(.), 'madness')]/ancestor::sp/speaker => distinct-values()
    //(l | ab)[contains(., 'madness') or contains(., 'Madness')]/ancestor::sp/speaker => distinct-values()

    If we remove the count() function from task 3, we get a list of all the <l> and <ab> element nodes that contain the word madness (there are 22). Since what we want for this task is the character who uttered madness instead of the line in which it was uttered, we need to climb back up to the speech and then grab the <speaker> node that is always the first child of <sp>. There is no axis we can use to directly get to the <speaker> node because <l> and <ab> are sometimes children of <sp>, making them siblings of <speaker>, and sometimes they are grandchildren of <sp>. Because there is no one axis that accounts for both of these potential relationships between <speaker> nodes and <l> and <ab>, so we have to add 2 steps to the path to return the desired speakers. We first use the ancestor axis to go back up the tree and locate all the <sp> nodes that are ancestors (in this case, either parents or grandparents) of the <l> and <ab> elements that contain the string madness. We then use the (default) child axis to grab the <speaker> node that is a child of <sp>. We now have a sequence of 18 character names, with duplicates. This number differs from the number of times madness is uttered in an <l> or <ab> because some characters utter the word madness more than once in a given <sp>. The last step is to pass the resulting sequence of 18 speaker names through the distinct-values() function to return a sequence of 7 characters without duplicates.

Bonus

  1. Alphabetize the list of characters who utter the word madness. We used the sort() function, which is new in XPath 3.1, so it isn’t in Michael Kay’s book (which was written when XPath 2.0 was the latest version), but it is described at some of the other resources listed in the XPath section of our main course page.

    Solutions
    //(l | ab)[contains(lower-case(.), 'madness')]/ancestor::sp/speaker => distinct-values() => sort()
    //(l | ab)[contains(., 'madness') or contains(., 'Madness')]/ancestor::sp/speaker => distinct-values() => sort()
    • Since the expression we already have written for the fourth required task returns a sequeence of the 7 seven characters that utter the word madness, all we need to do to get the desired result for this first bonus task is sort them alphabetically. sort() is a new XPath 3.1 function that, when given a sequence of strings as input, will sort them alphabetically. Therefore, we can pass the 7 character names (that result from the XPath in required task 4) through the sort() function (we use the arrow operator to do this) and return a sequence of the same 7 names, but now in alphabetic order.
  2. How many of Hamlet's sentences are questions? You can’t just count the lines (<l> or <ab> elements) that contain question marks because some lines may contain more than one question, e.g.:

    Ah, ha, boy! say'st thou so? art thou there, truepenny?]]>

    We came up with four ways to approach this task: with tokenize() (two methods), with string-to-codepoints() (which is easiest to use if you combine it with codepoints-to-string()), and with analyze-string(). As we’ll see below, some of Hamlet’s lines contain quoted questions, that is, Hamlet is not asking a question; he is instead reporting a question that someone else asked. Because of that detail the four methods do not all yield the same result. Which answer should count as correct, then, depends on whether we want to include reported questions when we determine How many of Hamlet's sentences are questions?

    Solutions
    • The solutions below all begin with //(l | ab)[ancestor::sp[speaker eq 'Hamlet']], which collects all <l> and <ab> in the entire document, the spoken lines of the play, and filters them to keep only the ones that have an ancestor <sp> element with a <speaker> child equal to the string value Hamlet. The current context therefore becomes all lines in the entire play delivered in speeches by Hamlet. This is what we want because we are trying to determine how many of Hamlet's sentences, specifically, are questions. (As an alternative to the <speaker> element you could have used the @who attribute to identify Hamlet’s speeches.)

      Because a line in Hamlet does not equate to exactly one full sentence with a capital letter starting it out and a punctuation mark ending it, we cannot rely on the count of lines that have a question mark somewhere in them to be the count of Hamlet's questions. Some lines contain more than one question, which means there is more than one question mark, so if we only counted the lines, we would return a value less than the actual count of question marks, and therefore less than the true count of questions. To account for this, our solutions break the lines into smaller units using tokenize(), string-to-codepoints(), or analyze-string().

    //(l | ab)[ancestor::sp[speaker eq 'Hamlet']] ! tokenize(., '\s+')[ends-with(., '?')] => count()
    //(l | ab)[ancestor::sp[speaker eq 'Hamlet']] ! tokenize(., '\s+')[contains(., '?')] => count()
    //(l | ab)[ancestor::sp[speaker eq 'Hamlet']] ! string-to-codepoints(.) ! codepoints-to-string(.)[. eq '?'] => count()
    //(l | ab)[ancestor::sp[speaker eq 'Hamlet']] ! analyze-string(., '.')/*[. eq '?'] => count()
    1. When we approach this task with tokenize() and ends-with(), we return a count of 222 question marks. This expression first collects all the lines spoken by Hamlet and then for each one, tokenizes it on whitespace (specifiying '\s+' as the second argument of tokenize() tokenizes on matched strings of one or more whitespace characters instead of on the default single space) into single word tokens. Therefore, punctuation marks will be left attached to the token they follow, and we can use this knowledge to grab all the tokens that end with a question mark. We do so by filtering the tokenized lines (word tokens) using the ends-with() function to return only the tokens that end with ?. Finally, we use the arrow operator to pass the sequence of tokens ending in question marks through the count() function.

    2. When we approach this task with tokenize() and contains(), we return a count of 224 question marks. We tokenize Hamlet's lines the same way we did in the first solution, but instead of using ends-with() to filter the tokens, we use the contains() function. The contains() function in the predicate looks for instances of question marks in the current context, which is all of the individual word tokens of Hamlet's lines. It returns a list words that contain a question mark, and we can then use the arrow operator to pass these words containing a question mark through the count() function to return the number of Hamlet-uttered questions. This method selects two more tokens than the preceding method, a difference that we discuss below.

    3. When we approach this task with string-to-codepoints() (and codepoints-to-string()), we return a count of 224 question marks. This approach returns a count of the individual question mark characters, instead of the count of words that end with (first method) or contain (second method) question marks. When we use string-to-codepoints() on the current context, the lines spoken by Hamlet, we break all the lines into a sequence of codepoints, or integer values that correspond to the decimal form of the Unicode codepoint of each character in the lines. The resulting list items are not intelligible to a human, but the computer knows that each one of the integers represents a specific character. The character of interest in this task is a question mark, so we want to filter the results to keep only the numerical values that corresponds to ?. To do so, we next use codepoints-to-string() to turn each of the integers into a one-character string and filter them to keep only the one-character strings that are equal to the string ?. What we return is a list of question marks, one on each result line. Using the arrow operator, we pass these question marks through the count() function to return the total number of question marks in any lines spoken by Hamlet.

    4. The round-about method described above was the most natural way to explode a string into individual characters before XPath 3.1; we had to explode the string into numerical values and then convert those to characters because there was no way to explode a string directly into a sequence of one-character strings. The advent of analyze-string() in XPath 3.1 provides a more direct method.

      The analyze-string() function takes two arguments, the string being analyzed (here a line of speech from the play) and a regular expression. We specify a dot as our regular expression, and since a dot matches any single character except a newline, it has the effect of matching each character of the first argument individually. The results are returned as specific structure with specific element names in a specific namespace, but we don’t need to know any of that for this task; we just need to filter the results to keep any individual character that is equal to a question mark. When we count those, we again find 224.

    It turns out that the first approach, using tokenize() and ends-with(), returns the result that is closest to what we want. This expression does not count the question in the following line as a question by Hamlet because the token lord?” does not end with a question mark:

    <ab xml:id="sha-ham501084" n="84">sweet lord! How dost thou, sweet lord?” This</ab>

    Here Hamlet quotes someone else asking a question, but does not actually ask a question himself. Using ends-with() to filter the whitespace-tokenized words ensures that the tokens we count (and thus the sentences we count by proxy) truly end with a question mark, and not a quotation mark, signifiying that they are questions asked by Hamlet himself, and not quoted by Hamlet from the speech of others. We conclude that 222 of Hamlet's sentences are questions.

  3. Write an XPath expression that selects all the lines (<l> or <ab> elements) that contain hyphenated words, but excludes any that have a hyphenated word only inside of a <stage> element, but not in a spoken word.

    As it happens, there is no line that contains both a hyphenated spoken word and a hyphenated word in an embedded stage direction, but you can’t be certain of that in advance, so your XPath expression should also match a hypothetical line like:

    The king doth wake to-nightRe-enter]]>

    You want to match this line because although it contains a hyphenated word in a stage direction, that is not the only hyphen in the line; there is also a hyphen in the spoken word to-night.

    Solution
    //(l | ab)[text()[contains(., '-')]]

    The expression //(l | ab)[contains(., '-')] selects all the lines in Hamlet that contain hyphenated words (there are 118 of them). However, some of these selected lines are lines that have a child <stage> element that contains a hyphenated word and the lines do not contain any spoken hyphenated words. To avoid matching on hyphenated words inside stage directions we filter on the text node children of the line, instead of on the line itself. The spoken words are inside text node children of the line, but words inside stage directions are grandchildren of lines, so we won’t be matching on those. This XPath expression selects a sequence of 112 lines that contain at least one hyphenated word as part of the spoken text of the line.