Digital humanities


Maintained by: David J. Birnbaum (djbpitt@gmail.com) [Creative Commons BY-NC-SA 3.0 Unported License] Last modified: 2018-03-19T23:14:51+0000


Test #4: XPath (answers)

Using Bad Hamlet, provide an XPath expression that retrieves:

  1. All speeches by Ophelia that contain Hamlet’s name. Requires, at least in our solution, contains(). (There are two such speeches.)

    //sp[@who = "Ophelia"][contains(.,"Hamlet")]

    We find all speeches, filter them to keep just the ones spoken by Ophelia, and then filter those to find just the ones that contain, anywhere inside, the string Hamlet. The predicates can be in either order, and you can also combine them into a single predicate.

  2. A semicolon-separated list of all unique speakers (<speaker>) in Act IV, without duplicates. Requires, at least in our solution, string-join() and distinct-values().

    string-join(distinct-values(//body/div[4]//speaker),'; ')

    Working from the inside out, we find all acts (<div> children of <body>) and filter them to keep just the fourth one. Then we find all descendant <speaker> elements (not children, since the speakers are descendants several levels deep from acts) of that second act. We wrap that in the distinct-values() function to get rid of the duplicates. Finally, we wrap all of that in the string-join() function to fuse the individual speaker names into a single list, with semicolon plus space between names.

  3. The number of speeches (<sp>) in each act (//body/div). Our solution requires count(). (The number of speeches you should find are 251 for Act 1, 201 for Act 2, 249 for Act 3, 179 for Act 4, and 257 for Act 5.)

    //body/div/count(descendant::sp) or //body/div/count(.//sp)

    We start with //body/div, which retrieves a sequence of the five <div> children of <body>, that is, the five acts. For each of those five acts we then get a count of its <sp> descendants. The hard part with the second version is that the dot is necessary; if you omit it and write //body/div/count(//sp), you count all of the <sp> elements that are descendants of the document node, not of the act you’re processing at the moment, so you’d wind up counting all of the speeches in the entire play each time. That means that you’d get the same number for each of the five acts, and it would be wrong for all of them. The dot means start this path from current context, and since the current context is the preceding path step, that means that for each of the five acts you look only at <sp> element descendants of that individual act.

  4. The speaker (<speaker>) of all speeches (<sp>) greater than 4000 characters. Requires, at least in our solution, string-length(). (There are two such speeches, one by Hamlet and one by Ghost.)

    //sp[string-length() gt 4000]/speaker

    We start by finding all of the speeches and then filter them by checking their string length and comparing that value to 4000, and we keep only the ones that are greater than 4000 characters. This isn’t how we’d do this in Real Life because our character count includes all textual characters anywhere inside the speech, which means stage directions, speaker names, and the extra space characters and end-of-line characters used to pretty-print the document. There are ways to count just the characters that are part of spoken text, and to get rid of extraneous white space, but we don’t bother with that here. Once we have just the equivalent speeches, one more path step will get the <speaker> child element of the speech we’re looking at at the moment, that is, of each speech in turn.

    1. The number of lines (<l> elements) in each speech (<sp> element).

      //sp/count(descendant::l)

    2. The number of lines in the longest speech.

      max(//sp/count(descendant::l))

    3. The longest speech (<sp>) itself.

      //sp[count(descendant::l) eq max(//sp/count(descendant::l))]

    4. The <head> of the scene that contains that speech.

      //sp[count(descendant::l) eq max(//sp/count(descendant::l))]/ancestor::div[1]/head

      or //sp[count(descendant::l) eq max(//sp/count(descendant::l))]/preceding-sibling::head

  5. Bonus

    How can you use XPath to get the semi-spurious Rosencrantz and Guildenstern out of the answer to #2? Your answer should cater to the following possibilities:

    string-join(distinct-values(tokenize(replace(string-join(//body/div[4]//speaker, " "), " and |, ", " "), " ")), ", ")

    To break this down from the inside out, we started with our original piece of XPath:

    //body/div[4]//speaker

    This returns a sequence of all of the <speaker> elements in the act, which contain the names of individual speakers, but also, as a unitary string in the sequence, Rosencrantz and Guildenstern, and also, potentially, things like King, Gertrude, Hamlet, and Servant. We could strip out the commas and the and conjunction separately from each speaker by adding a path step:

    //body/div[4]//speaker/replace(.,' and |,', ' ')

    The replace() function takes three arguments: the string inside which we’re performing the replacement (here it’s the individual <speaker> element), a regex that matches the substring to replace, and a string that serves as the replacement. Our regex uses the pipe (|) connector—also called the or connector—to match either and (with leading and trailing spaces) or a comma, and it replaces both with a single space character. If we just stripped out the matches, we’d wind up with RosencrantzGuildenstern; by replacing and with a space, instead of just deleting it, we avoid creating this unwanted value.

    What we do instead of running replace() over each <speaker> value separately, though, is run string-join() over all of the <speaker> values first, using a space character as the connector, so that we have to perform the replace() operation only once. First we do the string-join():

    string-join(//body/div[4]//speaker, " ")

    and then we wrap the replace() function around it:

    replace(string-join(//body/div[4]//speaker, " "), " and |, ", " ")

    At this point we have a single long string of white-space separated speakers, from which we’ve purged the commas and conjunctions. The next step is to split that one long string into a sequence of shorter individual strings, each one representing a name. We do that with the tokenize() function, splitting on white space:

    tokenize(replace(string-join(//body/div[4]//speaker, " "), " and |, ", " "), " ")

    Our list has duplicate values, which can remove with the distinct-values() function:

    distinct-values(tokenize(replace(string-join(//body/div[4]//speaker, " "), " and |, ", " "), " "))

    All that’s left is to form the distinct individual speaker names into a comma-separated string, with a space after each comma, since that’s how lists like this are normally formatted. We use string-join() to do that:

    string-join(distinct-values(tokenize(replace(string-join(//body/div[4]//speaker, " "), " and |, ", " "), " ")), ", ")

    If you change your <oXygen/> preferences to show space characters as dots (which we recommend), this path goes from being completely unreadable to being merely hard to read. Code that’s hard to read is an opportunity for developer error. Can we do better?

    XPath 3.1 instroduced a new arrow operator (=>), which pipes the output of one function into the input of another. The result is the same as with the deep nesting in the example above, but it’s more legible because we can read from left to right, instead of from the inside out. Here’s what the same process would look like with the arrow operator:

    //body/div[4]//speaker =>
    string-join(" ") =>
    replace(" and |,", " ") =>
    tokenize("\s+") =>
    distinct-values() =>
    string-join(", ")

    Notice how the lines are added in the order in which we wrapped new functions around old ones above. (The division into lines is for legibility; the XPath has the same meaning whether it’s all on one line or split over multiple lines.) When you use the arrow operator, you omit the first argument to the function on the right of the operator, since that is supplied automatically by the code on the left of the operator. This illustrates one limitation of the arrow operator syntax: the output of the operation on the left can be only the first argument to the function on the right. In most cases, including this one, that’s what we want. This notation has the same meaning as the deeply nested version above, but because we write it from left to right, we can build it up step by step more easily. Not only is it more legible, but it’s also easier to develop.

    The chaining of functions in this way is called pipelining, and you can read more about it at Patterns and antipatterns in XSLT micropipelining.