Digital humanities


Maintained by: David J. Birnbaum (djbpitt@gmail.com) [Creative Commons BY-NC-SA 3.0 Unported License] Last modified: 2017-02-19T21:39:22+0000


Test #4: XPath (answers)

Using Bad Hamlet, provide an XPath expression that retrieves:

  1. All speeches by Gertrude that contain Hamlet’s name. Requires, at least in our solution, contains(). (There are fifteen such speeches.)

    //sp[@who = "Gertrude"][contains(.,"Hamlet")]

    We find all speeches, filter them to keep just the ones spoken by Gertrude, and then filter those to find just the ones that contain, anywhere inside, the string Hamlet. The predicates can be in either order, and you can also combine them into a single predicate.

  2. A comma-separated list of all unique speakers (<speaker>) in Act II, without duplicates. Requires, at least in our solution, string-join() and distinct-values().

    string-join(distinct-values(//body/div[2]//speaker),', ')

    Working from the inside out, we find all acts (<div> children of <body>) and filter them to keep just the second one. Then we find all descendant <speaker> elements (not children, since the speakers are descendants several levels deep from acts) of that second act. We wrap that in the distinct-values() function to get rid of the duplicates. Finally, we wrap all of that in the string-join() function to fuse the individual speaker names into a single list, with comma plus space between names.

  3. The number of speeches (<sp>) in each act (//body/div). Our solution requires count(). (The number of speeches you should find are 251 for Act 1, 201 for Act 2, 249 for Act 3, 179 for Act 4, and 257 for Act 5.)

    //body/div/count(descendant::sp) or //body/div/count(.//sp)

    We start with //body/div, which retrieves a sequence of the five <div> children of <body>, that is, the five acts. For each of those five acts we then get a count of its <sp> descendants. The hard part with the second version is that the dot is necessary; if you omit it and write //body/div/count(//sp), you count all of the <sp> elements that are descendants of the document node, not of the act you’re processing at the moment, so you’d wind up counting all of the speeches in the entire play each time. That means that you’d get the same number for each of the five acts, and it would be wrong for all of them. The dot means start this path from current context, and since the current context is the preceding path step, that means that for each of the five acts you look only at <sp> element descendants of that individual act.

  4. The speaker (<speaker>) of all speeches (<sp>) equal to 200 characters. Requires, at least in our solution, string-length(). (There are two such speeches, one by Hamlet and one by First Clown.)

    //sp[string-length() eq 200]/speaker

    We start by finding all of the speeches and then filter them by checking their string length and comparing that value to 200, and we keep only the ones that are equal to 200 characters. This isn’t how we’d do this in Real Life because our character count includes all textual characters anywhere inside the speech, which means stage directions, speaker names, and the extra space characters and end-of-line characters used to pretty-print the document. There are ways to count just the characters that are part of spoken text, and to get rid of extraneous white space, but we don’t bother with that here. Once we have just the equivalent speeches, one more path step will get the <speaker> child element of the speech we’re looking at at the moment, that is, of each speech in turn.