Digital humanities


Maintained by: David J. Birnbaum (djbpitt@gmail.com) [Creative Commons BY-NC-SA 3.0 Unported License] Last modified: 2023-05-02T00:31:08+0000


Test #4: XPath

Instructions

This test has two required parts plus an optional bonus (extra credit) section. The first part asks questions about your understanding of XPath and the second asks you to create XPath expressions and use them to learn about a Bad Hamlet file similar to the one you’ve been using for practice. You’ll find the file at http://dh.obdurodon.org/even-worse-hamlet.xml. This file contains altered content that is different from the Bad Hamlet version that you’ve been using in your XPath assignments, so be sure to work with this new file.

Don’t forget to set the XPath version in the <oXygen/> XPath toolbar or XPath builder to 3.1. You may also want to revisit our XPath functions we use most tutorial.

Part 1: Questions about XPath

Your answers do not have to look like ours as long as they show a clear understanding of the terms.

  1. Question: Define nodes, sequences and atomic values. Give an example of how each of those concepts might arise when you use XPath to explore Hamlet in <oXygen/>. Your examples of these three concepts might involve either XPath expressions themselves or the results that XPath expressions return.

    • Nodes

      • Definition: Nodes are the units of information that make up the XML hierarchical tree. The most important types of nodes are the document node and element nodes, attribute nodes, and text nodes, all of which can be selected using XPath. Nodes may contain other nodes, for example, a <body> element may have paragraph <p> child elements. XPath describes the relationships among nodes in terms of axes, such as child, parent, ancestor, preceding- or following-sibling, and descendant.

      • How they are used: Nodes are used in path expressions to select parts of the XML tree. For example, the XPath expression //body/div[1]/descendant::sp selects all speeches in Act 1 by referring to four types of nodes: the document node (the initial slash), all <body> descendants of the document node (there is only one), the first <div> child of each <body> element, and all speeches that are on the descendant axis from that first <div>. In this example, each path step uses the nodes selected by the immediately preceding path step as a starting point (called the current context) and selects new nodes.

    • Sequences

      • Definition: A sequence is an ordered collection of zero or more items, such as an ordered collection of nodes selected by a step in an XPath path expression. There can also be sequences of atomic values, that is, items that are not nodes in the tree, such as strings and numbers. For example, the XPath expression //sp ! string-length(.) returns a sequence of integers, each one representing the number of characters (letters, punctuation, spaces, etc.) in a speech.

      • How they are used: We often use XPath expressions in exploratory data analysis (EDA) to select sequences of nodes or to create sequences of atomic values. For example, //body/div selects a sequence of <div> elements that represent the acts in Hamlet, and //body/div => count() returns a sequence of a single integer that represents the number of acts.

    • Atomic values

      • Definition: Unlike nodes, atomic values are not located in the XML tree. The atomic values that we use most often are strings, integers, doubles (non-integer numbers), and boolean values (true or false).

        How they are used: Suppose you want to find the string length of each speaker’s name throughout the play. You could do that with //speaker ! string-length(), which would return a sequence of integers that indicate the length, in character count, of each speaker element. These integers are atomic values because they don’t exist anywhere in the XML tree, and are instead constructed in response to your instruction to count something.

  2. Question: What is the difference between an axis and a predicate in a path expression? To answer this question, give an example of each within an XPath expression, explain how they are distinguished syntactically (that is, how each is spelled when used in an XPath expression), and explain what each contributes to the overall meaning of the XPath expression you use to illustrate them.

    • Axis

      • Definition: The axis is the part of a path expression that describes the direction that XPath looks in the hierarchical tree of a document. Common axes include parent, child, and descendant. If no axis is specified explicitly in a path step, by default XPath looks for nodes on the child axis; you can override that default by specifying an alternative axis.

        Example within an expression: Suppose we want to know the types of elements that can have stage-direction children. We can ask for that information with //stage/parent::* ! name(). This finds all <stage> elements and then, for each one in turn, looks on the parent axis to select its parent element, which could be of any type. We then use the name() function to return the names of those parent elements instead of the elements themselves. The names are atomic values because although the element nodes are in the tree, the names, which are strings, are not. If we were using this expression for EDA in Real Life we would extend it to //stage/parent::* ! name() => distinct-values()=> sort() to remove duplicate element names and sort the list for easier reading.

    • Predicate

      • Definition: A predicate filters the results of an XPath step in order to retain items selected by the preceding path step only if they meet a specific condition. Predicates do not select new items (nodes or atomic values); they just filter the items selected by a path expression to decide which ones to keep and which ones to ignore.

      • Example within an expression: Suppose we want to find the third act in Hamlet using an XPath expression. The expression //body/div[3] selects all <div> children of <body>, which is all acts. The predicate then filters that sequence of acts to keep only the third item in the sequence, that is, the third act.

  3. Question: Explain the difference between the simple map operator ! and the arrow operator =>. For example, consider the two expressions //sp ! count(.) and //sp => count() and how they return different results. Give one example each of a reasonable way you might use these operators to explore Hamlet.

    • Simple map operator (!)

      • Definition: The simple map (or bang) operator is attached to an XPath expression to indicate that the thing on the right must be done once for each item on the left. For example, //sp ! count(.) says to find all <sp> elements and for each one, count how many times it occurs. Since this expressions counts each individual speech separately it returns 1137 instances of the integer 1, that is, one value for each speech in the play. This is probably not something you would ask for in Real Life. A more useful expression might be //body/div ! count(descendant::sp). This expression selects each act and counts the number of speeches it contains, so, because there are five acts, it returns a sequence of five integers.

      • Example: Suppose we wanted to return the string length of each speech (<sp> element). We would use the bang operator to do this because we want to run the function separately for each instance of speech, and not just once for the sequence of all speeches. We could write the expression //sp ! string-length() and return a sequence of 1137 integers, each of which is the number of characters within a single speech.

    • Arrow operator (=>)

      • Definition: This operator is used in an XPath expression to apply the function on the right to the entire sequence on the left. For example, //sp => count() says to find all <sp> elements in the document and use a sequence of those elements as input into the count() function to obtain an integer (in this case, 1137).

      • Example: Suppose we wanted to return a deduplicated sequence of speaker names within the play. We could write an expression that starts by selecting all <speaker> nodes (//speaker) and we could then would apply the distinct-values() function to that sequence to remove any duplicate values. We could do this by nesting the path expression inside the function parentheses, that is, distinct-values(//speaker), but because we are used to reading from left to write (and not from inside to outside), we find //speaker => distinct-values() more legible. The arrow operator says to take the sequence on the left and make the entire sequence the input to a single instance of the function on the right.

        You may have noticed that you cannot use the string-length() function with the arrow operator to compute the length of all speeches, e.g., //sp => string-length(), and trying to do that raises an error that says that more than one item is not allowed as the first argument to the function. This is because the arrow operator operates on the entire sequence to the left all at once and the string-length() function is defined as accepting only a single item, and not a sequence of multiple items, as its input. You could use the arrow operator if there were only one thing on the left, so that, for example, //body => string-length() will work because there is only one <body> element in the document.

Part 2: Creating and using XPath expressions

The functions we used to answer the following questions include contains(), count(), distinct-values(), not(), sort(), string-join(). All of these are described in Michael Kay except sort() because it was introduced in XPath 3.1 and Mike’s book was written when 2.0 was the most recent version. The sort() function returns a sequence of items sorted into alphabetical order. There may be more than one correct answer to some of the questions.

Questions 5–9 build on one another. If you get stuck at some point, you can still receive partial credit for the following questions by explaining and illustrating how you would answer them if you had the requisite input. For example, if you can’t get the 77 lines you want for question 5, select some alternative lines as input into question 6 and describe and illustrate how you would find the speakers of speeches that contain those lines.

  1. Question: All line elements in the play <l> are supposed to have attributes of type @n, but some don’t, which is a markup mistake. What XPath expression will select the lines that don’t have @n attributes? (Hint: There are five such lines.)

  2. Question: Building on the preceding question, what XPath expression will tell you how many such lines there are? Your expression must return a single integer value, that is, XPath needs to do the counting instead of returning the lines and your finding the answer with your human eyeballs by looking next to the Description.

  3. Question: Hamlet’s Ghost (referred to as Ghost), although not appearing much, is an important symbol in the play as it represents Hamlet’s dead father. What XPath expression finds the scenes where Ghost is featured as a speaker? (Hint: There are 2 such scenes.)

  4. Question: What XPath expression finds all speeches spoken by Ghost? Your XPath expression must select the speeches themselves, and not just the speakers. (Hint: there are 14 such speeches.)

  5. Question: What XPath expression will find every line (<l> or <ab> element) in which the name Hamlet is spoken? Caution: There are lines that contain stage direction (<stage>) elements that mention Hamlet’s name, but being mentioned inside a stage direction isn’t the same as being spoken. Your XPath expression must include only lines where the name Hamlet is spoken within speech. (Hint: there are 77 such lines, 10 instances of <l> and 67 of <ab>.)

  6. Question: What XPath expression will return the speakers of each speech that contains a line (<l> or <ab> element) that mentions Hamlet? (Hint: There are 68 such speakers because some speeches contain more than one line that mentions Hamlet. Some of the speaker names are repeats because the same person may have multiple speeches that mention Hamlet by name.)

  7. Question: What expression would deduplicate the results of the last expression? In other words, you should return a sequence of strings where each name is listed only once. (Hint: There are 13 such speaker names.)

  8. Question: What XPath expression will sort the sequence in alphabetical order?

  9. Question: What XPath expression will return the sequence as a comma-separated list?

Part 3: Optional extra-credit questions

  1. What XPath expression will return a deduplicated list of all element names within the document? (Hint: You’ll need the name() function, which you can look up in Michael Kay. There are 28 distinct element names.)

    • Possible answers:

      //* ! name() => distinct-values()
      /descendant::* ! name() => distinct-values()

      We start by selecting all elements in the document. The expression //* returns a sequence of all elements because the double slash indicates that we start at the document node (because the expression begins with a slash), we look on the descendant axis, and all elements are descendants of the document node. The asterisk matches all element nodes, regardless of the element type.

      That expression returns every element node in the document, but we are looking for the names of the elements, and not the elements themselves (which include their attributes and contents). To say for each element we find return just the name of the element we use the name() function, and because we need to apply it to each element individually, we use the simple map (or bang) operator !. The expression //* ! name(), then, returns a sequence of strings, each of which is the name of an element in the document.

      Most elements in the document appear more than once and the task was to return a deduplicated list, so we use the arrow operator to remove the duplicates with the distinct-values() function. The arrow operator processes the entire sequence to the left all at once, so the input is a long sequence of element names that include duplicates and the output is a shorter list without duplicates.

  2. What XPath expression will select all speech <sp> elements that have both <l> and <ab> children? (Hint: There are 7 such speeches.)

    • Possible answers:

      //sp[l and ab]
      //sp[l] intersect //sp[ab]

      For this answer you will need to find all speech <sp> element nodes and filter the results to include only those speeches that have both <l> and <ab> children. Our first solution uses the and operator to construct a compound predicate. Our second solution uses the intersect operator (Kay, pp. 628–31) to select all speeches that contain lines on the left and all that contain anonymous blocks on the right and then keep only the speeches that are members of both the left and the right groups.

  3. What XPath expression will return the ratio of <l> to <ab> children for each of the speeches selected in the previous step and sort them from lowest to highest? (Hint: There are 7 such ratios, ranging from a low of 0.117 to a high of 6, and the number 1 appears twice in that list because two of the speeches in question have the same number of elements of both types.)

    • Possible answer:

      //sp[l and ab] ! (count(l) div count(ab))

      We use the bang operator to perform the operation on the right once for each item on the left, where the items on the left are the ones we selected to answer the previous question. On the right side we count the line children and the anonymous block children of each speech and divide the line count by the anonymous block count.

  4. Given the 7 values in the preceding question, what XPath expressions will return just the lowest value, just the highest value, and just the average (arithmetic mean) of all 7 values? (Hint: You’ll want to look up the appropriate functions in Michael Kay.)

    • Possible answers:

      //sp[l and ab] ! (count(l) div count(ab)) => max()
      //sp[l and ab] ! (count(l) div count(ab)) => min()
      //sp[l and ab] ! (count(l) div count(ab)) => avg()

      The expression in the preceding question returns a sequence of numerical values and we can use the arrow operator and the max() min(), and avg() functions to return just the largest, smallest, and average (mean) value for that sequence.

What to submit

Write your answers in a properly formatted markdown file with a filename that conforms to our usual filenaming conventions, with an .md filename extension and upload it to Canvas. You can remind yourself about markdown syntax at the GitHub three-minute guide to Mastering markdown that you read earlier. The test is open book and you can use any references you’d like, except that you cannot receive help from another person.

Should you have any questions, please ask in the #xpath channel in our Slack workspace. We can’t give you the answer, but we’ll do whatever we can short of that to help.