XSLT keys

Maintained by: David J. Birnbaum (djbpitt@gmail.com)

Last modified: 2024-12-04T19:53:32+0000

XSLT keys

The task: description and preparation

Suppose you want to find all of the scenes in which each character in Hamlet speaks. In order to write XSLT to perform that task you first need to know how scenes, speakers, and characters are encoded in your document. Your source is a TEI XML edition of Hamlet with the following markup conventions:

How scenes are identified: Scenes are ]]> children of ]]> parents, where the parents are acts. There are no other ]]> children of ]]> parents in the document, so the XPath pattern div/div matches the twenty scenes and nothing else. Scenes have a ]]> child that contains the act and scene numbers in human-readable form, e.g., Act 1, Scene 1]]>.
How speakers are identified: Speeches are ]]> children of scenes, that is, on the child axis from parent scene ]]> elements. Speeches have @who attributes that identify the speakers, e.g., ]]>. When characters speak in unison, the speaker names are separated by whitespace, e.g., ]]>. The speaker of a speech is also identified by a human-readable ]]> child, e.g., Rosencrantz and Guildenstern]]>.

All characters in the play, whether they speak or not, are listed in the cast list as ]]> elements with unique @xml:id values and human-readable names as content, e.g, Hamlet]]>. The @xml:id value matches the speaker identifiers on the @who attributes of ]]> elements after removing the leading hash character, so that, for example, from ]]> we can tokenize the attribute value, remove the hashes, and look up the @xml:id values Marcellus1 and Barnardo to learn that Marcellus and Bernardo speak in unison at that location.

Mapping between a speaker reference on a @who attribute and a speaker name requires dereferencing the pointer, and not simply removing the hash. While the association of a @who value of #Hamlet and the character name Hamlet might appear to suggest that stripping the hash from the reference will yield a human-friendly name, that strategy would not work with Marcellus (#Marcellus1, with trailing digit) or Bernardo (#Barnardo, with a instead of e). There are other, greater discrepancies between the machine-oriented @xml:id values and human-readable names, e.g., the @xml:id value ham-p.-king corresponds to the human-friendly name Player King, a character who speaks four times.

The slow way (without keys)

Construct a deduplicated sequence of speaking characters. Since not all characters speak, we can’t just start from the cast list, which includes non-speaking characters. We also can’t use the @who values as they appear because, for example, the (entire) @who value in ]]> does not correspond to any (single) character in the cast list. We can, though, tokenize the @who values to isolate the identifiers of characters who speak in unison, combine those results for all speeches, and then deduplicate the result. We can do that with:
```
 distinct-values()"/>]]>
```
which returns a sequence of 35 unique strings.
For each unique speaker value, select all of the scenes (]]> children of ]]> parents) where that speaker appears in the @who attribute of a child speech (]]> element), whether alone or together with other speakers. We can do this for scenes where Hamlet speaks, for example, with:
```
]]>
```
We can do this for all characters inside an ]]> over the $unique-speakers variable, but because ]]> changes the context and cuts us off from the input document, we need to save a pointer to the input document in a global variable (here called $root) before we enter the ]]>:
```
  
    
  
  
]]>
```
We output the character names with their scene counts as a sanity check, that is, to verify that the results are sensible. The first few lines of the output are:
```
#Barnardo: 2
#Francisco: 1
#Horatio: 9
```
The code above says to look at all 20 scenes separately for each character and, for each scene, check whether there is a speech by that character. That check requires looking at all speech children of the scene until a speech by that character is found, which means that if the character doesn’t speak in the scene, it requires looking at all speeches. This means looking at 20 scenes 35 times (700 combinations) and, for each combination (that is, 700 times), looking at, in the worst case, every speech within the scene. This is the inefficiency that we remedy with keys below.
Construct a human-friendly report of which characters speak in which scenes. So far we’ve been operating with the machine-oriented character identifiers and with entire scenes (]]> elements). We now need to:
1. Obtain a human-friendly name for each speaker. We can use the id() function to look up the ]]> element for each speaker, the string value of which will be the human-friendly speaker name. For example, the XPath expression id('Barnardo', $root) will select Bernardo]]>. To perform the lookup starting from a speaker reference in a @who attribute we need to 1) remove the hash that was present in, for example, ]]>, since that isn’t part of the @xml:id value on the ]]> element, and 2) specify the document, using the $root variable, because we’ll be inside the ]]>, which, as explained above, cuts us off from the original input document. We want just the string value, and we’ll take care of that when we format the output in a way that atomizes the data.
2. Obtain human-friendly identifiers for the scenes in which a character speaks. We’ll use the ]]> children of the scene ]]> elements for this purpose. For example, if we ask for the ]]> children of the scenes in which Ophelia speaks with the XPath expression //div/div[sp/@who[contains-token(., '#Ophelia')]]/head, we get a sequence of five ]]> elements, one for each scene in which Ophelia speaks. As with the speaker names, above, we’ll get rid of the ]]> markup when we format the output.
3. Create a human-friendly report. To format the output we’ll use a combination of string-join() (to combine the scene identifiers into a semicolon-separated list) and the || concatenation operator (to join the speaker name, a separator (we use a colon followed by a space character), the string-joined list of scenes, and a trailing newline), so that our complete, finished XSLT code is:
```
  
  
  
  
    
      
      
      
    
  
]]>
```
  The first few lines of output look like:
```
A Captain: Act 4, Scene 4
A Gentleman: Act 4, Scene 5
A Priest: Act 5, Scene 1
```
  We sort the results by the human-friendly character name (line 14) and we have to change the datatype of the $character-scenes variable from element(div)+ to element(head)+ (line 15) because we’re now selecting the ]]> children of the scenes, instead of the scenes themselves.

The fast way (with keys)

Most of the code above remains, except that we use keys to select ]]> elements for the scenes. We don’t use keys to select ]]> elements because the XPath id() function already performs fast lookups. The new version is:

]]>

The only changes are the addition of the key definition (line 12–14) and the use of the key to obtain the scenes (line 19). Here’s how keys work:

A key is defined with ]]>, which takes three attributes, as follows:
- Keys must have a name because you can define more than one, and you need to specify which one you’re using when you invoke them.
- The @match attribute identifies an XPath pattern, as it does for templates. The key defined above will match all scenes.
- The @use attribute is a relative path expression from whatever the key matches. The example above says to tokenize the value of the @who attributes of all ]]> children of each scene and use any of those individual speaker pointers to look up the scene.
The key() function uses a key, as defined able, to select nodes. The key() function takes two or three arguments:
- The first argument identifies by name the specific key to be used.
- The second argument identifies the key value to use for the lookup. For example, key('scene-by-speaker', '#Hamlet') selects all scenes in which the string #Hamlet appears as a token in the @who attribute of an ]]> child of that scene.
- Keys are not defined for a specific document. The key() function with two arguments looks in the current document, but because there is no current document inside ]]>, we include the optional third argument, which specifies the root of the tree in which to look, which in this case is the document node of Hamlet.

<oo>→<dh> Digital humanities

XSLT keys

Why we use keys in XSLT

The task: description and preparation

General logic

The slow way (without keys)

The fast way (with keys)

How keys work