Digital humanities


Maintained by: David J. Birnbaum (djbpitt@gmail.com) [Creative Commons BY-NC-SA 3.0 Unported License] Last modified: 2024-12-04T19:53:32+0000


XSLT keys

Why we use keys in XSLT

When you need to search for data, use keys. As with template rules, don’t put off learning how to use keys or dismiss them as an advanced feature. They are an essential tool of the trade. Searching for data without using keys is like using a screwdriver to hammer nails. (Michael Kay, The ten most common XSLT programming mistakes)

Keys in XSLT are like a back-of-the-book index. If you need to find all mentions of a particular topic in a printed book without an index, you would have to look at every page to see whether the topic is discussed there. With a back-of-the-book index, though, you would look through the index (which is typically alphabetized, so you can find the entry you want quickly) and then skip to the pages listed there. The person who compiled the index had to read the entire contents, but only once, after which a quick look in the index replaces a laborious search through every page.

The task: description and preparation

Suppose you want to find all of the scenes in which each character in Hamlet speaks. In order to write XSLT to perform that task you first need to know how scenes, speakers, and characters are encoded in your document. Your source is a TEI XML edition of Hamlet with the following markup conventions:

General logic

To produce a sequence of speaking characters and the scenes in which they speak, we might adopt the following strategy:

  1. Construct a deduplicated sequence of speaking characters.

  2. For each unique speaker, construct a sequence of scenes in which that character speaks.

  3. Construct a human-friendly report of which characters speak in which scenes.

The slow way (without keys)

  1. Construct a deduplicated sequence of speaking characters. Since not all characters speak, we can’t just start from the cast list, which includes non-speaking characters. We also can’t use the @who values as they appear because, for example, the (entire) @who value in ]]> does not correspond to any (single) character in the cast list. We can, though, tokenize the @who values to isolate the identifiers of characters who speak in unison, combine those results for all speeches, and then deduplicate the result. We can do that with:

     distinct-values()"/>]]>

    which returns a sequence of 35 unique strings.

  2. For each unique speaker value, select all of the scenes (]]> children of ]]> parents) where that speaker appears in the @who attribute of a child speech (]]> element), whether alone or together with other speakers. We can do this for scenes where Hamlet speaks, for example, with:

    ]]>

    We can do this for all characters inside an ]]> over the $unique-speakers variable, but because ]]> changes the context and cuts us off from the input document, we need to save a pointer to the input document in a global variable (here called $root) before we enter the ]]>:

    
    
    
      
        
      
      
    ]]>

    We output the character names with their scene counts as a sanity check, that is, to verify that the results are sensible. The first few lines of the output are:

    #Barnardo: 2
    #Francisco: 1
    #Horatio: 9

    The code above says to look at all 20 scenes separately for each character and, for each scene, check whether there is a speech by that character. That check requires looking at all speech children of the scene until a speech by that character is found, which means that if the character doesn’t speak in the scene, it requires looking at all speeches. This means looking at 20 scenes 35 times (700 combinations) and, for each combination (that is, 700 times), looking at, in the worst case, every speech within the scene. This is the inefficiency that we remedy with keys below.

  3. Construct a human-friendly report of which characters speak in which scenes. So far we’ve been operating with the machine-oriented character identifiers and with entire scenes (]]> elements). We now need to:

    1. Obtain a human-friendly name for each speaker. We can use the id() function to look up the ]]> element for each speaker, the string value of which will be the human-friendly speaker name. For example, the XPath expression id('Barnardo', $root) will select Bernardo]]>. To perform the lookup starting from a speaker reference in a @who attribute we need to 1) remove the hash that was present in, for example, ]]>, since that isn’t part of the @xml:id value on the ]]> element, and 2) specify the document, using the $root variable, because we’ll be inside the ]]>, which, as explained above, cuts us off from the original input document. We want just the string value, and we’ll take care of that when we format the output in a way that atomizes the data.

    2. Obtain human-friendly identifiers for the scenes in which a character speaks. We’ll use the ]]> children of the scene ]]> elements for this purpose. For example, if we ask for the ]]> children of the scenes in which Ophelia speaks with the XPath expression //div/div[sp/@who[contains-token(., '#Ophelia')]]/head, we get a sequence of five ]]> elements, one for each scene in which Ophelia speaks. As with the speaker names, above, we’ll get rid of the ]]> markup when we format the output.

    3. Create a human-friendly report. To format the output we’ll use a combination of string-join() (to combine the scene identifiers into a semicolon-separated list) and the || concatenation operator (to join the speaker name, a separator (we use a colon followed by a space character), the string-joined list of scenes, and a trailing newline), so that our complete, finished XSLT code is:

      
      
        
        
        
        
          
            
            
            
          
        
      ]]>

      The first few lines of output look like:

      A Captain: Act 4, Scene 4
      A Gentleman: Act 4, Scene 5
      A Priest: Act 5, Scene 1

      We sort the results by the human-friendly character name (line 14) and we have to change the datatype of the $character-scenes variable from element(div)+ to element(head)+ (line 15) because we’re now selecting the ]]> children of the scenes, instead of the scenes themselves.

The fast way (with keys)

Most of the code above remains, except that we use keys to select ]]> elements for the scenes. We don’t use keys to select ]]> elements because the XPath id() function already performs fast lookups. The new version is:



  
  
  
  
  
    
      
      
      
    
  
]]>

The only changes are the addition of the key definition (line 12–14) and the use of the key to obtain the scenes (line 19). Here’s how keys work:

How keys work

Keys achieve good performance by constructing the equivalent of a back-of-the-book index for a document the first time the key is used with the key() function. If this were a back-of-the-book index, the unique speaker identifiers would be the index entries, each of which would point to a list of scenes in which the speaker participates. On first use the processor has to look at the entire document, much as a human who is indexing a print book has to look at the whole book. But after that single full traversal, subsequent uses of the same key resource look only at the index, and not at the full document. This means that if you’re going to use a key only once it’s probably better not to build it at all, but the more times you call it, the greater the performance benefit.