Maintained by: David J. Birnbaum (djbpitt@gmail.com) Last modified: 2024-12-04T19:53:32+0000
When you need to search for data, use keys. As with template rules, don’t put off learning how to use keys or dismiss them as an advanced feature. They are an essential tool of the trade. Searching for data without using keys is like using a screwdriver to hammer nails. (Michael Kay,The ten most common XSLT programming mistakes)
Keys in XSLT are like a back-of-the-book index. If you need to find all mentions of a particular topic in a printed book without an index, you would have to look at every page to see whether the topic is discussed there. With a back-of-the-book index, though, you would look through the index (which is typically alphabetized, so you can find the entry you want quickly) and then skip to the pages listed there. The person who compiled the index had to read the entire contents, but only once, after which a quick look in the index replaces a laborious search through every page.
Suppose you want to find all of the scenes in which each character in Hamlet speaks. In order to write XSLT to perform that task you first need to know how scenes, speakers, and characters are encoded in your document. Your source is a TEI XML edition of Hamlet with the following markup conventions:
How scenes are identified: Scenes are
]]>
children of
]]>
parents, where the
parents are acts. There are no other
]]>
children of
]]>
parents in the document,
so the XPath pattern div/div
matches
the twenty scenes and nothing else. Scenes have a
]]>
child that contains the
act and scene numbers in human-readable form, e.g.,
Act 1, Scene 1]]>
.
How speakers are identified: Speeches are ]]>
children of scenes, that is, on the child axis from parent scene
]]>
elements. Speeches have
@who
attributes that identify the
speakers, e.g.,
]]>
. When
characters speak in unison, the speaker names are separated by whitespace,
e.g.,
]]>.
The speaker of a speech is also identified by a human-readable
]]>
child, e.g.,
Rosencrantz and Guildenstern]]>
.
All characters in the play, whether they speak or not, are listed in the cast
list as ]]>
elements with
unique @xml:id
values and
human-readable names as content, e.g,
Hamlet]]>
.
The @xml:id
value matches the speaker
identifiers on the @who
attributes of
]]>
elements after removing
the leading hash character, so that, for example, from
]]>
we can tokenize the attribute value, remove the hashes, and look up the
@xml:id
values
Marcellus1
and
Barnardo
to learn that Marcellus and
Bernardo speak in unison at that location.
Mapping between a speaker reference on a
@who
attribute and a speaker name
requires dereferencing the pointer, and not simply removing the hash. While
the association of a @who
value of
#Hamlet
and the character name
Hamlet
might appear to suggest that
stripping the hash from the reference will yield a human-friendly name, that
strategy would not work with Marcellus
(#Marcellus1
, with trailing digit) or
Bernardo (#Barnardo
, with a
instead of e
). There are other, greater discrepancies between the
machine-oriented @xml:id
values and
human-readable names, e.g., the @xml:id
value ham-p.-king
corresponds to the
human-friendly name Player King
, a
character who speaks four times.
To produce a sequence of speaking characters and the scenes in which they speak, we might adopt the following strategy:
Construct a deduplicated sequence of speaking characters.
For each unique speaker, construct a sequence of scenes in which that character speaks.
Construct a human-friendly report of which characters speak in which scenes.
Construct a deduplicated sequence of speaking characters.
Since not all characters speak, we can’t just start from the cast list,
which includes non-speaking characters. We also can’t use the
@who
values as they appear because, for
example, the (entire) @who
value in
]]>
does not correspond to any (single) character in the cast list. We can,
though, tokenize the @who
values to
isolate the identifiers of characters who speak in unison, combine those
results for all speeches, and then deduplicate the result. We can do that
with:
distinct-values()"/>]]>
which returns a sequence of 35 unique strings.
For each unique speaker value, select all of the scenes
(]]>
children of
]]>
parents) where that
speaker appears in the @who
attribute of a child speech
(]]>
element), whether
alone or together with other speakers. We can do this for
scenes where Hamlet speaks, for example, with:
]]>
We can do this for all characters inside an
]]>
over the
$unique-speakers
variable, but because
]]>
changes the
context and cuts us off from the input document, we need to save a pointer
to the input document in a global variable (here called
$root
) before we enter the
]]>
:
]]>
We output the character names with their scene counts as a sanity check, that is, to verify that the results are sensible. The first few lines of the output are:
#Barnardo: 2 #Francisco: 1 #Horatio: 9
The code above says to look at all 20 scenes separately for each character and, for each scene, check whether there is a speech by that character. That check requires looking at all speech children of the scene until a speech by that character is found, which means that if the character doesn’t speak in the scene, it requires looking at all speeches. This means looking at 20 scenes 35 times (700 combinations) and, for each combination (that is, 700 times), looking at, in the worst case, every speech within the scene. This is the inefficiency that we remedy with keys below.
Construct a human-friendly report of which characters speak in which
scenes. So far we’ve been operating with the machine-oriented
character identifiers and with entire scenes
(]]>
elements). We now need
to:
Obtain a human-friendly name for each speaker. We
can use the id()
function to
look up the ]]>
element for each speaker, the string value of which will be the
human-friendly speaker name. For example, the XPath expression
id('Barnardo', $root)
will
select
Bernardo]]>
.
To perform the lookup starting from a speaker reference in a
@who
attribute we need to 1)
remove the hash that was present in, for example,
]]>
,
since that isn’t part of the
@xml:id
value on the
]]>
element, and 2)
specify the document, using the
$root
variable, because we’ll
be inside the
]]>
, which,
as explained above, cuts us off from the original input document. We
want just the string value, and we’ll take care of that when we
format the output in a way that atomizes the data.
Obtain human-friendly identifiers for the scenes in which a
character speaks. We’ll use the
]]>
children of the
scene ]]>
elements
for this purpose. For example, if we ask for the
]]>
children of the
scenes in which Ophelia speaks with the XPath expression
//div/div[sp/@who[contains-token(., '#Ophelia')]]/head
,
we get a sequence of five
]]>
elements, one
for each scene in which Ophelia speaks. As with the speaker names,
above, we’ll get rid of the
]]>
markup when we
format the output.
Create a human-friendly report. To format the output
we’ll use a combination of
string-join()
(to combine the
scene identifiers into a semicolon-separated list) and the
||
concatenation operator (to
join the speaker name, a separator (we use a colon followed by a
space character), the string-joined list of scenes, and a trailing
newline), so that our complete, finished XSLT code is:
]]>
The first few lines of output look like:
A Captain: Act 4, Scene 4 A Gentleman: Act 4, Scene 5 A Priest: Act 5, Scene 1
We sort the results by the human-friendly character name (line 14)
and we have to change the datatype of the
$character-scenes
variable from
element(div)+
to
element(head)+
(line 15)
because we’re now selecting the
]]>
children of the
scenes, instead of the scenes themselves.
Most of the code above remains, except that we use keys to select
]]>
elements for the scenes. We
don’t use keys to select ]]>
elements because the XPath id()
function
already performs fast lookups. The new version is:
]]>
The only changes are the addition of the key definition (line 12–14) and the use of the key to obtain the scenes (line 19). Here’s how keys work:
A key is defined with
]]>
, which takes three
attributes, as follows:
Keys must have a name because you can define more than one, and you need to specify which one you’re using when you invoke them.
The @match
attribute identifies
an XPath pattern, as it does for templates. The key defined above
will match all scenes.
The @use
attribute is a relative
path expression from whatever the key matches. The example above
says to tokenize the value of the
@who
attributes of all
]]>
children of each
scene and use any of those individual speaker pointers to look up
the scene.
The key()
function uses a key, as
defined able, to select nodes. The
key()
function takes two or three
arguments:
The first argument identifies by name the specific key to be used.
The second argument identifies the key value to use for the lookup.
For example,
key('scene-by-speaker', '#Hamlet')
selects all scenes in which the string
#Hamlet
appears as a token in
the @who
attribute of an
]]>
child of that
scene.
Keys are not defined for a specific document. The
key()
function with two
arguments looks in the current document, but because there is no
current document inside
]]>
, we
include the optional third argument, which specifies the root of the
tree in which to look, which in this case is the document node of
Hamlet.
Keys achieve good performance by constructing the equivalent of a back-of-the-book
index for a document the first time the key is used with the
key()
function. If this were a back-of-the-book
index, the unique speaker identifiers would be the index entries, each of which
would point to a list of scenes in which the speaker participates. On first use the
processor has to look at the entire document, much as a human who is indexing a
print book has to look at the whole book. But after that single full traversal,
subsequent uses of the same key resource look only at the index, and not at the full
document. This means that if you’re going to use a key only once it’s probably
better not to build it at all, but the more times you call it, the greater the
performance benefit.