Digital humanities


Maintained by: David J. Birnbaum (djbpitt@gmail.com) [Creative Commons BY-NC-SA 3.0 Unported License] Last modified: 2017-04-15T17:34:06+0000


Test #5: XSLT (answers)

The problem

Using Bad Hamlet, write an XSLT stylesheet that will produce an HTML5 table with a header row followed by five data rows, one for each act, and three columns. The leftmost column will give the title of the act (from the <head> element within the act itself). The middle column will give the number of scenes in that act. The rightmost column will contain a comma-separated list of all speakers in that act, without duplicates. Our output is:

Act Number of Scenes Characters
Act 1 5 Bernardo, Francisco, Horatio, Marcellus, King, Cornelius and Voltimand, Laertes, Polonius, Hamlet, Gertrude, Marcellus and Bernardo, All, Ophelia, Ghost, Marcellus and Horatio
Act 2 2 Polonius, Reynaldo, Ophelia, King, Gertrude, Rosencrantz, Guildenstern, Voltimand, Hamlet, Rosencrantz and Guildenstern, First Player
Act 3 4 King, Rosencrantz, Guildenstern, Gertrude, Polonius, Ophelia, Hamlet, First Player, Rosencrantz and Guildenstern, Horatio, Prologue, Player King, Player Queen, Lucianus, All, Ghost
Act 4 7 King, Gertrude, Hamlet, Rosencrantz and Guildenstern, Rosencrantz, Guildenstern, Fortinbras, Captain, Gentleman, Horatio, Ophelia, Laertes, Danes, Servant, Sailor, Messenger
Act 5 2 First Clown, Second Clown, Hamlet, Horatio, Laertes, Priest, Gertrude, King, All, Osric, Lord, Fortinbras, First Ambassador

The characters can be in any order, and you don’t have to worry about the fact that in some acts people speak both separately and together. For example, in our solution, under Act 3 we list separately Rosencrantz, Guildenstern, and Rosencrantz and Guildenstern. In Real Life we’d fix that peculiarity, but for test purposes we’ll ignore it.

Our solution

One solution is:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns="http://www.w3.org/1999/xhtml" xmlns:xs="http://www.w3.org/2001/XMLSchema"
    xpath-default-namespace="http://www.tei-c.org/ns/1.0" exclude-result-prefixes="xs" version="2.0">
    <xsl:output method="xml" indent="yes" doctype-system="about:legacy-compat"/>
    <xsl:template match="/">
        <html>
            <head>
                <title>XSLT test</title>
            </head>
            <body>
                <table>
                    <tr>
                        <th>Act</th>
                        <th>Number of Scenes</th>
                        <th>Characters</th>
                    </tr>
                    <xsl:apply-templates select="//body/div"/>
                </table>
            </body>
        </html>
    </xsl:template>
    <xsl:template match="div">
        <tr>
            <td>
                <xsl:apply-templates select="head"/>
            </td>
            <td>
                <xsl:value-of select="count(div)"/>
            </td>
            <td>
                <xsl:value-of select="string-join(distinct-values(.//speaker),', ')"/>
            </td>
        </tr>
    </xsl:template>
</xsl:stylesheet>

We want a border around our table cells, and the correct way to configure that is to use CSS. The old @border attribute is invalid in HTML5, although <oXygen/> may not catch it. If you did it for the test it’s okay, but in Real Life (and in your projects) you’ll need to use CSS.

The first template creates the table, writes the header row, rounds up all of the acts (<div> elements immediately under the <body>), and applies templates to them. The second template catches those <div> elements and generates a new row in the table for each of them. It retrieves the title of the act from the <head> child of the act and writes it into the leftmost column. For the middle column, the template finds all of the scenes (<div> elements nested within <div> elements) of the current act by looking on the default child axis. Once all of the scenes are found, the result is wrapped in the count() function to get the total number of scenes in a particular act. When you want to output this value, you must remember to use <xsl:value-of>, which we speak about in greater detail below. For the rightmost column, the list of speakers, it finds all of the <speaker> descendants of the current act by using the descendant axis (spelled as //) starting from the current act, represented by the dot (.). The most common error in the answers was to start this path with a double slash without the leading dot. In that case, the routine jumps to the top of the entire document and rounds up all of the speakers in the entire play. The consequence of this mistake is that the entries for the individual acts are not limited to the characters who speak in that particular act, and also that the output is exactly the same for all acts because in all cases the list contains all speakers in the entire play, which is the same no matter which act we are nominally processing.

Once you’ve found the values (.//speaker) you can remove the duplicates by wrapping that result in distinct-values() and then use string-join() to combine them, with a comma and space (, ) in between. To output this value, though, you must again use <xsl:value-of>, and the most common error in this type of situation was to try to use <xsl:apply-templates> instead. The problem with using <xsl:apply-templates> is that <xsl:apply-templates> can only be used with nodes, and the result of count(), distinct-values(), or string-join() is not a node. Here’s why:

Atomic value? What’s an atomic value?

One type of object in the XML world is an atomic value, which is a fixed piece of information, such as the string Hamlet or the integer 2. Atomic values are not nodes in trees; what’s atomic about them is that they have a particular value that has nothing to do with their context or their role in a particular document. All of the data in our transformation originates in nodes in the tree representation of our play, and we can apply templates to those nodes with <xsl:apply-templates>. We can’t, though, apply templates to the string Hamlet or the integer 2, because templates can be applied only to nodes in a tree (a specific XML document), and not to atomic values. The string Hamlet when it’s the value of a particular <speaker> element might be associated with that <speaker> node, but the six-character string of text that names the protagonist of the play has a meaning that is independent of whether it’s the value of a particular <speaker> element. Much as we can’t apply templates to the abstract six-character string Hamlet, independently of the nodes in the tree representation of the play, we can’t apply templates to the results of count(), string-join(), or distinct-values()—because those functions return atomic values, and not nodes.

How come? The situation with count() is easy. The result of this function is a newly summed integer, which does not exist at all in the input XML document. The situation with string-join() is similarly straightforward. The result of string-join() reflects the concatenation of information retrieved from several different nodes and stitched together. There never was a node in the input XML document that contained the string-joined string, and since it was never a node, we can’t apply templates to it. The situation with distinct-values() is a little subtler, but nonetheless clear. Suppose Hamlet speaks twice in Act 2, so that the act (the <div> element) contains, somewhere on its descendant axis, two instances of <speaker>Hamlet</speaker>. If we do a distinct-values(), one of those is kept and one is ignored, but which? The answer in XPath terms is undefined, which is a technical term meaning that an XPath interpreter may keep whichever instance it wants, as long as at the end there is no duplication. This means that distinct-values() divorces the content from the elements that were the original source of that information, and that, in turn, means that we can’t tell which instance of Hamlet we’re dealing with. XPath resolves that uncertainty by saying that it isn’t either instance; it’s the atomic value (string value) that corresponds equally well to either node. And that, in turn, means that we can’t apply templates to the results of distinct-values() because whatever the result is, it isn’t a node in the tree.

So how do we get the value? We use <xsl:value-of>. <xsl:value-of> is the correct way to output an atomic value, such as the result of not only count(), distinct-values(), or string-join(), but also, say, translate() (the result is a string) or string-length() (the result is an integer).