Digital humanities


Author: Janis Chinn (janis.chinn@gmail.com) Maintained by: David J. Birnbaum (djbpitt@gmail.com) [Creative Commons BY-NC-SA 3.0 Unported License] Last modified: 2016-02-21T16:08:44+0000


Introduction to XSLT

The basics

You already know how to mark up (XML), constrain (Relax NG), and navigate (XPath) your documents; XSLT (eXtensible Stylesheet Language Transformations) is one way to transform your document, manipulate the tree, and output the results as XML, HTML, SVG, or plain text. You might use XSLT to generate project pages for display on your site, to generate intermediary pages for analysis and development, or to feed pieces of your data into another format for analysis with another tool, one that requires data in a particular format that is different from your main XML structure. Since XSLT is XML-aware, it uses XPath to navigate and manipulate your document, which means that when you use XSLT to implement a transformation (see below), you automatically use XPath within XSLT to find the pieces you want to transform (XPath expressions and XPath patterns) and to manipulate the data (XPath expressions).

An XSLT stylesheet is an XML document that must be valid against the XSLT schema. The root element is <xsl:stylesheet> and the elements inside the root are primarily <xsl:template> elements. These template elements typically have a @match attribute that matches an XPath pattern and instructs the computer to use that template to process all matching nodes. For example, a template node that matches <p> elements will be used to process <p> elements in the input document.

XSLT is a declarative programming language (unlike most programming languages with which you are likely to be familiar), which means that part of the way it works is that the templates don’t get applied from top to bottom. What happens instead is that program execution passes from template to template because an <xsl:apply-templates> element inside a template rule tells the system what to process next. One consequence of this model is that the order of template rules inside the stylesheet doesn’t matter because they don’t get applied in document order, from top to bottom. Rather, they get applied whenever an <xsl:apply-templates> element or the equivalent specifies that a particular type of node must be processed. When that happens, for every element or other object in your input document, if there is a template anywhere in the stylesheet that matches it, the stylesheet will find it and the template will fire.

XSLT builds in default rules to handle nodes for which there is no explicit template rule, which means that you have to write your own template rules only where you want something other than the default behavior. The default behavior is that if you try to apply templates to an element for which you haven’t created an explicit template, the system will pass silently into that element and apply templates to its children, until eventually the only thing left is to output the text. For that reason, if your stylesheet contains no templates at all, applying the stylesheet to the document will output all the plain text in your XML, without any markup; the default behavior will navigate from the document node at the top of the tree all the way down, outputting text whenever it encounters it. (This is rarely what you want!)

A typical stylesheet has the following exoskeleton, which <oXygen/> will generate for you when you create a new XSLT document:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    version="2.0">
    
</xsl:stylesheet>

For most purposes, you’ll want to be sure the @version attribute is set to 2.0, which should be the default behavior in <oXygen/> (and you can make it the default if it isn’t already).

Remember that running an XSLT stylesheet that contains no template rules on your XML will essentially strip out all your markup and output plain text. This is rarely what you want. Typically you’ll need to add at least one template rule to generate useful output.

Namespaces

The input namespace

If your XML document is in a namespace, you’ll need to tell your stylesheet about the namespace in order to process it with XSLT. To do this, add an @xpath-default-namespace attribute to the root <xsl:stylesheet> element and set its value to the value of the namespace declaration from the input XML file. For example, if you are transforming a TEI XML document with the following namespace declaration:

<TEI xmlns="http://www.tei-c.org/ns/1.0">

the root <TEI> element states that all elements within the document are in the TEI namespace (unless you explicitly say otherwise). If you were to write a template rule in your XSLT matching just TEI, it wouldn’t be applied, because the system would be looking for <TEI> elements in no namespace, whereas the XML declares that the <TEI> element is in the http://www.tei-c.org/ns/1.0 namespace. (Generally, if you run a transformation where you have template rules, none of them gets applied, and you just get plain text in the output [as if you had had no template rules], it’s because of mismatched namespaces. In that situation, no template rules are being applied because they only match elements in no namespace and all of the elements in your input XML are in a namespace, which means that the transformation falls back on the default behavior described above.) To tell your stylesheet always to look for elements in the TEI namespace, our <xsl:stylesheet> element should look something like this:

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    version="2.0" xpath-default-namespace="http://www.tei-c.org/ns/1.0">

Should you have input in mixed namespaces (perhaps a TEI document in the TEI namespace that contains embedded SVG in the SVG namespace), see your instructors for guidance about how to deal with it.

The output namespace

The @xpath-default-namespace attribute specifies the namespace of the input XML. If your output is going to be in a namespace (for example, if you are outputting HTML, which must be in the HTML namespace), you also need to specify the output namespace. When outputting HTML, the namespace declaration is http://www.w3.org/1999/xhtml, so if you are transforming TEI to HTML, your root <xsl:stylesheet> element should read:

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    version="2.0" xmlns="http://www.w3.org/1999/xhtml"
    xpath-default-namespace="http://www.tei-c.org/ns/1.0">

The green text says that the default namespace for all elements that you are creating in your output document is the HTML namespace. The blue text says that the namespace for all elements in your input document is the TEI namespace. If your input or output are in no namespace you should omit these declarations, and if they are in other namespaces, you’ll need to use the appropriate namespace values.

Controlling your output with <xsl:output>

You should always have an <xsl:output> element to control the type and formatting of your output. <xsl:output> is a top-level element, which means it must be a child of the root <xsl:stylesheet> element (making it a sibling to all your template rules, which are also top-level elements). <xsl:output> is usually placed at the top of the document, as a first child of the root <xsl:stylesheet> element, because that makes it easier for humans to find, but as long as it is a child (not a grandchild or other descendant) of the root element, your document will be valid. Officially, <xsl:output> is an optional element, which means that if it’s omitted you won’t get an error message, and the system will try to guess the kind of output you want, which can lead to errors if it guesses wrong. At minimum, <xsl:output> should have a @method attribute. You may also need to set a value for the optional @indent and @doctype-system attributes. Here are some guidelines:

For HTML5, then, putting it all together, you should use:

<xsl:output method="xml" indent="yes" doctype-system="about:legacy-compat"/>

Telling templates when to fire by using the @match attribute with an XPath pattern

Except in situations you are unlikely to encounter in this course, <xsl:template> requires the attribute @match, which matches an XPath pattern. An XPath pattern is not the same as a full XPath expression; it is just a piece of one, the minimum XPath needed to describe what you want to match. For example, to match all <p> elements in the document, write match="p" instead of match="//p". In other words, templates don’t specify where to look for the elements they match because they sit around waiting for the elements to come to them (courtesy of <xsl:apply-templates> or built-in processing rules), and for that reason they only have to describe what it is that they match, and not how or where to find it.

With that said, by varying the completeness of the pattern, you can get more or less specific about how to handle, say, <p> elements in different parts of the XML tree. If you want to treat <p> elements inside a <chapter> differently from <p> elements inside an <introduction>, you can create separate templates that match chapter/p and introduction/p, with as little context as you can get away with to specify the difference. But you don’t need (= shouldn’t have) a full path; your XPath pattern must be the simplest pattern that will match what you want to match. Most of your stylesheets will consist of <xsl:template> elements for each type of element that might arise in your input document (unless the built-in behavior, described above, which applies if there is no template, already does what you want, in which case you should not create an explicit template just to mimic that behavior).

Most (if not all) stylesheets you’ll write in this course will begin functionally with a template matching the document node, which is the (generally invisible) parent of the root element and is the uppermost node in the hierarchy of every XML document. When an XSLT stylesheet is applied to an XML document, the system always starts at the document node when looking for templates to apply. To match the document node, use the XPath pattern /. Any instructions that should fire only once to create the superstructure for your output will typically be created inside this template, and you’ll need at least one <xsl:apply-templates> element in order to interact with the lower branches of your tree. If you’re planning on outputting HTML, the template that matches the document node is the place to create your HTML superstructure, and within this superstructure you’ll want to include, typically, an <xsl:apply-templates> element that tells the processor how to build the HTML output inside that superstructure. For example, a typical XML-to-HTML transformation might start with code like:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    exclude-result-prefixes="xs"
    version="2.0" xmlns="http://www.w3.org/1999/xhtml">
    <xsl:output method="xml" indent="yes" doctype-system="about:legacy-compat"/>
    <xsl:template match="/">
        <html>
            <head>
                <title>Title goes here</title>
            </head>
            <body>
                <xsl:apply-templates/>
            </body>
        </html>
    </xsl:template>
</xsl:stylesheet>

This matches the document node, creates the HTML superstructure that will go into the output, and then, inside the HTML <body> element, applies templates to the children of the document node (by default, <xsl:apply-templates> means apply templates to the children of the node currently being processed). The only child element of the document node is always the root element of your input XML. Your stylesheet will also include other templates that specify what to do with the various elements of your input XML (see below).

Think of your <xsl:apply-templates> elements as place-holders that mark where to output the results of applying the templates they call. For example, any content you want to appear immediately inside the HTML <body> element that you’re creating can be placed correctly by putting the <xsl:apply-templates> element between the <body> start and end tags.

XPath expressions vs XPath patterns

One common source of confusion for new XSLT coders involves the difference between <xsl:template match="XPath pattern"> and <xsl:apply-templates select="XPath expression"/>. The terms are unfortunately similar, but here is how they work:

Processing something other than immediate child nodes

By default, <xsl:apply-templates> means apply templates to all child nodes (elements and text) of the current context, that is, the node currently being processed. You are not restricted to processing only child nodes, though; <xsl:apply-templates> optionally takes a @select attribute, which tells the system what nodes to apply templates to. The value of @select is a full XPath expression and will start from the current context, that is, from whatever node is being processed at the time. For example, if you are transforming TEI to HTML and the only XML you want to process is in the <teiHeader>, you can replace the general <xsl:apply-templates> with <xsl:apply templates select="//teiHeader">. If @select is omitted, the system will default to applying templates to all descendants of the current node. This behavior means that you often don’t need to specify @select if what you want to select is all of the children of the current context.

<xsl:apply-templates> is usually an empty element, but you may include <xsl:sort> between separate start and end tags to sort the nodes you’re applying templates to. By adding @select and @order to <xsl:sort> (see Michael Kay for details), you can specify what to sort by (the default is the textual value of the element, but you can override that) and whether to sort in ascending or descending order (the default is ascending).

Any elements you want to handle specially (that is, for which the built-in behavior is not what you want) will need their own template rules. Remember, though, that templates fire every time the system encounters a matching node in the XML, so if you want an element to be created once (for instance, the <html> element), it should go within a template that matches a node that only appears once (for instance, /). If you’re generating HTML <p> elements, on the other hand, you’ll need those to be inside a template that will fire many times because you want to generate many <p> elements, not one giant <p> element which contains the text of all the paragraphs. Similarly, if you are creating an HTML table with a lot of rows, you typically want only one table, so you should create that directly inside the <body> element and then create the <tr> elements for the rows in a template that fires once for each row you want to create. If, say, you want to create one table row for each <character> element in your input, your XSLT will probably look something like:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema" exclude-result-prefixes="xs" version="2.0">
    <xsl:output method="xml" indent="yes" doctype-system="about:legacy-compat"/>
    <xsl:template match="/">
        <html>
            <head>
                <title>Title goes here</title>
            </head>
            <body>
                <table>
                    <tr>
                        <!-- header row with <th> elements to label the columns -->
                    </tr>
                    <xsl:apply-templates select="//character"/>
                </table>
            </body>
        </html>
    </xsl:template>
    <xsl:template match="character">
        <tr>
            <!-- apply other templates to create the cells in the table row
                 for a particular character-->
        </tr>
    </xsl:template>
</xsl:stylesheet>

Note that you create only one <table>, so you do that inside a template that fires just once, the template that matches the document node. You create one row for each character, though, so you create your <tr> elements inside a template that matches <character> elements, and therefore fires once for each <character> element. The only <tr> that gets created inside the template rule for the document node is the one that labels the columns.

<xsl:apply-templates> vs. <xsl:value-of>

Sometimes you get the content of an element or attribute by applying templates to it and sometimes you use <xsl:value-of>. The difference between <xsl:apply-templates> and <xsl:value-of> is that <xsl:value-of> can return only plain text, that is, the textual content of a node (throwing away any markup), as well as the results of many functions and other atomic values (an atomic value is essentially any value that isn’t a node, such as a string or an integer). The result of <xsl:value-of> is always an atomic value, and it represents a dead end in the XML tree insofar as it cannot contain markup, which means that you cannot apply templates to any part of it. If, for example, you are processing a paragraph node tagged as <p>, <xsl:value-of select="."/> will return the textual value of the paragraph, throwing away any internal markup. If you want to process that internal markup (for example, if the paragraph contains titles or foreign words or emphasis or anything that should be processed separately), <xsl:value-of> will make it impossible to process those elements, and all you’ll get is their textual content, as if they weren’t marked up in the first place. If, on the other hand, the paragraph has no internal markup, there is no difference in behavior between <xsl:apply-templates> and <xsl:value-of>. As a rule of thumb:

Both <xsl:apply-templates> and <xsl:value-of> can take a @select attribute to specify what should be processed. That attribute is optional with <xsl:apply-templates> (if you don’t use @select, you will apply templates to the child nodes of the current context, whatever they may be), but the @select attribute is obligatory with <xsl:value-of>. (The <xsl:value-of> element optionally also accepts a @separator attribute, which allows you to specify a separating string to use when <xsl:value-of> outputs a sequence of values. The result is similar to the specification of a separator in the string-join() function.) If you’re curious, you can read more about the differences between <xsl:apply-templates> and <xsl:value-of> in our guide to advanced XSLT features.

White space

As you know, white space is generally normalized automatically when processing XML documents. But what if you need to preserve the white space from your original document in your transformation? How do you distinguish that situation from one where there’s extra white space in your XML document because it was pretty-printed (lines wrapped and extra spaces used for indentation), and the white-space isn’t meaningful and shouldn’t be retained? Although these cases aren’t common, when they do come up they are critical to correctly outputting your document. To resolve them you’ll want to use some combination of <xsl:preserve-space> or <xsl:strip-space>. These are both top-level elements (children of the root <xsl:stylesheet> element) that take the attribute @elements, the value of which is a space-delimited list of elements whose white space you want to preserve or strip out. If you want to affect all the elements in the document, you can set the value of the @elements attribute to *. Typically, XSLT will do what you expect and you won’t need to use these elements at all. If a problem arises, though, you can use <xsl:preserve-space> or <xsl:strip-space> to override the default behavior and control the processing manually.

Outputting mixed content

XSLT usually does The Right Thing when it is outputting just elements or just plain text, but mixed-content output (that is, a mixture of elements and plain text) can lead to awkward white-space handling. You can avoid having to worry about the intricacies of XSLT white-space handling by applying the following rule of thumb: when you are outputting mixed content, wrap all plain text in <xsl:text> tags. For example, instead of writing:

<xsl:template match="book">
    <item>
        <cite>
            <xsl:apply-templates select="title"/>
        </cite>
        by 
        <xsl:apply-templates select="author"/>
    </item>
</xsl:template>

you should use:

<xsl:template match="book">
    <item>
        <cite>
            <xsl:apply-templates select="title"/>
        </cite>
        <xsl:text> by </xsl:text>
        <xsl:apply-templates select="author"/>
    </item>
</xsl:template>

Putting it all together

By way of illustrating a complete transformation here are a sample XML doucment (whose content you may recognize from the first week of class) and a sample XSLT stylesheet to transform the XML into HTML for publication on the web.

<letter>
    <head>
        <context>The following letter was written shortly after Wilde’s 
        release from prison:</context>
    </head>
    <content>
        <dateline>
            <location>Rouen</location>,
            <date>
                <month>August</month>
                <year>1897</year>
            </date>
        </dateline>
        <salutation><person type="recipient">My own Darling Boy</person>,</salutation>
        <body>
            <p>I got your telegram half an hour ago, and just send a line to say that 
                I feel that my only hope of again doing beautiful work in art is being 
                with you. It was not so in the old days, but now it is different, and you 
                can really  recreate in me that energy and sense of joyous power on which 
                art depends.</p>
            <p>Everyone is furious with me for going back to you, but they don’t 
                understand us. I feel that it is only with you that I can do anything at 
                all. Do remake my ruined life for me, and then our friendship and love 
                will have a different meaning to the world.</p>
            <p>I wish that when we met at <location>Rouen</location> we had not parted at 
                all. There are such wide abysses now of space and land between us. But we 
                love each other.</p>
        </body>
        <valediction>Goodnight, dear. Ever yours, <person type="sender">Oscar</person>
        </valediction>
    </content>
</letter>

The XML is pretty straightforward. The root element is <letter>, has two children, a <head> and a <content> element, and the latter contains the body of the letter and the rest of the element. Locations within the text are tagged, but for the sake of simplicity and brevity, the sender and recipient are tagged only in the salutation and valediction, as <person> elements. (That is, the personal pronouns that refer to them in the body of the letter are not tagged.)

Our sample output will be an HTML document that does not include any information from the <head> element; it outputs our paragraphs as HTML paragraphs and italicizes all persons and locations. The result of the transformation can be seen below:

Rouen, August 1897

My own Darling Boy,

I got your telegram half an hour ago, and just send a line to say that I feel that my only hope of again doing beautiful work in art is being with you. It was not so in the old days, but now it is different, and you can really recreate in me that energy and sense of joyous power on which art depends.

Everyone is furious with me for going back to you, but they don’t understand us. I feel that it is only with you that I can do anything at all. Do remake my ruined life for me, and then our friendship and love will have a different meaning to the world.

I wish that when we met at Rouen we had not parted at all. There are such wide abysses now of space and land between us. But we love each other.

Goodnight, dear. Ever yours, Oscar

This is relatively simple to accomplish. The stylesheet is included below, followed by a discussion of how it works:

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns="http://www.w3.org/1999/xhtml" xmlns:xs="http://www.w3.org/2001/XMLSchema"
    exclude-result-prefixes="xs" version="2.0">
    <xsl:output method="xml" indent="yes" doctype-system="about:legacy-compat"/>
    <xsl:template match="/">
        <html>
            <head>
                <title>Oscar Wilde Letter 2</title>
            </head>
            <body>
                <xsl:apply-templates select="//content"/>
            </body>
        </html>
    </xsl:template>
    <xsl:template match="dateline">
        <h4>
            <xsl:apply-templates/>
        </h4>
    </xsl:template>
    <xsl:template match="location|person">
        <em>
            <xsl:apply-templates/>
        </em>
    </xsl:template>
    <xsl:template match="p|salutation|valediction">
        <p>
            <xsl:apply-templates/>
        </p>
    </xsl:template>
</xsl:stylesheet>

Lines 1–3 are created by <oXygen/> when you tell it to create a new XSLT stylesheet. The only part that we’ve added is the HTML namespace declaration on line 2, so that all output will be in the HTML namespace:

    xmlns="http://www.w3.org/1999/xhtml" xmlns:xs="http://www.w3.org/2001/XMLSchema"

Line 4 tells the system what type of document we’re outputting: an HTML5 document with indenting. Lines 6-13 set up our HTML superstructure (we’ve added a <title>, which will show up in the browser tab, but not in the browser window), populating our <body> element with the results of applying templates to all <content> elements wherever they appear. There’s only one <content> element, and no template for <content>, so the system falls back on the default behavior and applies templates to all of its children. (Note that we never apply templates to <head> or <context>, so they will not be output at all in our result document.)

The children of <content> are <dateline>, <salutation>, <body>, and <valediction>, and we have templates for all of those except <body>. That means that we’re relying on the default behavior for <body>, which is, again, to apply templates to its children. The <dateline> element, whose template is on lines 15-19, will process the contents of the element inside an HTML <h4> element. There’s no @select attribute on the <xsl:apply-templates> here, so the system will apply templates to all children of the element (there are three: the <location> element, the text() node after it that contains a comma and some white space, and the <date> element). We don’t have template rules for the second and third of these, so the built-in rules will take care of them; the <location> element is processed by the template on lines 20–24, which outputs the content wrapped in an <em> element (typically rendered as italics in the browser).

The template on lines 25–29 actually covers three different elements: <p> elements, <salutation> elements, and <valediction> elements. For all three, it outputs the contents inside an HTML <p> element. This way all <p>, <salutation>, and <valediction> elements in the input XML will become an HTML paragraph in our output. Since we again applied templates without a select attribute, we again revert to the default behavior of applying templates to all children elements of any <p>, <salutation>, or <valediction> element. Finally, the template on lines 20–24, which we mentioned earlier, will tag the contents of any <location> or <person> element as an HTML <em> element, normally causing it to be italicized in the browser.