Introduction to XSLT

You already know how to mark up (XML), constrain (Relax NG), and navigate (XPath) your documents; XSLT (eXtensible Stylesheet Language Transformations) is one way to transform a document, manipulate the tree, and output the results as XML, HTML, SVG, or plain text. You might use XSLT to generate project pages for display on your site, to create intermediary documents for analysis and development, or to feed pieces of your data into another format for analysis with another tool, one that requires data in a particular format that is different from your main XML structure. Since XSLT is XML-aware, it can use XPath to navigate and manipulate your document, which means that when you use XSLT to implement a transformation (see below), you automatically use XPath within XSLT to find the pieces you want to transform (XPath expressions and XPath patterns) and to manipulate the data (XPath expressions).

An XSLT stylesheet is an XML document that must be valid against the XSLT schema. The root element in this schema is <xsl:stylesheet> and the children of the root are primarily <xsl:template> elements. These template elements typically have a @match attribute that matches an XPath pattern and instructs the computer to use that template to process all matching nodes. For example, a template that matches <p> elements will be used to process <p> elements in the input document.

XSLT is a declarative programming language (unlike most programming languages with which you are likely to be familiar), which means that part of the way it works is that the templates don’t get applied from the top of the file to the bottom. What happens instead is that program execution passes from template to template because an <xsl:apply-templates> element inside a template rule tells the system what to process next. One consequence of this model is that the order of template rules inside the stylesheet doesn’t matter because they don’t get applied in that order. Rather, they get applied whenever an <xsl:apply-templates> element or the equivalent specifies that a particular type of node must be processed. As a result, for every element or other item in your input document that is specified as the target of an <xsl:apply-templates> element, if there is a template anywhere in the stylesheet that matches it, the stylesheet will find it and the template will fire.

XSLT builds in default rules to handle nodes for which there is no explicit template rule, which means that you have to write your own template rules only where you want something other than the default behavior. The default behavior is that:

For that reason, if your stylesheet contains no templates at all, the built-in templates will do all of the processing. This means that applying a stylesheet with no user-defined templates to a document will output all the plain text in the XML, without any markup. The default behavior will navigate from the document node at the top of the tree all the way down, throwing away markup and outputting text whenever it encounters it. This is rarely what you want; you will normally want to create at least one template rule in your stylesheet.

A typical stylesheet has the following exoskeleton, which <oXygen/> will generate for you when you create a new XSLT document:

The @version attribute needs to be set to 3.0, which should be the default behavior in <oXygen/> (and you can make it the default if it isn’t already).

Namespaces

The input namespace

If your XML document is in a namespace, you’ll need to tell your stylesheet about the namespace in order to process it with XSLT. To do this, add an @xpath-default-namespace attribute to the root <xsl:stylesheet> element and set its value to the value of the namespace declaration from the input XML file. For example, if you are transforming a TEI XML document with the following namespace declaration:

the root <TEI> element states that all elements within the document are in the TEI namespace (unless you explicitly say otherwise on some descendant element). If you were to write a template rule in your XSLT matching just TEI and you hadn’t told the system that input elements were in the TEI namespace, your template wouldn’t be applied because it would match <TEI> elements only if they are in no namespace, whereas the XML declares that the <TEI> element is in the http://www.tei-c.org/ns/1.0 namespace.

If you run a transformation where you have template rules but none of them gets applied, so that you just get plain text in the output, it’s often because of mismatched namespaces. In that situation, no template rules are being applied because they only match elements in no namespace and all of the elements in your input XML are in a namespace, which means that the transformation falls back on the default behavior described above.

To tell your stylesheet always to look for elements in the TEI namespace, the <xsl:stylesheet> element should look something like this:

In the example above the @xpath-default-namespace attribute has a value equal to the TEI namespace value declared on your XML. If your input XML is in some other namespace, the value of @xpath-default-namespace should match that. If your input XML is not in a namespace, do not include the @xpath-default-namespace attribute because its presence will cause all templates to match only elements in the declared namespace, and it will prevent them from matching elements in no namespace.

Should you have input in mixed namespaces (perhaps a TEI document in the TEI namespace that contains embedded SVG in the SVG namespace), see your instructors for guidance about how to deal with it.

The output namespace

The @xpath-default-namespace attribute specifies the namespace of the input XML. If your output is going to be in a namespace (for example, if you are outputting HTML, which must be in the HTML namespace), you also need to specify the output namespace. When outputting HTML, the namespace declaration is http://www.w3.org/1999/xhtml, so if you are transforming TEI to HTML, your root <xsl:stylesheet> element must read:

Line 6 says that the default namespace for all literal elements that you are creating in your output document is the HTML namespace. Line 5, copied from the example above, says that the namespace for all elements in your input document is the TEI namespace. If your input or output are in no namespace you must omit these declarations, and if they are in other namespaces, you’ll need to use the appropriate namespace values.

Controlling your output with <xsl:output>

You should always have an <xsl:output> element to control the type and formatting of your output. <xsl:output> is a top-level element, which means it must be a child of the root <xsl:stylesheet> element (making it a sibling to all your template rules, which are also top-level elements). <xsl:output> is usually placed at the top of the document, as a first child of the root <xsl:stylesheet> element, because that makes it easier for humans to find, but as long as it is a child (not a grandchild or other descendant) of the root element, your document will be valid. Officially, <xsl:output> is an optional element, which means that if it’s omitted you won’t get an error message, and the system will try to guess the kind of output you want, which can lead to errors if it guesses wrong. At minimum, <xsl:output> should have a @method attribute. When we create HTML output, we also set some additional attributes (see below for an example). Here are some guidelines:

Telling templates when to fire by using the @match attribute with an XPath pattern

Except in situations you are unlikely to encounter in our course, <xsl:template> requires the attribute @match, which matches an XPath pattern. An XPath pattern is not the same as a full XPath expression; it is just a piece of one, the minimum XPath needed to describe what you want to match. For example, to match all <p> elements in the document, write match="p" instead of match="//p". In other words, templates don’t specify where to look for the elements they match; they don’t have to do that because they sit around waiting for the elements to come to them (courtesy of <xsl:apply-templates> or other rules). For that reason they only have to describe what it is that they match, and not how or where to find it.

More about XPath path expressions and XPath patterns

An XPath path expression, which is what we have been practicing in our XPath unit, is evaluated from a current context. In our XPath explorations in <oXygen/>, the current context is the last place we clicked inside the document, and most of the time we’ve been ignoring that and instead beginning our path expressions with a slash, which means no matter what the current context, start the path at the document node. The advantage of always starting at the document node is that we don’t have to think about the current context, but that works only for our exploratory data analysis, and it won’t work in an XSLT environment.

A common mistake is to write match="//p" instead of match="p". The reason this is a mistake, even though it happens to work, is that it makes your code harder to read because the leading double slash does not affect the meaning, yet its presence implies that it must be there for a reason. All you want to specify in the value of a @match attribute, or any XPath pattern, is enough to identify unambiguously the nodes to which you want the template to apply. If we were transformating our TEI version of Hamlet, for example, templates that apply only to acts could include match="body/div", templates that apply only to scenes could include match="div/div", and templates that apply to both acts and scenes, but not to <div> elements in the header, could include match="body//div".

How to read XPath expressions and XPath patterns

We find it most helpful to read XPath expressions from the left, path step by path step, because each step specifies the current context(s) for the next step. An XPath expression like //body/div, then, means start at the document node, find all <body> elements on its descendant axis, and then, for each <body> element, find all <div> elements on its child axis.

We find it most helpful to read XPath patterns from the right. For example, an XPath pattern like body/div means find all <div> elements that are children of <body> elements. Reading from the right helps us avoid thinking that we have to navigate to the leftmost component of the pattern first. We don’t have to do that because XPath patterns match, but they don’t traverse.

When to use XPath expressions and when to use XPath patterns in XSLT

Where we use XPath expressions and where we use XPath patterns is specified by the languages that use XPath, and is not up to us. In XSLT, the value of the @match attribute is defined as an XPath pattern, and the value of the @select attribute on <xsl:apply-templates> (see below) is defined as an XPath expression, for which the current context item is the item that the @match attribute matched. If @match matches multiple items (for example, if it matches acts and there are multiple acts in a play), the rule fires once for each of them, and only one of them will be the current context item at a given moment in the transformation process.

As the examples above illustrate, by varying the completeness of the pattern, you can get more or less specific about how to handle, say, <p> elements in different parts of the XML tree. If you want to treat <p> elements inside a <chapter> differently from <p> elements inside an <introduction>, you can create separate templates with match="chapter/p" and match="introduction/p", with as little context as you can get away with to specify the difference. But you don’t need (= shouldn’t have) a full path; your XPath pattern must be the simplest pattern that will match what you want to match. Most of your stylesheets will consist of <xsl:template> elements for each type of element that might arise in your input document (unless the built-in behavior, described above, which applies if there is no template, already does what you want, in which case you should not create an explicit template just to mimic that behavior).

Your first template

Most (if not all) stylesheets you’ll write in this course will begin with a template matching the document node, which is both the (generally invisible) parent of the root element and the uppermost node in the hierarchy of every XML document. When an XSLT stylesheet is applied to an XML document, the system always starts at the document node when looking for templates to apply. To match the document node, use the XPath pattern /. Any instructions that should fire only once to create the superstructure for your output will typically be created inside this template, and you’ll need at least one <xsl:apply-templates> element in order to interact with the lower branches of your tree. If you’re planning on outputting HTML, the template that matches the document node is the place to create your HTML superstructure, and within this superstructure you’ll want to include, typically, an <xsl:apply-templates> element that tells the processor how to build the HTML output inside that superstructure. For example, a typical XML-to-HTML transformation might start with code like:

The template rule above matches the document node, creates the HTML superstructure that will go into the output, and then, inside the HTML <body> element, applies templates to the children of the document node (by default, <xsl:apply-templates> means apply templates to the children of the node currently being processed). The only child element of the document node is always the root element of your input XML. Your stylesheet will also include other templates that specify what to do with the various elements of your input XML (see below).

Think of your <xsl:apply-templates> elements as place-holders that mark where to output the results of applying the templates they call. For example, any content you want to appear immediately inside the HTML <body> element that you’re creating can be placed correctly by putting the <xsl:apply-templates> element between the <body> start- and end-tags.

Processing something other than immediate child nodes

By default, <xsl:apply-templates> means apply templates to all child nodes (elements and text) of the current context, that is, the node currently being processed. You are not restricted to processing only child nodes, though; <xsl:apply-templates> optionally takes a @select attribute, which tells the system what nodes to apply templates to. The value of @select is a full XPath expression and will start from the current context, that is, from whatever node is being processed at the time. For example, if you are transforming TEI to HTML and the only XML you want to process is in the <teiHeader>, you can replace the general <xsl:apply-templates> with <xsl:apply templates select="//teiHeader">. If @select is omitted, the system will default to applying templates to all children of the current context node. That this is the default behavior means that you don’t need to (= shouldn’t) specify @select if what you want to select is all of the children of the current context.

Any elements you want to handle specially (that is, for which the built-in behavior is not what you want) will need their own template rules. Remember, though, that templates fire every time the system encounters a matching node in the XML, so if you want an element to be created once (for instance, the <html> element), it should go within a template that matches a node that only appears once (for instance, /). If you’re generating HTML <p> elements, on the other hand, you’ll need those to be inside a template that will fire many times because you want to generate many <p> elements, not one giant <p> element which contains the text of all the paragraphs. Similarly, if you are creating an HTML table with a lot of rows, you typically want only one table, so you should create that directly inside the <body> element and then create the <tr> elements for the rows in a template that fires once for each row you want to create. If, say, you want to create one table row for each <character> element in your input, your XSLT will probably look something like:

Note that you create only one <table>, so you do that inside a template that fires just once, the template that matches the document node. You create one row for each character, though, so you create your <tr> elements inside a template that matches <character> elements, and therefore fires once for each <character> element. The only <tr> that gets created inside the template rule for the document node is the one that labels the columns, since you want just one of those.

<xsl:apply-templates> vs. <xsl:value-of>

Sometimes you get the content of an element or attribute by applying templates to it and sometimes you use <xsl:value-of>. The difference between <xsl:apply-templates> and <xsl:value-of> is that, as far as nodes are concerned, <xsl:value-of> can return only a text node, that is, plain text. We often use <xsl:value-of> to return constructed textual values. For example, if we want to create a table cell that contains the number of speeches by Hamlet, that number is a value that we construct by using the count() function, along the lines of:

The output created by <xsl:value-of> is always a text node, and it represents a dead end in the XML tree insofar as it cannot contain markup (because it is a single text node) and it isn’t on the tree (because you constructed it). This means that you cannot apply templates to its children (it doesn’t have any) or anything else on an axis from it (because it isn’t on a tree and therefore doesn’t have axes). If, for example, you are processing a paragraph node tagged as <p>, <xsl:value-of select="."/> will return the textual value of the paragraph, throwing away any internal markup. If you want to process that internal markup (for example, if the paragraph contains titles or foreign words or emphasis or anything that should be processed separately), <xsl:value-of> will make it impossible to process those elements, and all you’ll get is their textual content, as if they weren’t marked up in the first place. If, on the other hand, the paragraph has no internal markup, there is no difference in behavior between <xsl:apply-templates> and <xsl:value-of>. As a rule of thumb:

Both <xsl:apply-templates> and <xsl:value-of> can take a @select attribute to specify what should be processed. That attribute is optional with <xsl:apply-templates>; as we noted above, if you don’t use @select, you will apply templates to all of the child nodes of the current context, whatever they may be. In the case of <xsl:value-of>, though, the @select attribute is obligatory. For example, <xsl:value-of select="."/> will output the string value of the current context node, throwing away any markup. <xsl:value-of select="string-length(.)"/> will output a single integer, representing a count of the number of characters in the current context node.

Outputting mixed content

XSLT usually does The Right Thing when it is outputting just elements or just plain text, but mixed-content output (that is, a mixture of elements and plain text) can lead to awkward white-space handling. You can avoid having to worry about the intricacies of XSLT white-space handling by applying the following rule of thumb: when you are outputting mixed content, wrap all plain text in <xsl:text> tags. For example, instead of writing:

Putting it all together

By way of illustrating a complete transformation, here are a sample XML document (whose content you may recognize from the first week of class) and a sample XSLT stylesheet to transform the XML into HTML for publication on the web.

The XML is pretty straightforward. The root element is <letter>, which has two children, a <head> and a <content> element, and the latter contains the body of the letter and the rest of the elements. Locations within the text are tagged, but for the sake of simplicity and brevity, the sender and recipient are tagged only in the salutation and valediction, as <person> elements. (That is, the personal pronouns that refer to them in the body of the letter are not tagged.)

Our sample output will be an HTML document that does not include any information from the <head> element; it outputs our paragraphs as HTML paragraphs and italicizes all persons and locations. The result of the transformation can be seen below:

Rouen, August 1897

My own Darling Boy,

I got your telegram half an hour ago, and just send a line to say that I feel that my only hope of again doing beautiful work in art is being with you. It was not so in the old days, but now it is different, and you can really recreate in me that energy and sense of joyous power on which art depends.

Everyone is furious with me for going back to you, but they don’t understand us. I feel that it is only with you that I can do anything at all. Do remake my ruined life for me, and then our friendship and love will have a different meaning to the world.

I wish that when we met at Rouen we had not parted at all. There are such wide abysses now of space and land between us. But we love each other.

Goodnight, dear. Ever yours, Oscar

Our stylesheet that performs this transformation is below, followed by a discussion of how it works:

Lines 1–7 are created by <oXygen/> when you tell it to create a new XSLT stylesheet. The only part that we’ve added is the HTML namespace declaration on line 5, so that all output will be in the HTML namespace:

Lines 10-19 set up our HTML superstructure (we’ve added a <title>, which will show up in the browser tab, but not in the browser window), populating our <body> element with the results of applying templates to all <content> elements wherever they appear. There’s only one <content> element, and no template for <content>, so the system falls back on the default behavior and applies templates to all of its children. (Note that we never apply templates to <head> or <context>, so they will not be output at all in our result document.)

The children of <content> are <dateline>, <salutation>, <body>, and <valediction>, and we have templates for all of those except <body>. That means that we’re relying on the default behavior for <body>, which is, again, to apply templates to its children. The <dateline> element, whose template is on lines 20–24, will process the contents of the <dateline> element and output the results inside an HTML <h4> element. There’s no @select attribute on the <xsl:apply-templates> here, so the system will apply templates to all children of the element (there are three: the <location> element, the text node after it that contains a comma and some white space, and the <date> element). We don’t have template rules for the second and third of these, so the built-in rules will take care of them; the <location> element is processed by the template on lines 25–29, which outputs the content wrapped in an <em> element (typically rendered as italics in the browser). The @match attribute in this template rule uses the union operator (|) to say that the same template matches both <location> and <person> elements.

Using the HTML <em> element to italicize locations, as we do above, is not good practice because HTML elements have semantics (that is, meanings). The <em> element is appropriate only for emphasized text, and there is nothing emphatic about a placename. Using HTML elements in a semantically incorrect way in order to achieve a rendering effect is called tag abuse, and it should be avoided. It would be more correct to tag the locations by wrapping them in <span class="location"> and then using a CSS rule like span.location { font-style: italic; } to italicize them.

The template on lines 30–34 matches three different types of elements: <p> elements, <salutation> elements, and <valediction> elements. For all three, it outputs the contents inside an HTML <p> element. This way any <p>, <salutation>, and <valediction> element in the input XML will become an HTML paragraph in our output. Since we again applied templates without a @select attribute, we again revert to the default behavior of applying templates to all child elements of any <p>, <salutation>, or <valediction> element. Those child elements (and their children, all the way down the tree) will be processed by whatever templates match them, whether those are templates that we wrote or built-in default templates.

<oo>→<dh> Digital humanities