User-defined functions in XSLT

Maintained by: David J. Birnbaum (djbpitt@gmail.com)

Last modified: 2022-12-08T15:54:56+0000

Why create your own functions?

Long XSLT templates can be difficult to understand, debug, modify, and maintain because they implement a lot of functionality in one place. For example, if you have a template that computes complex values and outputs them in different rows and cells of an HTML table, the logic of the table organization may be difficult to see and understand if it is broken up, within your template, by the logic that computes the content values. You can improve the legibility by computing the values at the beginning of the template and assigning them to variables, which you then reference in place in the constructed HTML output, but if you need to perform the same types of computation in multiple locations, you don’t want to have to duplicate the code. What you’d like is to be able to use functions that do exactly what you need anywhere in your stylesheet with as much ease as you use standard XPath library functions. If you had that facility, a template that is responsible primarily for HTML table structure could focus on that structure, and much of the code that computes the values that will eventually populate the cells could be located elsewhere

This type of organization is automatic when you use the standard function library. For example, if you want to insert the mean of each row of values in a table into a cell at the end of the row, you can pass the values into the standard XPath avg() function and it will return the result. Your template doesn’t have to include code that divides the sum of the values by the count of the values (that is, the code that computes the average) because that responsibility lies elsewhere, and all you need to do is invoke the avg() function by name and supply its input. But, as noted above, there is no standard library function that will compute the median or the mode of a sequence of values, and although you can piece the functionality together yourself, if you do it inside the <td> cell tags that will hold the result, or even in a variable that you compute within the template and then reference in the cell, your template grows longer and harder to work with.

An alternative to monolithic templates is to decompose your computation into stand-alone functions that you can invoke similarly to the way you invoke functions from the standard library. This approach may require additional code initially in order to manage the overhead of defining the functions, but there are at least three types of compensatory benefits:

If you isolate stand-alone bits of functionality in their own functions, you can concentrate on them when you develop them and then not have to think about (or even notice) how they work when you use them in your templates. This is similar to the way you don’t have to think about the standard library functions beyond knowing their signatures (names, parameters, types, etc.). As a result, your templates become shorter and easier to read because they are not trying to do everything in one place.

Functions can invoke other functions, both those from the standard library and those that you write yourself. This means that not only can a complex template offload some of its computation onto a user-defined function, but a user-defined function that needs to do several things can offload its individual tasks onto smaller user-defined functions.
If you later decide to change your implementation of a function, you can do it in its own place, without having to drill down to the few relevant lines inside a large template that does many things at once. In a collaborative development environment, such as with our course projects, this makes it easy for one team member to be responsible for developing a particular bit of functionality while another team member concentrates on structuring the template that will use the output of that functionality.
The code we develop does not always do what we think it is doing. For example, if we want to know how long the individual speeches in Hamlet are and we measure them with string-length() we might fail to notice ahead of time that we don’t want to count the whitespace characters introduced by pretty-printing. Or we might want to perform integer division but we accidentally use the div operator instead of idiv. If the input we’re focusing on happens not to have extraneous whitespace or happens not to leave a remainder under division, we may not realize that our code would give results, but not the results we want, with different input. This is the most dangerous type of computational mistake because it doesn’t raise an error message; it just quietly gives the result we asked for even when that is not the result we thought we were asking for. For this reason software developers write tests that supply a range of different sample inputs to a function and verify that the result is what it should be, a development strategy called unit testing.

If a large, composite template that produces a lot of output does not perform as expected, locating the source of the error can be challenging. If, though, we can test separately each function that contributes to that composite output, we have a better chance of finding and fixing the source of a problem. A unit is a small, isolatable bit of functionality and the goal of unit testing is to anticipate the variety of types of input a function might receive and confirm that it performs as intended with any input it might encounter. We don’t discuss how to perform unit testing here, but our point is that isolating small coherent bits of functionality in their own functions makes it possible to test separately the parts of the logic that contribute to a final composite result. In other words, you want to write testable code.

It may seem at first as if the fragmentation of complex processing into separate functions will complicate development because the code on which a template relies may be located in different places in the stylesheet. A more useful perspective is that:

We already work with standard library functions without having to think about where the code that makes them work is located. Once we are confident that a function does what we want, we don’t have to look inside it while developing a template or another function that uses it.
Templates and functions become harder to test and to debug as they attempt to do more. A common guideline is that if your template or function requires more than one screen of code, you might want to consider separating out smaller units of functionality into their own function definitions. (Revising an implementation without changing the core functionality is called refactoring.) There’s nothing magic about one screen of code; the point is that it is usually easier able to develop something small that relies on functions encoded elsewhere than to have to consider all of the functionality at the same time and in the same place.
Small, independent units of functionality can often be reused, and coding something once and reusing it takes less time and effort and is less prone to error than reimplementing essentially the same functionality separately in multiple places.

How does functional programming work?

Functional programming entails some assumptions that may at first seem counter-intuitive to developers who are new to the paradigm. For example, XPath and XSLT cannot change the input file as they process it, although they can return an altered version of it as output. XPath and XSLT also cannot change variable values, which means that we cannot increment a counter with a statement like $x = $x + 1 (although there are, of course, alternative ways of maintaining a count). The benefit of these constraints is that they make it possible for functions to be applied to sequences in any order, including in parallel (that is, to process several items in an input sequence at once, instead of looping through them one by one, waiting for each to finish before beginning the next). The order in which operations happen in XSLT is largely transparent to the user because in a declarative language we state the desired result of an operation without having to know all of the details about how our code produces that result, but we nonetheless benefit from more efficient execution when the computer can do things as soon as it has the resources available to do them, and is not constrained by a possibly suboptimal execution that reflects a step-by-step human perspective.

The Computational stylesheets chapter of Kay (pp. 985–1000) provides a clear introduction to functional programming in XSLT that is accessible to those who are not trained professionally in computer science. This would a good time to read it!

One example of where not being constrained by a pre-determined order of execution can be helpful is that when you write XPath like //sp ! string-length() to compute the length of each speech in a play, the processor does not have to loop through the speeches one by one, waiting for the first one to be measured before the second can begin. And although an XPath expression like this one is guaranteed to return the results in a way that is consistent with document order, it is not required to process them in document (or any other specific) order. The processor is able to optimize the order of execution only because XSLT and XPath cannot change state (that is, cannot change the input document, variable values, or other items on which other processing steps depend), so no operation has to wait for another to finish to ensure that the state hasn’t change midstream.

The XPath for statement may look like a sequential operation, but for $speech in //sp return string-length($speech) can process the speeches in any order, and even in parallel, exactly like //sp ! string-length(). It has to return results so that the order of the output items corresponds to the order of the input items, but it doesn’t have to process them in that order. The same is true of <xsl:for-each> and the same is true when we apply templates to a sequence of nodes. In technical terms these are mapping operations, rather than loops, where mapping means that each item in the input sequence is mapped to a corresponding item in the output sequence. As Kay explains:

Although there is a defined order of processing, each item in the sequence is processed independently of the others; there is no way that the processing of one item can influence the way other items are processed. This also means that you can’t break out of the loop. Think of the items as being processed in parallel. (323)

The crucial difference between mapping and looping that makes mapping potentially more efficient computationally is that mapping doesn’t have to happen one item at a time in input order.

There are, of course, operations that have to happen in a particular order. For example, if we compute a value with count() or sum() or some other function that outputs a numerical result and then format it with format-number(), the result of the formatting operation cannot be created until the result of the operation that does the counting or addition is available for use as its input. But even in cases like this, if we are counting and formatting many things (for example, the number of speeches in all of the scenes of a play), we don’t have to do all of the counting for all of the speeches first and then format all of the counts, and we don’t have to count and format each individual scene, from data to final output, before we begin the next. The functional model means that the order in which we apply the steps in the processing pipeline (data → counting → formatting → output) to the scenes can be optimized by the processor to take advantage of whatever input is available at any point in the operation.

A functional language also ensures that a function will always produce the same output when it is given the same input in the same context. This is how the XPath generate-id() function can output the same value for the same node when invoked in different locations in the stylesheet, a strategy you may have applied to completing our Shakespeare sonnets task. It is also why, perhaps surprisingly, we cannot measure the amount of time an operation takes by calling current-time() before and after the execution; the value returned by current-time(), like the value returned by any XPath function, is guaranteed to be the same each time it is called with the same input (which in this case means each time it is called, since this particular function does not accept input arguments). The next time you run the entire transformation the value will be different, but within a single transformation the same function with the same input is not permitted to yield different results.

A few XPath functions are not fully deterministic, that is, do not always produce exactly the same result when invoked with the same input. For example, the distinct-values() function is non-deterministic with respect to ordering, which means that two invocations of the function with the same input will always produce the same inventory of values, but it is not guaranteed to return them in the same order. See the discussion of deterministic and non-deterministic functions in the XPath functions spec for details.

Part of the efficiency of functional programming comes from the way function chains can be processed. If, in our earlier example, we want to compute the character length of speeches in a play after whitespace normalization, we can chain together two standard library functions, e.g.:

//sp ! normalize-space() ! string-length()

Because functional processing is not required to perform repeated operations as one-by-one, sequential iterations, and because existing values (the input document, variables) cannot be changed, the processor is free to perform the two operations (whitespace normalization, string-length measurement) whenever its optimizations suggest. It may create all of the whitespace-normalized values first and then compute their length, it may whitespace-normalize each string and then immediately compute its length, or it may use some combination of the two. And, as mentioned earlier, it may perform those operations over the members of the input sequence in any order, including some at the same time, even though it must return them in a way that is consistent with the input order.

The parts of a user-defined function

A function has a signature, which consists of its name (in a namespace), the input it accepts (parameters), and the result it returns. The rest of the function body constructs the result, which is returned when the function is called. These parts are described in more detail below.

A preliminary note about terminology: Parameters are the input items that a function expects, as described in the function declaration. Arguments are the values supplied inside the parentheses when a function is used. When a function is called and executed, it assigns each argument in the function call, in order, to a parameter inside the function body, in the same order. The close relationship of parameters to arguments makes it difficult at times to decide which term to use, and for all practical purposes they can be thought of as equivalent.

Function name and namespace

A user-defined function created with <xsl:function> has a name, and you should choose one that is reasonably self-documenting, that is, that describes what the function does. Use a consistent naming strategy for your functions; for example, don’t use camel case (e.g., myFirstFunction()), kebab case (e.g., my-second-function()), and snake case (e.g., my_third_function()) function names in the same stylesheet. XSLT won’t care if you mix your naming strategies, but if you do, you’ll forget the names of your own functions when it comes time to use them.

The name of a user-defined function must be in a namespace and the namespace must be different from reserved namespaces, such as those used for the library functions built into XPath. You should choose a namespace value that is unlikely to be chosen by someone else, so that you will be able to import and use someone else’s functions without introducing conflicts with your own functions. It is customary to use a URI as a namespace value, but that is not required, and the URI is not obligatorily a URL, which is to say that it does not have to point to an existing resource on the Internet. In line 4 of the example below we use http://www.obdurodon.org as the namespace for our user-defined functions and we bind it to the namespace prefix djb:; this is identical to the way we bind the prefix xsl: to the XSLT namespace. When we then define the function in line 10 we use its full name, with the namespace prefix, and when we eventually use it in a transformation we’ll call it as djb:mean(). For your own projects you might want to use a short version of your project name as the namespace prefix and the main URL of your project as the associated namespace value.

The skeleton of a stylesheet that includes a user-defined function looks like the following:

]]>

As mentioned above, we declare the namespace for our function in line 4, and our <xsl:function> element, which defines our function, begins on line 10 and ends on line 12. At the moment this element is empty, and we will add functionality to it below in a graduated way.

Functions in XSLT are identified by their name and the number of arguments they expect (the number of arguments expected is called the function’s arity). The skeletal example above does not yet specify any input parameters, which means that it has an arity of zero and must be called without any arguments. This isn’t what we want, since we need to be able to input into our function the values for which are computing a mean, and we describe in the following section how we specify that our function requires that input. The fact that functions are identified by a combination of their name and their arity means that multiple functions can have the same name as long as they accept different numbers of arguments. A function that appears to accept a variable number of arguments is understood by XSLT as different functions that happen to share the same name but have different arity. For example, tokenize('my input", "\s+") takes two arguments, the string being split and the regex on which to perform the splitting, and there is also a one-argument version (e.g., tokenize("my input")) that uses \s+ as a default replacement for the missing second argument. From an XSLT perspective, those are two different functions with different arity that happen to share a name.

The reason this detail matters is that you cannot define a function that takes a variable number of arguments, which you might want to do if you would like there to be a default value for one of your arguments, as is the case with tokenize(), above. If you want to achieve that effect you need to define separate functions, each with a different number of parameters, and the one with fewer parameters can then call the one with more parameters, passing along the original input and supplying default values for the additional parameters. The fact that functions are defined by a combination of their name and their arity also means that even though you can (and should) specify the datatypes of your parameters (see below), you cannot define two different functions with the same name and the same arity that differ only in the datatypes required for their parameters.

Function parameters and datatypes

The first children of the <xsl:function> element, which defines a function, are its parameters, which are specified as empty <xsl:param> elements. Our function to compute a mean will accept one argument, a sequence of numbers, so we need to augment our function definition by adding an <xsl:param> child element:

]]>

You can call your parameters whatever you want. If a function requires more than one argument, we must declare one parameter for each argument, and the arguments supplied when the function is used are assigned to the parameters in the order in which they appear, so that the first argument becomes the value of the first parameter, etc.

XSLT function arguments are positional, that is, their role in the function is determined by the way they are assigned to parameters in the order in which they are listed inside the parentheses when the function is used. There are programming languages that have named parameters that are not sensitive to order, but XPath and XSLT do not support named function parameters and all function parameters are therefore positional.

It is customary to pass all information that a function requires into the function as arguments, that is, as input supplied inside the function parentheses each time the function is used. Functions also have access to stylesheet variables without their being passed in explicitly because stylesheet variables are global by definition, and therefore available anywhere in the stylesheet, including inside user-defined functions. Despite the availability of stylesheet variables, some developers avoid using them inside user-defined functions and instead pass them in explicitly as function arguments inside the function parentheses (and then assign them internally to function parameters) when they are needed, reasoning (correctly) that functions that are not wholly self-contained, and that rely on information elsewhere in the stylesheet that is not supplied explicitly to the function when it is invoked, are more difficult to debug, maintain, and reuse. Other developers are less categorical, reasoning (also correctly) that using some types of stylesheet variables (such as constants with atomic values that do not depend on the context item) can simplify the function code. A function does not have access to any nodes in the tree being processed unless they are either supplied as arguments when the function is called or assigned to stylesheet variables and therefore globally available.

The standard library function avg() accepts not only numerical values as input, but also durations (there are pre-defined datatypes for durations expressed in years, months, days, hours, minutes, seconds), and it’s capable of averaging those. Since the purpose of writing our own version is just to illustrate how to go about developing a user-defined function, we’ve taken the liberty of simplifying the task, so our version accepts only numbers as input, and not durations.

An <xsl:param> element has an optional @as attribute that specifies the datatype of the expected input. Since our function requires a sequence of numbers as its input, we can (= should) use this attribute to specify that the datatype of our single input parameter must be sequence of numerical values. We don’t care whether the numerical values are integers (whole numbers) or doubles (numbers with digits to the right of the decimal point, like 4.2) or some of each, but since every integer can be treated as a double (e.g., the integer 4 is numerically equal to the double 4.0), we specify the input as a sequence of doubles.

We’ve decided that our function can only accept arguments that are numeric (or that can be treated as numeric), but we also need to decide whether the function should accept an empty sequence, that is a sequence of zero (numeric) items. If you decide that supplying an empty argument should be valid, you need to decide what the return value would be, since the arithmetic mean of the input values doesn’t have an obvious meaning when the input contains no values. Because the standard library avg() function returns an empty sequence when the input is an empty sequence, we’ve decided (somewhat arbitrarily) to mimic that behavior: we will not regard an empty input sequence as an error and we wlll return an empty sequence as a result. We will, though, regard input values that are not numbers and not able to be understood as numbers (for example, strings) as erroneous, and we’ll terminate the operation and report an error should we encounter that sort of input to our function.

You want to plan what your function will do with edge cases before you implement it. Whether you allow an empty sequence as input to your function or regard it as an error is up to you, and what the function should return if you do allow an empty sequence as input is also up to you, since there’s no obvious single correct answer. The take-away is that you should make those decisions before you write the code that will handle the input.

Although the @as attribute is optional as far as XSLT syntax is concerned, it is good practice (and required for all work in our course) always to specify the datatypes of all parameters. For that reason, we now enhance our skeletal function declaration by adding a datatype specification (see line 2):

]]>

Datatype specifications use the same occurrence indicators as Relax NG, so the asterisk means that the input must be a sequence of zero or more numbers (or items that can be treated as numbers). The most common datatypes for atomic values are xs:integer, xs:double (numbers with decimal components), and xs:string. Elements can be specified as element() (which allows any element type) or, if a specific element type is required, with the element name inside the parentheses, so that a function that required as input a sequence of one or more <speech> elements would specify the type of the parameter as element(speech)+.

The main reason we always specify datatypes on our parameters (and variables) is that we want the XSLT processor to notify us should we accidentally supply a value of the wrong type. Inexperienced developers sometimes omit the @as in a misguided quest to reduce the number of error notifications. This is misguided because reducing the number of notifications does not reduce the number of errors, and it is obviously better to be notified when something you did not expect happens than to receive, accept, and trust an erroneous result because you never noticed that you were passing your function an erroneous value. It isn’t possible to trap all developer errors, but checking that your datatypes are what you expect is much better than checking nothing.

When it comes time to use our parameters within the function body, we refer to them by name, prefixed with a dollar sign, so that, for example, the parameter we declare above can be referenced as $input. Parameters are thus similar to XSLT variables: they are declared with a @name attribute and given a name that does not begin with a dollar sign, but when they are referenced subsequently, the dollar sign is prepended to them.

Function result and datatype

A function returns the result of evaluating the code inside the function body after the parameter declarations. Unlike in some other programming languages, there is no explicit return statement; the result of the function that is returned is simply the result of evaluating the function body. The body may include literal result elements, it may incorporate the result of applying templates to whatever was passed in as input, and it may construct new atomic values. It is common to return a result with the <xsl:sequence> element, although this is not required or expected.

If our function is returning a single atomic value (such as a string or number, or something that can be treated as a string or number), we can use <xsl:value-of> instead of <xsl:sequence>. But because <xsl:value-of> can only create a text node (that is, can only create a single string), if we want to return elements or attributes or more than one thing, our only option is <xsl:sequence> or a sequence constructor (that is, literal elements and other content). There are also some subtle differences in whitespace handling when we return multiple values with <xsl:value-of> vs <xsl:sequence>.

Just as it is possible to specify the datatype of input parameters by using the @as attribute on <xsl:param> elements, it is also possible (and good practice, and required for all code in this course) to specify the datatype of the result of a function by adding the @as attribute to the <xsl:function> element. Our function will return an empty sequence if the input is an empty sequence, but in all other cases where we supply valid input it will return a single double (that is, a single numerical value that may have digits to the right of the decimal point), so we now enhance our skeleton to add that information (see line 1):

]]>

Function body

Our function returns an empty sequence if the input is an empty sequence, and otherwise it divides the sum of the input values by their count. We implement this logic inside the function body, after the <xsl:param> elements, as follows:

]]>

We count the number of input items and bind that count to a variable because we use the value twice, first to test whether there are any input items (since zero input items is a special case that requires special handling) and then, if there are input values, to compute the mean. Binding a value that we use more than once to a variable makes our code easier to understand, it reduces the computational overhead (since we don’t perform the same computation more than once), and it reduces the opportunity for error. The empty <xsl:when> statement on line 6 correctly returns an empty sequence if there are no input values, and the <xsl:otherwise> statement below it computes the mean in the traditional way if there are input values. The @as specification on line 2 ensures that the function will correctly raise an error if any of the input values are not numbers (or items that can be treated as numbers).

Our function now matches the completed version at the top of this tutorial, except that we have not yet added any documentation. We address that concern in the next section.

Documenting a function

While the structure of a function definition is partially self-documenting (we can see the name, the parameters, and the type information), we recommend adding explicit documentation that describes in reasonably natural language what the function does and what its input and output look like (and all code submitted for this course requires that type of documentation). Our preference is to format this documentation as XML comments immediately after the <xsl:function> start-tag, and we find the comments easiest to read if they are arranged as in the example above because both the comment as a whole and its individual parts stand out by virtue of the separators and the blank lines. With that said, any documentation that you find easy to read is acceptable.

Some developers prefer to write comments before the function definition, rather than inside it. We prefer comments inside the <xsl:function> element because when we collapse the element in the <oXygen/> editor to reduce clutter when we aren't working on it, comments inside are automatically collapsed as well, while preceding comments would have to be collapsed as a separate step.

Act	Scenes
Mean

Act	Scenes
1	5
2	2
3	4
4	7
5	2
Mean	4

Computing a mode

The mode is the value or values that occur most frequently in the input. If, for example, the input is (1, 2, 3, 2, 1), there are two modal values, 1 and 2, because each of them occurs twice, and that’s the greatest frequency with which any value occurs in the input sequence. We’ll make the following assumptions in our code that computes modal values:

We’ll allow an input of zero items (that is, an empty sequence) and return the empty sequence as the computed mode.
We”ll restrict our input to integers. Computing a mode of non-integer values is possible (the mode will be the most frequent value(s), whether those values are integers or not), but because doubles can vary in large or small ways, they are much less likely to repeat than integer values, which means that a sequence of doubles is unlikely to contain repeating values. If in your work you have a need for modal values of sequences of doubles, you can adjust the function accordingly.

The following function computes a mode for an input sequence of zero or more integers:

]]>

We allow an input sequence of zero or more integers (line 12). If the input sequence is empty, we return an empty sequence; otherwise we return one or more integers, so we set the return data type (line 1) as zero or more integers. The function logic is as follows:

We create a variable called $djb:input-items, which will be a sequence of zero or more empty <djb:input-item> elements. Those elements are a convenience for holding each input value and its frequency together, so that, for example, if the input value 7 occurs 3 times in the input sequence, that information will be stored as <djb:input-item value="7" count="3"/>.

Using temporary elements for this data-management purpose works, but it isn’t very efficient computationally because elements have a lot of inherent properties that we aren’t using because we don’t need them; all we need is a way to keep the values and their frequencies together. In the optional Odds and ends section, below, we explore more efficient alternatives, and we would probably use one of those in Real Life. We’ve used temporary elements here for pedagogical reasons, that is, because you’re already familiar with elements, attributes, and path expressions.
We use <xsl:for-each-group> to form the items in the input sequence into groups according to their values. Since the scene counts in Hamlet are, in order, (5, 2, 4, 7, 2), where the value 2 occurs twice and the other three distinct values occur once, we’ll wind up with four groups:
```
]]>
```
We find the greatest frequency of any value with max($input-items/@count). If there are zero input values there will be no largest count value (which is why we set the data type as xs:double?, with a question mark to make it optional); otherwise there will be at exactly one largest count (although, for reasons described above, there may be more than one input value that occurs the maximum number of times).
The return value of our function is a sequence of zero or more values, which we obtain with $input-items[@count = $max-count]/@value. This filters our <djb:input-item> elements to select only those with @count values equal to the largest count value. We then take a path step from those elements to their @value attributes, which are the input numbers that occur the maximum numbers of times, and we return a sequence of those numbers.
If there are no input item the result of filtering them in the last step will necessarily be an empty sequence, so we do not require special handling for null input.

If we add a row to the table of information about scenes in Hamlet to render the mode and populate the cell with ]]>, it returns the value of 2, which is the modal number of scenes-per-act in the play because there are 2 acts that contain 2 scenes and all other scene counts occur in only one act.

Odds and ends

You can regard this section of the tutorial as optional, but we hope that you’ll at least look through it to see what’s there. In particular, maps are a new feature of XPath 3.0 that work quite differently from other features of the language. We don’t introduce them in a comprehensive way here because they require their own tutorial, but since we might use them in Real Life if we were creating a user-defined function to compute an arithmetic mode, we thought we should at least demonstrate how to use them for that purpose. XPath and XSLT developers survived without maps before XSLT 3.0 was introduced, which is to say that the alternative implementation that we discuss above is both idiomatic and fully adequate to the task, and you should feel free to employ that method in your own work if you are not comfortable using maps.

Verification

We tested our function against the examples at https://www.mathsisfun.com/mode.html and verified that we returned the expected modal values in each case. That site discusses alternatives to computing an exact mode in situations where the exact mode is not useful, such as binning (grouping values by range) when all individual values occur only once. We did not implement those methods here, so we always return the modal values, even where they may be of limited practical use.

In Real Life we would perform this verification with proper unit tests (using XSpec, a framework for unit testing XSLT stylesheets), but we have not yet implemented an XSpec test suite for our djb:mode() function.

More examples

Priscilla Walmsley’s FunctX XQuery functions site provides excellent documented examples of both the standard XPath and XQuery function library and her own user-defined functions (in a namespace conventionally mapped to the functx: prefix).

Named templates

The <xsl:function> element was added to XSLT only in version 2.0, and similar functionality in version 1.0 was achieved with the help of named templates and <xsl:call-template>. Since the introduction of <xsl:function> we find that we rarely use the older method, but Kay pp. 349–50 discusses the similarities and differences, including when each might be preferable.

Recursion and iteration

Functions and named templates can call themselves, a process known as recursion. Recursion is a common idiom in XSLT programming (and in functional programming in general); see Kay pp. 274–80 and 350–53 for examples and discussion. XSLT 3.0 (not covered in Kay) introduces <xsl:iterate>, which offers an alternative approach to some tasks that would previously have been addressed with recursion; for more information see the xsl:iterate section of the Saxon documentation and the discussion at 7.2 The xsl:iterate Instruction in the XSLT 3.0 spec.

Imports, includes, and packages

Our example above embeds our user-defined function definition inside the stylesheet in which it is used to create a result, but a common use of functions involves creating them in a separate XSLT stylesheet and importing them into the stylesheets that then use them to perform transformations. The advantage of this approach is that functions that you use repeatedly in different projects can be written once and reused, just as we use the same standard library functions in different projects. There are three mechanisms for importing functions (and other information) from one stylesheet into another: <xsl:import> (Kay pp.. 357–67), <xsl:include> (Kay pp. 372–76), and XSLT packages, new in XSLT 3.0 and not discussed in Kay. You can read more about packages in the Saxon documentation (xsl:package, xsl:use-package, and the links from those pages) and Section 3.5 Packages of the XSLT 3.0 spec.

Alternative ways of computing a mode

When I shared this task on the xml.com Slack, several readers contributed alternative ways of computing a mode. Some approaches implement the logic that identifies the modes entirely in XPath, while others rely also on XSLT elements. Some approaches use maps, while others do not. Here is an overview of some of those alternatives. In Real Life we would use whichever we found easiest to understand and implement.

The syntax for creating a user-defined function is the same in the method above and in all of these alternatives, all of these functions are called in the same way, and they all return the same results. The reason to read this section, then, is not to learn something new about functions as much as to think about the variety of ways XPath and XSLT allow us to approach a task. That variety does tell us something important about functions, though, because the only thing that changes across these different methods is the function body, and the signature remains constant, which means that the template that uses the function does not have to be modified when we change the function body. We find this a persuasive example of the benefits of modular approaches to coding, that is, of isolating a small amount of functionality so that it can be modified without having to change (or even think about) the way other parts of the program are going to use it.

Pure XPath without maps

]]>

This approach uses the let … return construction introduced in XPath 3.0. Values can be bound to variables using the let operator followed by := (not just an equal sign), after which the result of evaluating an XPath expression that (typically) uses those variables can be returned. If there are multiple let statements, they need to be separated by commas.

The := symbol is informally called the walrus operator because—at least if you have a good imagination—it looks like the eyes and tusks of a walrus lying on its side. For reasons that have to do with the technical meaning of the terms operator, assignment, and binding in computer science, := is not, strictly speaking, an operator, and when we associate values with variables we describe the process as binding, rather than assignment. A jargon-neutral way to describe the effect of the := symbol is that it associates the value to the right with the variable name to the left.

This method creates two variables: $dist is a sequence of the distinct values from the input (each of which appears in the input with a frequency of one or more) and $mx is the highest such frequency. To compute the frequencies we take each distinct value and use the XPath index-of() function to find the offsets of the position in the original sequence where the value appears. If, for example, the input were (5, 2, 4, 7, 2), index-of($input, 2) would return the sequence (2, 5) because the value 2 appears at positions 2 and 5 in the input. For our purposes we don’t really care about the specific offsets; we care only how many offsets there are for each distinct input value, since we can count the offsets as a surrogate for counting the frequency of the values themselves. The snippet $dist ! count(index-of($input, .)) creates a sequence of those counts, one per distinct value in the original input, and we wrap that sequence of frequencies in the max() function to find the largest such value (there must be exactly one, even if it occurs more than once because multiple input items appear at maximal frequency). We then bind that maximal frequency value to the variable $mx.

Finally, we use the maximal frequency to filter our distinct values by counting how often each one occurs (again) and finding the one(s) that appear most frequently. The predicate in line 15 uses the same index-of() strategy to count the frequency of each distinct value in the original input sequence, and it retains those for which the count equals $mx. This strategy is similar to the one we used in questions 4 and 5 of XSLT assignment #4 to find the speaker of the longest speech in Hamlet.

This method can be condensed into a single line, which is as elegant as it is challenging for a human (well, for this human) to parse. The one-line version is:

]]>

The logic is largely the same, but it dispenses with interim variables and it avoids duplicate output without using the distinct-values() function, which means that it has to implement an alternative way of dealing with values that appear more than once in the input sequence. Reading from the inside out:

Starting with the innermost predicate, $input ! count(index-of($input, .)), we count the number of times each value appears in the input sequence (counting the offsets of the values as a surrogate for counting the values themselves). We don’t remove duplicates, so if, for example, our input is (1, 2, 3, 4, 5, 3, 4), the predicate returns (1, 1, 2, 2, 1, 2, 2). We wrap this in the max() function, which tells us that the modal value(s) occur twice. With this input data, then, we can replace the inner predicate with the integer value 2, which means that our expression is equivalent to $input[index-of($input, .)[2]].
Inside the remaining (outer) predicate we now get the offset positions for each value in the input and filter them to keep only the second such position. Here are examples of how that works for input values that occur only once (we use 2 as an example) and for values that occur twice (we use 3 as an example):
- When the input value is 2, the expression $input[index-of($input, .)[2]] evaluates to $input[()] because there is no second offset position for the value 2 in the input. This is a valid expression that selects nothing, which means that the input value 2 is not added to the output sequence.
- When the input value is 3, the expression $input[index-of($input, .)[2]] evaluates to $input[6] because 6 is the second offset position for the value 3 in the input. The sixth item of the input is therefore included in the output sequence.
The clever part of this approach that makes up for our not having removed the duplicates earlier is that we filter the offset positions to keep only the second one. Input values that appear only once don’t have a second offset position, so they yield an empty sequence, which is not an error in XPath, and those input values not included in the output. But the values that appear twice do have a second offset position, so their second appearance (but not the first) is included in the output, thus removing the duplicates by selecting only one instance of the desired input value.

It took us long enough to work through the logic that we probably wouldn’t use this method in Real Life, but figuring out how it works was a rewarding experience that enhanced our knowledge and understanding of XPath.

Pure XPath with maps

The following XPath-oriented function uses maps. XPath that uses functions in the map namespace must declare that namespace, which means adding the following attribute to the root <xsl:stylesheet> element:

xmlns:map="http://www.w3.org/2005/xpath-functions/map"

We do not try to explain how to use maps in any general way here because they really require their own tutorial, but we do try to explain how the map-related parts of the following code work:

]]>

The let … return construction allows us to create three variables and then return a result that depends on them. The variables are:

$distinct is a deduplicated sequence of the input values, that is, the input values with duplicates removed. We use this to count the number of times each one appears in the original input sequence.
$freqs is a map with a key for each unique frequency, where the value of the key is a sequence of values that appear in the input sequence with that frequency. We create this map by merging a sequence of one-item maps, the construction of which we discuss below.
$max is the largest key value, that is, the highest frequency with which any value occurs in the input.

Maps (see 3.11.1 Maps in the XPath 3.1 spec) are key:value pairs in which the keys (in our case the number of times a value appears in the input sequence) are atomic values (in our case integers, since they are frequency counts) that must be unique. Since there may be multiple input values that occur the same number of times, we approach the task by first constructing a separate one-item map for for each unique integer value in the input, where the frequency with which that integer appears in the input is the key and the integer itself is the value. The part of our code that does this is:

The map:entry() function (line 15) constructs a one-item map and takes two arguments, the first of which is the key (in our case we count the number of times a particular distinct input value occurs in the input sequence) and the value of which is the integer whose frequency we are counting. When this XPath for … return expression finishes, it creates a sequence of one-item maps, which serve as the first argument to the map:merge() function wrapped around them (lines 13–16 in the full example, above). With our Hamlet example this produces the following maps:

The order of these four maps is unpredictable because the order of the items returned by the distinct-values() function is unpredictable.

The map:merge() function as we use it takes two arguments, the first of which is a sequence of maps to combine into a single map and the second of which is a map of options that describe how the merge should proceed. The options are … er … optional, but because maps are not allowed to have duplicate keys and three of our one-item maps have the same key, we want to specify that in case of duplicate keys we want the values to be merged into a sequence. We do that by constructing, as our second argument to the map:merge() function, a one-item map with the key "duplicates" and the value "combine" (both key and value are in quotation marks because both are strings).

This part of the code illustrates two ways to create new one-item maps. One is to use the map:entry(key, value) function (line 15) and the other is to use the map constructor syntax map{ key: value } (line 16). These are discussed in the Maps in XPath section of the Saxon documentation.

The output of the map:merge() function is the following single map with two key:value pairs:

The order of the key:value pairs inside a map, like the order of items in the output of distinct-values(), is unpredictable.

We assign this merged map to the variable $freqs (lines 13–16), generate a sequence of its keys with the map:keys() function (inside the parentheses in line 17; in our case the result is (1, 2)), identify the largest key value with the regular XPath max() function (to the right of the assignment operator in line 17), and bind that value to a variable called $max (line 18). There is guaranteed to be exactly one largest key value because our merge operation ensured that all keys would be unique by combining into a single sequence, with a single key, the values for keys that were duplicated in the earlier sequence of one-item maps.

One way to look up the values associated with a key in a map is to follow the map name with parentheses and write the key value inside the parentheses. For that reason, $freqs($max) returns the value associated with the largest key, which in our case is the value 2. This means that the most frequently any value occurs in the input sequence is twice (our key), and the value that occurs twice is 2, which is the count of scenes in Acts 2 and 5. Had there been more than one act with the same most frequent number of scenes (for example, had there been 2 acts with 2 scenes, 2 acts with 4 scenes, and 1 act with 6 scenes), our initial input would have been the sequence (2, 2, 4, 4, 1)the highest frequency (the value of $max) would still have been 2, and $freqs($max) would have returned the sequence (2, 4).

XSLT with maps

The XPath map functions map-entry() and map-merge() and the map constructor map{ } used above have counterparts in XSLT. A one-item map can be created with <xsl:map-entry> and one-item maps can be merged by wrapping the <xsl:map-entry> elements in <xsl:map>. Below is an implementation that uses these XSLT map methods:

]]>

We begin the same way that we did with our implementation in the main body of this tutorial, by grouping the input values according to value. We do this inside an <xsl:variable> element to construct a variable called $groups, the members of which are one-item maps with input values as keys and frequencies as values. The value of $groups is:

We now want to invert the keys and values and merge the result into a single map, and because keys must be unique, that means that we need to combine the three one-item maps for input items that occur only once into a single map entry. XSLT does not have a counterpart to the XPath map:merge() function, and although there’s no impediment to using that function in an otherwise XSLT context, we instead implement the merge with XSLT resources to create a new variable called $keyed-by-freq. We use grouping again, this type grouping the one-item maps according to their values, and we use the ?* notation to return, as a grouping key, the value of the individual map items (that is, the frequencies; see 3.11.3 The Lookup Operator ("?") for Maps and Arrays in the XPath 3.1 spec for more information). Since we are now grouping by the values of the maps in the $groups variable, which are frequencies, the keys for the new map will be 1 and 2. The values in the new map are the keys from the old map, and we obtain those by applying the map:keys() function to each item in the group we are processing at the moment. Because the new keys are the frequencies, their largest value is the greatest frequency, which we compute by applying the max() function to the keys of $keyed-by-freq (and we find those with the XPath map:keys() function). We then use the lookup operator mentioned above to look up (that is, retrieve) the value associated with that key in our map.

<oo>→<dh> Digital humanities