Maintained by: David J. Birnbaum (djbpitt@gmail.com) Last modified: 2022-12-08T15:54:56+0000
The XPath and XQuery Functions
and Operators 3.1 W3C Recommendation 21 March 2017 specification documents
the standard XPath functions that are automatically available for use in XSLT. The
functions we use most are in a namespace mapped to the predefined
fn:
namespace prefix; this is the default function
namespace, which means that it doesn’t have to be (and usually isn’t) specified, so
that, for example, fn:contains()
and
contains()
refer to the same function. Additional
standard functions are available automatically, in their own namespaces, for
mathematics, arrays, and maps. Finally, functions that are not part of any formal
specification but that are nonetheless in wide use are available in popular external
libraries, such as Priscilla Walmsley’s FunctX and Michael Kay’s Saxon extension functions.
Despite the richness and variety of these resources, once you begin to develop any sort of substantial XSLT stylesheets you are likely to discover a need for functions that are not available in existing resources. To meet this need XSLT provides a facility for users to create their own functions, using standard XSLT and XPath resources as building blocks, and integrating user-defined functions into your XSLT stylesheets can often make them easier to write, understand, debug, modify, and maintain. This tutorial offers a hands-on introduction to creating your own XSLT functions. The major parts of the tutorial are:
The text of this tutorial is relatively long, but much of it is prose and the amount of actual code is quite limited. At the same time, many features of the code (not only concerning functions) will be new, and we encourage you to try the examples and think about ways in which what you will learn about user-defined functions can be applied to your own XSLT. While it is not always necessary to create your own user-defined functions, we’ve found that the mindful incorporation of user-defined functions into most of our stylesheets ultimately reduces our development time and simplifies our development effort and process overall.
Statisticians work with three basic types of averages: the arithmetic mean, the median, and the mode:
The arithmetic mean, which is what we usually think of as an
average
, is equal to the sum of the numbers we are averaging
divided by their count. For example, the mean of the 5 values
(3, 0, 3, 7, 8)
is
4.2
because the sum of the five values
is 21
and the result of dividing
21
by
5
(the number of values in the
sequence) is 4.2
.
The median is the number in the middle of a sorted sequence of
values. The median of (3, 0, 3, 7, 8)
is 3
because 2 of the 5 values are less
than or equal to 3
and 2 of the values
are greater than or equal to 3
. If
there’s an even number of values there isn’t a single one in the middle, so
in that case the median is defined the mean of the two middle values. For
example, if there are 6 values, we sort them and average the middle 2 (the
third and fourth) to obtain the median.
The mode is the value that occurs most frequently in a sequence of
values. The mode of (3, 0, 3, 7, 8)
is
3
because that value occurs twice and
all other values occur only once. A non-empty sequence of values always has
at least one mode, but it can have more. For example, the two modal values
of (3, 0, 3, 8, 8)
are
3
and
8
because they both occur twice and the
only other value (0
) occurs only once.
If all values in the sequence are unique, that is, occur only once, every
one is a modal value.
A function that computes these values must also have defined behavior for unusual input, commonly called edge cases. For example, what is the mean, median, or mode of a sequence of zero items? What is the mean, median, or mode if some or all of the supplied values are not numeric? A function will always do something no matter what the input, even if that something is to raise an error or return an empty sequence, and it’s up to the developer to decide what that something should be.
The avg()
function in the standard XPath library
computes an arithmetic mean, but there are no standard library functions to compute
the median or mode. To learn about user-defined functions we’ll do three things:
In this tutorial we’ll implement, step by step, our own version of the
standard library avg()
function, that
is, our own function to compute an arithmetic mean. In Real Life we would
just use the standard library function, and we reimplement that
functionality here only for pedagogical purposes, so that we can use the
built-in function as a point of reference.
In this tutorial we’ll then implement our own function to compute a mode. There is no standard function that performs this operation, which means that creating a user-defined function for this purpose may be genuinely useful.
As a homework task you’ll implement your own function to compute a median. We describe that task in more detail on the assignment page.
The user-defined function below computes the arithmetic mean of a sequence of numbers. We will explain the individual steps below (= don’t be concerned about the parts that are unfamiliar at first), but so that you can keep the final result in mind, our function will eventually look like the following:
]]>
The actual function definition, excluding comments, takes only 11 lines of code. It can be made even shorter, but the version above offers what we regard as a reasonable balance of concision and legibility.
We can think of a function as a bit of code that accepts input (called the function
parameters) and returns a result. Not all functions require
user-supplied input (for example, the standard library
current-date()
function does not accept input
parameters) and functions may return an empty sequence as their result. An empty
sequence is not an error in XPath or in XSLT, which means that returning an empty
sequence is a variety of returning a result.
Long XSLT templates can be difficult to understand, debug, modify, and maintain because they implement a lot of functionality in one place. For example, if you have a template that computes complex values and outputs them in different rows and cells of an HTML table, the logic of the table organization may be difficult to see and understand if it is broken up, within your template, by the logic that computes the content values. You can improve the legibility by computing the values at the beginning of the template and assigning them to variables, which you then reference in place in the constructed HTML output, but if you need to perform the same types of computation in multiple locations, you don’t want to have to duplicate the code. What you’d like is to be able to use functions that do exactly what you need anywhere in your stylesheet with as much ease as you use standard XPath library functions. If you had that facility, a template that is responsible primarily for HTML table structure could focus on that structure, and much of the code that computes the values that will eventually populate the cells could be located elsewhere
This type of organization is automatic when you use the standard function library.
For example, if you want to insert the mean of each row of values in a table into a
cell at the end of the row, you can pass the values into the standard XPath
avg()
function and it will return the result. Your
template doesn’t have to include code that divides the sum of the values by the
count of the values (that is, the code that computes the average) because that
responsibility lies elsewhere, and all you need to do is invoke the
avg()
function by name and supply its input. But,
as noted above, there is no standard library function that will compute the median
or the mode of a sequence of values, and although you can piece the functionality
together yourself, if you do it inside the
<td>
cell tags that will hold the result, or
even in a variable that you compute within the template and then reference in the
cell, your template grows longer and harder to work with.
An alternative to monolithic templates is to decompose your computation into stand-alone functions that you can invoke similarly to the way you invoke functions from the standard library. This approach may require additional code initially in order to manage the overhead of defining the functions, but there are at least three types of compensatory benefits:
If you isolate stand-alone bits of functionality in their own functions, you can concentrate on them when you develop them and then not have to think about (or even notice) how they work when you use them in your templates. This is similar to the way you don’t have to think about the standard library functions beyond knowing their signatures (names, parameters, types, etc.). As a result, your templates become shorter and easier to read because they are not trying to do everything in one place.
Functions can invoke other functions, both those from the standard library and those that you write yourself. This means that not only can a complex template offload some of its computation onto a user-defined function, but a user-defined function that needs to do several things can offload its individual tasks onto smaller user-defined functions.
If you later decide to change your implementation of a function, you can do it in its own place, without having to drill down to the few relevant lines inside a large template that does many things at once. In a collaborative development environment, such as with our course projects, this makes it easy for one team member to be responsible for developing a particular bit of functionality while another team member concentrates on structuring the template that will use the output of that functionality.
The code we develop does not always do what we think it is doing. For
example, if we want to know how long the individual speeches in
Hamlet are and we measure them with
string-length()
we might fail to notice
ahead of time that we don’t want to count the whitespace characters
introduced by pretty-printing. Or we might want to perform integer division
but we accidentally use the div
operator
instead of idiv
. If the input we’re
focusing on happens not to have extraneous whitespace or happens not to
leave a remainder under division, we may not realize that our code would
give results, but not the results we want, with different input. This is the
most dangerous type of computational mistake because it doesn’t raise an
error message; it just quietly gives the result we asked for even when that
is not the result we thought we were asking for. For this reason software
developers write tests that supply a range of different sample inputs to a
function and verify that the result is what it should be, a development
strategy called unit testing.
If a large, composite template that produces a lot of output does not perform as expected, locating the source of the error can be challenging. If, though, we can test separately each function that contributes to that composite output, we have a better chance of finding and fixing the source of a problem. A unit is a small, isolatable bit of functionality and the goal of unit testing is to anticipate the variety of types of input a function might receive and confirm that it performs as intended with any input it might encounter. We don’t discuss how to perform unit testing here, but our point is that isolating small coherent bits of functionality in their own functions makes it possible to test separately the parts of the logic that contribute to a final composite result. In other words, you want to write testable code.
It may seem at first as if the fragmentation of complex processing into separate functions will complicate development because the code on which a template relies may be located in different places in the stylesheet. A more useful perspective is that:
We already work with standard library functions without having to think about where the code that makes them work is located. Once we are confident that a function does what we want, we don’t have to look inside it while developing a template or another function that uses it.
Templates and functions become harder to test and to debug as they attempt to
do more. A common guideline is that if your template or function requires
more than one screen of code, you might want to consider separating out
smaller units of functionality into their own function definitions.
(Revising an implementation without changing the core functionality is
called refactoring.) There’s nothing magic about one screen of
code
; the point is that it is usually easier able to develop
something small that relies on functions encoded elsewhere than to have to
consider all of the functionality at the same time and in the same
place.
Small, independent units of functionality can often be reused, and coding something once and reusing it takes less time and effort and is less prone to error than reimplementing essentially the same functionality separately in multiple places.
Functional programming entails some assumptions that may at first seem
counter-intuitive to developers who are new to the paradigm. For example, XPath and
XSLT cannot change the input file as they process it, although they can return an
altered version of it as output. XPath and XSLT also cannot change variable values,
which means that we cannot increment a counter with a statement like
$x = $x + 1
(although there are, of course,
alternative ways of maintaining a count). The benefit of these constraints is that
they make it possible for functions to be applied to sequences in any order,
including in parallel (that is, to process several items in an input sequence at
once, instead of looping through them one by one, waiting for each to finish before
beginning the next). The order in which operations happen in XSLT is largely
transparent to the user because in a declarative language we state the desired
result of an operation without having to know all of the details about how our code
produces that result, but we nonetheless benefit from more efficient execution when
the computer can do things as soon as it has the resources available to do them, and
is not constrained by a possibly suboptimal execution that reflects a step-by-step
human perspective.
The Computational stylesheets
chapter of Kay (pp. 985–1000) provides a
clear introduction to functional programming in XSLT that is accessible to those
who are not trained professionally in computer science. This would a good time
to read it!
One example of where not being constrained by a pre-determined order of execution can
be helpful is that when you write XPath like
//sp ! string-length()
to compute the length of
each speech in a play, the processor does not have to loop through the speeches one
by one, waiting for the first one to be measured before the second can begin. And
although an XPath expression like this one is guaranteed to return the
results in a way that is consistent with document order, it is not required
to process them in document (or any other specific) order. The processor is
able to optimize the order of execution only because XSLT and XPath cannot change
state (that is, cannot change the input document, variable values, or
other items on which other processing steps depend), so no operation has to wait for
another to finish to ensure that the state hasn’t change midstream.
The XPath for
statement may look like a
sequential operation, but
for $speech in //sp return string-length($speech)
can process the speeches in any order, and even in parallel, exactly like
//sp ! string-length()
. It has to return
results so that the order of the output items corresponds to the order
of the input items, but it doesn’t have to process them in that order.
The same is true of <xsl:for-each>
and
the same is true when we apply templates to a sequence of nodes. In technical
terms these are mapping operations, rather than loops,
where mapping means that each item in the input sequence is mapped to
a corresponding item in the output sequence. As Kay explains:
Although there is a defined order of processing, each item in the sequence is processed independently of the others; there is no way that the processing of one item can influence the way other items are processed. This also means that you can’t break out of the loop. Think of the items as being processed in parallel. (323)
The crucial difference between mapping and looping that makes mapping potentially more efficient computationally is that mapping doesn’t have to happen one item at a time in input order.
There are, of course, operations that have to happen in a particular order. For
example, if we compute a value with count()
or
sum()
or some other function that outputs a
numerical result and then format it with
format-number()
, the result of the formatting
operation cannot be created until the result of the operation that does the
counting or addition is available for use as its input. But even in cases like
this, if we are counting and formatting many things (for example, the number of
speeches in all of the scenes of a play), we don’t have to do all of the
counting for all of the speeches first and then format all of the counts, and we
don’t have to count and format each individual scene, from data to final output,
before we begin the next. The functional model means that the order in which we
apply the steps in the processing pipeline (data → counting → formatting →
output) to the scenes can be optimized by the processor to take advantage of
whatever input is available at any point in the operation.
A functional language also ensures that a function will always produce the same
output when it is given the same input in the same context. This is how the XPath
generate-id()
function can output the same value
for the same node when invoked in different locations in the stylesheet, a strategy
you may have applied to completing our Shakespeare sonnets task. It is also why, perhaps surprisingly, we cannot
measure the amount of time an operation takes by calling
current-time()
before and after the execution; the
value returned by current-time()
, like the value
returned by any XPath function, is guaranteed to be the same each time it is called
with the same input (which in this case means each time it is called, since this
particular function does not accept input arguments). The next time you run the
entire transformation the value will be different, but within a single
transformation the same function with the same input is not permitted to yield
different results.
A few XPath functions are not fully deterministic, that is, do not
always produce exactly the same result when invoked with the same input. For
example, the distinct-values()
function is
non-deterministic with respect to ordering, which means that two invocations of
the function with the same input will always produce the same inventory of
values, but it is not guaranteed to return them in the same order. See the
discussion of deterministic and non-deterministic functions in the XPath functions
spec for details.
Part of the efficiency of functional programming comes from the way function chains can be processed. If, in our earlier example, we want to compute the character length of speeches in a play after whitespace normalization, we can chain together two standard library functions, e.g.:
//sp ! normalize-space() ! string-length()
Because functional processing is not required to perform repeated operations as one-by-one, sequential iterations, and because existing values (the input document, variables) cannot be changed, the processor is free to perform the two operations (whitespace normalization, string-length measurement) whenever its optimizations suggest. It may create all of the whitespace-normalized values first and then compute their length, it may whitespace-normalize each string and then immediately compute its length, or it may use some combination of the two. And, as mentioned earlier, it may perform those operations over the members of the input sequence in any order, including some at the same time, even though it must return them in a way that is consistent with the input order.
A function has a signature, which consists of its name (in a namespace), the input it accepts (parameters), and the result it returns. The rest of the function body constructs the result, which is returned when the function is called. These parts are described in more detail below.
A preliminary note about terminology: Parameters are the input items that a function expects, as described in the function declaration. Arguments are the values supplied inside the parentheses when a function is used. When a function is called and executed, it assigns each argument in the function call, in order, to a parameter inside the function body, in the same order. The close relationship of parameters to arguments makes it difficult at times to decide which term to use, and for all practical purposes they can be thought of as equivalent.
A user-defined function created with
<xsl:function>
has a name, and you
should choose one that is reasonably self-documenting, that is, that describes
what the function does. Use a consistent naming strategy for your functions; for
example, don’t use camel case (e.g.,
myFirstFunction()
), kebab case
(e.g., my-second-function()
), and snake
case (e.g., my_third_function()
)
function names in the same stylesheet. XSLT won’t care if you mix your naming
strategies, but if you do, you’ll forget the names of your own functions when it
comes time to use them.
The name of a user-defined function must be in a namespace and the namespace must
be different from reserved namespaces, such as those used for the
library functions built into XPath. You should choose a namespace value that is
unlikely to be chosen by someone else, so that you will be able to import and
use someone else’s functions without introducing conflicts with your own
functions. It is customary to use a URI as a namespace value, but that is not
required, and the URI is not obligatorily a URL, which is to say that it does
not have to point to an existing resource on the Internet. In line 4 of the
example below we use http://www.obdurodon.org
as the namespace for our user-defined functions and we bind it to the namespace
prefix djb:
; this is identical to the way we
bind the prefix xsl:
to the XSLT namespace.
When we then define the function in line 10 we use its full name, with the
namespace prefix, and when we eventually use it in a transformation we’ll call
it as djb:mean()
. For your own projects you
might want to use a short version of your project name as the namespace prefix
and the main URL of your project as the associated namespace value.
The skeleton of a stylesheet that includes a user-defined function looks like the following:
]]>
As mentioned above, we declare the namespace for our function in line 4, and our
<xsl:function>
element, which defines
our function, begins on line 10 and ends on line 12. At the moment this element
is empty, and we will add functionality to it below in a graduated way.
Functions in XSLT are identified by their name and the number of arguments they
expect (the number of arguments expected is called the function’s
arity). The skeletal example above does not yet specify any input
parameters, which means that it has an arity of zero and must be called without
any arguments. This isn’t what we want, since we need to be able to input into
our function the values for which are computing a mean, and we describe in the
following section how we specify that our function requires that input. The fact
that functions are identified by a combination of their name and their arity
means that multiple functions can have the same name as long as they accept
different numbers of arguments. A function that appears to accept a variable
number of arguments is understood by XSLT as different functions that happen to
share the same name but have different arity. For example,
tokenize('my input", "\s+")
takes two
arguments, the string being split and the regex on which to perform the
splitting, and there is also a one-argument version (e.g.,
tokenize("my input")
) that uses
\s+
as a default replacement for the missing
second argument. From an XSLT perspective, those are two different functions
with different arity that happen to share a name.
The reason this detail matters is that you cannot define a function that takes a
variable number of arguments, which you might want to do if you would like there
to be a default value for one of your arguments, as is the case with
tokenize()
, above. If you want to achieve that
effect you need to define separate functions, each with a different number of
parameters, and the one with fewer parameters can then call the one with more
parameters, passing along the original input and supplying default values for
the additional parameters. The fact that functions are defined by a combination
of their name and their arity also means that even though you can (and should)
specify the datatypes of your parameters (see below), you cannot define two
different functions with the same name and the same arity that differ only in
the datatypes required for their parameters.
The first children of the <xsl:function>
element, which defines a function, are its parameters, which are specified as
empty <xsl:param>
elements. Our function
to compute a mean will accept one argument, a sequence of numbers, so we need to
augment our function definition by adding an
<xsl:param>
child element:
]]>
You can call your parameters whatever you want. If a function requires more than one argument, we must declare one parameter for each argument, and the arguments supplied when the function is used are assigned to the parameters in the order in which they appear, so that the first argument becomes the value of the first parameter, etc.
XSLT function arguments are positional, that is, their role in the function is determined by the way they are assigned to parameters in the order in which they are listed inside the parentheses when the function is used. There are programming languages that have named parameters that are not sensitive to order, but XPath and XSLT do not support named function parameters and all function parameters are therefore positional.
It is customary to pass all information that a function requires into the function as arguments, that is, as input supplied inside the function parentheses each time the function is used. Functions also have access to stylesheet variables without their being passed in explicitly because stylesheet variables are global by definition, and therefore available anywhere in the stylesheet, including inside user-defined functions. Despite the availability of stylesheet variables, some developers avoid using them inside user-defined functions and instead pass them in explicitly as function arguments inside the function parentheses (and then assign them internally to function parameters) when they are needed, reasoning (correctly) that functions that are not wholly self-contained, and that rely on information elsewhere in the stylesheet that is not supplied explicitly to the function when it is invoked, are more difficult to debug, maintain, and reuse. Other developers are less categorical, reasoning (also correctly) that using some types of stylesheet variables (such as constants with atomic values that do not depend on the context item) can simplify the function code. A function does not have access to any nodes in the tree being processed unless they are either supplied as arguments when the function is called or assigned to stylesheet variables and therefore globally available.
The standard library function avg()
accepts not only numerical values as input, but also durations (there are
pre-defined datatypes for durations expressed in years, months, days, hours,
minutes, seconds), and it’s capable of averaging those. Since the purpose of
writing our own version is just to illustrate how to go about developing a
user-defined function, we’ve taken the liberty of simplifying the task, so
our version accepts only numbers as input, and not durations.
An <xsl:param>
element has an optional
@as
attribute that specifies the datatype of
the expected input. Since our function requires a sequence of numbers as its
input, we can (= should) use this attribute to specify that the datatype of our
single input parameter must be sequence of numerical values. We don’t care
whether the numerical values are integers (whole numbers) or doubles (numbers
with digits to the right of the decimal point, like
4.2
) or some of each, but since every
integer can be treated as a double (e.g., the integer
4
is numerically equal to the double
4.0
), we specify the input as a sequence of
doubles.
We’ve decided that our function can only accept arguments that are numeric (or
that can be treated as numeric), but we also need to decide whether the function
should accept an empty sequence, that is a sequence of zero (numeric) items. If
you decide that supplying an empty argument should be valid, you need to decide
what the return value would be, since the arithmetic mean of the input
values
doesn’t have an obvious meaning when the input contains no
values. Because the standard library avg()
function returns an empty sequence when the input is an empty sequence, we’ve
decided (somewhat arbitrarily) to mimic that behavior: we will not regard an
empty input sequence as an error and we wlll return an empty sequence as a
result. We will, though, regard input values that are not numbers and not able
to be understood as numbers (for example, strings) as erroneous, and we’ll
terminate the operation and report an error should we encounter that sort of
input to our function.
You want to plan what your function will do with edge cases before you implement it. Whether you allow an empty sequence as input to your function or regard it as an error is up to you, and what the function should return if you do allow an empty sequence as input is also up to you, since there’s no obvious single correct answer. The take-away is that you should make those decisions before you write the code that will handle the input.
Although the @as
attribute is optional as far
as XSLT syntax is concerned, it is good practice (and required for all work in
our course) always to specify the datatypes of all parameters. For that reason,
we now enhance our skeletal function declaration by adding a datatype
specification (see line 2):
]]>
Datatype specifications use the same occurrence indicators as Relax NG, so the
asterisk means that the input must be a sequence of zero or more numbers (or
items that can be treated as numbers). The most common datatypes for atomic
values are xs:integer
,
xs:double
(numbers with decimal components),
and xs:string
. Elements can be specified as
element()
(which allows any element type) or,
if a specific element type is required, with the element name inside the
parentheses, so that a function that required as input a sequence of one or more
<speech>
elements would specify the type
of the parameter as element(speech)+
.
The main reason we always specify datatypes on our parameters (and variables) is
that we want the XSLT processor to notify us should we accidentally supply a
value of the wrong type. Inexperienced developers sometimes omit the
@as
in a misguided quest to reduce the number
of error notifications. This is misguided because reducing the number of
notifications does not reduce the number of errors, and it
is obviously better to be notified when something you did not expect happens
than to receive, accept, and trust an erroneous result because you never noticed
that you were passing your function an erroneous value. It isn’t possible to
trap all developer errors, but checking that your datatypes are what you expect
is much better than checking nothing.
When it comes time to use our parameters within the function body, we refer to
them by name, prefixed with a dollar sign, so that, for example, the parameter
we declare above can be referenced as $input
.
Parameters are thus similar to XSLT variables: they are declared with a
@name
attribute and given a name that does not
begin with a dollar sign, but when they are referenced subsequently, the dollar
sign is prepended to them.
A function returns the result of evaluating the code inside the function body
after the parameter declarations. Unlike in some other programming languages,
there is no explicit return statement; the result of the function
that is returned is simply the result of evaluating the function body. The body
may include literal result elements, it may incorporate the result of applying
templates to whatever was passed in as input, and it may construct new atomic
values. It is common to return a result with the
<xsl:sequence>
element, although this is
not required or expected.
If our function is returning a single atomic value (such as a string or
number, or something that can be treated as a string or number), we can use
<xsl:value-of>
instead of
<xsl:sequence>
. But because
<xsl:value-of>
can only create a
text node (that is, can only create a single string), if we want to return
elements or attributes or more than one thing, our only option is
<xsl:sequence>
or a sequence
constructor (that is, literal elements and other content). There are also
some subtle differences in whitespace handling when we return multiple
values with <xsl:value-of>
vs
<xsl:sequence>
.
Just as it is possible to specify the datatype of input parameters by using the
@as
attribute on
<xsl:param>
elements, it is also
possible (and good practice, and required for all code in this course) to
specify the datatype of the result of a function by adding the
@as
attribute to the
<xsl:function>
element. Our function
will return an empty sequence if the input is an empty sequence, but in all
other cases where we supply valid input it will return a single double (that is,
a single numerical value that may have digits to the right of the decimal
point), so we now enhance our skeleton to add that information (see line 1):
]]>
Our function returns an empty sequence if the input is an empty sequence, and
otherwise it divides the sum of the input values by their count. We implement
this logic inside the function body, after the
<xsl:param>
elements, as follows:
]]>
We count the number of input items and bind that count to a variable because we
use the value twice, first to test whether there are any input items (since zero
input items is a special case that requires special handling) and then, if there
are input values, to compute the mean. Binding a value that we use more than
once to a variable makes our code easier to understand, it reduces the
computational overhead (since we don’t perform the same computation more than
once), and it reduces the opportunity for error. The empty
<xsl:when>
statement on line 6 correctly
returns an empty sequence if there are no input values, and the
<xsl:otherwise>
statement below it
computes the mean in the traditional way if there are input values. The
@as
specification on line 2 ensures that
the function will correctly raise an error if any of the input values are not
numbers (or items that can be treated as numbers).
Our function now matches the completed version at the top of this tutorial, except that we have not yet added any documentation. We address that concern in the next section.
While the structure of a function definition is partially self-documenting (we
can see the name, the parameters, and the type information), we recommend adding
explicit documentation that describes in reasonably natural language what the
function does and what its input and output look like (and all code submitted
for this course requires that type of documentation). Our preference is to
format this documentation as XML comments immediately after the
<xsl:function>
start-tag, and we find
the comments easiest to read if they are arranged as in the example above
because both the comment as a whole and its individual parts stand out by virtue
of the separators and the blank lines. With that said, any documentation that
you find easy to read is acceptable.
Some developers prefer to write comments before the function definition,
rather than inside it. We prefer comments inside the
<xsl:function>
element because when
we collapse the element in the <oXygen/> editor to reduce clutter when
we aren't working on it, comments inside are automatically collapsed as
well, while preceding comments would have to be collapsed as a separate
step.
We call a user-defined function the same way we call a standard library function, so
djb:mean((1, 3, 2, 4, 3))
returns the value value
2.6
, which is the result of dividing the sum
(13
) by the count
(5
) and the same as the value returned by
supplying the same input to the standard library
avg()
function.
We need the inner parentheses because both the standard library
avg()
function and our custom
djb:mean()
function take a single argument,
which is a sequence of zero or more numbers (or items that can be treated as
numbers). Since the arguments of a function are separated by commas, had we
written, incorrectly, djb:mean(1, 3, 2, 4, 3)
we would have raised an error, notifying us that there is no function called
djb:mean()
with an arity of 5. That is,
without the inner parentheses we would have been trying to pass 5 separate
arguments to the function and our function accepts only one input argument. The
inner parentheses ensure that we’ll pass a single argument that will be
understood as a sequence of 5 values.
The five acts of Hamlet have 5, 2, 4, 7, and 2 scenes, respectively, so the mean number of scenes is 4. We can confirm that by transforming bad-hamlet.xml with the following stylesheet:
Scenes per act
Scenes per act
Act
Scenes
Mean
]]>
The output of the transformation is:
Scenes per act
Scenes per act
Act
Scenes
1
5
2
2
3
4
4
7
5
2
Mean
4
]]>
The mode is the value or values that occur most frequently in the input. If, for
example, the input is (1, 2, 3, 2, 1)
, there
are two modal values, 1
and
2
, because each of them occurs twice, and
that’s the greatest frequency with which any value occurs in the input sequence.
We’ll make the following assumptions in our code that computes modal values:
We’ll allow an input of zero items (that is, an empty sequence) and return the empty sequence as the computed mode.
We”ll restrict our input to integers. Computing a mode of non-integer values is possible (the mode will be the most frequent value(s), whether those values are integers or not), but because doubles can vary in large or small ways, they are much less likely to repeat than integer values, which means that a sequence of doubles is unlikely to contain repeating values. If in your work you have a need for modal values of sequences of doubles, you can adjust the function accordingly.
The following function computes a mode for an input sequence of zero or more integers:
]]>
We allow an input sequence of zero or more integers (line 12). If the input sequence is empty, we return an empty sequence; otherwise we return one or more integers, so we set the return data type (line 1) as zero or more integers. The function logic is as follows:
We create a variable called
$djb:input-items
, which will be a
sequence of zero or more empty
<djb:input-item>
elements. Those
elements are a convenience for holding each input value and its frequency
together, so that, for example, if the input value
7
occurs 3 times in the input sequence,
that information will be stored as
<djb:input-item value="7" count="3"/>
.
Using temporary elements for this data-management purpose works, but it isn’t very efficient computationally because elements have a lot of inherent properties that we aren’t using because we don’t need them; all we need is a way to keep the values and their frequencies together. In the optional Odds and ends section, below, we explore more efficient alternatives, and we would probably use one of those in Real Life. We’ve used temporary elements here for pedagogical reasons, that is, because you’re already familiar with elements, attributes, and path expressions.
We use <xsl:for-each-group>
to form
the items in the input sequence into groups according to their values. Since
the scene counts in Hamlet are, in order,
(5, 2, 4, 7, 2)
, where the value
2
occurs twice and the other three
distinct values occur once, we’ll wind up with four groups:
]]>
We find the greatest frequency of any value with
max($input-items/@count)
. If there are
zero input values there will be no largest count value (which is why we set
the data type as xs:double?
, with a
question mark to make it optional); otherwise there will be at exactly one
largest count (although, for reasons described above, there may be more than
one input value that occurs the maximum number of times).
The return value of our function is a sequence of zero or more values, which
we obtain with
$input-items[@count = $max-count]/@value
.
This filters our <djb:input-item>
elements to select only those with
@count
values equal to the largest
count value. We then take a path step from those elements to their
@value
attributes, which are the input
numbers that occur the maximum numbers of times, and we return a sequence of
those numbers.
If there are no input item the result of filtering them in the last step will necessarily be an empty sequence, so we do not require special handling for null input.
If we add a row to the table of information about scenes in Hamlet to
render the mode and populate the cell with
]]>
,
it returns the value of 2
, which is the modal
number of scenes-per-act in the play because there are 2 acts that contain 2 scenes
and all other scene counts occur in only one act.
You can regard this section of the tutorial as optional, but we hope that you’ll at least look through it to see what’s there. In particular, maps are a new feature of XPath 3.0 that work quite differently from other features of the language. We don’t introduce them in a comprehensive way here because they require their own tutorial, but since we might use them in Real Life if we were creating a user-defined function to compute an arithmetic mode, we thought we should at least demonstrate how to use them for that purpose. XPath and XSLT developers survived without maps before XSLT 3.0 was introduced, which is to say that the alternative implementation that we discuss above is both idiomatic and fully adequate to the task, and you should feel free to employ that method in your own work if you are not comfortable using maps.
We tested our function against the examples at https://www.mathsisfun.com/mode.html and verified that we returned the expected modal values in each case. That site discusses alternatives to computing an exact mode in situations where the exact mode is not useful, such as binning (grouping values by range) when all individual values occur only once. We did not implement those methods here, so we always return the modal values, even where they may be of limited practical use.
In Real Life we would perform this verification with proper unit tests (using
XSpec, a framework for unit testing XSLT stylesheets), but we have not yet
implemented an XSpec test suite for our
djb:mode()
function.
Priscilla Walmsley’s FunctX XQuery
functions site provides excellent documented examples of both the
standard XPath and XQuery function library and her own user-defined functions
(in a namespace conventionally mapped to the
functx:
prefix).
The <xsl:function>
element was added to
XSLT only in version 2.0, and similar functionality in version 1.0 was achieved
with the help of named templates and
<xsl:call-template>
. Since the
introduction of <xsl:function>
we find
that we rarely use the older method, but Kay pp. 349–50 discusses the
similarities and differences, including when each might be preferable.
Functions and named templates can call themselves, a process known as
recursion. Recursion is a common idiom in XSLT programming (and
in functional programming in general); see Kay pp. 274–80 and 350–53 for
examples and discussion. XSLT 3.0 (not covered in Kay) introduces
<xsl:iterate>
, which offers an
alternative approach to some tasks that would previously have been addressed
with recursion; for more information see the xsl:iterate section of the Saxon documentation and the discussion at 7.2 The xsl:iterate
Instruction in the XSLT 3.0 spec.
Our example above embeds our user-defined function definition inside the
stylesheet in which it is used to create a result, but a common use of functions
involves creating them in a separate XSLT stylesheet and importing them into the
stylesheets that then use them to perform transformations. The advantage of this
approach is that functions that you use repeatedly in different projects can be
written once and reused, just as we use the same standard library functions in
different projects. There are three mechanisms for importing functions (and
other information) from one stylesheet into another:
<xsl:import>
(Kay pp.. 357–67),
<xsl:include>
(Kay pp. 372–76), and XSLT
packages, new in XSLT 3.0 and not discussed in Kay. You can read more about
packages in the Saxon documentation (xsl:package, xsl:use-package, and the links from those pages) and Section 3.5 Packages of the
XSLT 3.0 spec.
When I shared this task on the xml.com Slack, several readers contributed alternative ways of computing a mode. Some approaches implement the logic that identifies the modes entirely in XPath, while others rely also on XSLT elements. Some approaches use maps, while others do not. Here is an overview of some of those alternatives. In Real Life we would use whichever we found easiest to understand and implement.
The syntax for creating a user-defined function is the same in the method above and in all of these alternatives, all of these functions are called in the same way, and they all return the same results. The reason to read this section, then, is not to learn something new about functions as much as to think about the variety of ways XPath and XSLT allow us to approach a task. That variety does tell us something important about functions, though, because the only thing that changes across these different methods is the function body, and the signature remains constant, which means that the template that uses the function does not have to be modified when we change the function body. We find this a persuasive example of the benefits of modular approaches to coding, that is, of isolating a small amount of functionality so that it can be modified without having to change (or even think about) the way other parts of the program are going to use it.
]]>
This approach uses the let … return
construction introduced in XPath 3.0. Values can be bound to variables using
the let
operator followed by
:=
(not just an equal sign), after which
the result of evaluating an XPath expression that (typically) uses those
variables can be returned. If there are multiple
let
statements, they need to be separated
by commas.
The :=
symbol is informally called the
walrus operator because—at least if you have a good
imagination—it looks like the eyes and tusks of a walrus lying on its
side. For reasons that have to do with the technical meaning of the
terms operator, assignment, and binding
in computer science, :=
is not,
strictly speaking, an operator, and when we associate values with
variables we describe the process as binding, rather than
assignment. A jargon-neutral way to describe the effect
of the :=
symbol is that it associates
the value to the right with the variable name to the left.
This method creates two variables: $dist
is
a sequence of the distinct values from the input (each of which appears in
the input with a frequency of one or more) and
$mx
is the highest such frequency. To
compute the frequencies we take each distinct value and use the XPath
index-of()
function to find the offsets of
the position in the original sequence where the value appears. If, for
example, the input were (5, 2, 4, 7, 2)
,
index-of($input, 2)
would return the
sequence (2, 5)
because the value
2
appears at positions 2 and 5 in the
input. For our purposes we don’t really care about the specific offsets; we
care only how many offsets there are for each distinct input value, since we
can count the offsets as a surrogate for counting the frequency of the
values themselves. The snippet
$dist ! count(index-of($input, .))
creates
a sequence of those counts, one per distinct value in the original input,
and we wrap that sequence of frequencies in the
max()
function to find the largest such
value (there must be exactly one, even if it occurs more than once because
multiple input items appear at maximal frequency). We then bind that maximal
frequency value to the variable $mx
.
Finally, we use the maximal frequency to filter our distinct values by
counting how often each one occurs (again) and finding the one(s) that
appear most frequently. The predicate in line 15 uses the same
index-of()
strategy to count the frequency
of each distinct value in the original input sequence, and it retains those
for which the count equals $mx
. This
strategy is similar to the one we used in questions 4 and 5 of XSLT assignment #4 to find the
speaker of the longest speech in Hamlet.
This method can be condensed into a single line, which is as elegant as it is challenging for a human (well, for this human) to parse. The one-line version is:
]]>
The logic is largely the same, but it dispenses with interim variables and it
avoids duplicate output without using the
distinct-values()
function, which means
that it has to implement an alternative way of dealing with values that
appear more than once in the input sequence. Reading from the inside
out:
Starting with the innermost predicate,
$input ! count(index-of($input, .))
,
we count the number of times each value appears in the input
sequence (counting the offsets of the values as a surrogate for
counting the values themselves). We don’t remove duplicates, so if,
for example, our input is
(1, 2, 3, 4, 5, 3, 4)
, the
predicate returns
(1, 1, 2, 2, 1, 2, 2)
. We wrap
this in the max()
function, which
tells us that the modal value(s) occur twice. With this input data,
then, we can replace the inner predicate with the integer value
2
, which means that our expression
is equivalent to
$input[index-of($input, .)[2]]
.
Inside the remaining (outer) predicate we now get the offset
positions for each value in the input and filter them to keep only
the second such position. Here are examples of how that works for
input values that occur only once (we use
2
as an example) and for values
that occur twice (we use 3
as an
example):
When the input value is 2
,
the expression
$input[index-of($input, .)[2]]
evaluates to $input[()]
because there is no second offset position for the value
2
in the input. This is a
valid expression that selects nothing, which means that the
input value 2
is not added
to the output sequence.
When the input value is 3
,
the expression
$input[index-of($input, .)[2]]
evaluates to $input[6]
because 6
is the second
offset position for the value
3
in the input. The sixth
item of the input is therefore included in the output
sequence.
The clever part of this approach that makes up for our not having removed the duplicates earlier is that we filter the offset positions to keep only the second one. Input values that appear only once don’t have a second offset position, so they yield an empty sequence, which is not an error in XPath, and those input values not included in the output. But the values that appear twice do have a second offset position, so their second appearance (but not the first) is included in the output, thus removing the duplicates by selecting only one instance of the desired input value.
It took us long enough to work through the logic that we probably wouldn’t use this method in Real Life, but figuring out how it works was a rewarding experience that enhanced our knowledge and understanding of XPath.
The following XPath-oriented function uses maps. XPath that uses functions in
the map namespace must declare that namespace, which means adding the
following attribute to the root
<xsl:stylesheet>
element:
xmlns:map="http://www.w3.org/2005/xpath-functions/map"
We do not try to explain how to use maps in any general way here because they really require their own tutorial, but we do try to explain how the map-related parts of the following code work:
]]>
The let … return
construction allows us to
create three variables and then return a result that depends on them. The
variables are:
$distinct
is a deduplicated
sequence of the input values, that is, the input values with
duplicates removed. We use this to count the number of times each
one appears in the original input sequence.
$freqs
is a map with a key for each
unique frequency, where the value of the key is a sequence of values
that appear in the input sequence with that frequency. We create
this map by merging a sequence of one-item maps, the construction of
which we discuss below.
$max
is the largest key value, that
is, the highest frequency with which any value occurs in the
input.
Maps (see 3.11.1 Maps in the XPath 3.1 spec) are key:value pairs in which the keys (in our case the number of times a value appears in the input sequence) are atomic values (in our case integers, since they are frequency counts) that must be unique. Since there may be multiple input values that occur the same number of times, we approach the task by first constructing a separate one-item map for for each unique integer value in the input, where the frequency with which that integer appears in the input is the key and the integer itself is the value. The part of our code that does this is:
The map:entry()
function (line 15)
constructs a one-item map and takes two arguments, the first of which is the
key (in our case we count the number of times a particular distinct input
value occurs in the input sequence) and the value of which is the integer
whose frequency we are counting. When this XPath
for … return
expression finishes, it
creates a sequence of one-item maps, which serve as the first argument to
the map:merge()
function wrapped around
them (lines 13–16 in the full example, above). With our Hamlet
example this produces the following maps:
The order of these four maps is unpredictable because the order of the
items returned by the
distinct-values()
function is
unpredictable.
The map:merge()
function as we use it takes
two arguments, the first of which is a sequence of maps to combine into a
single map and the second of which is a map of options that describe how the
merge should proceed. The options are … er … optional, but because maps are
not allowed to have duplicate keys and three of our one-item maps have the
same key, we want to specify that in case of duplicate keys we want the
values to be merged into a sequence. We do that by constructing, as our
second argument to the map:merge()
function, a one-item map with the key
"duplicates"
and the value
"combine"
(both key and value are in
quotation marks because both are strings).
This part of the code illustrates two ways to create new one-item maps.
One is to use the
map:entry(key, value)
function (line
15) and the other is to use the map constructor syntax
map{ key: value }
(line 16). These are
discussed in the Maps in XPath section of the Saxon documentation.
The output of the map:merge()
function is
the following single map with two key:value pairs:
The order of the key:value pairs inside a map, like the order of items in
the output of distinct-values()
, is
unpredictable.
We assign this merged map to the variable
$freqs
(lines 13–16), generate a sequence
of its keys with the map:keys()
function
(inside the parentheses in line 17; in our case the result is
(1, 2)
), identify the largest key value
with the regular XPath max()
function (to
the right of the assignment operator in line 17), and bind that value to a
variable called $max
(line 18). There is
guaranteed to be exactly one largest key value because our merge operation
ensured that all keys would be unique by combining into a single sequence,
with a single key, the values for keys that were duplicated in the earlier
sequence of one-item maps.
One way to look up the values associated with a key in a map is to follow the
map name with parentheses and write the key value inside the parentheses.
For that reason, $freqs($max)
returns the
value associated with the largest key, which in our case is the value
2
. This means that the most frequently any
value occurs in the input sequence is twice (our key), and the value that
occurs twice is 2
, which is the count of
scenes in Acts 2 and 5. Had there been more than one act with the same most
frequent number of scenes (for example, had there been 2 acts with 2 scenes,
2 acts with 4 scenes, and 1 act with 6 scenes), our initial input would have
been the sequence (2, 2, 4, 4, 1)
the
highest frequency (the value of $max
)
would still have been 2
, and
$freqs($max)
would have returned the
sequence (2, 4)
.
The XPath map functions map-entry()
and
map-merge()
and the map constructor
map{ }
used above have counterparts in
XSLT. A one-item map can be created with
<xsl:map-entry>
and one-item maps
can be merged by wrapping the
<xsl:map-entry>
elements in
<xsl:map>
. Below is an
implementation that uses these XSLT map methods:
]]>
We begin the same way that we did with our implementation in the main body of
this tutorial, by grouping the input values according to value. We do this
inside an <xsl:variable>
element to
construct a variable called $groups
, the
members of which are one-item maps with input values as keys and frequencies
as values. The value of $groups
is:
We now want to invert the keys and values and merge the result into a single
map, and because keys must be unique, that means that we need to combine the
three one-item maps for input items that occur only once into a single map
entry. XSLT does not have a counterpart to the XPath
map:merge()
function, and although there’s
no impediment to using that function in an otherwise XSLT context, we
instead implement the merge with XSLT resources to create a new variable
called $keyed-by-freq
. We use grouping
again, this type grouping the one-item maps according to their values, and
we use the ?*
notation to return, as a
grouping key, the value of the individual map items (that is, the
frequencies; see 3.11.3
The Lookup Operator ("?") for Maps and Arrays in the XPath 3.1 spec
for more information). Since we are now grouping by the values of the maps
in the $groups
variable, which are
frequencies, the keys for the new map will be
1
and 2
.
The values in the new map are the keys from the old map, and we obtain those
by applying the map:keys()
function to
each item in the group we are processing at the moment. Because the new keys
are the frequencies, their largest value is the greatest frequency, which we
compute by applying the max()
function to
the keys of $keyed-by-freq
(and we find
those with the XPath map:keys()
function).
We then use the lookup operator mentioned above to look up (that is,
retrieve) the value associated with that key in our map.