Maintained by: David J. Birnbaum (djbpitt@gmail.com) Last modified: 2024-03-19T16:59:54+0000
This is an abridged version of our comprehensive XSLT user-defined functions tutorial. The full tutorial illustrates the development of a function to compute the mode of a sequence of integers, and is intended as preparation for XSLT functions assignment #1, which asks learners to develop their own function to compute a median. This abridged tutorial uses the median as an example, and is intended for a context where user-defined functions are introduced as a drive-by topic, that is, one that is practiced in class but has no accompanying homework assignment.
We can think of a function as a bit of code that accepts zero or more input items (called the function parameters) and returns a result (including, possibly, an empty sequence).
You already have experience using built-in XPath functions, but once you begin to
develop any substantial XSLT transformations you are likely to discover a need for
additional functions that you can use in your stylesheets similarly to the way you
use standard XPath functions. For example, XPath has a built-in function to compute
an arithmetic mean (avg()
) but not to compute a
median, and in this activity we’ll create a user-defined function to remedy that
lack. We’ll then practice using it inside an XSLT stylesheet.
Statisticians work with three basic types of averages: the arithmetic mean, the median, and the mode. The median is
the number in the middle of a sorted sequence of values. For example, the median of
(3, 0, 3, 7, 8)
is
3
because 2 of the 5 values are less than or
equal to 3
and 2 of the values are greater than
or equal to 3
. If there’s an even number of
values there isn’t a single one in the middle, and in that case the median is
defined as the mean of the two middle values. For example,
(3, 0, 6, 5, 3, 4)
has 6 values, and when we
sort them, the middle 2 (the third and fourth) are
3
and 4
,
so the median of the 6 values is 3.5
, which is
the mean of 3
and
4
.
A user may invoke a function with unusual input (sometimes called edge
cases). For example, what should a median function do if the input is an
empty sequence? Should it accept only integers (whole numbers), or also doubles
(numbers with a decimal point, like 3.14159
)?
What should it do if the input sequence includes non-numeric values? A function will
always do something no matter what the input, even if that something is to raise an
error, and it’s up to the developer to decide what the something should be.
In this activity we’re going to require our median function to accept a single parameter, which must be a sequence of one or more doubles. This means that it will raise an error if we call it with an empty sequence, with a sequence that includes non-numerical values, or with more than one argument. If we call it with valid input, it will return a single numerical value, which will be a double that represents the computed median of the input sequence.
The XSLT element <xsl:function>
creates a
user-defined function. A user-defined function, just like a standard XPath function,
has a signature, which consists of its name (in a namespace),
the input it accepts (parameters), and the result it returns.
The rest of the function (the body) constructs the result, which is
returned when the function is called. These parts are described in more detail
below.
The name of a user-defined function must be in a user-defined namespace. It is
customary to use a URI as
a namespace value, but that is not required, and the URI is not obligatorily a
URL, which is to say that it
does not have to point to an existing resource on the Internet. Below we use
djb:
as the namespace prefix and
http://www.obdurodon.org
as the associated
namespace value. For your own projects you might want to use a short version of
your project name as the namespace prefix and the main URL of your project as
the associated namespace value.
The skeleton of a stylesheet that includes a user-defined function looks like the following:
]]>
A function definition specifies zero or more parameters by including zero or more
empty <xsl:param>
elements as children
of the <xsl:function>
element. Our function
requires one argument, which must be a sequence of one or more doubles, so we
need to augment the function definition above by adding an
<xsl:param>
child to our
<xsl:function>
element:
]]>
Datatype specifications use the same occurrence indicators as Relax NG, so the plus sign means that the input must be a sequence of one or more doubles.
When it comes time to use our parameters within the function body, we refer to
them by name, prefixed with a dollar sign, so that, for example, the parameter
we declare above can be referenced as $input
.
Parameters are thus similar to XSLT variables: they are declared with a
@name
attribute and given a name that does not
begin with a dollar sign, but when they are referenced subsequently, the dollar
sign is prepended to them. We’ll see an example of that use below.
A function returns the result of evaluating the code inside the function body.
Just as you should specify the datatype of input parameters by using the
@as
attribute on
<xsl:param>
elements, you should specify
the datatype of the result of a function by adding the
@as
attribute to the
<xsl:function>
element:
]]>
Unlike in some other programming languages, there is no explicit return
statement; the result of the function that is returned is simply the
result of evaluating the function body. We implement the logic that computes the
median inside the function body, after any
<xsl:param>
elements, as follows:
]]>
We start by sorting the input values (line 15) and counting them (line 16), and
we bind the results of those operations to variables so that we can
reuse them below. Since the median is computed differently for an odd vs even
number of input values, we use
<xsl:choose>
(lines 17–27) to manage the
branching. The XPath mod
operator returns the remainder of integer division, and if we divide the count
by 2
, the result will be
0
if there is an even number of input items
and 1
if there is an odd number.
The median of an even number of sorted input values (lines 18–22) is the mean of
the middle two values. We know that the count is an even number, which means
when we divide by 2 the result will be an integer that corresponds to the
position in the sorted input just before the midpoint. For example, if there are
6 input values, 6 div 2
equals
3
and the third and fourth items straddle
the midpoint. We assign the value of our division to the variable
$half
and then select the value at that
offset into the sorted input sequence plus the one after it, which gives us a
sequence of two values (in this case, the third and fourth items in the sorted
sequence of 6 items). We then use the standard XPath library function
avg()
to average them.
The median of an odd number of values (lines 23–26) is the middle value in the
sorted sequence. We find that offset by performing integer division
(idiv
, not
div
; integer division ignores any
remainder) on the count and adding 1. For example, if there are 5 items in the
sequence 5 idiv 2
equals
2
, and when we add 1 to that, 3 is the
middle position in a sequence of 5 items.
You can call a user-defined function the same way you call a standard library
function, so djb:median((1, 3, 2, 4, 3))
returns
the value 3
. The five acts of Hamlet
have 5, 2, 4, 7, and 2 scenes, respectively, so the median number of scenes is 4.
You can confirm that by transforming bad-hamlet.xml
with the following stylesheet:
Scenes per act
Scenes per act
Act
Scenes
Median
]]>
The result of that transformation is:
Scenes per act
Scenes per act
Act
Scenes
1
5
2
2
3
4
4
7
5
2
Median
4
]]>