Digital humanities


Maintained by: David J. Birnbaum (djbpitt@gmail.com) [Creative Commons BY-NC-SA 3.0 Unported License] Last modified: 2024-03-19T16:59:54+0000


User-defined functions in XSLT

About this tutorial

This is an abridged version of our comprehensive XSLT user-defined functions tutorial. The full tutorial illustrates the development of a function to compute the mode of a sequence of integers, and is intended as preparation for XSLT functions assignment #1, which asks learners to develop their own function to compute a median. This abridged tutorial uses the median as an example, and is intended for a context where user-defined functions are introduced as a drive-by topic, that is, one that is practiced in class but has no accompanying homework assignment.

What is a function?

We can think of a function as a bit of code that accepts zero or more input items (called the function parameters) and returns a result (including, possibly, an empty sequence).

Why create your own functions?

You already have experience using built-in XPath functions, but once you begin to develop any substantial XSLT transformations you are likely to discover a need for additional functions that you can use in your stylesheets similarly to the way you use standard XPath functions. For example, XPath has a built-in function to compute an arithmetic mean (avg()) but not to compute a median, and in this activity we’ll create a user-defined function to remedy that lack. We’ll then practice using it inside an XSLT stylesheet.

What is a median and how is it computed

Statisticians work with three basic types of averages: the arithmetic mean, the median, and the mode. The median is the number in the middle of a sorted sequence of values. For example, the median of (3, 0, 3, 7, 8) is 3 because 2 of the 5 values are less than or equal to 3 and 2 of the values are greater than or equal to 3. If there’s an even number of values there isn’t a single one in the middle, and in that case the median is defined as the mean of the two middle values. For example, (3, 0, 6, 5, 3, 4) has 6 values, and when we sort them, the middle 2 (the third and fourth) are 3 and 4, so the median of the 6 values is 3.5, which is the mean of 3 and 4.

Thinking about user-defined functions

A user may invoke a function with unusual input (sometimes called edge cases). For example, what should a median function do if the input is an empty sequence? Should it accept only integers (whole numbers), or also doubles (numbers with a decimal point, like 3.14159)? What should it do if the input sequence includes non-numeric values? A function will always do something no matter what the input, even if that something is to raise an error, and it’s up to the developer to decide what the something should be.

In this activity we’re going to require our median function to accept a single parameter, which must be a sequence of one or more doubles. This means that it will raise an error if we call it with an empty sequence, with a sequence that includes non-numerical values, or with more than one argument. If we call it with valid input, it will return a single numerical value, which will be a double that represents the computed median of the input sequence.

The parts of a user-defined function

The XSLT element <xsl:function> creates a user-defined function. A user-defined function, just like a standard XPath function, has a signature, which consists of its name (in a namespace), the input it accepts (parameters), and the result it returns. The rest of the function (the body) constructs the result, which is returned when the function is called. These parts are described in more detail below.

Function name and namespace

The name of a user-defined function must be in a user-defined namespace. It is customary to use a URI as a namespace value, but that is not required, and the URI is not obligatorily a URL, which is to say that it does not have to point to an existing resource on the Internet. Below we use djb: as the namespace prefix and http://www.obdurodon.org as the associated namespace value. For your own projects you might want to use a short version of your project name as the namespace prefix and the main URL of your project as the associated namespace value.

The skeleton of a stylesheet that includes a user-defined function looks like the following:


  
  
  
  
  

  
  
  
  
  
]]>

Function parameters and datatypes

A function definition specifies zero or more parameters by including zero or more empty <xsl:param> elements as children of the <xsl:function> element. Our function requires one argument, which must be a sequence of one or more doubles, so we need to augment the function definition above by adding an <xsl:param> child to our <xsl:function> element:


  

]]>

Datatype specifications use the same occurrence indicators as Relax NG, so the plus sign means that the input must be a sequence of one or more doubles.

When it comes time to use our parameters within the function body, we refer to them by name, prefixed with a dollar sign, so that, for example, the parameter we declare above can be referenced as $input. Parameters are thus similar to XSLT variables: they are declared with a @name attribute and given a name that does not begin with a dollar sign, but when they are referenced subsequently, the dollar sign is prepended to them. We’ll see an example of that use below.

Function result and datatype

A function returns the result of evaluating the code inside the function body. Just as you should specify the datatype of input parameters by using the @as attribute on <xsl:param> elements, you should specify the datatype of the result of a function by adding the @as attribute to the <xsl:function> element:


  

]]>

Function body

Unlike in some other programming languages, there is no explicit return statement; the result of the function that is returned is simply the result of evaluating the function body. We implement the logic that computes the median inside the function body, after any <xsl:param> elements, as follows:


  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
    
      
      
      
    
    
      
      
    
  
]]>

We start by sorting the input values (line 15) and counting them (line 16), and we bind the results of those operations to variables so that we can reuse them below. Since the median is computed differently for an odd vs even number of input values, we use <xsl:choose> (lines 17–27) to manage the branching. The XPath mod operator returns the remainder of integer division, and if we divide the count by 2, the result will be 0 if there is an even number of input items and 1 if there is an odd number.

The median of an even number of sorted input values (lines 18–22) is the mean of the middle two values. We know that the count is an even number, which means when we divide by 2 the result will be an integer that corresponds to the position in the sorted input just before the midpoint. For example, if there are 6 input values, 6 div 2 equals 3 and the third and fourth items straddle the midpoint. We assign the value of our division to the variable $half and then select the value at that offset into the sorted input sequence plus the one after it, which gives us a sequence of two values (in this case, the third and fourth items in the sorted sequence of 6 items). We then use the standard XPath library function avg() to average them.

The median of an odd number of values (lines 23–26) is the middle value in the sorted sequence. We find that offset by performing integer division (idiv, not div; integer division ignores any remainder) on the count and adding 1. For example, if there are 5 items in the sequence 5 idiv 2 equals 2, and when we add 1 to that, 3 is the middle position in a sequence of 5 items.

Calling a user-defined function inside an XSLT transformation

You can call a user-defined function the same way you call a standard library function, so djb:median((1, 3, 2, 4, 3)) returns the value 3. The five acts of Hamlet have 5, 2, 4, 7, and 2 scenes, respectively, so the median number of scenes is 4. You can confirm that by transforming bad-hamlet.xml with the following stylesheet:



  
  
  
  
  
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
      
        
        
        
      
      
        
        
      
    
  
  
  
  
  
    
      
        Scenes per act
      
      
        

Scenes per act

Act Scenes
Median
]]>

The result of that transformation is:




    
        Scenes per act
    
    
        

Scenes per act

Act Scenes
1 5
2 2
3 4
4 7
5 2
Median 4
]]>