# <oo>→<dh> Digital humanities

Maintained by: David J. Birnbaum (djbpitt@gmail.com) Last modified: 2022-12-05T16:53:17+0000

## User-defined functions in XSLT: exercise 1

### Context

For your last assignment you read a tutorial that introduced XSLT user-defined functions. The sample task that we used in that tutorial to illustrate how to create and use a user-defined function involved computing the arithmetic mode(s) of a sequence of integers, and then using that function to find the modal number of scenes in the acts of Hamlet.

The tutorial discussed two of the three types of average in common use in statistical overviews, the mean and the mode, but it made only passing reference to the median, which is the third type of average. The median is determined by arranging all of the values in order and selecting the one in the middle. The median is useful when the input data is unbalanced because the median is not as strongly affected by outliers as the mean. For example, if all 100 persons who live in a hypothetical city earn \$50,000/yr each and Elon Musk suddenly moves in and earns \$2.3 billion (in unrealized stock options, rather than salary, but it’s reasonable to think that that is nonetheless a type of earned wealth), the mean earnings for the 101 persons is \$22,821,782.18 (the sum of all 101 earnings amounts divided by 101). While in a certain sense this is the average earning of those persons, in a different sense the average person earns \$50,000, which is the median value, since if we arrange the 101 values in order from least to most, the middle one is \$50,000. None of the three types of average tells the whole story (the whole point of statistics is to summarize at the expense of individual details), but each nonetheless reports something that is true about the data.

When we constructed a function to compute a mode in the tutorial, we required the input to be sequences of one or more integers. We made that decision for the following reasons:

• Our reason for requiring at least one input value was that we couldn’t come up with a meaningful way to understand what the most frequent value would be if there were no values. We could, as an alternative, have returned an empty sequence when the input was an empty sequence, and this is what the `avg(())` standard library function returns. In Real Life returning the empty sequence is better, but it requires more code, and for tutorial purposes we wanted to keep the example as simple as possible.

• Our reason for requiring integers was that doubles can be infinitely varied, which meant that it was likely that most doubles would appear only once in the input, a data shape where the mode is not very informative. The mean and the median don’t have that problem.

For this exercise we will assume that computing the median of an empty sequence is an error, but any other sequence of numerical values (integer or double; negative, positive, or zero) is acceptable.

Your task for this assignment is to write a function that computes an arithmetic median and applies it to finding the median number of scenes per act Hamlet. The actual median value for the five scene counts (5, 2, 4, 7, 2 for Acts 1–5, respectively) in Hamlet is 4 because 2 of the 5 input counts are lower than 4 (2 instances of 2) and 2 are higher (5, 7).

Although the count of scenes in an act of Hamlet will always be a positive integer, you should write your function in a more general way that allows one or more doubles as input. If a function requires doubles and you give it integers it will (in most cases) convert them to doubles internally (this is called type promotion; see Kay, p. 548), so requiring doubles does not prevent you from submitting integers. There are three possible types of input: a sequence that contains an odd number of values, a sequence that contains an even number of values, and an empty sequence.

• The median of an odd number of input values is the value in the middle when you sort the input items from lowest to highest. For example, the median of (5, 2, 4, 7, 2) is 4, as explained above.

• The median of an even number of input values is the mean of the two values in the middle. For example, the median of (5, 4, 7, 2) is 4.5 because the middle two values are 4 and 5 and the mean of 4 and 5 is 4.5. You should use the standard library `avg()` function to compute a mean.

• Your function should require at least one item in the input sequence, so an empty sequence of input values should raise an error, as should an input item that is not a double and cannot be promoted to (that is, regarded as equal to) a double.

### How to proceed

• Define a namespace for your function and bind it to a prefix. The namespace does not have to be a working URL, but we’d suggest using your project URL and binding it to a project-related prefix. For example, the Ovid project might define `xmlns:ovid="http://ovid.obdurodon.org"`. You could then invoke your function as `ovid:median()`.

• Create an `<xsl:function>` element for your user-defined function. Specify a function name (in your namespace) as the value of the `@name` attribute and a return type as the value of the `@as` attribute.

• Create an empty `<xsl:param>` element inside your function body with `@name` and `@as` attributes. This parameter will hold the values you input into the function when you invoke it.

• The rest of your function body will compute the return value of your function. XSLT doesn’t have an explicit return statement; whatever the function body evaluates to becomes the value that gets returned by the function

The logic to compute a median is easy to state in plain language, and we do that above. Note that your function needs to deal with three possible situations: an odd number of input values, an even number of input values, and no input values. Here are a few code snippets that you may find helpful:

• You can count the number of items in the input sequence with `count()`. You can determine whether that count is odd or even by dividing it by 2 and looking at the remainder, and the XPath `mod` operater will return the value of the remainder of a division operation. For example, `3 mod 2` returns a value of `1`, which is the remainder after dividing 2 into 3. With that approach zero looks like an even number because `0 mod 2 = 0`, but, as we mention above, zero input items requires special handling.
• You can sort the input sequence with `sort()`.
• You can get a value at a particular position in the input sequence with a numerical predicate. For example, if you have assigned your input sequence to a variable called `\$input`, you can sort the sequence and then get the third item in sorted order with `sort(\$input)`. Do not hard-code the value `3` in your function. The number `3` is the middle position only when there are exactly five values. Your function must find the median for any non-zero number of values, which means that you have write code that will find the middle value of an odd number of items and the two middle values of an even number (which, as explained above, you then average).

To test your function and ensure that it works with different types of input, you can insert lines where you supply alternative input. For example, if you add to the XSLT something like (using your own function name in your own namespace, which might differ from ours):

``]]>``

near the area where you are reporting descriptive statistics for Hamlet, it will compute the median of the direct input. You should test, for your median function, odd numbers of values, even numbers of values, integers, doubles, non-numbers (e.g., strings), and an empty sequence. Include sequences of different lengths and both sequences that are already sorted and those that are not. Input that includes strings and the empty sequence should raise errors; input that consists of one or more numeric values should return the median. Perform a sanity check, that is, input some sequences where you know what the median should be and verify that the result returned by the function is correct.

You can input an empty sequence, for testing purposes, as `djb:median(())`, with inner parentheses for reasons explained in the tutorial (in the first box in the Calling our user-defined function inside an XSLT transformation section). Alternatively, you can ask for the median of counts from Hamlet that don’t exist, along the lines of:

``]]>``

This expression will return an empty sequence because there is no element called `<play>` in this markup.

### What to submit

Submit just your XSLT, which we will run against Hamlet and with test data to verify your function for computing the median.