# <oo>→<dh> Digital humanities

## User-defined functions in XSLT: exercise 1

### Context

For your last assignment you read a tutorial that introduced XSLT user-defined functions. The sample task that we used in that tutorial to illustrate how to create and use a user-defined function involved computing the arithmetic mode(s) of a sequence of integers, and then using that function to find the modal number of scenes in the acts of Hamlet.

The tutorial discussed two of the three types of average in common use in statistical overviews, the mean and the mode, but it made only passing reference to the median, which is the third type of average. The median is determined by arranging all of the values in order and selecting the one in the middle. The median is useful when the input data is unbalanced because the median is not as strongly affected by outliers as the mean. For example, if all 100 persons who live in a hypothetical city earn \$50,000/yr each and Elon Musk suddenly moves in and earns \$2.3 billion (in unrealized stock options, rather than salary, but it’s reasonable to think that that is nonetheless a type of earned wealth), the mean earnings for the 101 persons is \$22,821,782.18 (the sum of all 101 earnings amounts divided by 101). While in a certain sense this is the average earning of those persons, in a different sense the average person earns \$50,000, which is the median value, since if we arrange the 101 values in order from least to most, the middle one is \$50,000. None of the three types of average tells the whole story (the whole point of statistics is to summarize at the expense of individual details), but each nonetheless reports something that is true about the data.

When we constructed a function to compute a mode in the tutorial, we required the input to be sequences of one or more integers. We made that decision for the following reasons:

• Our reason for requiring at least one input value was that we couldn’t come up with a meaningful way to understand what the most frequent value would be if there were no values. We could, as an alternative, have returned an empty sequence when the input was an empty sequence, and this is what the `avg(())` standard library function returns. In Real Life returning the empty sequence is better, but it requires more code, and for tutorial purposes we wanted to keep the example as simple as possible.

• Our reason for requiring integers was that doubles can be infinitely varied, which meant that it was likely that most doubles would appear only once in the input, a data shape where the mode is not very informative. The mean and the median don’t have that problem.

For this exercise we will assume that computing the median of an empty sequence is an error, but any other sequence of numerical values (integer or double; negative, positive, or zero) is acceptable.

Your task for this assignment is to write a function that computes an arithmetic median and applies it to finding the median number of scenes per act Hamlet. The actual median value for the five scene counts (5, 2, 4, 7, 2 for Acts 1–5, respectively) in Hamlet is 4 because 2 of the 5 input counts are lower than 4 (2 instances of 2) and 2 are higher (5, 7).

Although the count of scenes in an act of Hamlet will always be a positive integer, you should write your function in a more general way that allows one or more doubles as input. If a function requires doubles and you give it integers it will (in most cases) convert them to doubles internally (this is called type promotion; see Kay, p. 548), so requiring doubles does not prevent you from submitting integers. There are three possible types of input: a sequence that contains an odd number of values, a sequence that contains an even number of values, and an empty sequence. Do not use `[3]` to specify the third value because the third value is the middle value only when there are exactly five values. Your code needs to find the middle value for sequence of any number of values.

• The median of an odd number of input values is the value in the middle when you sort the input items from lowest to highest. For example, the median of (5, 2, 4, 7, 2) is 4, as explained above.

• The median of an even number of input values is the mean of the two values in the middle. For example, the median of (5, 4, 7, 2) is 4.5 because the middle two values are 4 and 5 and the mean of 4 and 5 is 4.5. You should use the standard library `avg()` function to compute a mean.

• Your function should require at least one item in the input sequence, so an empty sequence of input values should raise an error, as should an input item that is not a double and cannot be promoted to (that is, regarded as equal to) a double.

### How to proceed

You can copy the sample XSLT in the Calling our user-defined function inside an XSLT transformation section of our tutorial and add your own `<xsl:function>` element, alongside ours. You may either use our user-defined function namespace or declare and use your own. You can then add a line to report the median number of scenes near the place where we output the mean and mode.

The logic to compute a median is easy to state in plain language, although the statements are a bit different depending on whether you have an odd or an even number of values. Here are a few code snippets that you may find helpful:

• You can count the number of items in the input sequence with `count()`.
• You can sort the input sequence with `sort()`.
• You can get a value at a particular position in the input sequence with a numerical predicate. For example, if you have assigned your input sequence to a variable called `\$input`, you can sort the sequence and then get the third item in sorted order with `sort(\$input)[3]`. Do not hard-code the value `3` in your function. The number `3` is the middle position only when there are exactly five values. Your function must find the median for any non-zero number of values.

To test your function and ensure that it works with different types of input, you can insert lines where you supply alternative input. For example, if you add to the XSLT something like (using your own function name in your own namespace, which might differ from ours):

``]]>``

near the area where you are reporting descriptive statistics for Hamlet, it will compute the median of the direct input. You should test, for your median function, odd numbers of values, even numbers of values, integers, doubles, non-numbers (e.g., strings), and an empty sequence. Include sequences that are already sorted and those that are not. Input that includes strings and the empty sequence should raise errors; input that consists of one or more numeric values should return the median. Perform a sanity check, that is, input some sequences where you know what the median should be and verify that the result returned by the function is correct.

You can input an empty sequence, for testing purposes, as `djb:median(())`, with inner parentheses for reasons explained in the tutorial (in the first box in the Calling our user-defined function inside an XSLT transformation section). Alternatively, you can ask for the median of counts from Hamlet that don’t exist, along the lines of:

``]]>``

This expression will return an empty sequence because there is no element called `<play>` in this markup.

### What to submit

Submit just your XSLT, which we will run against Hamlet and with test data to verify your function for computing the median.