Digital humanities


Maintained by: David J. Birnbaum (djbpitt@gmail.com) [Creative Commons BY-NC-SA 3.0 Unported License] Last modified: 2022-12-05T16:53:17+0000


User-defined functions in XSLT: exercise 1

Context

For your last assignment you read a tutorial that introduced XSLT user-defined functions. The sample task that we used in that tutorial to illustrate how to create and use a user-defined function involved computing the arithmetic mode(s) of a sequence of integers, and then using that function to find the modal number of scenes in the acts of Hamlet.

The tutorial discussed two of the three types of average in common use in statistical overviews, the mean and the mode, but it made only passing reference to the median, which is the third type of average. The median is determined by arranging all of the values in order and selecting the one in the middle. The median is useful when the input data is unbalanced because the median is not as strongly affected by outliers as the mean. For example, if all 100 persons who live in a hypothetical city earn $50,000/yr each and Elon Musk suddenly moves in and earns $2.3 billion (in unrealized stock options, rather than salary, but it’s reasonable to think that that is nonetheless a type of earned wealth), the mean earnings for the 101 persons is $22,821,782.18 (the sum of all 101 earnings amounts divided by 101). While in a certain sense this is the average earning of those persons, in a different sense the average person earns $50,000, which is the median value, since if we arrange the 101 values in order from least to most, the middle one is $50,000. None of the three types of average tells the whole story (the whole point of statistics is to summarize at the expense of individual details), but each nonetheless reports something that is true about the data.

When we constructed a function to compute a mode in the tutorial, we required the input to be sequences of one or more integers. We made that decision for the following reasons:

For this exercise we will assume that computing the median of an empty sequence is an error, but any other sequence of numerical values (integer or double; negative, positive, or zero) is acceptable.

The task

Your task for this assignment is to write a function that computes an arithmetic median and applies it to finding the median number of scenes per act Hamlet. The actual median value for the five scene counts (5, 2, 4, 7, 2 for Acts 1–5, respectively) in Hamlet is 4 because 2 of the 5 input counts are lower than 4 (2 instances of 2) and 2 are higher (5, 7).

Although the count of scenes in an act of Hamlet will always be a positive integer, you should write your function in a more general way that allows one or more doubles as input. If a function requires doubles and you give it integers it will (in most cases) convert them to doubles internally (this is called type promotion; see Kay, p. 548), so requiring doubles does not prevent you from submitting integers. There are three possible types of input: a sequence that contains an odd number of values, a sequence that contains an even number of values, and an empty sequence.

How to proceed

The logic to compute a median is easy to state in plain language, and we do that above. Note that your function needs to deal with three possible situations: an odd number of input values, an even number of input values, and no input values. Here are a few code snippets that you may find helpful:

To test your function and ensure that it works with different types of input, you can insert lines where you supply alternative input. For example, if you add to the XSLT something like (using your own function name in your own namespace, which might differ from ours):

]]>

near the area where you are reporting descriptive statistics for Hamlet, it will compute the median of the direct input. You should test, for your median function, odd numbers of values, even numbers of values, integers, doubles, non-numbers (e.g., strings), and an empty sequence. Include sequences of different lengths and both sequences that are already sorted and those that are not. Input that includes strings and the empty sequence should raise errors; input that consists of one or more numeric values should return the median. Perform a sanity check, that is, input some sequences where you know what the median should be and verify that the result returned by the function is correct.

You can input an empty sequence, for testing purposes, as djb:median(()), with inner parentheses for reasons explained in the tutorial (in the first box in the Calling our user-defined function inside an XSLT transformation section). Alternatively, you can ask for the median of counts from Hamlet that don’t exist, along the lines of:

]]>

This expression will return an empty sequence because there is no element called <play> in this markup.

What to submit

Submit just your XSLT, which we will run against Hamlet and with test data to verify your function for computing the median.