Maintained by: David J. Birnbaum (djbpitt@gmail.com)
Last modified:
2022-12-05T16:53:17+0000
For your last assignment you read a tutorial that introduced XSLT user-defined functions. The sample task that we used in that tutorial to illustrate how to create and use a user-defined function involved computing the arithmetic mode(s) of a sequence of integers, and then using that function to find the modal number of scenes in the acts of Hamlet.
The tutorial discussed two of the three types of average in common use in statistical overviews, the mean and the mode, but it made only passing reference to the median, which is the third type of average. The median is determined by arranging all of the values in order and selecting the one in the middle. The median is useful when the input data is unbalanced because the median is not as strongly affected by outliers as the mean. For example, if all 100 persons who live in a hypothetical city earn $50,000/yr each and Elon Musk suddenly moves in and earns $2.3 billion (in unrealized stock options, rather than salary, but it’s reasonable to think that that is nonetheless a type of earned wealth), the mean earnings for the 101 persons is $22,821,782.18 (the sum of all 101 earnings amounts divided by 101). While in a certain sense this is the average earning of those persons, in a different sense the average person earns $50,000, which is the median value, since if we arrange the 101 values in order from least to most, the middle one is $50,000. None of the three types of average tells the whole story (the whole point of statistics is to summarize at the expense of individual details), but each nonetheless reports something that is true about the data.
When we constructed a function to compute a mode in the tutorial, we required the input to be sequences of one or more integers. We made that decision for the following reasons:
Our reason for requiring at least one input value was that we couldn’t come
up with a meaningful way to understand what the most frequent value
would be if there were no values. We could, as an alternative, have returned
an empty sequence when the input was an empty sequence, and this is what the
avg(())
standard library function returns.
In Real Life returning the empty sequence is better, but it requires more
code, and for tutorial purposes we wanted to keep the example as simple as
possible.
Our reason for requiring integers was that doubles can be infinitely varied, which meant that it was likely that most doubles would appear only once in the input, a data shape where the mode is not very informative. The mean and the median don’t have that problem.
For this exercise we will assume that computing the median of an empty sequence is an error, but any other sequence of numerical values (integer or double; negative, positive, or zero) is acceptable.
Your task for this assignment is to write a function that computes an arithmetic median and applies it to finding the median number of scenes per act Hamlet. The actual median value for the five scene counts (5, 2, 4, 7, 2 for Acts 1–5, respectively) in Hamlet is 4 because 2 of the 5 input counts are lower than 4 (2 instances of 2) and 2 are higher (5, 7).
Although the count of scenes in an act of Hamlet will always be a positive integer, you should write your function in a more general way that allows one or more doubles as input. If a function requires doubles and you give it integers it will (in most cases) convert them to doubles internally (this is called type promotion; see Kay, p. 548), so requiring doubles does not prevent you from submitting integers. There are three possible types of input: a sequence that contains an odd number of values, a sequence that contains an even number of values, and an empty sequence.
The median of an odd number of input values is the value in the middle when you sort the input items from lowest to highest. For example, the median of (5, 2, 4, 7, 2) is 4, as explained above.
The median of an even number of input values is the mean of the two values in
the middle. For example, the median of (5, 4, 7, 2) is 4.5 because the
middle two values are 4 and 5 and the mean of 4 and 5 is 4.5. You should use
the standard library avg()
function to
compute a mean.
Your function should require at least one item in the input sequence, so an empty sequence of input values should raise an error, as should an input item that is not a double and cannot be promoted to (that is, regarded as equal to) a double.
Define a namespace for your function and bind it to a prefix. The namespace
does not have to be a working URL, but we’d suggest using your project URL
and binding it to a project-related prefix. For example, the Ovid project
might define
xmlns:ovid="http://ovid.obdurodon.org"
.
You could then invoke your function as
ovid:median()
.
Create an <xsl:function>
element for
your user-defined function. Specify a function name (in your namespace) as
the value of the @name
attribute and a
return type as the value of the @as
attribute.
Create an empty <xsl:param>
element
inside your function body with @name
and @as
attributes. This parameter will
hold the values you input into the function when you invoke it.
The rest of your function body will compute the return value of your
function. XSLT doesn’t have an explicit return
statement; whatever
the function body evaluates to becomes the value that gets returned by the
function
The logic to compute a median is easy to state in plain language, and we do that above. Note that your function needs to deal with three possible situations: an odd number of input values, an even number of input values, and no input values. Here are a few code snippets that you may find helpful:
count()
. You can determine whether that count
is odd or even by dividing it by 2 and looking at the remainder, and the XPath
mod
operater will return the value of the
remainder of a division operation. For example,
3 mod 2
returns a value of
1
, which is the remainder after dividing 2
into 3. With that approach zero looks like an even number because
0 mod 2 = 0
, but, as we mention above, zero
input items requires special handling.sort()
.$input
, you can sort the
sequence and then get the third item in sorted order with
sort($input)[3]
. Do not hard-code the
value 3
in your function. The number
3
is the middle position only when
there are exactly five values. Your function must find the median for
any non-zero number of values, which means that you have write code that will
find the middle value of an odd number of items and the two middle values of an
even number (which, as explained above, you then average).To test your function and ensure that it works with different types of input, you can insert lines where you supply alternative input. For example, if you add to the XSLT something like (using your own function name in your own namespace, which might differ from ours):
]]>
near the area where you are reporting descriptive statistics for Hamlet, it will compute the median of the direct input. You should test, for your median function, odd numbers of values, even numbers of values, integers, doubles, non-numbers (e.g., strings), and an empty sequence. Include sequences of different lengths and both sequences that are already sorted and those that are not. Input that includes strings and the empty sequence should raise errors; input that consists of one or more numeric values should return the median. Perform a sanity check, that is, input some sequences where you know what the median should be and verify that the result returned by the function is correct.
You can input an empty sequence, for testing purposes, as
djb:median(())
, with inner parentheses for
reasons explained in the tutorial (in the first box in the Calling our
user-defined function inside an XSLT transformation section).
Alternatively, you can ask for the median of counts from Hamlet
that don’t exist, along the lines of:
]]>
This expression will return an empty sequence because there is no element called
<play>
in this markup.
Submit just your XSLT, which we will run against Hamlet and with test data to verify your function for computing the median.