Digital humanities


Maintained by: David J. Birnbaum (djbpitt@gmail.com) [Creative Commons BY-NC-SA 3.0 Unported License] Last modified: 2022-04-10T18:23:30+0000


Test #6: Schematron

The data

The data for the Schematron test is the Shakespearean sonnets file that you worked with previously, which is located at http://dh.obdurodon.org/shakespeare-sonnets.xml.

The tasks

Create a Schematron schema to validate the following two features of the sonnets file. Test your schema against both good and bad data to be sure that it both catches actual problems and does not erroneously report a problem when there isn’t one. The bonus tasks at the end are optional. Upload only your Schematron schema to Canvas.

Task 1: Does the Roman numeral match the position of the sonnet?

The Roman numeral of the sonnet should be the same as its position in the sequence of sonnets. For example, the sonnet with Roman numeral CXXVI should be the 126th <sonnet> child of the root element.

The challenge for this task is that the XPath position() function will return the position of the current context item as an Arabic numeral, which means that you have to compare an Arabic numeral to a Roman numeral. In order to perform that comparison you need to convert one format to the other, that is, you need to compare either two Roman numerals (which are strings) or two integer values (represented by Arabic numerals). Converting Roman numerals to Arabic numerals is difficult (see the note below), but converting Arabic numerals to Roman numerals is easy because XPath comes with a format-integer() function that can do that for you. This means that you can convert the position value to a Roman numeral, which is a string, and then compare it to the Roman numeral string that is the value of the @number attribute of the sonnet.

The format-integer() function is new in XPath 3.0, so it isn’t in Michael Kay’s book, but you can read about it in the spec at https://www.w3.org/TR/xpath-functions-31/#func-format-integer. The version you want takes two arguments: an integer value (such as the position() of a particular sonnet or a count of its preceding sibling <sonnet> elements plus 1) and what the specification calls a picture string, which describes how the integer to be formatted. Legal picture string values are predefined by the specification, and the picture string to format an integer value as an upper-case Roman numeral string is I. For example, format-integer(123, 'I') returns the string value "CXXIII".

The strategy we would recommend, then, is to match each sonnet, determine its position, convert the position to a Roman numeral by using the format-integer() function, and verify that the result is equal to the Roman numeral that is the value of the @number attribute on the sonnet.

By the way, this validation will catch two types of errors. It will find sonnets with incorrect Roman numerals and it will also find Roman numerals that are incorrect for other reasons. Because we aren’t used to working with Roman numerals it’s easy to make a mistake when we try to type them, but the XPath format-integer() function is incapable of making that kind of mistake.

Task 2: Does every sonnet have exactly 14 lines?

Sonnets normally have fourteen lines, but two Shakespearean sonnets famously do not: one has 15 lines and one has 12 lines. We don’t want this reported as an error in our XML because it is correct for those two sonnets, but we do want Schematron to tell us about sonnets with a line count other than 14 so that we can correct any missing or extra lines for other sonnets.

Your task is to write a Schematron rule that will check whether each sonnet has exactly 14 lines and raise a warning (not an error) if the line count is not equal to 14. By default the notifications that Schematron displays are considered error messages, and they’re flagged that way in three places:

You can tell Schematron to report a warning instead of an error, though, and if you do that you get the same three types of notification, except that everything that is red for an error will be yellow for a warning. Since sonnets with a line count other than 14 are not necessarily errors, but we do want to be notified about them because they could be errors, we’re going to ask Schematron to report any counts other than 14 as warnings. The way to tell Schematron that a <sch:assert> or <sch:report> should issue a warning instead of raising an error is to add a @role attribute to the <sch:assert> or <sch:report> with a value of either warn or warning.

You don’t have to (= shouldn’t) specify a @role attribute value if you want the issue to be treated as an error, since that’s the default behavior. Schematron will let you specify a value of either error or fatal, but in Real Life we’ve never done that because Schematron is easier to read if we don’t clutter it up by specifying a behavior that we’ll get anyway because it’s the default.

By the way, if you want to be notified of some property of your document informationally (not an error or warning, but nonetheless a notification) you can specify a @role value of information or info, which will produce the same behavior as with errors and warnings except that the coloring will be blue. In Real Life what you consider an error, a cause for warning, or a cause for an informational notification is up to you. If you start to type a @role attribute on a <sch:assert> or <sch:report> element, <oXygen/> will prompt you about your options, so you don’t have to memorize them.

Optional bonus tasks

Converting Roman numers to Arabic numeral notation

Converting Roman numerals to Arabic numeral notation is difficult, but it isn’t impossible, and there’s an example, with good, clear discussion, at Converting Roman numerals with XQuery & XSLT I–IV. The code is in XQuery, but the XQuery used there is mostly XPath, so even without having studied XQuery explicitly you can get a sense of the logic by reading the explanation.