Digital humanities


Maintained by: David J. Birnbaum (djbpitt@gmail.com) [Creative Commons BY-NC-SA 3.0 Unported License] Last modified: 2021-03-16T17:28:03+0000


Schematron assignment #1: answers

The text

In a three-way election for Best Stooge Ever, each candidate (Curly, Larry, Moe) wins between 0% and 100% of the votes. Assume that all votes are cast for one of the three candidates (no abstentions, write-ins, invalid ballots, etc.), which means that when you add the percentages for the three candidates, the result must be exactly 100%. Assume also that we’re recording percentage of the vote, not raw votes, and that the percentages are all integer values. (In Real Life we’d probably record the raw count and calculate the percentages, but in real life we wouldn’t be voting for Best Stooge Ever in the first place!) Here’s a Relax NG schema for the results of the election:

start = results
results = element results { stooge+ }
stooge = element stooge { name, xsd:int }
name = attribute name { "Curly" | "Larry" | "Moe" }

Here’s a sample XML document that is valid against the preceding schema:

<results>
    <stooge name="Curly">50</stooge>
    <stooge name="Larry">35</stooge>
    <stooge name="Moe">15</stooge>
</results>

We could have written a better Relax NG schema, but we didn’t, and although our sloppy schema works with the results above, it also allows erroneous results like the following:

<results>
    <stooge name="Curly">55</stooge>
    <stooge name="Larry">38</stooge>
    <stooge name="Moe">11</stooge>
</results>

The task

The problem here is that the three percentage values total 104%, and no matter how good our coding, it is not possible to prevent this type of error by using Relax NG alone. Your assignment is to write a Schematron schema that verifies that the three percentages always total exactly 100%. Test your results by creating the Relax NG schema, your Schematron schema, and a sample XML document that you can validate against both schemas in <oXygen/>. Enter correct and incorrect values and verify that the Schematron schema is working correctly. For homework, upload only your Schematron schema.

Our solution

<schema xmlns="http://purl.oclc.org/dsdl/schematron" queryBinding="xslt2">
    <pattern>
        <rule context="results">
            <assert test="sum(stooge) eq 100">
                The sum of the vote percentages does not equal 100%.
            </assert>
        </rule>
    </pattern>
</schema>

A digression about namespaces

We’ve set the Schematron namespace as the default namespace with xmlns="http://purl.oclc.org/dsdl/schematron. Notice that there is no namespace prefix in this statement, and when we set the value of the @xmlns attribute equal to a value, we are declaring a default namespace, which will apply to the element on which the declaration occurs (the root <schema> element) and all of its descendants. We could, alternatively, have bound the Schematron namespace to the prefix sch:, which is what <oXygen/> does by default. In that case our root element might have looked like:

<sch:schema xmlns:sch="http://purl.oclc.org/dsdl/schematron" queryBinding="xslt2"
    xmlns:sqf="http://www.schematron-quickfix.com/validator/process">

This version binds the prefix sch: to the Schematron namespace, which means that all elements that begin with this prefix are in that namespace. In this case no default namespace is declared, so every Schematron instruction will have to be preceded by the sch: namespace prefix. It is possible to do both, that is, to bind a prefix to a namespace and to declare that namespace as the default, although we don’t find that very useful, since either method alone will do the job.

These two ways of ensuring that Schematron instruction elements are in the Schematron namespace are equivalent, so you can use either one. They do have different implications that may matter in other Schematron applications, though, and we’ll discuss those when they come up. In our sample solution above we removed the declaration of the Schematron Quick Fix namespace, which <oXygen/> binds to the prefix sqf:, because we aren’t using it. We removed it just to simplify the display, but if you prefer to leave it in and ignore it, that does no harm, and it will be available if you later decide to use it. You can learn about SQF, which we have found useful in real projects, in a demo video from the <oXygen/> team.

The Schematron file that we wrote uses only one <rule> inside one <pattern>, and we defined the value of the @context attribute of our <rule> element (equivalent to the @match attribute in <xsl:template> elements in XSLT) as results, which is an XPath pattern (not a full XPath path expression). Any <results> element in our document will be submitted to any tests we define inside this <rule>. The <assert> element inside this <rule> uses the XPath sum() function to total the values of all <stooge> elements located on the child axis of our current context, <results>, and compare that value to 100. It asserts that the sum will equal 100, and therefore raises an error (using the error message that we wrote as the content of the <assert> element) if it doesn’t.

Inside the <assert>, we write an error message that Schematron will generate when this test is failed and the XML document breaks the rules. We put, The sum of the vote percentages does not equal 100%, but you could have written anything that you feel would be informative to someone trying to correct the error.

About XPath path expressions and XPath patterns

Xpath path expressions in our Schematron

An XPath path expression, which is what we have been practicing in our XPath unit, is evaluated from a current context. In our XPath explorations, the <oXygen/> XPath exploration interface doesn’t have a current context, so we typically begin our path expressions with a slash, which means start at the document node. If you just type an element name in the <oXygen/> XPath exploration interface, it won’t find anything because it won’t know where to start looking. In technologies based on XPath, though, like Schematron, every XPath expression always has a current context, which is the node that the asserts and reports have fired on (in this case, a <results> element node).

Our rule fires once for each <results> element in the document. There happens to be only one, but in principle there could be more; for example, there could be a different election every year, each with its own <results> element, inside a single document that wraps them all in some higher-level root element. XPath expressions in the asserts and reports are relative to the current context, so when we ask for the sum of <stooge> elements, we mean the sum of <stooge> element children (because the child axis is the default XPath axis) of the <results> element being processed at the moment. This feature means that we can check the sums of multiple <results> elements, and for each one Schematron will look only at the children of the one being processed at that moment.

A common mistake is to write sum(//stooge) instead of sum(stooge). The reason this is a mistake, even though it happens to get the correct result, is that if you have multiple elections you’ll be summing all of the <stooge> values in the entire document, and not just in an individual <results> element. In other words, in this case you’ll get the correct result only accidentally. If you want to sum the <stooge> values that are children of a specific <results> element, you want to use the child axis to restrict yourself to only those <stooge> elements.

XPath patterns in our Schematron

An XPath pattern does not have to navigate anywhere and does not have a current context. An XPath pattern is typically only a partial path expressions (only as long as it needs to be) that specifies what has to be matched for a rule to fire. A @context value like results will fire on each <results> element everywhere in the document; it says that whenever a location in the document matches the pattern (in this case, is a <results> element), the rule fires. If we were processing Hamlet, a value like body/div would process all <div> children of all <body> elements everywhere in the document.

Because XPath patterns match partial paths, they never begin with a double slash, since the double slash would have no meaning. In a full XPath path expression, a leading double slash means look everywhere in the document for a match, but XPath patterns do that automatically. For this reason, although writing <sch:rule context="//results"> will get the correct result, it’s the wrong answer, since the leading double slash adds no meaning to the pattern.

How to read XPath path expressions and XPath patterns

We find it most helpful to read XPath path expressions from the left, path step by path step, because each step specifies the current context(s) for the next step. An XPath expression like //body/div, then, means start at the document node, find all <body> elements on its descendant axis, and then, for each <body> element, find all <div> elements on its child axis.

We find it most helpful to read XPath patterns from the right. For example, an XPath pattern like body/div means find all <div> elements that are children of <body> elements. Reading from the right helps us avoid thinking that we have to navigate to the leftmost component of the pattern, and we don’t have to do that because XPath patterns match, but they don’t traverse.

When to use XPath expressions and when to use XPath patterns

Where we use XPath expressions and where we use XPath patterns is specified by the languages that use XPath, and is not up to us. In Schematron, the value of the @context attribute is defined as an XPath pattern, and the value of the @test attribute is defined as an XPath expression, for which the current context is the node that the @context attribute matched. If @context matches multiple nodes (for example, if there are multiple <results> elements in the document), the rule fires once for each of them, so only one of them will be the current context at a given moment in the validation process.

The bonus tasks

You can stop here and consider the assignment complete, but for more Schematron practice, you’re welcome to add additional rules to check for additional types of error. The following types of errors could have been controlled by writing a better Relax NG schema, but for the purpose of learning Schematron, let’s do it in Schematron:

  1. There should be exactly three votes, with exactly one for each Stooge. No duplicate Stooges and no missing Stooges.
  2. Each individual Stooge’s vote should range from 0 to 100. No negative integers and no integers greater than 100. (The Relax NG schema is ensuring that all values are integers, so you don’t have to worry about that.)

Our bonus solution

<?xml version="1.0" encoding="UTF-8"?>
<schema xmlns="http://purl.oclc.org/dsdl/schematron" queryBinding="xslt2">
    <pattern>
        <rule context="results">
            <assert test="sum(stooge) eq 100">
                The sum of the vote percentages should equal 100%.
            </assert>
            <assert test="count(stooge) eq 3">
                There should be exactly 3 stooges.
            </assert>
            <assert test="count(stooge) eq count(distinct-values(stooge/@name))">
                No two stooges should have the same name.
            </assert>
        </rule>
        <rule context="stooge">
            <report test="number(.) lt 0 or number(.) gt 100">
                Vote percentages must be between 0 and 100.
            </report>
        </rule>
    </pattern>
</schema>

To specify that there should be three stooges, we added a second <assert> within the same rule, since the context is the same—that is, we want this new assertion to fire on the <results> element. This time we use the count() function to count stooges on the child axis, and compare that value to 3. To test that no stooges are repeated, we take advantage of the @name attribute and compare the count of all stooges against the count of the distinct values of stooge names. If all of the names are distinct from one another, the count of stooges will be equal to the count of distinct stooge names.

Finally, we want to test that for any given stooge, the vote percentage is within the range from 0 to 100. Since this is something that applies separately to each individual <stooge>, and not the <results> element as a whole, we created a new <rule> where the value of the @context attribute is now stooge. This means that it will fire once for each <stooge> element, and that it will check the value for that individual stooge. Inside that rule, we used a <report>, which outputs its message when the test inside it is true (because it is reporting that the real situation matches what the test requires), as opposed to <assert>, which triggers when false (because it is informing the developer that something asserted has failed to be satisfied). The test here is whether the percentage of votes for the stooge being examined is less than 0 or greater than 100, and we separate these using the XPath or logical operator.

To test the value of each stooge’s content without getting a stylesheet error, we used the XPath number() function, which converts the content (a string of characters) into a number if possible, or into the special value NaN (Not a Number) if not. If the content of one of our <stooge> elements is not a number (e.g., an error like <stooge name="Moe">Moe</stooge>, which is unlikely, but nonetheless worth guarding against) or is not within the range from 0 to 100, an error will be thrown, targeted at the specific <stooge> whose numbers were fudged, because <stooge> is the context of our rule.

About NaN

NaN exists to cater to situations where you need to perform numerical comparisons and you can’t be certain that you won’t wind up having to evaluate a value that can’t be converted to a number. We used this in real life in a situation where we had to sort some years numerically, and in the field where we were entering years, in some cases we had the string value unknown. That number('unknown') evaluates to NaN, and NaN can be compared to a real number without throwing an error, enabled us to sort by year without having to convert unknown explicitly into a fake numerical value. NaN has the unique property of never being equal to itself, so number('moe') eq number('moe') evaluates to false!