Digital humanities


Maintained by: David J. Birnbaum (djbpitt@gmail.com) [Creative Commons BY-NC-SA 3.0 Unported License] Last modified: 2023-04-12T16:22:12+0000


Schematron assignment #1: answers

The text

In a three-way election for Best Stooge Ever, each candidate (Curly, Larry, Moe) wins between 0% and 100% of the votes. Assume that all votes are cast for one of the three candidates (no abstentions, write-ins, invalid ballots, etc.), which means that when you add the percentages for the three candidates, the result must be exactly 100%. Assume also that we’re recording percentage of the vote, not raw votes, and that the percentages are all integer values. (In Real Life we’d probably record the raw count and calculate the percentages, but in real life we wouldn’t be voting for Best Stooge Ever in the first place!) Here’s a Relax NG schema for the results of the election:

start = results
results = element results { election+ }
election = element election { year, stooge+ }
year = attribute year { xsd:gYear }
stooge = element stooge { name, xsd:int }
name = attribute name { "Curly" | "Larry" | "Moe" }

Here’s a sample XML document that is valid against the preceding schema:


  
    50
    35
    15
  
  
    53
    33
    14
  
]]>

We could have written a better Relax NG schema, but we didn’t, and although our sloppy schema works with the results above, it also allows erroneous results like the following:

<results>
    <stooge name="Curly">55</stooge>
    <stooge name="Larry">38</stooge>
    <stooge name="Moe">11</stooge>
</results>

The task

The problem here is that the three percentage values total 104%, and no matter how good our coding, it is not possible to prevent this type of error by using Relax NG alone. Your assignment is to write a Schematron schema that verifies that the three percentages always total exactly 100%. Test your results by creating the Relax NG schema, your Schematron schema, and a sample XML document that you can validate against both schemas in <oXygen/>. Enter correct and incorrect values and verify that the Schematron schema is working correctly. For homework, upload only your Schematron schema.

Our solution

<schema xmlns="http://purl.oclc.org/dsdl/schematron" queryBinding="xslt2">
    <pattern>
        <rule context="election">
            <assert test="sum(stooge) eq 100">
                The sum of the vote percentages does not equal 100%.
            </assert>
        </rule>
    </pattern>
</schema>

A digression about namespaces

We’ve set the Schematron namespace as the default namespace with xmlns="http://purl.oclc.org/dsdl/schematron. Notice that there is no namespace prefix in this statement, and when we set the value of the @xmlns attribute equal to a value, we are declaring a default namespace, which will apply to the element on which the declaration occurs (the root <schema> element) and all of its descendants. We could, alternatively, have bound the Schematron namespace to the prefix sch:, which is what <oXygen/> does by default. In that case our root element might have looked like:

<sch:schema xmlns:sch="http://purl.oclc.org/dsdl/schematron" queryBinding="xslt2"
    xmlns:sqf="http://www.schematron-quickfix.com/validator/process">

This version binds the prefix sch: to the Schematron namespace, which means that all elements that begin with this prefix are in that namespace. In this case no default namespace is declared, so every Schematron instruction will have to be preceded by the sch: namespace prefix. It is possible to do both, that is, to bind a prefix to a namespace and to declare that namespace as the default, although we don’t find that very useful, since either method alone will do the job.

These two ways of ensuring that Schematron instruction elements are in the Schematron namespace are equivalent, so you can use either one. They do have different implications that may matter in other Schematron applications, though, and we’ll discuss those when they come up. In our sample solution above we removed the declaration of the Schematron Quick Fix namespace, which <oXygen/> binds to the prefix sqf:, because we aren’t using it. We removed it just to simplify the display, but if you prefer to leave it in and ignore it, that does no harm, and it will be available if you later decide to use it. You can learn about SQF, which we have found useful in real projects, in a demo video from the <oXygen/> team.

The Schematron file that we wrote uses only one <rule> inside one <pattern>, and we defined the value of the @context attribute of our <rule> element (equivalent to the @match attribute in <xsl:template> elements in XSLT) as election, which is an XPath pattern (not a full XPath path expression). Any <election> element in our document will be submitted to any tests we define inside this <rule>. The <assert> element inside this <rule> uses the XPath sum() function to total the values of all <stooge> elements located on the child axis of our current context, a single <election> element, and compare that value to 100. It asserts that the sum will equal 100, and therefore raises an error (using the error message that we wrote as the content of the <assert> element) if it doesn’t.

Inside the <assert>, we write an error message that Schematron will generate when this test is failed and the XML document breaks the rules. We put, The sum of the vote percentages does not equal 100%, but you could have written anything that you feel would be informative to someone trying to correct the error.

About XPath path expressions and XPath patterns

XPath path expressions in our Schematron

The value of the @context attribute on a Schematron <rule> element is an XPath pattern. Like the value of the @match attribute on an <xsl:template> element, which is also an XPath pattern, the value of @context should be just enough XPath to match the node where we want our Schematron rules to be applied. We don’t need (= should not write) a full XPath expression because we don’t have to navigate to the location; we just have to describe how to match it. This means, among other things, that it is always a mistake to begin the value of a @context attribute with a double slash. A leading double slash won’t prevent your code from working, but it’s nonetheless a mistake because it makes it harder to read and harder to understand.

Our rule fires once for each <election> element in the document. There are different elections in different years, each with its own <election> element, and they are all inside a single <results> root element. The XPath pattern that we specify as the value of the @context attribute ensures that the rules fire separately for each election, which is what we want, since if there is an error in the values for one election, we want the validation to tell us which election is the source of the error.

The XPath expressions used in the asserts and reports are relative to the current context, so when we ask for the sum of <stooge> elements, we mean the sum of <stooge> element children (because the child axis is the default XPath axis) of the <election> element being processed at the moment. A common mistake is to write sum(//stooge) instead of sum(stooge). The reason this is a mistake is that if you have multiple elections you’ll be summing all of the <stooge> values in the entire document, and not just in an individual <election> element. If you want to sum the <stooge> values that are children of a specific <election> element, you want to use the child axis to restrict yourself to only those <stooge> elements.

How to read XPath path expressions and XPath patterns

We find it most helpful to read XPath path expressions from the left, path step by path step, because each step specifies the current context(s) for the next step. An XPath expression like //body/div, then, means start at the document node, find all <body> elements on its descendant axis, and then, for each <body> element, find all <div> elements on its child axis.

We find it most helpful to read XPath patterns from the right. For example, an XPath pattern like body/div means find all <div> elements that are children of <body> elements. Reading from the right helps us avoid thinking that we have to navigate to the leftmost component of the pattern, and we don’t have to do that because XPath patterns match, but they don’t traverse.

When to use XPath expressions and when to use XPath patterns

Where we use XPath expressions and where we use XPath patterns is specified by the languages that use XPath, and is not up to us. In Schematron, the value of the @context attribute is defined as an XPath pattern and the value of the @test attribute is defined as an XPath expression, for which the current context is the node that the @context attribute matched. If @context matches multiple nodes (for example, if there are multiple <election> elements in the document, as is the case here), the rule fires once for each of them, so only one of them will be the current context at a given moment in the validation process.

The bonus tasks

You can stop here and consider the assignment complete, but for more Schematron practice, you’re welcome to add additional rules to check for additional types of error. The following types of errors could have been controlled by writing a better Relax NG schema, but for the purpose of learning Schematron, let’s do it in Schematron:

  1. There should be exactly three votes, with exactly one for each Stooge. No duplicate Stooges and no missing Stooges.
  2. Each individual Stooge’s vote should range from 0 to 100. No negative integers and no integers greater than 100. (The Relax NG schema is ensuring that all values are integers, so you don’t have to worry about that.)

Our bonus solution

<?xml version="1.0" encoding="UTF-8"?>
<schema xmlns="http://purl.oclc.org/dsdl/schematron" queryBinding="xslt2">
    <pattern>
        <rule context="election">
            <assert test="sum(stooge) eq 100">
                The sum of the vote percentages should equal 100%.
            </assert>
            <assert test="count(stooge) eq 3">
                There should be exactly 3 stooges.
            </assert>
            <assert test="count(stooge) eq count(distinct-values(stooge/@name))">
                No two stooges should have the same name.
            </assert>
        </rule>
        <rule context="stooge">
            <report test="number(.) lt 0 or number(.) gt 100">
                Vote percentages must be between 0 and 100.
            </report>
        </rule>
    </pattern>
</schema>

To specify that there should be three stooges, we added a second <assert> within the same rule, since the context is the same—that is, we want this new assertion to fire once for each <election> element. This time we use the count() function to count stooges on the child axis, and compare that value to 3. To test that no stooges are repeated, we take advantage of the @name attribute and compare the count of all stooges against the count of the distinct values of stooge names. If all of the names are distinct from one another, the count of stooges will be equal to the count of distinct stooge names.

Finally, we want to test that for any given stooge, the vote percentage is within the range from 0 to 100. Since this is something that applies separately to each individual <stooge>, and not the <election> element as a whole, we created a new <rule> where the value of the @context attribute is now stooge. This means that it will fire once for each <stooge> element, and that it will check the value for that individual stooge. Inside that rule, we used a <report>, which outputs its message when the test inside it is true (because it is reporting that the real situation matches what the test requires), as opposed to <assert>, which triggers when false (because it is informing the developer that something asserted has failed to be satisfied). The test here is whether the percentage of votes for the stooge being examined is less than 0 or greater than 100, and we separate these using the XPath or logical operator.

eq, =, and number()

To test the value of each stooge’s content we used the XPath number() function, which converts the content of a <stooge> element (a string of characters) into a number. The reason we have to do this is that value comparison (using eq) requires not only that there be exactly one item on each side of the comparison operator, but also that they be of the same datatype. It looks to a human as if the stooge votes in our XML are numbers, and therefore comparable to numbers in the XPath expression inside the @test attribute, but they could just as easily be understood as strings of characters that happen to be digits. Since XPath cannot know whether they represent a number or a string in the XML, if we try to compare one of those values with a number in our XPath, we'll raise an error about unmatched datatypes: Cannot compare xs:untypedAtomic to xs:integer. xs:untypedAtomic means that our Schematron knows that the value inside a <stooge> element is an atomic value, but it cannot know whether it is a string or a number or any other specific type of atomic value. Using the number() function inside our @test to cast (the technical term for convert) the value to a number lets our XPath comparison proceed.

Value comparison (like eq) requires that the datatypes on both sides of the comparison operator be the same, but general comparison (like =) does not. General comparison will automatically treat the value in the XML as a number if we are comparing it to a number, so we don’t have to cast it explicitly. We might think that it would be better to use general comparison so that we don’t have to fuss with the datatype ourselves, but because value comparison is stricter, it provides more protection against coding errors, and our goal is not to reduce error messages, but to reduce errors. Using general comparison instead of value comparison here is not a mistake, but using value comparison to compare one thing to one thing is better because it provides more protection against error, and we use general comparison primarily when one of the comparands must be a sequence.