Schematron tutorial

Maintained by: David J. Birnbaum (djbpitt@gmail.com)

Last modified: 2022-08-08T23:47:40+0000

Schematron tutorial

About schematron

Schematron is a constraint language that can be used for validating aspects of your XML that are difficult to express using a grammar-based schema language like Relax NG. In the example that will be described in detail below, I encoded the start and end pages of some text I was transcribing from a printed source. I used Relax NG to constrain those values to integers, and I used pattern facets to specify the allowable range of integers (I knew how long the printed source was, so I could specify that no page number could be higher than the last page number in the book), but Relax NG isn’t capable of catching situations where I might carelessly type an end-page number that is less than the associated start-page number. This is exactly the sort of constraint that is easy to express in Schematron, and for this project I augmented my Relax NG schema with a very small Schematron schema whose only function was to guard against this type of error.

It’s easy to think that we aren’t going to make errors of that sort. We do, of course, make all sorts of errors, but in this particular project I was working with a team of coders, and the more I could button down the validation in <oXygen/>, the more I could avoid errors that they might make because they had misunderstood the conventions. For example, in the rendered output of this project, page ranges are given in abbreviated form, so that a text that runs from pages 123 to 136 would be shown as “123–36” (the hundreds value is not repeated). What the coders have to enter, though, is the full number in both cases; my XSLT transformation is responsible for knowing when to truncate the hundreds value (or any other part of a number). A coder could easily enter just two digits for the end page, misunderstanding the difference between what gets entered in the source and what gets rendered in the output, and Schematron validation can catch and avert this type of error.

Schematron is typically used together with a schema language like Relax NG, and <oXygen/> can validate an XML against both Relax NG and Schematron simultaneously. Like XSLT, Schematron is written in XML according to a particular schema (that is, a particular tag set). Because Schematron is used primarily to take care of the few details that cannot be handled in Relax NG, the rule sets in my projects tend to be fairly small.

The superstructure of a Schematron schema

The working part of a Schematron schema consists of <assert> and <report> elements, which rely on XPath to define what precisely they are asserting or reporting. If you create a new Schematron schema in <oXygen/>, it will create the following superstructure:

<sch:schema xmlns:sch="http://purl.oclc.org/dsdl/schematron" queryBinding="xslt2"
    xmlns:sqf="http://www.schematron-quickfix.com/validator/process">
    
</sch:schema>

To avoid having to type the sch: namespace prefix in front of all of our Schematron elements, we modify this by setting the Schematron namespace as the default (line 3 below), and once we’ve done that we can remove the prefix from the root element. If your document is in a specific namespace, you’ll also need to account for this with a <ns> child of your root <schema> element (line 4 below). To put this all together, the following root element would be used to create a schema to constrain Bad Hamlet, which is in the TEI namespace:

<schema xmlns:sch="http://purl.oclc.org/dsdl/schematron" queryBinding="xslt2"
    xmlns:sqf="http://www.schematron-quickfix.com/validator/process"
    xmlns="http://purl.oclc.org/dsdl/schematron">
        <ns uri="http://www.tei-c.org/ns/1.0" prefix="tei"/>

</schema>

When you work with TEI documents all TEI elements (but not attributes) in your XPath expressions within the schema will then need to be preceded by tei: namespace prefix that you defined so that the system will know how to find them in Bad Hamlet. This means that //body/div would not apply to anything in this schema, because the elements in that XPath expression are not namespaced. Since elements in the document are all in the TEI namespace, what you want instead is //tei:body/tei:div. Note, though, that attributes by default are not in a namespace even when the rest of the document is. This means that, for example, a @ref attribute on a <div> element could be addressed as //tei:body/tei:div/@ref (and if you put a namespace prefix in front of the attribute name, you wouldn’t find anything because the attribute isn’t in a namespace).

For most purposes, think of the root <schema> element as containing a set of <pattern> elements. A <pattern> element contains one or more <rule> elements, and every <rule> element has a @context attribute. The <rule> element contains, as child elements, <assert> and <report> elements, which have @test attributes. See below for explanations and examples of the type of content that goes in these elements and attributes. For now, though, you should try first to get a sense of the overall structure of a Schematron schema. That structure looks roughly like the following (within a particular <rule> element, the <assert> element may be replaced by a <report> element):

<schema xmlns:sch="http://purl.oclc.org/dsdl/schematron" queryBinding="xslt2"
    xmlns:sqf="http://www.schematron-quickfix.com/validator/process"
    xmlns="http://purl.oclc.org/dsdl/schematron">
    <pattern>
        <rule context="">
            <assert test=""> ... </assert>
        </rule>
    </pattern>
</schema>

Schematron patterns and rules

For everything that you’d like to test in a Schematron schema, create a <pattern> element that contains a <rule> element. The value of the @context attribute on the <rule> element is the place in the XML structure where the rule should kick in—that is, it’s the Schematron counterpart to the @match attribute in XSLT, which specifies when a template should kick in. Here’s the part of one of my files that contains start and end page numbers:

<text-pages-r>
    <start>36</start>
    <end>37</end>
</text-pages-r>

I created a Schematron rule that fires whenever an <end> element is encountered in the document, so I specified “end” as the value of the @context attribute:

<pattern>
    <rule context="end">
 
    </rule>
</pattern>

The value of the @context attribute is an XPath pattern, that “XPath-like” structure that we also use for the value of the @match attribute on an <xsl:template> element in XSLT. As in the XSLT situation, you can specify context (e.g., a value of body/div would apply to <div> elements only if they are the immediate children of a <body> element) and you can use predicates (e.g., a value of sp[@who eq 'Hamlet'] would match <sp> elements only when they have a @who attribute with the value “Hamlet”).

Warning: If a Schematron <pattern> element contains more than one <rule> element with @context values that match the same node, only the first one will fire. For example, if you have a <rule> that matches all children of <div> elements and another rule inside the same <pattern> that matches all <p> elements, both rules will appear to match <p> children of <div> elements, but the rule that occurs second, after the other, will be ignored. To avoid writing rules that are ignored you can use more precise @context values, write more detailed @test patterns on your <assert> or <report> elements, or make the competing <rule> elements children of different <pattern> parents (the conflict matters only when two <rule> elements are children of one and the same <pattern>). Don’t, though, get in the habit of putting every <rule> inside a separate <pattern>. <pattern> is a useful way to group <rule> elements, which helps keep track of long or complex sets of Schematron constraints.

`<assert>` and `<report>` rules

The actual constraint checking in a Schematron schema is peformed by <assert> and <report> elements, which are children of the <rule> element. For this project I wrote an <assert> rule that asserts that when an <end> element is encountered, its numerical value must be greater than the numerical value of the immediately preceding <start> element:

<assert test=". &gt; preceding-sibling::start">The  
    end page cannot be less than the start page</assert>

The textual content of the <assert> element is the warning that will be shown to the coder if the condition that is being asserted is violated, and you can write whatever you will find most informative. The value of the @test attribute is an XPath expression that evaluates to True or False. The test in this case says that the value of the current context node (the specific <end> element being tested at the moment, represented by the dot according to the standard XPath convention) must be greater than the value of a preceding-sibling <start> element. My Relax NG schema is already ensuring that the <text-pages-r> container element contains exactly one <start> and one <end> element, in that order, and that those two elements contain integers within a certain range. Since I know that the current context <end> element must have exactly one preceding sibling <start> element, I don’t have to use a numerical predicate or otherwise worry about whether I’ll find what I’m looking for. All I have to do is compare the values of what I already know that I’ll find: two integers.

Note that the greater-than operator is written using an XML entity (>). This ensures that it won’t be mistaken for markup, and will be understood as a greater-than character when the XPath is evaluated by the Schematron validation engine.

When I put this all together, I have:

<schema xmlns:sch="http://purl.oclc.org/dsdl/schematron" queryBinding="xslt2"
    xmlns:sqf="http://www.schematron-quickfix.com/validator/process"
    xmlns="http://purl.oclc.org/dsdl/schematron">
    <pattern>
        <rule context="end">
            <assert test=". &gt; preceding-sibling::start">The  
                end page cannot be less than the start page</assert>
        </rule>
    </pattern>
</schema>

And that’s my entire Schematron schema for this project! All the rest of my validation was handled by Relax NG.

The preceding Schematron schema uses an <assert> rule to assert that some condition should be met, and it generates a user-specified message whenever the test described in the @test attribute on the <assert> element fails. While <assert> generates a message whenever a condition fails to be met, <report>, which uses the same syntax as <assert> (a @test contains an XPath expression and the textual content of the element contains a message to be generated or not according to the result of the test), does the opposite: it generates a message whenever a condition does occur. Almost anything you can test with <assert> you also test with <report> by testing for the opposite, and vice versa, so use whichever is clearest to you.

Attaching a Schematron schema to an XML file in <oXygen/>

You can name your Schematron file whatever you want, but by convention Schematron files end in “.sch”, much as Relax NG compact-syntax files conventionally end in “.rnc”, XSLT files in “.xsl”, and XML files in “.xml”. To associate a Schematron schema with an XML document in <oXygen/>, use the same strategy as with a Relax NG schema: while in your XML document, go to the menu bar, select Document, then Schema, then Associate schema …, and then navigate to your Schematron schema (selecting the appropriate type from the drop-down menu, if necessary). <oXygen/> will insert a line into the top of the XML file that will look something like (my Schematron file in this project is called “aa.sch”):

<?xml-model href="aa.sch" type="application/xml" 
    schematypens="http://purl.oclc.org/dsdl/schematron"?>

As with Relax NG, you don’t have to worry about this line; <oXygen/> takes care of it for you. Once you’ve associated the Schematron schema with the document, <oXygen/> will use it for real-time validation, alongside your Relax NG schema.

<oo>→<dh> Digital humanities