Maintained by: David J. Birnbaum (djbpitt@gmail.com) Last modified: 2022-08-08T23:47:40+0000
Schematron is a constraint language that can be used for validating aspects of your XML that are difficult to express using a grammar-based schema language like Relax NG. In the example that will be described in detail below, I encoded the start and end pages of some text I was transcribing from a printed source. I used Relax NG to constrain those values to integers, and I used pattern facets to specify the allowable range of integers (I knew how long the printed source was, so I could specify that no page number could be higher than the last page number in the book), but Relax NG isn’t capable of catching situations where I might carelessly type an end-page number that is less than the associated start-page number. This is exactly the sort of constraint that is easy to express in Schematron, and for this project I augmented my Relax NG schema with a very small Schematron schema whose only function was to guard against this type of error.
It’s easy to think that we aren’t going to make errors of that sort. We do, of course, make all sorts of errors, but in this particular project I was working with a team of coders, and the more I could button down the validation in <oXygen/>, the more I could avoid errors that they might make because they had misunderstood the conventions. For example, in the rendered output of this project, page ranges are given in abbreviated form, so that a text that runs from pages 123 to 136 would be shown as “123–36” (the hundreds value is not repeated). What the coders have to enter, though, is the full number in both cases; my XSLT transformation is responsible for knowing when to truncate the hundreds value (or any other part of a number). A coder could easily enter just two digits for the end page, misunderstanding the difference between what gets entered in the source and what gets rendered in the output, and Schematron validation can catch and avert this type of error.
Schematron is typically used together with a schema language like Relax NG, and <oXygen/> can validate an XML against both Relax NG and Schematron simultaneously. Like XSLT, Schematron is written in XML according to a particular schema (that is, a particular tag set). Because Schematron is used primarily to take care of the few details that cannot be handled in Relax NG, the rule sets in my projects tend to be fairly small.
The working part of a Schematron schema consists of
<assert>
and
<report>
elements, which rely on XPath to
define what precisely they are asserting or reporting. If you create a new
Schematron schema in <oXygen/>, it will create the following
superstructure:
<sch:schema xmlns:sch="http://purl.oclc.org/dsdl/schematron" queryBinding="xslt2"
xmlns:sqf="http://www.schematron-quickfix.com/validator/process">
</sch:schema>
To avoid having to type the sch:
namespace
prefix in front of all of our Schematron elements, we modify this by setting the
Schematron namespace as the default (line 3 below), and once we’ve done that we can
remove the prefix from the root element. If your document is in a specific
namespace, you’ll also need to account for this with a
<ns>
child of your root
<schema>
element (line 4 below). To put this
all together, the following root element would be used to create a schema to
constrain Bad Hamlet, which is in the TEI namespace:
<schema xmlns:sch="http://purl.oclc.org/dsdl/schematron" queryBinding="xslt2"
xmlns:sqf="http://www.schematron-quickfix.com/validator/process"
xmlns="http://purl.oclc.org/dsdl/schematron">
<ns uri="http://www.tei-c.org/ns/1.0" prefix="tei"/>
</schema>
When you work with TEI documents all TEI elements (but not attributes) in
your XPath expressions within the schema will then need to be preceded by
tei:
namespace prefix that you defined so that
the system will know how to find them in Bad Hamlet. This means that
//body/div
would not apply to anything in this
schema, because the elements in that XPath expression are not namespaced. Since
elements in the document are all in the TEI namespace, what you want instead is
//tei:body/tei:div
. Note, though, that
attributes by default are not in a namespace even when the rest of the
document is. This means that, for example, a
@ref
attribute on a
<div>
element could be addressed as
//tei:body/tei:div/@ref
(and if you put a
namespace prefix in front of the attribute name, you wouldn’t find anything because
the attribute isn’t in a namespace).
For most purposes, think of the root <schema>
element as containing a set of <pattern>
elements. A <pattern>
element contains one
or more <rule>
elements, and every
<rule>
element has a
@context
attribute. The
<rule>
element contains, as child elements,
<assert>
and
<report>
elements, which have
@test
attributes. See below for explanations and
examples of the type of content that goes in these elements and attributes. For now,
though, you should try first to get a sense of the overall structure of a Schematron
schema. That structure looks roughly like the following (within a particular
<rule>
element, the
<assert>
element may be replaced by a
<report>
element):
<schema xmlns:sch="http://purl.oclc.org/dsdl/schematron" queryBinding="xslt2"
xmlns:sqf="http://www.schematron-quickfix.com/validator/process"
xmlns="http://purl.oclc.org/dsdl/schematron">
<pattern>
<rule context="">
<assert test=""> ... </assert>
</rule>
</pattern>
</schema>
For everything that you’d like to test in a Schematron schema, create a
<pattern>
element that contains a
<rule>
element. The value of the
@context
attribute on the
<rule>
element is the place in the XML
structure where the rule should kick in—that is, it’s the Schematron counterpart to
the @match
attribute in XSLT, which specifies
when a template should kick in. Here’s the part of one of my files that contains
start and end page numbers:
<text-pages-r>
<start>36</start>
<end>37</end>
</text-pages-r>
I created a Schematron rule that fires whenever an
<end>
element is encountered in the
document, so I specified “end” as the value of the
@context
attribute:
<pattern>
<rule context="end">
</rule>
</pattern>
The value of the @context
attribute is an XPath
pattern, that “XPath-like” structure that we also use for the value of the
@match
attribute on an
<xsl:template>
element in XSLT. As in the
XSLT situation, you can specify context (e.g., a value of
body/div
would apply to
<div>
elements only if they are the
immediate children of a <body>
element) and
you can use predicates (e.g., a value of
sp[@who eq 'Hamlet']
would match
<sp>
elements only when they have a
@who
attribute with the value “Hamlet”).
Warning: If a Schematron <pattern>
element
contains more than one <rule>
element with
@context
values that match the same node, only
the first one will fire. For example, if you have a
<rule>
that matches all children of
<div>
elements and another rule inside the same
<pattern>
that matches all
<p>
elements, both rules will appear to match
<p>
children of
<div>
elements, but the rule that occurs
second, after the other, will be ignored. To avoid writing rules that are ignored
you can use more precise @context
values, write
more detailed @test
patterns on your
<assert>
or
<report>
elements, or make the competing
<rule>
elements children of different
<pattern>
parents (the conflict matters only
when two <rule>
elements are children of one
and the same <pattern>
). Don’t, though, get in
the habit of putting every <rule>
inside a
separate <pattern>
.
<pattern>
is a useful way to group
<rule>
elements, which helps keep track of long
or complex sets of Schematron constraints.
<assert>
and
<report>
rulesThe actual constraint checking in a Schematron schema is peformed by
<assert>
and
<report>
elements, which are children of the
<rule>
element. For this project I wrote an
<assert>
rule that asserts that when an
<end>
element is encountered, its numerical
value must be greater than the numerical value of the immediately preceding
<start>
element:
<assert test=". > preceding-sibling::start">The
end page cannot be less than the start page</assert>
The textual content of the <assert>
element
is the warning that will be shown to the coder if the condition that is being
asserted is violated, and you can write whatever you will find most informative. The
value of the @test
attribute is an XPath
expression that evaluates to True
or
False
. The test in this case says that the
value of the current context node (the specific
<end>
element being tested at the moment,
represented by the dot according to the standard XPath convention) must be greater
than the value of a preceding-sibling
<start>
element. My Relax NG schema is
already ensuring that the <text-pages-r>
container element contains exactly one
<start>
and one
<end>
element, in that order, and that those
two elements contain integers within a certain range. Since I know that the current
context <end>
element must have exactly one
preceding sibling <start>
element, I don’t
have to use a numerical predicate or otherwise worry about whether I’ll find what
I’m looking for. All I have to do is compare the values of what I already know that
I’ll find: two integers.
Note that the greater-than operator is written using an XML entity
(>
). This ensures that it won’t be
mistaken for markup, and will be understood as a greater-than character when the
XPath is evaluated by the Schematron validation engine.
When I put this all together, I have:
<schema xmlns:sch="http://purl.oclc.org/dsdl/schematron" queryBinding="xslt2"
xmlns:sqf="http://www.schematron-quickfix.com/validator/process"
xmlns="http://purl.oclc.org/dsdl/schematron">
<pattern>
<rule context="end">
<assert test=". > preceding-sibling::start">The
end page cannot be less than the start page</assert>
</rule>
</pattern>
</schema>
And that’s my entire Schematron schema for this project! All the rest of my validation was handled by Relax NG.
The preceding Schematron schema uses an
<assert>
rule to assert that some condition
should be met, and it generates a user-specified message whenever the test described
in the @test
attribute on the
<assert>
element fails. While
<assert>
generates a message whenever a
condition fails to be met,
<report>
, which uses the same syntax as
<assert>
(a
@test
contains an XPath expression and the
textual content of the element contains a message to be generated or not according
to the result of the test), does the opposite: it generates a message whenever a
condition does occur. Almost anything you can test with
<assert>
you also test with
<report>
by testing for the opposite, and
vice versa, so use whichever is clearest to you.
You can name your Schematron file whatever you want, but by convention Schematron
files end in “.sch”, much as Relax NG compact-syntax files conventionally end in
“.rnc”, XSLT files in “.xsl”, and XML files in “.xml”. To associate a Schematron
schema with an XML document in <oXygen/>, use the same strategy as with a
Relax NG schema: while in your XML document, go to the menu bar, select
Document
, then Schema
, then Associate schema …
, and then
navigate to your Schematron schema (selecting the appropriate type from the
drop-down menu, if necessary). <oXygen/> will insert a line into the top of
the XML file that will look something like (my Schematron file in this project is
called “aa.sch”):
<?xml-model href="aa.sch" type="application/xml"
schematypens="http://purl.oclc.org/dsdl/schematron"?>
As with Relax NG, you don’t have to worry about this line; <oXygen/> takes care of it for you. Once you’ve associated the Schematron schema with the document, <oXygen/> will use it for real-time validation, alongside your Relax NG schema.
The main source for Schematron information is http://www.schematron.com/.