Digital humanities


Maintained by: David J. Birnbaum (djbpitt@gmail.com) [Creative Commons BY-NC-SA 3.0 Unported License] Last modified: 2021-11-05T17:02:14+0000


Test #5: Schematron

The tasks

We damaged some of the markup in the Bad Hamlet file that we’ve been using in class, and your task for this test is to write Schematron that will report the damage. You should create a Schematron schema that, when used to validate http://dh.obdurodon.org/bad-hamlet.xml, will address the following issues. If you don’t know what XPath functions to use, or how to use them, you can do what we do and look them up in the references linked under the XPath section of our main course page. If you have time, we encourage you to try the bonus questions, which, like the required questions, are similar to tasks we might implement in Schematron in Real Life.

If your Schematron does not report any errors in the XML, that doesn’t mean that there aren’t any, since the Schematron may not be doing what you think it should be doing. For that reason, you’ll want to edit the XML to ensure that your Schematron raises an error report when there is an error and does not raise an error report when there is not an error.

Before you start: Schematron traps to watch out for

Namespaces

Schematron is namespace-aware and Bad Hamlet is in the TEI namespace. This means that a <sch:rule> element like:

]]>

will not match any <l> elements in the document because it will look only for <l> elements in no namespace, and the <l> elements in this document are in the TEI namespace. Similarly, any element name inside a @test attribute on an <sch:assert> or <sch:report> must also be specified as being in the TEI namespace.

Confusingly, though, although element inherit namespaces from ancestors (which is why all elements in our TEI document are in the TEI namespace after we declare it, just once, as the default on the root <TEI> element), attributes are in no namespace unless they have an explicit prefix. This means that, for example, the @n attributes in our TEI document are in no namespace, but @xml:id attributes are in the XML namespace.

So how do we write Schematron for documents in a namespace? The XML namespace is automatic, so we don’t have to declare or define it; we can just refer to @xml:id attributes and our Schematron will understand their namespace. But we do need to declare the TEI namespace and specify it on TEI elements (but not on any attributes). There are a few ways to manage namespaces in Schematron, and we’re going to recommend here one that is verbose, but also explicit: declare a namespace prefix for the TEI and use it whenever you refer to a TEI document. Here’s how to do that (you can copy this code block, erase our rules, and use the framework for your own Schematron):



    
    
        
            The parent of this speech is not a div, and it should
        
        
            No @n attribute
            No @xml:id attribute
        
    
]]>

We declare the TEI namespace in line 4, bind it to the prefix tei: (end of line 4), and use the prefix whenever we refer to an element (in this case in the @context and @test attribute values). Where we need to specify an axis, it goes before the namespace prefix (line 7). Attributes are not in a default namespace, so there is no namespace prefix on line 10, but we do have to use a namespace prefix when attributes are explicitly in a namespace in the XML (line 11). We have to declare the TEI prefix (line 4), but the XML prefix is automatically available in any XML environment and does not have to be declared.

Competing <rule> elements

One poorly documented feature of Schematron (it’s in the documentation, but without any special emphasis or warnings, although it’s easy to forget) is that if there are multiple Schematron <sch:rule> elements that are children of the same <sch:pattern> element and that match the same nodes, only the first match will be seen. For example, if your Schematron says something incorrect like the following:


    
    
        
    
    
        
    
]]>

both <sch:rule> elements match <l> elements. This means that the second one will fire on <ab> elements (which it also matches), but it won’t see <l> elements because there is an earlier <sch:rule> element in the same <sch:pattern> parent that also matches <l> elements.

There are two ways to deal with this:

Many is the time we’ve struggled to fix Schematron tests that were not finding bad XML, only to realize that the tests were correct, but they were being masked by preceding-sibling <sch:rule> elements with the same or an overlapping @context value.

For this test we suggest answering each question in a separate <sch:pattern> element. Not only will this help protect you from identical or overlapping @context values, but it will also make it easier for you to concentrate on one task at a time. We recommend commenting out all of the <sch:pattern> elements except the one you’re working on at the moment (Schematron is XML, which means that you can use XML comment delimiters), so that you aren’t distracted by error messages about other issues, and then uncomment them all at the end and verify that they all work as expected.

Required task 1: Missing @who attributes

All speeches (<sp> elements) should have both <speaker> children and @who attributes, and your Schematron should alert you about any that are missing one or the other of these features. As it happens, all of them do contain <speaker> children, but two of them are missing @who attributes, and since in Real Life you wouldn’t know that in advance, your Schematron needs to check both conditions.

Required task 2: Spurious leading and trailing spaces

Back when you first created XML we emphasized the importance of being consistent in the use of whitespace at the beginnings and ends of elements. It’s easy for a human not to notice, for example, that in the following speech:


    Horatio
    What art thou that usurp'st this time of
        night,
    Together with that fair and warlike
        form
    In which the majesty of buried Denmark
     Did sometimes march? by heaven I charge
        thee, speak! 
]]>

the first three lines correctly have no leading or trailing space characters, but the last one incorrectly has both. Your Schematron should notify you when <l>, <ab>, or <stage> elements begin or end with whitespace.

Required task 3: Character @xml:id values

The @xml:id values associated with the <role> elements in the cast list observe the following orthographic conventions:

Write Schematron that will verify that all @xml:id values associated with <role> elements conform to one of those three patterns. You do not have to check whether the characters do or do not speak and you not have to verify that the characters with numbers at the end of their names are in the correct ordinal position. You just need to verify that all @xml:id values on <role> elements match one or another of these patterns.

We used the XPath matches() function in our solution (also for some of the bonus tasks), and you may want to read up on that in Michael Kay before attempting this task.

Optional bonus task 1

Enhance the Schematron you wrote for Required task 3 to verify that the numbers at the end of the @xml:id values for the third type of character mentioned above matches the character’s position in the sequence of 37 <role> elements. For example, your Schematron should verify that the <role> element with with the @xml:id value sha-ham-role23 is, in fact, the 23rd of the 37 <role> elements.

Optional bonus task 2

Enhance the Schematron you wrote for Task 3 to verify that all characters in the first two groups speak and that all characters in the third group do not speak. When a character speaks, that character’s @xml:id value appears as the value of a @who attribute on a <sp> element. For example in:


    Horatio
    Hail to your lordship!

]]>

the value of the @who attribute matches one of the @xml:id attributes on a <role>. Note that when characters speak in unison, their @xml:id values are combined inside a @who value, e.g.:


    Rosencrantz and Guildenstern
    We'll wait upon you.
]]>

Optional bonus task 3

Verify that there is no speaker (word included in a @who attribute, but see above about speech in unison) who does not appear as the @xml:id value of a <role> element in the cast list.

This is not the same as the preceding task because it does the validation in the opposite direction. This task will report an error if there is a @who value (or part of a @who value after splitting into words) that does not correspond to one of the target @xml:id values. The preceding task verifies that every @xml:id value on a <role> element appears as a @who value (or part of a @who value) on an <sp> element except when the @xml:id value matches the pattern for non-speaking characters.