Digital humanities


Maintained by: David J. Birnbaum (djbpitt@gmail.com) [Creative Commons BY-NC-SA 3.0 Unported License] Last modified: 2021-04-03T17:02:07+0000


Test #5: Schematron answers

The text

This test uses a poem from this semester’s Dickinson project, which you can find at schematron-instance.xml. We have modified the XML for use in this Schematron test, including introducing content and markup that was not present in the project data. The original developers are not responsible for these modifications, which were made only for testing purposes.

You may assume that Relax NG validation is ensuring that all required elements and attributes are present and in allowed contexts. For example, you do not have to validate that the document contains a <date>, that the date is where it should be in the document, that the date is a four-digit year, that the date is allowed to have a @period attribute, or that late is a permitted value for that attribute.

Required tasks

Create Schematron rules that enforce the following constraints:

  1. A <line> element cannot begin or end with a space character.
  2. A <line> element cannot be empty or contain just whitespace.
  3. A <stanza> element with a @type attribute that has the value quatrain must contain exactly four lines. Your error message should report the actual line count of stanzas that fail this test.
  4. A poem with a <date> element that has a @period attribute with the value late must be dated between 1875 and 1886, inclusive. Your error message should report the actual date that appears in the poem.

Optional, extra-credit tasks

  1. The first two types of constraints described above for <line> elements also apply to all element descendants of the <metadata> element. That is, none of those elements can begin or end with a space character and they cannot be blank or consist entirely of whitespace. You should write one rule that tests all of these elements, and not a separate rule for each of them.
  2. The real first line of the poem must match the text of the first line as given in the metadata section, except that there may be whitespace differences. (This is a common real-life exception because pretty-printing could introduce whitespace differences.)
  3. Any theme mentioned in the <poem_themes> metadata element must correspond to the name of at least one child element inside a <line>. For example, if there were a <theme> element with the value obdurodon, there would have to be an element called <obdurodon> in a line of the poem. No fair checking by specific names; this rule has to work with any element type, including those not present in this particular poem. Your error message should report the value of the spurious <theme> element.
  4. All child elements of <line> must have names that correspond to <theme> element children of the <poem_themes> metadata element. This is the mirror image of the preceding rule, and your error message should report the name of the element type that appears in the body but is not listed among the metadata themes.

Our solution

<?xml version="1.0" encoding="UTF-8"?>
<sch:schema xmlns:sch="http://purl.oclc.org/dsdl/schematron" queryBinding="xslt2"
    xmlns:sqf="http://www.schematron-quickfix.com/validator/process">
    <sch:pattern>
        <!-- ====================================================== -->
        <!-- Global variables                                       -->
        <sch:let name="metadata-themes" value="//theme"/>
        <sch:let name="body-themes" value="//line/* ! name()"/>
        <!-- ====================================================== -->
        <sch:rule context="stanza[@type eq 'quatrain']">
            <!-- A quatrain must contain 4 lines -->
            <sch:let name="line-count" value="count(line)"/>
            <sch:assert test="$line-count eq 4">Quatrains must contain exactly four lines. This
                stanza contains <sch:value-of select="$line-count"/> lines.</sch:assert>
        </sch:rule>
        <sch:rule context="line">
            <!-- A line cannot be empty and or contain just whitespace -->
            <sch:report test="string-length(.) = 0">Lines cannot be empty and cannot contain just
                whitespace</sch:report>
            <!-- A line cannot start with a space character -->
            <sch:report test="starts-with(., ' ')">Lines should not begin with space
                characters</sch:report>
            <!-- A line cannot end with a space character -->
            <sch:report test="ends-with(., ' ')">Lines should not end with space
                characters</sch:report>
        </sch:rule>
        <sch:rule context="date[@period eq 'late']">
            <!-- Late period poems must be dated between 1875 and 1886 -->
            <sch:let name="year" value="number(.)"/>
            <sch:assert test="$year ge 1875 and number(.) le 1886">Late-period poems must have a
                date between 1875 and 1886, inclusive, and <sch:value-of select="$year"/> does not
                fall within that range.</sch:assert>
        </sch:rule>
        <sch:rule context="line/*">
            <!-- Inline element types must be listed among the metadata themes -->
            <sch:let name="name" value="name()"/>
            <sch:assert test="$name = $metadata-themes">The element type "<sch:value-of
                    select="$name"/>" is not among the metadata themes: "<sch:value-of
                    select="string-join($metadata-themes, ', ')"/>"</sch:assert>
        </sch:rule>
        <sch:rule context="theme">
            <!-- Every theme element in the metadata must appear inside a line in the body -->
            <sch:assert test=". = $body-themes">The metadata theme "<sch:value-of select="."/>" does
                not appear in the body themes: "<sch:value-of
                    select="string-join($body-themes, ', ')"/>"</sch:assert>
        </sch:rule>
    </sch:pattern>
    <sch:pattern>
        <sch:rule context="metadata//*[not(self::poem_themes)]">
            <!-- Metadata entries (all descendants) cannot be empty or contain just whitespace -->
            <sch:report test="string-length(normalize-space(.)) = 0">Metadata entries cannot be
                empty and cannot contain just whitespace</sch:report>
            <sch:report test="starts-with(., ' ')">Metadata entries cannot begin with space
                characters.</sch:report>
            <sch:report test="ends-with(., ' ')">Metadata entries cannot end with space
                characters.</sch:report>
        </sch:rule>
        <!-- The real first line must match the metadata first line, except for whitespace -->
        <sch:rule context="body/descendant::line[1]">
            <sch:assert test="normalize-space(.) eq //first_line ! normalize-space(.)">The first
                line of the poem must contain the same text as the &lt;first_line&gt; element in the
                metatdata.</sch:assert>
        </sch:rule>
    </sch:pattern>
</sch:schema>

Discussion

Overview

If multiple rules inside an <sch:pattern> element are able to match the same thing (for example, if one specifies all lines as the context and another specifies only first lines), only the first one (in order) will fire. The way to work around this is to separate otherwise overlapping rules into different <sch:pattern> elements. you did not have to do this for the required activities, none of which required rule contexts that would overlap. As an alternative to splitting your rules into different <sch:pattern> elements as a way of avoiding overlapping contexts, you can write tests that take the differences into account. We illustrate both of these strategies below.

Basic solution

All of these rules can go inside the same <sch:pattern> element:

<sch:rule context="stanza[@type eq 'quatrain']">
    <!-- A quatrain must contain 4 lines -->
    <sch:let name="line-count" value="count(line)"/>
    <sch:assert test="$line-count eq 4">Quatrains must contain exactly four lines. This
        stanza contains <sch:value-of select="$line-count"/> lines.</sch:assert>
</sch:rule>
<sch:rule context="line">
    <!-- A line cannot be empty and or contain just whitespace -->
    <sch:report test="string-length(.) = 0">Lines cannot be empty and cannot contain just
        whitespace</sch:report>
    <!-- A line cannot start with a space character -->
    <sch:report test="starts-with(., ' ')">Lines should not begin with space
        characters</sch:report>
    <!-- A line cannot end with a space character -->
    <sch:report test="ends-with(., ' ')">Lines should not end with space
        characters</sch:report>
</sch:rule>
<sch:rule context="date[@period eq 'late']">
    <!-- Late period poems must be dated between 1875 and 1886 -->
    <sch:let name="year" value="number(.)"/>
    <sch:assert test="$year ge 1875 and number(.) le 1886">Late-period poems must have a
        date between 1875 and 1886, inclusive, and <sch:value-of select="$year"/> does not
        fall within that range.</sch:assert>
</sch:rule>

To verify that quatrains contain exactly four lines, we match on <stanza> elements with a @type value of quatrain, which means that the rule fires once per <stanza> element if the element has a @type attribute with the value quatrain (and ignores other <stanza> elements). We use the count() function to count its <line> children, and we save this value in a variable because we’re going to use it twice (for the test and the report), and by saving it in a variable we have to count it only once. We then assert that the count must be equal to 4, and report an error, with the actual count, if the test fails. With this poem the test will give the correct result if you don’t specify the @type in the @context because this poem happens not to contain any stanzas that are not quatrains, but you should specify the @type attribute in the rule anyway. The poet also wrote poems with stanzas were not quatrains, so specifying the attribute will protect you from spurious error reports on tercets or sestets in other poems.

We write separate rules to verify that a line does not begin or end with a space characters. Those aren’t very robust tests, though, because they check only for literal space characters, and not for other whitespace characters, like tabs and newlines. A more robust rule to check whether there is any whitespace character (not just a literal space) at the beginning of a line might be:

<sch:report test="matches(., '^\s')">

Because matches() checks for a regex, rather than a string, we can use the regex notation \s, which means any whitespace character; this saves us from having to check for each possible whitespace character separately. If you want to use starts-with(), you could use a more complex test, checking for each type of whitespace character individually:

<sch:report test="starts-with(., ' ') or 
starts-with(., '&#x09;' or 
starts-with(., '&#x0a;')

Because we can’t easily read tab or newline characters, we use numerical character references to represent them in our code (see Kay, p. 142). We would also want to write a more robust rule to check the end of the line, using '\s$' as the regex to match.

To ensure that a <line> element isn’t empty, we measure the length with string-length() and assert that it should not be equal to 0.

The context for these rules is line, so the rule fires once per <line> element.

To test the date range we match the <date> element only if it has a @period attribute with the value of late, and we save the value to a variable so that we can reuse it easily. If you want to use general equality for the comparison, you don’t need the number() function, but if you want to use value comparison, you do. They two types of comparison the same meaning in this particular context, so you can use either one; they do have different meanings elsewhere, though, and you can remind yourself of the difference in the Comparison section of our https://dh.obdurodon.org/functions.xhtml. We assert that the year falls within the designated range, and report an error if it doesn’t.

Element children of <line> elements must be listed among the metadata <theme> elements

<sch:let name="metadata-themes" value="//theme"/>
<sch:rule context="line/*">
    <!-- Inline element types must be listed among the metadata themes -->
    <sch:let name="name" value="name()"/>
    <sch:assert test="$name = $metadata-themes">The element type "<sch:value-of
            select="$name"/>" is not among the metadata themes: "<sch:value-of
            select="string-join($metadata-themes, ', ')"/>"</sch:assert>
</sch:rule>

We create a variable called $metadata-themes that is equal to a sequence of all <theme> elements in the document, which are only the ones listed within the metadata section. In our rule, the @context value matches all children of the <line> element it is looking at the moment, that is, all inline elements inside each line of the poem, one line at a time. For each element instance that the @context matches, we store the element name in a variable so that we can reuse it easily, once for the test and again for the report. The test asserts that the name of the inline elemebt is equal to one of the <theme> elements in the metadata section, and here we have to use general equality because that is the only want to test whether any item on the left side of the equal sign (there’s only one, the name of the element being tested) is equal to any item on the right (the sequence of all <theme> values from the metadata). If you try to use the eq for value comparison here, you’ll raise an error because value comparison can only compare one thing to one thing, and our right side contains a sequence of <theme> elements, that is, more than one. If the test fails, we report both the name of the problematic element that we matched and the list of expected values that we extracted from the <theme> elements in the metadata section.

Because this rule matches on children of line elements (context="line/*"), and not on <line> elements themselves, it does not complete or conflict with the earlier rule that matches on <line> elements themselves. For that reason, it can go inside the same <sch:pattern>.

Metadata entries (all descendants) cannot be empty or contain just whitespace

<sch:pattern>
    <sch:rule context="metadata//*[not(self::poem_themes)]">
        <!-- Metadata entries (all descendants) cannot be empty or contain just whitespace -->
        <sch:report test="string-length(normalize-space(.)) = 0">Metadata entries cannot be
            empty and cannot contain just whitespace</sch:report>
        <sch:report test="starts-with(., ' ')">Metadata entries cannot begin with space
            characters.</sch:report>
        <sch:report test="ends-with(., ' ')">Metadata entries cannot end with space
            characters.</sch:report>
    </sch:rule>
</sch:pattern>

We can test the whitespace behavior of all metadata elements at once by matching on all descendants of the <metadata> element, except that we don't want to test <poem_themes> in this way because it has element content, which means that it might contain whitespace for pretty-printing, and we care about whitespace only when it begins or ends real text. There are two ways to exclude <poem_themes> from the @context value:

  • Use a predicate to match only child elements of <metadata> if those children are not of type <poem_themes>. We do that above by using the self axis to say match me unless I, myself, am an element of type <poem_themes>. This type of test is the principal use case for the self axis.
  • XPath has an except operator that can used to exclude items from a sequence. If you specify the value of @context as metadata//(* except poem_themes), that will match all element descendants of <metadata> except descendant <poem_theme> elements.

If have to put this <sch:rule> into a different <sch:pattern> than the rule that checks date. The reason is that the @context for the date checking rule is date[@period eq 'late'] and the context for checking all descendants of <metadata> is metadata//*[not(self::poem_themes)], and a late <date> element matches both of those XPath patterns. If a component of the document we are validating matches more than one @context value inside the same <sch:pattern>, only the first <sch:rule> will fire. We could have worked around this limitation with more complex predicates, but we find our code easier to write, read, debug, and maintain if we keep our @context XPath patterns as simple as possible, even when that requires us to create additional <sch:pattern> elements.

To test whether the first line of the real poem matches the <first_line> element in the metadata we could use either of those elements as the @context value. In both cases, though, we have to ensure that we don’t wind up with two @context attributes within a single <sch:pattern> element that match the same item in the document. We chose to match the real first line, and we normalize the whitespace on both it and the <first_line> element and then check whether they are the same.