Digital humanities


Maintained by: David J. Birnbaum (djbpitt@gmail.com) [Creative Commons BY-NC-SA 3.0 Unported License] Last modified: 2023-04-29T00:49:48+0000


Test #7: Schematron

The tasks

The task for this test was to write Schematron that constrained an excerpt of Herman Melville’s Moby Dick in the three following ways:

  1. Verify the formatting and structure of titles found within <chapter-title> elements was correct for each <chapter>.
  2. Check that page numbers listed within the @page attributes maintained a logical increasing order as the chapters progressed.
  3. Make sure that quotation marks were balanced within the body of the excerpt.

And, as extra credit:

Additional details and hints on how to approach these tasks can be found in the original assignment.

Our solution

Below is our solution to the Schematron test:



    
        
        
        
        
            Chapter titles must follow the
                format "CHAPTER #. Title."
        
        
        
        
        
            The first chapter must appear on or after the first
                page, but instead appeared on page []
        
        
            
            Expected a strictly ascending
                sequence of @page values, but this chapter’s page number [] is less than the previous chapter’s page number [].
        
        
        
        
        
            
            
            
             Expected even count of quotation marks, but
                found an odd amount []. Check for unclosed
                quotations.
        
    
    
        
        
        
        
            
            
             Expected chapter title number to correspond
                with the chapter’s position, but found chapter number ""
                within Chapter .
        
    


]]>

There are several good ways to approach these tasks other than those written in our solution. A solution that does not directly match ours or even use all the same functions we use is not necessarily incorrect. Whether your implementation was similar to ours or you went about it a different way, please take a look below for additional explanation for each part of the tasks.

Explanations

Part one

In this part, we looked to raise an error whenever the titles included within the <chapter-title> elements did not follow this structure:

  1. The string CHAPTER, followed by a space.

  2. Then the chapter number, followed by a dot and then a space.

  3. Then the chapter name, followed by a dot.

The Schematron that accomplishes this looks like the following:


    Chapter titles must follow the
        format "CHAPTER #. Title."
]]>

Since we want to look at each <chapter-title> element individually and test whether or not it follows the desired formatting conventions, our @context value is chapter-title. While this constraint could also be applied by setting the @context of the rule to chapter and then pathing down to its <chapter-title> child for the @test, this would be slightly inaccurate: In Schematron, error messages are reported (that is, the dreaded red squiggly line in <oXygen/> appears) wherever the @context attribute is set, and so it is best to use the most specific context value for a given constraint where possible. This makes it easier for a user to locate where the invalidity is occurring within the XML. If the @context were equal to chapter, the error message would appear on the <chapter> element, and from there the user would have to figure out which specific elements within the chapter were incorrect. In contrast, an error message appearing on the <chapter-title> elements better directs a user to the source of the error and is more representative of the scope of the constraint.

Our solution uses matches(), which accepts two arguments. The first argument of matches() is a string, and the second argument is a regular expression. matches() will use the regular expression provided in the second argument and see if it occurs anywhere within the string from the first argument: if it does, it returns true, and if the regular expression does not appear somewhere in the string, it returns false. We chose the regular expression ^CHAPTER [0-9]+\. .+\.$, which checks for the leading string "CHAPTER", followed by a single space character, then one or more digits, then a dot and space after the digits dots and a dot after the text of the chapter title. Here are some considerations we made when creating this expression:

Part two

This part focused on making sure the page numbers found within the @page attributes within each <chapter> element increased in a strictly ascending order (the technical term for this is monotonically increasing). Our solution was the following:


    The first chapter must appear on or after the first
        page, but instead appeared on page []


    
    Expected a strictly ascending
        sequence of @page values, but this chapter’s page number [] is less than the previous chapter’s page number [].
]]>

In our solution, we set the @context value of our first rule to chapter[1], which allows us to perform a unique test on just the first chapter, and then to chapter on the second rule, which allows us to both easily access the @page value of that <chapter> and the preceding <chapter> sibling. We then perform the following two checks:

  1. First, we check that the very first <chapter> in the document falls on a page number equal to or greater than one. This guarentees that our first chapter appears at a logical page number, since no chapter can begin at a negative page number, but allows flexibility for it to appear on or after page one, since the first chapter might only begin after a title page or dedication page, for example, and thus not start exactly on one.

    Since the first chapter has no preceding siblings, attempting to access its previous sibling’s page number, such as with the XPath expression preceding-sibling::chapter[1]/number(@page), will result in a value of NaN, or not a number. NaN comparison works strangely: in XPath, all real numbers are less than NaN, and there is no real number this is greater than NaN. At the same time, NaN itself is not equal to NaN, nor is it less than NaN or greater than NaN. Even more strangely, NaN plus a number is not greater to, equal to, or less than NaN. Needless to say, NaN comparison can be very uninituitive, and we so avoid it here by not referencing the previous sibling and just making our comparisons of the current page number against the constant value of 1. (For a fascinating tour of the World of NaN see If it’s not a number, what is it? Demystifying NaN for the working programmer.)

  2. Next, we make sure the rest of the chapters have monotonically increasing page numbers by making sure that a given chapter’s page number is greater than the page number of the chapter that came just before it. Because a node in Schematron can match with only a single <rule> within a <pattern>, the first <chapter> in the document never sees the <report> within the second rule because it has already matched with the first rule with context chapter[1]. Without fear of selecting a previous sibling that doesn’t exist, we can find the last chapter’s page number and store this in a variable for easier referencing later on. We then make sure that the current @page is greater than the last chapter’s @page, thus enforcing a strictly increasing order across all chapters.

For both the above tests, we made sure to cast the @page values we found to a numeric data type with the XPath function number(). This is an important step because comparisons between an untyped value or a string value can differ from numerical comparison. As an example, "11" lt "7" (note the quotation marks, which mean that we’re dealing with strings, and not numerical values) evaluates to true, since in string comparison the first digit of 1 from 11 is less than the a digit of 7, but number("11") lt number("7") (where the number() function casts the strings to numerical values) evaluates to the expected value of false.

A solution that works within this document but is not ideal is to manually select the first, second, and third <chapter> and make sure chapter[1] is less than chapter[2] is less than chapter[3]. While this performs the validation we intend it to, what if we were working across the whole novel and had to then manually select all 135 chapters of Moby Dick? Typing out 135 unique XPath expressions would get tiring quickly. Instead, using a method like above ensures that our tests will work across all chapters, regardless of how many there are in our XML.

Part three

Part three looked to ensure that quotation marks within the body were balanced. Our solution looked like this:


    
    
    
     Expected even count of quotation marks, but
        found an odd amount []. Check for unclosed
        quotations.
]]>

Since an closing quotation mark should always fall in the same paragraph as its opening quotation mark within this excerpt, we set our @context value to p.

This solution borrows heavily from the solution to Schematron assignment 2, where we repeat the same method of using translate() and string-length()to count how many quotation marks are in a given <p>. Since quotation marks are already reserved characters in XPath to delineate string values, we use the string representation &quot; to select the quotation mark in the second argument of translate(). An equally effective strategy is to escape the quotation mark with another quotation mark: we reference a similar method involving apostrophes in our explanation to XSLT assignment 4.

Once we’ve counted how many quotation marks exist within the paragraph, we then mod that value by 2. If the number is divisible by 2, and thus even, it will have a remainder of 0, but if the number is not divisible by 2, and thus odd, it will have a remainder of 1. We use this property in our @test to assert that the number of quotation marks in a paragraph must be even, and if they are even, it means they are balanced within that paragraph.

Extra credit

The extra credit question involved making sure a given <chapter-title> correlated correctly to the offset position of that <chapter> element within the excerpt. Our solution looked like the following:


    
    
     Expected chapter title number to correspond
        with the chapter’s position, but found chapter number ""
        within Chapter .
]]>

Here, having our @context set to chapter is important because we want to be able to reference the position() of a given chapter against the number listed in its <chapter-title>.

We can’t compare the <chapter-title> to the position() of the paragraph right away because the chapter number in the title is sandwiched between some other text. To extract it, we used substring-before() and substring-after() to seperate out the number from the rest of the title text. From there, we can compare it to the chapter’s position and assert that they must be equal. As with task 2, we make sure to cast our extracted chapter number from a string to a numerical value with the number() function to eliminate unexpected behavior when comparing untyped or string values. We do not need to cast the result of position() since the result of position() is already returned as a numerical value.