Digital humanities

Maintained by: David J. Birnbaum ( [Creative Commons BY-NC-SA 3.0 Unported License] Last modified: 2023-04-14T18:03:05+0000

Test #7: Schematron

Test overview

For this test we’ll work with an excerpt of Herman Melville’s Moby Dick, which can be found at Here, you’ll be using Schematron to catch typos and and verify the document’s structure. To make the XML more manageable, we have included content from only the first three chapters of the book and shortened each chapter to just the first several paragraphs. If you’d like to read the full novel, a Gutenberg copy is available at We’ve also created a Relax NG schema for this excerpt, which you don’t need for the test, but if you’d like to take a look you’ll find it at

The tasks

You’ll need to write three types of <assert> or <report> tests to perform the three types of validation described below. Whether you use an <assert> or <report> in your solutions is up to you (use whichever you find easiest to understand), but be sure to include 1) a mindful error message you would find helpful if you were editing this markup for a project and 2) code comments that help you understand what each part of your Schematron is testing.

The errors your tests will look for don’t appear in the supplied XML, which means that if you don’t see an error report, you can’t automatically know whether your rule is capable of recognizing an error should one occur. If you have a rule that doesn’t report an error, the normal way to check whether that’s because there aren’t any errors or because your rule isn’t matching anything at all is to introduce an error in the XML yourself (temporarily) to verify whether your rule can find it.

For each task we provide a hint about the XPath functions we used, but in most cases there are alternative approaches that are no worse than ours. That is, your tests do not have to match ours as long as they are in correct, legible XPath that provides the same functionality.

Here are the three types of errors for which you’ll need to test:

  1. The structure of the chapter titles in Moby Dick is very consistent, always following the pattern of:

    1. The string CHAPTER, followed by a space.

    2. Then the chapter number, followed by a dot and then a space.

    3. Then the chapter name, followed by a dot.

    For example:

    <chapter-title>CHAPTER 1. Loomings.</chapter-title>

    Write a Schematron rule that raises an error when the title featured in a <chapter-title> element deviates from this pattern. For example, the following <chapter-title> elements should all be reported as invalid:

    <chapter-title>CHAPTER 1.Loomings.</chapter-title>
    <chapter-title>Chapter 1. Loomings.</chapter-title>
    <chapter-title>CHAPTER A. Loomings.</chapter-title>

    In the preceding examples, the first is missing the space after the chapter number, the second hasn’t fully capitalized the word CHAPTER, and the third uses a letter where a number is expected.

    This is just a sampling of cases to test for, so make sure that your @test value checks for all aspects of the formatting, paying particular attention to capitalization and spacing. (Hint: we used matches() in our solution to this problem.)

  2. This XML document uses a @page attribute on each <chapter> element to denote which page number each chapter starts on. We don’t have enough information to verify precisely whether the page numbers listed in our XML are accurate, since they are transcribed from a print copy of the book, but we can use Schematron to catch values that would be illegal within the context of the XML. For example, the values of the @page attributes should be strictly increasing across the chapters, meaning that each chapter should start at a later page number than the preceding ones (the technical term for this is monotonically increasing). For example:

    <chapter page="1">...</chapter>
    <chapter page="7">...</chapter>
    <chapter page="11">...</chapter>

    Here each chapter correctly starts at a page number that is larger than the preceding ones. If the second chapter in this example started on page 23, though, that would be incorrect because the third chapter, which starts on page 11, would start at an earlier page than the chapter that precedes it.

    Write a Schematron rule that raises an error when an @page value of a <chapter> is set to a value that would break the logical ascending order. (Hint: we used number() in our solution to this problem.)

  3. Quotation marks are often used in Moby Dick in descriptions of the scenery Ishmael encounters along his journey. For example:

    <p>With halting steps I paced the streets, and passed the sign of "The Crossed
    Harpoons"—but it looked too expensive and jolly there...</p>

    Because of the way quotations are structured within this excerpt, all start quotation marks should have a matching ending quotation mark. Write a Schematron rule that constrains the document by verifying that quotation marks found within the text are balanced, that is, that there is an even number of them, so that for every quotation mark that opens a quotation there is a quotation mark that closes it. (It is not possible to verify with Schematron that the quotation marks are around actual quotations, but at least we can verify that they are paired.)

    (Hint: we used the mod operator in our solution, which returns the remainder of two divided numbers. See the Using mod section below for information about how mod works. Additionally, reviewing our methods for counting characters in Schematron assignment 2 might also prove useful here.)

Using mod

mod, which is an abbreviation for the mathematical operation modulus, will divide two numbers and then return the resulting remainder value. For example, the expression 7 mod 2 would return 1, since 7 divided by 2 is 3 with a remainder value of 1. mod is commonly used as a way to check if a given number is even: what number, when used to divide an even number, will always output a remainder of 0, while when it is used to divide an odd number it will never output a remainder of 0?

Bonus task (optional, extra credit)

For extra credit on the Schematron test, help constrain Moby Dick so the title number featured in each <chapter-title> element accurately corresponds to that chapter’s offset position within the book. For example, in:

    CHAPTER 1. Loomings.




CHAPTER 3. The Carpet-Bag.






the markup would be invalid, even though it follows our <chapter-title> naming rules, because a chapter title with number of "3" is the second (not third) chapter within the XML. (Hint: for our solution, we used substring-before(), substring-after(), and position()).

What to submit

All you need to upload for this test is your Schematron schema file with the file extension .sch. We will associate your schema with the XML ourselves and verify your Schematron rules.