Maintained by: David J. Birnbaum (djbpitt@gmail.com)
Last modified:
2023-04-14T18:03:05+0000
For this test we’ll work with an excerpt of Herman Melville’s Moby Dick, which can be found at http://dh.obdurodon.org/moby-dick-excerpt.xml. Here, you’ll be using Schematron to catch typos and and verify the document’s structure. To make the XML more manageable, we have included content from only the first three chapters of the book and shortened each chapter to just the first several paragraphs. If you’d like to read the full novel, a Gutenberg copy is available at https://www.gutenberg.org/files/2701/2701-h/2701-h.htm. We’ve also created a Relax NG schema for this excerpt, which you don’t need for the test, but if you’d like to take a look you’ll find it at http://dh.obdurodon.org/moby-dick.rnc.
You’ll need to write three types of <assert>
or <report>
tests to perform the three types
of validation described below. Whether you use an
<assert>
or
<report>
in your solutions is up to you (use
whichever you find easiest to understand), but be sure to include 1) a mindful error
message you would find helpful if you were editing this markup for a project and 2)
code comments that help you understand what each part of your Schematron is
testing.
The errors your tests will look for don’t appear in the supplied XML, which means that if you don’t see an error report, you can’t automatically know whether your rule is capable of recognizing an error should one occur. If you have a rule that doesn’t report an error, the normal way to check whether that’s because there aren’t any errors or because your rule isn’t matching anything at all is to introduce an error in the XML yourself (temporarily) to verify whether your rule can find it.
For each task we provide a hint about the XPath functions we used, but in most cases there are alternative approaches that are no worse than ours. That is, your tests do not have to match ours as long as they are in correct, legible XPath that provides the same functionality.
Here are the three types of errors for which you’ll need to test:
The structure of the chapter titles in Moby Dick is very consistent, always following the pattern of:
The string CHAPTER
, followed by a space.
Then the chapter number, followed by a dot and then a space.
Then the chapter name, followed by a dot.
For example:
<chapter-title>CHAPTER 1. Loomings.</chapter-title>
Write a Schematron rule that raises an error when the title featured in a
<chapter-title>
element deviates
from this pattern. For example, the following
<chapter-title>
elements should all
be reported as invalid:
<chapter-title>CHAPTER 1.Loomings.</chapter-title>
<chapter-title>Chapter 1. Loomings.</chapter-title>
<chapter-title>CHAPTER A. Loomings.</chapter-title>
In the preceding examples, the first is missing the space after the chapter
number, the second hasn’t fully capitalized the word CHAPTER
, and the
third uses a letter where a number is expected.
This is just a sampling of cases to test for, so make sure that your
@test
value checks for all aspects of
the formatting, paying particular attention to capitalization and spacing.
(Hint: we used matches()
in our
solution to this problem.)
This XML document uses a @page
attribute
on each <chapter>
element to denote
which page number each chapter starts on. We don’t have enough information
to verify precisely whether the page numbers listed in our XML are accurate,
since they are transcribed from a print copy of the book, but we can use
Schematron to catch values that would be illegal within the context of the
XML. For example, the values of the
@page
attributes should be strictly
increasing across the chapters, meaning that each chapter should start at a
later page number than the preceding ones (the technical term for this is
monotonically increasing). For example:
<chapter page="1">...</chapter>
<chapter page="7">...</chapter>
<chapter page="11">...</chapter>
Here each chapter correctly starts at a page number that is larger than the preceding ones. If the second chapter in this example started on page 23, though, that would be incorrect because the third chapter, which starts on page 11, would start at an earlier page than the chapter that precedes it.
Write a Schematron rule that raises an error when an
@page
value of a
<chapter>
is set to a value that
would break the logical ascending order. (Hint: we used
number()
in our solution to this
problem.)
Quotation marks are often used in Moby Dick in descriptions of the scenery Ishmael encounters along his journey. For example:
<p>With halting steps I paced the streets, and passed the sign of "The Crossed
Harpoons"—but it looked too expensive and jolly there...</p>
Because of the way quotations are structured within this excerpt, all start quotation marks should have a matching ending quotation mark. Write a Schematron rule that constrains the document by verifying that quotation marks found within the text are balanced, that is, that there is an even number of them, so that for every quotation mark that opens a quotation there is a quotation mark that closes it. (It is not possible to verify with Schematron that the quotation marks are around actual quotations, but at least we can verify that they are paired.)
(Hint: we used the mod
operator in our
solution, which returns the remainder of two divided numbers. See the Using mod
section
below for information about how mod
works. Additionally, reviewing our methods for counting characters in Schematron
assignment 2 might also prove useful here.)
mod
mod
, which is an abbreviation for the
mathematical operation modulus, will divide two numbers and then return the
resulting remainder value. For example, the expression
7 mod 2
would return 1, since 7 divided by 2 is
3 with a remainder value of 1. mod
is commonly
used as a way to check if a given number is even: what number, when used to divide
an even number, will always output a remainder of 0, while when it is used to divide
an odd number it will never output a remainder of 0?
For extra credit on the Schematron test, help constrain Moby Dick so the
title number featured in each
<chapter-title>
element accurately
corresponds to that chapter’s offset position within the book. For example, in:
CHAPTER 1. Loomings.
...
...
...
CHAPTER 3. The Carpet-Bag.
...
...
...
...
]]>
the markup would be invalid, even though it follows our
<chapter-title>
naming rules, because a
chapter title with number of "3" is the second (not third) chapter within
the XML. (Hint: for our solution, we used
substring-before()
,
substring-after()
, and
position()
).
All you need to upload for this test is your Schematron schema file with the file extension .sch. We will associate your schema with the XML ourselves and verify your Schematron rules.