Maintained by: David J. Birnbaum (djbpitt@gmail.com)
Last modified:
2023-04-29T00:49:48+0000
The task for this test was to write Schematron that constrained an excerpt of Herman Melville’s Moby Dick in the three following ways:
<chapter-title>
elements was correct for
each <chapter>
.@page
attributes maintained a logical
increasing order as the chapters progressed.And, as extra credit:
<chapter-title>
correlated correctly to
the offset position of that <chapter>
element within the excerpt.Additional details and hints on how to approach these tasks can be found in the original assignment.
Below is our solution to the Schematron test:
Chapter titles must follow the
format "CHAPTER #. Title."
The first chapter must appear on or after the first
page, but instead appeared on page [ ]
Expected a strictly ascending
sequence of @page values, but this chapter’s page number [ ] is less than the previous chapter’s page number [ ].
Expected even count of quotation marks, but
found an odd amount [ ]. Check for unclosed
quotations.
Expected chapter title number to correspond
with the chapter’s position, but found chapter number " "
within Chapter .
]]>
There are several good ways to approach these tasks other than those written in our solution. A solution that does not directly match ours or even use all the same functions we use is not necessarily incorrect. Whether your implementation was similar to ours or you went about it a different way, please take a look below for additional explanation for each part of the tasks.
In this part, we looked to raise an error whenever the titles included within the
<chapter-title>
elements did not follow
this structure:
The string CHAPTER
, followed by a space.
Then the chapter number, followed by a dot and then a space.
Then the chapter name, followed by a dot.
The Schematron that accomplishes this looks like the following:
Chapter titles must follow the
format "CHAPTER #. Title."
]]>
Since we want to look at each
<chapter-title>
element individually and
test whether or not it follows the desired formatting conventions, our
@context
value is
chapter-title
. While this constraint could
also be applied by setting the @context
of
the rule to chapter
and then pathing down to
its <chapter-title>
child for the
@test
, this would be slightly inaccurate:
In Schematron, error messages are reported (that is, the dreaded red squiggly
line in <oXygen/> appears) wherever the
@context
attribute is set, and so it is
best to use the most specific context value for a given constraint where
possible. This makes it easier for a user to locate where the invalidity is
occurring within the XML. If the @context
were equal to chapter
, the error message would
appear on the <chapter>
element, and
from there the user would have to figure out which specific elements within the
chapter were incorrect. In contrast, an error message appearing on the
<chapter-title>
elements better directs
a user to the source of the error and is more representative of the scope of the
constraint.
Our solution uses matches()
, which accepts
two arguments. The first argument of
matches()
is a string, and the second
argument is a regular expression. matches()
will use the regular expression provided in the second argument and see if it
occurs anywhere within the string from the first argument: if it does, it
returns true
, and if the regular expression
does not appear somewhere in the string, it returns
false
. We chose the regular expression
^CHAPTER [0-9]+\. .+\.$
, which checks for
the leading string "CHAPTER", followed by a single space character, then one or
more digits, then a dot and space after the digits dots and a dot after the text
of the chapter title. Here are some considerations we made when creating this
expression:
The expression should begin with the anchor
^
and end with the anchor
$
, since we want "CHAPTER" to be
the first word that appears in the
<chapter-title>
and a literal
dot to be the last character in the
<chapter-title>
. Without the
anchors, characters appearing before "CHAPTER" or after the dot might be
missed when we perform our test, since
matches()
only checks whether the
regular expression occurs somewhere within our inputted string,
but does not verify that the regular expression completely spans the
provided string.
We chose to model our page numbers as
[0-9]+
, because even though we are
only working in an excerpt that goes up to chapter 3, we would want our
regular expression to permit multi-digit numbers should we ever apply
our schema to a full version of the novel. A more restrictive version of
this pattern could be [1-9][0-9]*
,
which disallows a chapter number of 0, but since we verify this number
again within our fourth constraint, we let this part of the expression
be somewhat less restrictive in favor of better readability.
We chose to model the actual chapter title itself as
.+
, since each chapter’s title
could have a variable amount of words and special characters, and we
wanted our rules to be applicable to any chapter in the novel, both
within and beyond those featured in the excerpt. A more restrictive
version of this pattern could exclude specific characters that should
not appear in chapter titles, and it could require that each new word
start with a capital letter.
Where we wanted literal dots to appear within the
<chapter-title>
, we made sure to
escape the dot within our regular expression, since an unescaped dot in
Regex is equivalent to any character
.
This part focused on making sure the page numbers found within the
@page
attributes within each
<chapter>
element increased in a
strictly ascending order (the technical term for this is monotonically
increasing). Our solution was the following:
The first chapter must appear on or after the first
page, but instead appeared on page [ ]
Expected a strictly ascending
sequence of @page values, but this chapter’s page number [ ] is less than the previous chapter’s page number [ ].
]]>
In our solution, we set the @context
value
of our first rule to chapter[1]
, which allows
us to perform a unique test on just the first chapter, and then to
chapter
on the second rule, which allows us to
both easily access the @page
value of that
<chapter>
and the preceding
<chapter>
sibling. We then perform the
following two checks:
First, we check that the very first
<chapter>
in the document falls
on a page number equal to or greater than one. This guarentees that our
first chapter appears at a logical page number, since no chapter can
begin at a negative page number, but allows flexibility for it to appear
on or after page one, since the first chapter might only begin after a
title page or dedication page, for example, and thus not start exactly
on one.
Since the first chapter has no preceding siblings, attempting to access
its previous sibling’s page number, such as with the XPath expression
preceding-sibling::chapter[1]/number(@page)
,
will result in a value of NaN, or not a number
. NaN comparison
works strangely: in XPath, all real numbers are less than NaN, and there
is no real number this is greater than NaN. At the same time, NaN itself
is not equal to NaN, nor is it less than NaN or greater than NaN. Even
more strangely, NaN plus a number is not greater to, equal to, or less
than NaN. Needless to say, NaN comparison can be very uninituitive, and
we so avoid it here by not referencing the previous sibling and just
making our comparisons of the current page number against the constant
value of 1. (For a fascinating tour of the World of NaN see If it’s not a number, what is it? Demystifying NaN for the working
programmer.)
Next, we make sure the rest of the chapters have monotonically increasing
page numbers by making sure that a given chapter’s page number is
greater than the page number of the chapter that came just before it.
Because a node in Schematron can match with only a single
<rule>
within a
<pattern>
, the first
<chapter>
in the document never
sees the <report>
within the
second rule because it has already matched with the first rule with
context chapter[1]
. Without fear of
selecting a previous sibling that doesn’t exist, we can find the last
chapter’s page number and store this in a variable for easier
referencing later on. We then make sure that the current
@page
is greater than the last
chapter’s @page
, thus enforcing a
strictly increasing order across all chapters.
For both the above tests, we made sure to cast the
@page
values we found to a numeric data
type with the XPath function number()
. This
is an important step because comparisons between an untyped value or a string
value can differ from numerical comparison. As an example,
"11" lt "7"
(note the quotation marks,
which mean that we’re dealing with strings, and not numerical values) evaluates
to true
, since in string comparison the
first digit of 1 from 11 is less than
the a digit of 7, but
number("11") lt number("7")
(where the
number()
function casts the strings to
numerical values) evaluates to the expected value of
false
.
A solution that works within this document but is not ideal is to manually select
the first, second, and third <chapter>
and make sure chapter[1]
is less than
chapter[2]
is less than
chapter[3]
. While this performs the validation
we intend it to, what if we were working across the whole novel and had to then
manually select all 135 chapters of Moby Dick? Typing out 135
unique XPath expressions would get tiring quickly. Instead, using a method like
above ensures that our tests will work across all chapters, regardless of how
many there are in our XML.
Part three looked to ensure that quotation marks within the body were balanced. Our solution looked like this:
Expected even count of quotation marks, but
found an odd amount [ ]. Check for unclosed
quotations.
]]>
Since an closing quotation mark should always fall in the same paragraph as its
opening quotation mark within this excerpt, we set our
@context
value to
p
.
This solution borrows heavily from the solution to Schematron assignment 2, where we
repeat the same method of using translate()
and string-length()
to count how many
quotation marks are in a given <p>
.
Since quotation marks are already reserved characters in XPath to delineate
string values, we use the string representation
"
to select the quotation mark in
the second argument of translate()
. An
equally effective strategy is to escape the quotation mark with another
quotation mark: we reference a similar method involving apostrophes in our
explanation to XSLT assignment
4.
Once we’ve counted how many quotation marks exist within the paragraph, we then
mod
that value by 2. If the number is
divisible by 2, and thus even, it will have a remainder of 0, but if the number
is not divisible by 2, and thus odd, it will have a remainder of 1. We use this
property in our @test
to assert that the
number of quotation marks in a paragraph must be even, and if they are even, it
means they are balanced within that paragraph.
The extra credit question involved making sure a given
<chapter-title>
correlated correctly to
the offset position of that <chapter>
element within the excerpt. Our solution looked like the following:
Expected chapter title number to correspond
with the chapter’s position, but found chapter number " "
within Chapter .
]]>
Here, having our @context
set to
chapter
is important because we want to be
able to reference the position()
of a given
chapter against the number listed in its
<chapter-title>
.
We can’t compare the <chapter-title>
to
the position()
of the paragraph right away
because the chapter number in the title is sandwiched between some other text.
To extract it, we used substring-before()
and substring-after()
to seperate out the
number from the rest of the title text. From there, we can compare it to the
chapter’s position and assert that they must be equal. As with task 2, we make
sure to cast our extracted chapter number from a string to a numerical value
with the number()
function to eliminate
unexpected behavior when comparing untyped or string values. We do not need to
cast the result of position()
since the
result of position()
is already returned as
a numerical value.