Digital humanities


Maintained by: David J. Birnbaum (djbpitt@gmail.com) [Creative Commons BY-NC-SA 3.0 Unported License] Last modified: 2021-12-27T22:03:54+0000


Using Schematron in editing

Background

Because scholars often feel passionate about defending their arguments, academic journals are frequently home to polemical exchanges of responses, rejoinders and ripostes, some of which may seem surprisingly strident. You can see an example of the genre from my own field (Slavic linguistics) at https://openaccess.leidenuniv.nl/bitstream/handle/1887/1902/344_073.pdf. (Biographic aside: Horace Lunt, to whom Frederik Kortlandt is responding in this article, was my professor.)

Because professors may be curiously insensitive to the virtues of brevity, rejoinders can easily outstrip in size the articles to which they are responding, and in a cascading polemic they can seem to grow insatiably. In an effort to combat this tendency, and to build in a natural mechanism for winding down such exchanges, a few journals have instituted size limits such as “authors are welcome to respond, but the response must be no longer than half the length of the original.” The hope is that one or the other of the participants in such a discussion will walk away before it deteriorates into something like “So’s your old man!” (4 words), “Sez you!” (2), and “Jerk” (1).

Schema and sample XML document instance

Imagine that you are the editor of an academic journal where submissions are managed in XML, and you are responsible for enforcing a “no more than half the length” policy like the one described above. The structure of your journal is described by:

start = journal
journal = element journal { issue+ }
issue = element issue { article+ }
article = element article { title, author, date, content }
content = element content { p+ }
title = element title { text }
author = element author { text }
date = element date { xsd:date }
p = element p { text } 

(This is clearly a simplification. In real life the structure would allow emphasis, bibliography, footnotes, etc.)

Here is an XML file that is valid against that schema. For convenience it contains only one issue, but in Real Life there might be multiple <issue> elements:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-model href="journal.rnc" type="application/relax-ng-compact-syntax"?>
<journal>
    <issue>
        <article>
            <title>My favorite XPath function</title>
            <author>Eric Gratta</author>
            <date>2012-09-05</date>
            <content>
                <p>You can do anything with tokenize(). You can divide
                text into words, breaking on white space, on punctuation, or in
                any other way that meets your needs. You can even use tokenize()
                in a nested context, perhaps breaking on white space and then,
                within each white-space-delimited word, on hyhpens. How cool is
                that!</p>
            </content>
        </article>
        <article>
            <title>Reflections on tokenize()</title>
            <author>David J. Birnbaum</author>
            <date>2012-10-10</date>
            <content>
                <p>Gratta’s attention to tokenize() is misplaced. matches() is much more useful.</p>
                <p>Sometimes, though, contains() alone is enough.</p>
            </content>
        </article>
        <article>
            <title>On the relative merits of tokenize() and matches()</title>
            <author>Eric Gratta</author>
            <date>2012-10-23</date>
            <content>
                <p>No way matches() is as cool as tokenize()!</p>
            </content>
        </article>
    </issue>
</journal>

The task

Your task is to write a Schematron schema that will ensure that the <content> section of each <article> is no more than half the length of the <content> section of the immediately preceding <article>. As a way of simplifying the task for pedagogical purposes, we are assuming that the only articles in the XML document are part of the polemic, so you don’t have to worry about other articles that are not part of the polemic, and that wouldn’t be governed by the length restriction. We are also putting the articles in chronological order, so you may assume that when we say the immediately preceding article, we mean that article that precedes immediately both in order of publication and in document order in the XML document.

Three complications:

  1. If you try to compare the length of the current article contents to preceding article contents, you’re at risk for mishandling the first article, which has no preceding article, and therefore no point of comparison.
  2. Any article after the second has multiple preceding ones, and you want to compare its length only to that of the immediately preceding one.
  3. You may measure length as either number of characters or number of words. In either case, though, you don’t want to include insignificant white space, such as the extra characters <oXygen/> introduces when pretty-printing. You can make those go away with the XPath normalize-space() function. And if you haven’t used this function, this would be a good time to look it up in Michael Kay. We use it all the time, and you’ll probably need it for your projects.