Digital humanities

Maintained by: David J. Birnbaum ( [Creative Commons BY-NC-SA 3.0 Unported License] Last modified: 2023-04-07T15:48:55+0000

Schematron assignment #2


This assignment is situated in the context of Real Life linguistic documentation project in which we were asked to provide some XML assistance. Your assignment involves writing a bit of Schematron in the middle, but to explain why it was necessary we first describe both the linguistic project itself and the eventual XML conversion that the Schematron was ultimately used to facilitate.

The problem

Here’s a quote from (simplified slightly):

Linguistic corpora often record transcriptions in multiple tiers, such as a transcription of the original utterance, a word-by-word gloss with grammatical information, and a more fluid, natural-language translation. The set of notational conventions most commonly used for this purpose by corpus linguists has been codified in the Leipzig Glossing Rules ( Here is a Russian example based on that document:

Orth Мы с Марко поеха-л-и автобус-ом в Переделкино
Translit My s Marko poexa-l-i avtobus-om v Peredelkino.
ILG we with Marko go-PST-PL bus-by to Peredelkino.
Free 'Marko and I went to Peredelkino by bus.'

Other tiers might include International Phonetic Alphabet (IPA) and interlinear glossing or free translation into other languages.

Each of the computationally tractable tiers (everything except Free) should have the same number of words, and each word should have the same number of hyphens.

From field notes to markup

Field linguists often type up this information in plain text, so that their starting point is something like:

Orth: Мы с Марко поеха-л-и автобус-ом в Переделкино
Translit: My s Marko poexa-l-i avtobus-om v Peredelkino.
ILG: we with Marko go-PST-P bus-by to Peredelkino.
Free: Marko and I went to Peredelkino by bus.

Assume that you can get the raw text into the following XML easily:

    <orth>Мы с Марко поеха-л-и автобус-ом в Переделкино</orth>
    <translit>My s Marko poexa-l-i avtobus-om v Peredelkino.</translit>
    <ilg>we with Marko go-PST-P bus-by to Peredelkino.</ilg>
    <free>Marko and I went to Peredelkino by bus.</free>

You don’t have to do the following for this assignment, but once you’ve learned a bit about regular expressions and the XPath tokenize() function, you would be able to write XSLT to convert this XML to a different XML structure, one where the pieces are aligned properly, that is, so that every word and morpheme on the Orth, Translit, and ILG (interlinear gloss) tier is associated with the corresponding word or morpheme on the other tiers (except the Free tier, which isn’t expected to match up; it’s a free translation, after all). But that works only if the person who entered the data originally got the spaces and hyphens right! If the number of spaces and hyphens doesn’t match up in the Orth, Translit, and ILG tiers, you can’t automate the alignment.

The task: use Schematron to get your data ready for XML-to-XML conversion

When we had to perform this type of plain-text-to-XML converstion for a real linguistic documentation project, the linguists’ initial, raw field notes had lots of error: spaces instead of hyphens and vice versa, as well as other punctuation (periods, hash marks, etc.) in place of both spaces and hyphens. This is typical field data; it’s hard for a human to pay attention to counting spaces and punctuation marks, which is why we use markup languages in projects of this sort in the first place. Before we even tried to transform the data with XSLT to something that formalized the word-by-word and morpheme-by-morpheme alignment, we used Schematron to verify that the number of spaces and hyphens matched where it needed to. Schematron validation doesn’t mean that we can’t still have a mistake, of course, but it greatly reduces the risk of not noticing an error, since only if we were to make the same error (or the same type of error) in every associated tier, or if we made errors that cancelled each other out, would we fool the counter.

Your assignment, then, is to write a Schematron schema that will take input like:

        Мы с Марко поеха-л-и автобус-ом в Переделкино
        My s Marko poexa-l-i avtobus-om v Peredelkino.
        we with Marko go-PST-P bus-by to Peredelkino.
        Marko and I went to Peredelkino by bus.
        Мы с Марин-ой поеха-л-и поезд-ом в Казань
        My s Marin-oj poexa-l-i poezd-om v Kazan′.
        we with Marina-INS go-PST-P train-by to Kazan.
        Marina and I went to Kazan by train.


and verify that the first three lines (Orth, Translit, and ILG) of each <sentence> all have the same number of spaces and the same number of hyphens. You do not have to convert this XML to word-aligned or morpheme-aligned XML; all you have to do is write the Schematron that will verify whether the spaces and hyphens match. The verification is a prerequisite for the transformation, which would be the next step in Real Life, but for a Schematron assignment all you have to do is … well … write the Schematron.

Perhaps surprisingly, there is no basic XPath function that, given a string in which to look (like a sentence) and a character for which to look (like a space or a hyphen), will count the number of times the character appears in the string. Since there isn’t a single pre-existing function that will do that sort of counting, we’ll have to write our own, and here are three XPath-idiomatic ways to think about how to approach that task:

Any of the preceding three strategies will do the job, so choose the one you find easiest to understand. Don’t try to use any of them without first doing what we do when we have to work with functions we haven’t used before, though: look them up in Kay, read the descriptions, and look at the examples. You can test the functions separately in the <oXygen/> XPath toolbar to ensure that you understand how they work.

To test your Schematron rules, create your own small sample XML document, with a handful of sentences formatted like the example above, with each tier in its own element but no internal markup separating words or morphemes. Be sure to create a wrapper element that contains multiple <sentence> children to ensure that you are able to process each sentence separately. You can make up your own examples in a language of your choice or copy examples from If you make up your own examples, don’t worry about the precision of your linguistic annotations; this is an exercise in Schematron, and not in field linguistics. It doesn’t matter what tiers you use, as long as you have at least two that have spaces and hyphens in them that are supposed to correspond. You should also make copies of some of your examples, muck up the spaces and hyphenation, and use that bad data to test whether your Schematron schema can catch the errors.

What to submit

You should turn in your solution to the above assignment in a Schematron schema file, that is, a file with the extension .sch, along with the XML document that you validated with your Schematron.