Maintained by: David J. Birnbaum (djbpitt@gmail.com) Last modified: 2023-04-07T15:48:55+0000
This assignment is situated in the context of Real Life linguistic documentation project in which we were asked to provide some XML assistance. Your assignment involves writing a bit of Schematron in the middle, but to explain why it was necessary we first describe both the linguistic project itself and the eventual XML conversion that the Schematron was ultimately used to facilitate.
Here’s a quote from http://dh.obdurodon.org/schematron-class-01.html (simplified slightly):
Linguistic corpora often record transcriptions in multiple tiers, such as a transcription of the original utterance, a word-by-word gloss with grammatical information, and a more fluid, natural-language translation. The set of notational conventions most commonly used for this purpose by corpus linguists has been codified in the Leipzig Glossing Rules (http://www.eva.mpg.de/lingua/resources/glossing-rules.php). Here is a Russian example based on that document:
Orth | Мы | с | Марко | поеха-л-и | автобус-ом | в | Переделкино |
Translit | My | s | Marko | poexa-l-i | avtobus-om | v | Peredelkino. |
ILG | we | with | Marko | go-PST-PL | bus-by | to | Peredelkino. |
Free | 'Marko and I went to Peredelkino by bus.' |
Other tiers might include International Phonetic Alphabet (IPA) and interlinear glossing or free translation into other languages.
Each of the computationally tractable tiers (everything except Free
) should
have the same number of words, and each word should have the same number of
hyphens.
Field linguists often type up this information in plain text, so that their starting point is something like:
Orth: Мы с Марко поеха-л-и автобус-ом в Переделкино
Translit: My s Marko poexa-l-i avtobus-om v Peredelkino.
ILG: we with Marko go-PST-P bus-by to Peredelkino.
Free: Marko and I went to Peredelkino by bus.
Assume that you can get the raw text into the following XML easily:
<sentence>
<orth>Мы с Марко поеха-л-и автобус-ом в Переделкино</orth>
<translit>My s Marko poexa-l-i avtobus-om v Peredelkino.</translit>
<ilg>we with Marko go-PST-P bus-by to Peredelkino.</ilg>
<free>Marko and I went to Peredelkino by bus.</free>
</sentence>
You don’t have to do the following for this assignment, but once you’ve learned a bit
about regular expressions and the XPath tokenize()
function, you would be able to write XSLT to convert this XML to a different XML
structure, one where the pieces are aligned properly, that is, so that every word and
morpheme on the Orth, Translit, and ILG (interlinear gloss) tier is associated with the
corresponding word or morpheme on the other tiers (except the Free tier, which isn’t
expected to match up; it’s a free translation, after all). But that works only if the
person who entered the data originally got the spaces and hyphens right! If the number
of spaces and hyphens doesn’t match up in the Orth, Translit, and ILG tiers, you can’t
automate the alignment.
When we had to perform this type of plain-text-to-XML converstion for a real linguistic documentation project, the linguists’ initial, raw field notes had lots of error: spaces instead of hyphens and vice versa, as well as other punctuation (periods, hash marks, etc.) in place of both spaces and hyphens. This is typical field data; it’s hard for a human to pay attention to counting spaces and punctuation marks, which is why we use markup languages in projects of this sort in the first place. Before we even tried to transform the data with XSLT to something that formalized the word-by-word and morpheme-by-morpheme alignment, we used Schematron to verify that the number of spaces and hyphens matched where it needed to. Schematron validation doesn’t mean that we can’t still have a mistake, of course, but it greatly reduces the risk of not noticing an error, since only if we were to make the same error (or the same type of error) in every associated tier, or if we made errors that cancelled each other out, would we fool the counter.
Your assignment, then, is to write a Schematron schema that will take input like:
Мы с Марко поеха-л-и автобус-ом в Переделкино
My s Marko poexa-l-i avtobus-om v Peredelkino.
we with Marko go-PST-P bus-by to Peredelkino.
Marko and I went to Peredelkino by bus.
Мы с Марин-ой поеха-л-и поезд-ом в Казань
My s Marin-oj poexa-l-i poezd-om v Kazan′.
we with Marina-INS go-PST-P train-by to Kazan.
Marina and I went to Kazan by train.
]]>
and verify that the first three lines (Orth, Translit, and ILG) of each
<sentence>
all have the same number of spaces
and the same number of hyphens. You do not have to convert this XML to word-aligned
or morpheme-aligned XML; all you have to do is write the Schematron that will verify
whether the spaces and hyphens match. The verification is a prerequisite for
the transformation, which would be the next step in Real Life, but for a Schematron
assignment all you have to do is … well … write the Schematron.
Perhaps surprisingly, there is no basic XPath function that, given a string in which to look (like a sentence) and a character for which to look (like a space or a hyphen), will count the number of times the character appears in the string. Since there isn’t a single pre-existing function that will do that sort of counting, we’ll have to write our own, and here are three XPath-idiomatic ways to think about how to approach that task:
You can use the translate()
function to
translate, for example, all space characters into nothing, that is, to delete
them. See the third item in the itemized Effects
list in Kay, p. 897 and
the short paragraph below it, as well as the Usage and examples
section
on the following page. You can use the
string-length()
function to count the
number of characters in a string. This means that you compute the length of the
initial string and then strip out the space characters with
translate()
and compute the length again.
Subtracting the length after removing the space characters from the original
length will be equal to the number of space characters. You can do the same with
hyphens.
You can split each string into pieces with the
tokenize()
function, which you may want to
look up in Kay, since you’ll need to know not only what to split up, but how to
specify whether you’re splitting on spaces, hyphens, or something else. You can
then count the number of tokens with
count()
, and the number will be one greater
than the number of spaces (or hyphens, depending on what you used to split)
between them. You don’t actually have to adjust for that difference of one
because the point is to compare the three tiers, so as long as they are all one
short, you can still compare them.
You can use the string-to-codepoints()
function to explode a string into a sequence of numerical values, one for each
character. You then use the index-of()
function to find the offset positions of all of the instances of a particular
character in that sequence (you can use
string-to-codepoints()
on the character
you’re looking for, and not just the string inside which you’re looking, so you
don’t have to work explicitly with the numbers). You don’t care about the
specific positions, but you do care about how many position values will be
returned, since that will be equal to the number of times the target character
appears in the string, and you can count the number of items in the sequence
returned by index-of()
by using the
count()
function. You’ll need to look up
the unfamiliar functions in Kay in order to learn how they work.
Any of the preceding three strategies will do the job, so choose the one you find easiest to understand. Don’t try to use any of them without first doing what we do when we have to work with functions we haven’t used before, though: look them up in Kay, read the descriptions, and look at the examples. You can test the functions separately in the <oXygen/> XPath toolbar to ensure that you understand how they work.
To test your Schematron rules, create your own small sample XML document, with a handful
of sentences formatted like the example above, with each tier in its own element but no
internal markup separating words or morphemes. Be sure to create a wrapper element
that contains multiple <sentence>
children
to ensure that you are able to process each sentence separately. You can make
up your own examples in a language of your choice or copy examples from http://www.eva.mpg.de/lingua/resources/glossing-rules.php. If you make up your
own examples, don’t worry about the precision of your linguistic annotations; this is an
exercise in Schematron, and not in field linguistics. It doesn’t matter what tiers you
use, as long as you have at least two that have spaces and hyphens in them that are
supposed to correspond. You should also make copies of some of your examples, muck up
the spaces and hyphenation, and use that bad data to test whether your Schematron schema
can catch the errors.
You should turn in your solution to the above assignment in a Schematron schema file, that is, a file with the extension .sch, along with the XML document that you validated with your Schematron.