Maintained by: David J. Birnbaum (djbpitt@gmail.com) Last modified: 2021-12-27T22:03:54+0000
Linguistic corpora often record transcriptions in multiple tiers, such as a transcription of the original utterance, a word-by-word gloss with grammatical information, and a more fluid, natural-language translation. The set of notational conventions most commonly used for this purpose by corpus linguists have been codified in the Leipzig Glossing Rules (http://www.eva.mpg.de/lingua/resources/glossing-rules.php). Here is a Russian example from that document (modified to add the original orthography tier):
Orth | Мы | с | Марко | поеха-л-и | автобус-ом | в | Переделкино |
Translit | My | s | Marko | poexa-l-i | avtobus-om | v | Peredelkino. |
Gram | 1PL | COM | Marko | go-PST-PL | bus-INS | ALL | Peredelkino. |
ILG | we | with | Marko | go-PST-PL | bus-by | to | Peredelkino. |
Free | 'Marko and I went to Peredelkino by bus.' |
(COM = comitative; ALL = allative)
Other tiers might include International Phonetic Alphabet (IPA) and interlinear glossing or free translation into other languages.
What Schematron can validate: Each of the computationally tractable tiers should have the same number of words, and each word should have the same number of hyphens.
The Rusian genealogy project at http://genealogy.obdurodon.org/ is an XML database. There are entries for people and for marriages, where a marriage contains pointers to the participants and to their offspring.
What Schematron can validate: The targets of all pointers must exist: husbands, wives, children, parents.
In the Annotated Afanas′ev Library http://aal.obdurodon.org, page ranges are entered as follows:
<text-pages-r> <start>36</start> <end>37</end> </text-pages-r>
What Schematron can validate: The value of the contents of the
<end>
element must be greater than or equal to the value of the
contents of the <start>
element.