Digital humanities


Maintained by: David J. Birnbaum (djbpitt@gmail.com) [Creative Commons BY-NC-SA 3.0 Unported License] Last modified: 2016-12-05T20:52:53+0000


Schematron assignment #2: answers

The assignment

Write a Schematron schema that will take input like:

<sentence>
    <orth>Мы с Марко поеха-л-и автобус-ом в Переделкино</orth>
    <translit>My s Marko poexa-l-i avtobus-om v Peredelkino.</translit>
    <ilg>we with Marko go-PST-P bus-by to Peredelkino.</ilg>
    <free>Marko and I went to Perdelkino by bus.</free>
</sentence>

and verify that the first three tiers (<orth>, <translit>, and <ilg>) all have the same number of spaces and the same number of hyphens.

An enlarged example

In order to verify that we could test our Schematron against a document that contained more than one sentence, we created a slightly enlarged example:

<?xml version="1.0" encoding="UTF-8"?>
<stuff>
    <sentence>
        <orth>Мы с Марко поеха-л-и автобус-ом в Переделкино</orth>
        <translit>My s Marko poexa-l-i avtobus-om v Peredelkino.</translit>
        <ilg>we with Marko go-PST-P bus-by to Peredelkino.</ilg>
        <free>Marko and I went to Perdelkino by bus.</free>
    </sentence>
    <sentence>
        <orth>Мы с Марин-ой поеха-л-и поезд-ом в Казань</orth>
        <translit>My s Marin-oj poexa-l-i poezd-om v Kazan′.</translit>
        <ilg>we with Marina-INS go-PST-P train-by to Kazan.</ilg>
        <free>Marina and I went to Kazan by train.</free>
    </sentence>
</stuff>

If there’s a discrepancy between two tiers, it seemed most natural to report that as an error at the level of the sentence, rather than of one or another of the tiers, since although Schematron can easily recognize when the tiers don’t agree, there’s no way for it to tell which of the discrepant tiers contains an error.

A basic solution

Here is our simple solution:

<?xml version="1.0" encoding="UTF-8"?>
<schema xmlns="http://purl.oclc.org/dsdl/schematron" queryBinding="xslt2">
    <pattern>
        <rule context="sentence">
            <let name="orthSpaces"
                value="string-length(orth) - string-length(translate(orth,' ',''))"/>
            <let name="translitSpaces"
                value="string-length(translit) - string-length(translate(translit,' ',''))"/>
            <let name="ilgSpaces" value="string-length(ilg) - string-length(translate(ilg,' ',''))"/>
            <let name="orthHyphens"
                value="string-length(orth) - string-length(translate(orth,'-',''))"/>
            <let name="translitHyphens"
                value="string-length(translit) - string-length(translate(translit,'-',''))"/>
            <let name="ilgHyphens" value="string-length(ilg) - string-length(translate(ilg,'-',''))"/>
            <report
                test="($orthSpaces, $translitSpaces, $ilgSpaces) != avg(($orthSpaces, $translitSpaces, $ilgSpaces))"
                >The spaces don’t match: orth (<value-of select="$orthSpaces"/>) ~ translit
                    (<value-of select="$translitSpaces"/>) ~ ilg (<value-of select="$ilgSpaces"
                />)</report>
            <report
                test="($orthHyphens, $translitHyphens, $ilgHyphens) != avg(($orthHyphens, $translitHyphens, $ilgHyphens))"
                >The hyphens don’t match: orth (<value-of select="$orthHyphens"/>) ~ translit
                    (<value-of select="$translitHyphens"/>) ~ ilg (<value-of select="$ilgHyphens"
                />)</report>
        </rule>
    </pattern>
</schema>

The @context attribute of the Schematron <rule> element

Because we want to report the error at the level of the <sentence> element (see above), we set the value of the @context attribute on our Schematron <rule> element to sentence.

Variables

A popular strategy for counting the number of spaces or hyphens in a string is to subtract the length of the string after stripping out the character in question from its original length (using the XPath translate() function). We use this strategy to set up six variables, recording the number of spaces and hyphens in our <orth>, <translit>, and <ilg> tiers. There are alternative strategies that will achieve the same result, but we find this easiest to understand and to write.

Testing

We need to run a three-way test, that is, we need to compare the count of spaces or hyphens in three strings to determine whether they all have the same count. Some programming languages permit transitive comparisons, and in those languages you could write something like a = b = c to test whether a, b, and c are all equal. XPath is not that kind of language, though, so we need an alternative strategy. One straightforward approach would combine two tests in one:

$orthSpaces eq $translitSpaces and $orthSpaces eq $ilgSpaces

In a combined test with the and operator, the result of the test is false unless both parts succeed. We don’t need to run a third test, comparing transliteration to interlinear glossing directly, because arithmetic equality transitivity ensures that if a = b and b = c, it is also true that a = c.

We opted, though, for a more elegant approach. We created a sequence of the three values and compared that, using the general comparison operator !=, to the average of the three values. If any of the three values is not equal to the average (that is, if the test for general nonequality succeeds), they are not all the same.

Reporting

To make it easier for the human to find the error, we used the Schematron <value-of> element to report the space or hyphen counts for each tier. It isn’t possible using this strategy to find the word where a mismatch in hyphens occurs because we’re doing our counting on the level of the entire line, and not on a word-by-word basis.

A more graceful solution

Comparing the sequence of all values to the average of all values, which we do above, works because only if all values are the same will every individual value be equal to the average of all of the values. A more elegant approach, though, might just count the number of distinct values:

<report test="count(distinct-values(($orthSpaces, $translitSpaces, $ilgSpaces))) eq 1">

If the values are all the same, there will be only one distinct value.

An enhanced solution: finding which individual word has mismatched hyphens

Getting a word-level report is tricky because, since the individual words are not independent nodes in the XML tree, it isn’t possible to set the value of the @context element to point to them. We need, then, to continue to run our tests on the level of entire sentences, and find some other way to do a word-by-word test.

Happily, it is possible to declare the XSLT namespace in Schematron and use XSLT features, including <xsl:function>, which lets us declare our own function. We use that strategy to let our user-defined function djb:hyphenation() generate a word-by-word result, which Schematron can then use to output a more specific error report. Here is the code:

<?xml version="1.0" encoding="UTF-8"?>
<schema xmlns="http://purl.oclc.org/dsdl/schematron"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform" queryBinding="xslt2">
    <ns prefix="djb" uri="http://www.obdurodon.org"/>
    <xsl:function name="djb:hyphenation" as="xs:boolean+">
        <xsl:param name="orthWords" as="xs:string+"/>
        <xsl:param name="transWords" as="xs:string+"/>
        <xsl:param name="ilgWords" as="xs:string+"/>
        <xsl:for-each select="1 to count($orthWords)">
            <xsl:variable name="orthHyphens" as="xs:integer"
                select="string-length($orthWords[current()]) - 
                string-length(translate($orthWords[current()],'-',''))"/>
            <xsl:variable name="transHyphens" as="xs:integer"
                select="string-length($transWords[current()]) - 
                string-length(translate($transWords[current()],'-',''))"/>
            <xsl:variable name="ilgHyphens" as="xs:integer"
                select="string-length($ilgWords[current()]) - 
                string-length(translate($ilgWords[current()],'-',''))"/>
            <xsl:sequence select="$orthHyphens eq $transHyphens and $orthHyphens eq $ilgHyphens"/>
        </xsl:for-each>
    </xsl:function>
    <pattern>
        <rule context="sentence">
            <let name="orthSpaces"
                value="string-length(orth) - string-length(translate(orth,' ',''))"/>
            <let name="translitSpaces"
                value="string-length(translit) - string-length(translate(translit,' ',''))"/>
            <let name="ilgSpaces" value="string-length(ilg) - string-length(translate(ilg,' ',''))"/>
            <let name="orthWords" value="tokenize(orth,'\s+')"/>
            <let name="transWords" value="tokenize(translit,'\s+')"/>
            <let name="ilgWords" value="tokenize(ilg,'\s+')"/>
            <let name="results" value="djb:hyphenation($orthWords,$transWords,$ilgWords)"/>
            <report
                test="($orthSpaces, $translitSpaces, $ilgSpaces) != 
                    avg(($orthSpaces, $translitSpaces, $ilgSpaces))"
                >The spaces don’t match: orth (<value-of select="$orthSpaces"/>) ~ translit
                    (<value-of select="$translitSpaces"/>) ~ ilg (<value-of select="$ilgSpaces"
                />)</report>
            <report test="$results != true()">Word # <value-of
                    select="index-of($results,false())[1]"/> doesn't match: "<value-of
                    select="$orthWords[index-of($results,false())[1]]"/>" (orthographic, <value-of
                    select="string-length($orthWords[index-of($results,false())[1]]) - 
                    string-length(translate($orthWords[index-of($results,false())[1]],'-',''))"
                />) ~ "<value-of select="$transWords[index-of($results,false())[1]]"/>"
                (transliterated, <value-of
                    select="string-length($transWords[index-of($results,false())[1]]) - 
                    string-length(translate($transWords[index-of($results,false())[1]],'-',''))"
                />) ~ "<value-of select="$ilgWords[index-of($results,false())[1]]"/>" (interlinear
                gloss, <value-of
                    select="string-length($ilgWords[index-of($results,false())[1]]) - 
                    string-length(translate($ilgWords[index-of($results,false())[1]],'-',''))"
                />)</report>
        </rule>
    </pattern>
</schema>

Checking spaces

We use exactly the same strategy for checking spaces as we did in the simple solution, above.

Namespaces and using XSLT within Schematron

User-defined functions have to be in a user-defined namespace, and we’ve used the URI http://www.obdurodon.org as the namespace for our function and bound it to the prefix djb:. Schematron does not support the general XML namespace declaration syntax, so we have to use the Schematron-specific namespace declaration syntax instead:

<ns prefix="djb" uri="http://www.obdurodon.org"/>

We also have to declare the XSLT namespace, which we do using the standard XML namespace declaration syntax, writing:

xmlns:xsl="http://www.w3.org/1999/XSL/Transform"

inside the <schema> start tag. The reason we use the standard XML namespace declaration syntax here is that this namespace declaration applies at a higher level, at the stage where the processor has to sort out which parts of the page are in the Schematron namespace and which are in the XSLT namespace, that is, while looking at children of the root <schema> element. It deals with the djb namespace at a deeper level, where the Schematron processor is able to manage the namespace resolution.

The djb:hyphenation() function

This isn’t a general tutorial on writing your own XSLT functions (see the clear and comprehensive write-up in the Michael Kay book for that), but the general way the function works is that we break the three strings into words using the tokenize() function and pass those three sequences of words into the function, saving the output of the function in a variable we call $results

<let name="results" value="djb:hyphenation($orthWords,$transWords,$ilgWords)"/>

When the function returns, the $results variable will contain information that we can use to determine the position in the sequence of words in which a discrepany in hyphen count appears (see below).

The function iterates over the words in an <xsl:for-each> element, calculates the number of hyphens in each word in corresponding positions in the three tiers, and compares those three numbers (we used the two-part test with and this time because it was more legible). The comparison returns (as the value of the <xsl:sequence> element) a Boolean (true or false) value depending on whether the counts of hyphens in the same word position in the three tiers are all equal to one another. The function generates one Boolean value for each word position in the sentence, and after it has examined all of the sets of words, it returns a sequence of Boolean values to the Schematron rule, where, as we noted above, it becomes the value of the $results variable.

If any value in the returned sequence is not equal to Boolean true(), the Schematron report finds the position of the first false() value (using the XPath index-of() function) and uses that position to find the specific words, count the hyphens in those words on all three tiers, and output a report that gives, for each tier, the tier identifier, the word, and the number of hyphens. The <report> element is difficult to read because although Schematron allows the use of the <let> element to create variables, it doesn’t permit the creation of variables inside a <report> element. This means that we can’t create convenience variables to hold our counts of string lengths and hyphens, which would make our code easier to read, and we have to do all of the measurement and arithmetic at once instead.

Tightening our code

After we wrote the Schematron above, which meets all of our requirements, we noticed that inside the djb:hyphenation function we perform the same computation three times, with different input, to count the hyphens in each word. Furthermore, the calculation itself is pretty verbose, and therefore hard to read. To make things more legible, we refactored (revised) that part of our code, breaking the calculation out into a separate djb:countHyphens function that we could call three times. And just to keep in practice we tried a different strategy for testing whether all three counts were the same. Here’s the revised code, with the revisions highlighted:

<?xml version="1.0" encoding="UTF-8"?>
<schema xmlns="http://purl.oclc.org/dsdl/schematron"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform" queryBinding="xslt2">
    <ns prefix="djb" uri="http://www.obdurodon.org"/>
    <xsl:function name="djb:hyphenation" as="xs:boolean+">
        <xsl:param name="orthWords" as="xs:string+"/>
        <xsl:param name="transWords" as="xs:string+"/>
        <xsl:param name="ilgWords" as="xs:string+"/>
        <xsl:for-each select="1 to count($orthWords)">
            <xsl:variable name="orthHyphens" as="xs:integer"
                select="djb:countHyphens($orthWords[current()])"/>
            <xsl:variable name="transHyphens" as="xs:integer"
                select="djb:countHyphens($transWords[current()])"/>
            <xsl:variable name="ilgHyphens" as="xs:integer"
                select="djb:countHyphens($ilgWords[current()])"/>
            <xsl:sequence
                select="count(distinct-values(($orthHyphens, $transHyphens, $orthHyphens))) eq 1"/>
        </xsl:for-each>
    </xsl:function>
    <xsl:function name="djb:countHyphens" as="xs:integer">
        <xsl:param name="word"/>
        <xsl:variable name="length" as="xs:integer" select="string-length($word)"/>
        <xsl:variable name="dehyphenatedLength" as="xs:integer"
            select="string-length(translate($word,'-',''))"/>
        <xsl:sequence select="$length - $dehyphenatedLength"/>
    </xsl:function>
    <pattern>
        <rule context="sentence">
            <let name="orthSpaces"
                value="string-length(orth) - string-length(translate(orth,' ',''))"/>
            <let name="translitSpaces"
                value="string-length(translit) - string-length(translate(translit,' ',''))"/>
            <let name="ilgSpaces" value="string-length(ilg) - string-length(translate(ilg,' ',''))"/>
            <let name="orthWords" value="tokenize(orth,'\s+')"/>
            <let name="transWords" value="tokenize(translit,'\s+')"/>
            <let name="ilgWords" value="tokenize(ilg,'\s+')"/>
            <let name="results" value="djb:hyphenation($orthWords,$transWords,$ilgWords)"/>
            <report
                test="($orthSpaces, $translitSpaces, $ilgSpaces) != 
                    avg(($orthSpaces, $translitSpaces, $ilgSpaces))"
                >The spaces don’t match: orth (<value-of select="$orthSpaces"/>) ~ translit
                    (<value-of select="$translitSpaces"/>) ~ ilg (<value-of select="$ilgSpaces"
                />)</report>
            <report test="$results != true()">Word # <value-of
                    select="index-of($results,false())[1]"/> doesn't match: "<value-of
                    select="$orthWords[index-of($results,false())[1]]"/>" (orthographic, <value-of
                    select="string-length($orthWords[index-of($results,false())[1]]) - 
                    string-length(translate($orthWords[index-of($results,false())[1]],'-',''))"
                />) ~ "<value-of select="$transWords[index-of($results,false())[1]]"/>"
                (transliterated, <value-of
                    select="string-length($transWords[index-of($results,false())[1]]) - 
                    string-length(translate($transWords[index-of($results,false())[1]],'-',''))"
                />) ~ "<value-of select="$ilgWords[index-of($results,false())[1]]"/>" (interlinear
                gloss, <value-of
                    select="string-length($ilgWords[index-of($results,false())[1]]) - 
                    string-length(translate($ilgWords[index-of($results,false())[1]],'-',''))"
                />)</report>
        </rule>
    </pattern>
</schema>