Digital humanities


Maintained by: David J. Birnbaum (djbpitt@gmail.com) [Creative Commons BY-NC-SA 3.0 Unported License] Last modified: 2023-04-14T20:56:32+0000


Schematron assignment #2: answers

The assignment

Write a Schematron schema that will take input like:

<?xml version="1.0" encoding="UTF-8"?>
<stuff>
    <sentence>
        <orth>Мы с Марко поеха-л-и автобус-ом в Переделкино</orth>
        <translit>My s Marko poexa-l-i avtobus-om v Peredelkino.</translit>
        <ilg>we with Marko go-PST-P bus-by to Peredelkino.</ilg>
        <free>Marko and I went to Peredelkino by bus.</free>
    </sentence>
    <sentence>
        <orth>Мы с Марин-ой поеха-л-и поезд-ом в Казань</orth>
        <translit>My s Marin-oj poexa-l-i poezd-om v Kazan′.</translit>
        <ilg>we with Marina-INS go-PST-P train-by to Kazan.</ilg>
        <free>Marina and I went to Kazan by train.</free>
    </sentence>
</stuff>

and verify that the first three tiers (<orth>, <translit>, and <ilg>) of each <sentence> all have the same number of spaces and the same number of hyphens.

If there’s a discrepancy between two tiers, it seemed most natural to report that as an error at the level of the sentence, rather than of one or another of the tiers, since although Schematron can easily recognize when the tiers don’t agree, there’s no way for it to tell which of the discrepant tiers contains an error.

A basic solution

Here is one possible solution:

<?xml version="1.0" encoding="UTF-8"?>
<schema xmlns="http://purl.oclc.org/dsdl/schematron" queryBinding="xslt2">
    <pattern>
        <rule context="sentence">
            <!-- ================================================== -->
            <!-- Variables for checking spaces                      -->
            <!-- ================================================== -->
            <let name="orth_spaces"
                value="string-length(orth) - string-length(translate(orth, ' ', ''))"/>
            <let name="translit_spaces"
                value="string-length(translit) - string-length(translate(translit, ' ', ''))"/>
            <let name="ilg_spaces"
                value="string-length(ilg) - string-length(translate(ilg, ' ', ''))"/>
            <!-- ================================================== -->
            <!-- Test for checking spaces                           -->
            <!-- ================================================== -->
            <assert test="
                    ($orth_spaces, $translit_spaces, $ilg_spaces)
                    => distinct-values()
                    => count()
                    eq 1">Spaces don’t match: <value-of select="$orth_spaces"/> in orth,
                    <value-of select="$translit_spaces"/> in translit, <value-of
                    select="$ilg_spaces"/> in ilg.</assert>
            <!-- ================================================== -->
            <!-- Variables for checking hyphens                     -->
            <!-- ================================================== -->
            <let name="orth_hyphens"
                value="string-length(orth) - string-length(translate(orth, '-', ''))"/>
            <let name="translit_hyphens"
                value="string-length(translit) - string-length(translate(translit, '-', ''))"/>
            <let name="ilg_hyphens"
                value="string-length(ilg) - string-length(translate(ilg, '-', ''))"/>
            <!-- ================================================== -->
            <!-- Test for checking hyphens                          -->
            <!-- ================================================== -->
            <assert test="
                    ($orth_hyphens, $translit_hyphens, $ilg_hyphens)
                    => distinct-values()
                    => count()
                    eq 1">Hyphens don’t match: <value-of select="$orth_hyphens"/> in
                orth, <value-of select="$translit_hyphens"/> in translit, <value-of
                    select="$ilg_hyphens"/> in ilg.</assert>
        </rule>
    </pattern>
</schema>

The issues

This assignment poses two challenges. The first is how to count the number of spaces or hyphens in each of the tiers, since that is a prerequisite to comparing the counts. The second has two parts: where to report the comparison (that is, what should be the value of the @context attribute on the <rule> element) and how to compare the counts.


How to count hyphens and spaces

XPath does not have a function that counts the number of characters of a particular type inside a string. That is, there is no core function that takes two arguments, a string and a character, and returns the number of times the character appears in the string. Some of you may have tried to use the count() function, which may feel correct because the count() function is … well … how XPath counts things. But the count() function takes, as its one and only argument, a sequence of things to count, so if we want to use it, we need to give it that type of sequence. There is no way to say with just the XPath count() function count the number of hyphen characters in this string.

There are at least four ways to approach the task. We present them here from the simplest to the most complex, and if your goal is just to solve the immediate task, it’s okay to study the first, skip the other three, and pick up below the last of them, where we discuss the @context attribute of the Schematron <rule> element. But the more complex approaches provide a terrific, contextualized introduction to increasingly advanced features of XPath, so even if you use the first method in Real Life, we encourage you to take the time to read about the others.

I. Use the string-length() and translate() functions

The number of space characters in a sentence is equal to the length of the original sentence minus the length of the original sentence after you’ve stripped out the space characters. In other words, string-length($tier) - string-length(translate($tier, ' ','')) is an indirect way of counting the number of space characters in a string represented by the variable $tier. The same is true of hyphens. This is the approach we use in our basic solution, above.

II. Use the tokenize() function

From a linguistic perspective we aren’t interested in spaces per se as much as we are in making sure that the number of whitespace-delimited words in the tiers is the same. We can count whitespace-delimited words with count(tokenize($tier)) (or, more legibly, tokenize($tier) => count()). The tokenize() function splits the string represented by the $tier variable into substrings wherever there is one or more whitespace characters and then counts the number of substrings. Note that this approach and the preceding one return different results when there are multiple spaces between words, since this one splits on sequences of whitespace, while the preceding one counts each individual whitespace character. Whether you want to require exactly one space character between words (and treat multiple spaces as a data-entry error) is up to you, and you can choose your Schematron rule to enforce your preference in either case.

We can split the string into substrings on hyphens the same way we do on space characters. That’s an unusual thing to do in Real Life (unlike splitting a line of text into words on whitespace, since words are real things and we often divide sentences into them), but as a strategy for counting, the method works the same way for hyphens as it does for spaces.

III. Use the string-to-codepoints(), codepoints-to-string(), and (optionally) index-of() functions

This approach is closer to the human description of the problem than the preceding ones because it is based on finding the character of interest in the string directly and counting how many times it appears. The preceding two approaches, on the other hand, do not count either spaces or hyphens directly, although they do count them indirectly. The two preceding approaches are simpler, so you might favor them in Real Life, but it’s worth reading through this third alternative to learn about some new XPath functions.

You can break a string into a sequence of individual characters (represented by one-character strings) within an XPath for expression, as in

for $char in string-to-codepoints($tier) 
return codepoints-to-string($char)

The way for expressions work is that they are followed by a return statement, and for each item in the sequence after the for you return the result of applying the return statement to that item. If you run this expression against the input string obdurodon with

for $char in string-to-codepoints("obdurodon")
return codepoints-to-string($char)

you’ll get back "o", "b", "d", "u", "r", "o", "d", "o", "n", a sequence of nine one-character strings. What the code says is use string-to-codepoints() to break the string into a sequence of integer values (codepoints), one for each character, and then for each of those integers use codepoints-to-string() to turn it into a one-character string. (The integers are the Unicode codepoints of the characters in decimal form. See the Wikipedia List of Unicode characters for some examples. If you want to see the numbers, you can return just $char, instead of first converting it from a number to a string.)

An XPath-idiomatic alternative to a for expression uses the simple mapping operator (!), and with that approach we can write

$tier ! string-to-codepoints(.) ! codepoints-to-string(.)

Reading from left to right, this says to take the string we’re interested in (imagining that we’ve saved it as the value of a variable called $tier), break it into a sequence of integers with the string-to-codepoints() function, and turn each integer, in turn, into a string with the codepoints-to-string() function. Once we have a sequence of one-character strings we can count the occurrences of spaces or hyphens in the following indirect way. What’s indirect about it is that we won’t count the character instances themselves; what we’ll do instead is find the positions in the sequence where the character we care about occurs and we’ll count the number of positions. Since there is one position for each instance of the character we care about, the positions can serve as proxies for the characters, which is to say that the count of positions will equal the number of characters.

We can find the offset of a particular item in a sequence of items with the index-of() function, which takes two arguments: the first is the haystack (the sequence of items in which you’re searching) and the second is the needle (the single item you want to find in the sequence, which may occur more than once). For example,

index-of(
    string-to-codepoints('obdurodon') ! codepoints-to-string(.) 
    , 'o'
)

returns a sequence of three integers, 1, 6, and 8, because the letter o is the first, sixth, and eighth letter of the string obdurodon. For our purposes we don’t care about the specific positions, but we do care how many such positions there are, so if we wrap count() around that long expression, it will return the single integer value of 3 because the character occurs in three positions in the string obdurodon.

We can use this approach to count the number of space characters or hyphens in a tier, but what’s peculiar or unnatural about it is that we return the offset positions without caring what they are, since all we do is count them. That fact makes our code harder to understand, since when we see the index-of() function, our first thought is that we care about the specific values, and not just how many of them there are. We can avoid that confusion by applying a predicate to filter the sequence of one-character strings, using a comparison operation inside the predicate to find the instances of matching characters themselves and then counting them, without bothering to compute (distractingly) their offsets. The counts will be the same in either case, since each occurrence has its own offset; the advantage of the more direct approach is that it doesn’t leave us wondering why we computed specific offset values if we don’t care what they are. The new version, also using the simple mapping and arrow operators, looks like

$tier
! string-to-codepoints(.) 
! codepoints-to-string(.)[. eq '-'] 
=> count()
IV. Use the analyze-string() function (also introducing the serialize() function)

XPath 3.0 introduced the function analyze-string(), which lets us use regular expressions to find substrings inside a string. analyze-string() takes two arguments, the first (the haystack) is the string to parse and the second (the needle) is a regular expression that we match against the string. For example, if our regular expression were [aeiou] (a character class that matches any single vowel letter) and our string were obdurodon, the regex pattern would match one instance of o, then one of u, and then two more of o. The function also keeps track of what it doesn’t match. As the function name implies, analyze-string(), with its support for regex, is the most powerful and flexible way of performing string surgery, which means that in situations where the simpler methods above don’t meet your needs, analyze-string() is your superpower.

The output of analyze-string() is complex because it has to report, in the order in which it encounters them in the haystack, each matching and non-matching substring. The function reports this in an XML structure that is in a particular namespace and looks like:

<analyze-string-result xmlns="http://www.w3.org/2005/xpath-functions">
    <match>o</match>
    <non-match>bd</non-match>
    <match>u</match>
    <non-match>r</non-match>
    <match>o</match>
    <non-match>d</non-match>
    <match>o</match>
    <non-match>n</non-match>
</analyze-string-result>

Note that the output of the function has a predefined root element name (<analyze-string-result>) in a predefined namespace (declared as the default with @xmlns on the element), and the root contains a sequence of <match> and <non-match> elements (in that same namespace, because default namespaces are inherited by descendants). These child elements are in the order in which they are encountered in the input string, and if you put them through the string-join() function to stitch them together in order, you’ll reconstruct the original input string, which is sort of cool, even if not especially useful.

You can see this structured XML result by running:

analyze-string('obdurodon', '[aeiou]') => serialize()

in the XPath/XQuery builder in <oXygen/>. The output of analyze-string() is an XML tree (not text with angle brackets), which isn’t human readable, so we have to pipe the output into the serialize() function if we want to convert it to XML that is formatted, with angle brackets, as human-readable text. (If you don’t do that, you won’t get any human-readable output.) We don’t need to serialize the output in order to count hyphens or spaces (counting is an XPath process on the tree), but if we want to peek for ourselves at what the immediate output of the function looks like, the serialize() function will do that for us.

We can use the output of analyze-string() (without serializing it) to give us <match> elements, which we can then count, since that count will be equal to the number of times our regex was matched in the input. For this particular assignment, then, we can run analyze-string() over each of the tiers, specifying a single hyphen or a single space character as the value of the regex to be matched, and we can then count the number of <match> elements in the result. We have to be mindful of the namespace, though! Our XPath expression would look like the following (in these examples we continue looking for vowel letters in the string obdurodon; for the homework assignment you would, of course, look instead for either hyphens or space characters in the content of each of the tiers):

analyze-string('obdurodon', '[aeiou]')/Q{http://www.w3.org/2005/xpath-functions}match => count()

We already know that the first part of this XPath expression outputs an XML structure in a particular namespace, but this may be the first time you are seeing the Q{} notation. Because we haven’t mapped any prefix to the namespace of the <match> elements that are output by the analyze-string() function, we need to specify it literally, which we do here by putting the namespace string inside the curly braces in Q{…}. The official term in the XPath spec for this way of representing a namespace is BracedURILiteral; see https://www.w3.org/TR/xpath-31/#doc-xpath31-BracedURILiteral. Since the function returns the root element of the analyze-string() output (with its contents), we can append a path step to look on its child axis in order to find and count the <match> child elements.

The Q{} notation is difficult to type, and <oXygen/> helpfully predefines the fn: prefix as mapped to this namespace, even when you don’t declare that prefix yourself. This means that as long as we are working inside <oXygen/> we can, alternatively and more legibly, write:

analyze-string('obdurodon', '[aeiou]')/fn:match => count()

That predefinition is an <oXygen/> convenience, though, and we can’t rely on every XPath processor to know about a prefix that we don’t declare ourselves. (The prefix is predefined in XQuery, which uses XPath, but it isn’t predefined in XPath itself, which means that it is not reliably available in XPath when used outside of XQuery.) For that reason, it’s safe to use fn: in <oXygen/>, but before you use it with any other XPath processor, test it and make sure that it’s predefined there, too.

There are a couple of other, alternative ways to negotiate the namespace that are safe to use here, but that should be used with caution elsewhere:

analyze-string('obdurodon', '[aeiou]')/*:match => count()

The notation *:match uses a namespace wildcard to match a <match> element in any namespace. The asterisk where the namespace prefix would usually go means that any namespace is acceptable, including none (that is, this expression also matches elements in no namespace.)

Similarly, in:

analyze-string('obdurodon', '[aeiou]')/*[name() eq 'match'] => count()

our path step finds all element children of the root element (an asterisk matches all element nodes, of any type) and then uses the name() function inside a predicate to filter them to keep only those for which the element name is <match>. The name() function returns the name as it is written in the serialized XML (this is called a lexical QName, which is a name with an optional namespace prefix), character by character. You could also use:

analyze-string('obdurodon', '[aeiou]')/*[local-name() eq 'match'] => count()

The local-name() function matches the non-namespace part of the element name, so it has the same meaning as the wildcard namespace here.

What won’t work here is what may seem to a human like the most natural approach:

analyze-string('obdurodon', '[aeiou]')/match => count()

This will return no results because it looks for <match> elements only if they are in no namespace, and in the output of analyze-string() the <match> elements are in the namespace they inherit from the default specified on their parent. In other words, you can specify the namespace 1) with the Q{} notation; 2) with the fn: namespace prefix if it’s predefined for you, as it is in <oXygen/>; or 3) by using a predicate with name() or local-name() to match on the element name as it would appear in the serialized XML, irrespective of the namespace.


The @context attribute of the Schematron <rule> element

Because we want to report the error at the level of the <sentence> element (see above), we set the value of the @context attribute on our Schematron <rule> element to sentence. This ensures that if we have multiple sentences, an error will be reported (with a red squiggly line) on the sentence where it occurs, which makes it easy to find and fix. Since all of our tests, for both spaces and hyphens, are run at the level of <sentence>, we have just one Schematron <rule> element, which contains one <assert> to test the spaces and one to test the hyphens.

Variables

We use the first of the four strategies described above, the one that relies on the translate() function, to set up six variables, recording the number of spaces and hyphens in our <orth>, <translit>, and <ilg> tiers. We use the variables both because they make our <assert> statements briefer and easier to understand and because they save us from performing the same calculations more than once (once to run the test and again to report the results).

How to perform the comparisons

We need to run a three-way test, that is, we need to compare the count of spaces or hyphens in three strings to determine whether they all have the same count. Some programming languages permit transitive comparisons, and in those languages you could write something like a = b = c to test whether a, b, and c are all equal. XPath is not that kind of language, though, so we need an alternative strategy. One straightforward approach would combine two tests in one:

$orthSpaces eq $translitSpaces and $orthSpaces eq $ilgSpaces

In a combined test with the and operator, the result of the test is true only if both parts succeed. We don’t need to run a third test, comparing transliteration to interlinear glossing directly, because arithmetic equality transitivity ensures that if a = b and b = c, it is also true that a = c.

We opted, though, for a more elegant approach. If we take the distinct values of our counts for all tiers, removing any repeated values, there will be one item remaining only if all values are equal. For that reason, we pass the three values into the distinct-values() function, count the distinct values, test whether the count is equal to 1, and report a problem if it isn’t:

<assert test="($orth_spaces, $translit_spaces, $ilg_spaces) 
    => distinct-values()
    => count()
    eq 1">

Reporting

To make it easier for the human to find the error, we used the Schematron <value-of> element to report the space or hyphen counts for each tier. It is possible to find and report the specific words with discrepancies in their hyphens, but not with just the approach above because we’re doing our counting on the level of the entire line, and not on a word-by-word basis. If you need to do the equivalent of operating on the words for your project, or if you’re just curious, let us know and we’ll be happy to explore that with you.