Maintained by: David J. Birnbaum (djbpitt@gmail.com)
Last modified:
2023-04-14T20:56:32+0000
Write a Schematron schema that will take input like:
<?xml version="1.0" encoding="UTF-8"?>
<stuff>
<sentence>
<orth>Мы с Марко поеха-л-и автобус-ом в Переделкино</orth>
<translit>My s Marko poexa-l-i avtobus-om v Peredelkino.</translit>
<ilg>we with Marko go-PST-P bus-by to Peredelkino.</ilg>
<free>Marko and I went to Peredelkino by bus.</free>
</sentence>
<sentence>
<orth>Мы с Марин-ой поеха-л-и поезд-ом в Казань</orth>
<translit>My s Marin-oj poexa-l-i poezd-om v Kazan′.</translit>
<ilg>we with Marina-INS go-PST-P train-by to Kazan.</ilg>
<free>Marina and I went to Kazan by train.</free>
</sentence>
</stuff>
and verify that the first three tiers
(<orth>
,
<translit>
, and
<ilg>
) of each
<sentence>
all have the same number of
spaces and the same number of hyphens.
If there’s a discrepancy between two tiers, it seemed most natural to report that as an error at the level of the sentence, rather than of one or another of the tiers, since although Schematron can easily recognize when the tiers don’t agree, there’s no way for it to tell which of the discrepant tiers contains an error.
Here is one possible solution:
<?xml version="1.0" encoding="UTF-8"?>
<schema xmlns="http://purl.oclc.org/dsdl/schematron" queryBinding="xslt2">
<pattern>
<rule context="sentence">
<!-- ================================================== -->
<!-- Variables for checking spaces -->
<!-- ================================================== -->
<let name="orth_spaces"
value="string-length(orth) - string-length(translate(orth, ' ', ''))"/>
<let name="translit_spaces"
value="string-length(translit) - string-length(translate(translit, ' ', ''))"/>
<let name="ilg_spaces"
value="string-length(ilg) - string-length(translate(ilg, ' ', ''))"/>
<!-- ================================================== -->
<!-- Test for checking spaces -->
<!-- ================================================== -->
<assert test="
($orth_spaces, $translit_spaces, $ilg_spaces)
=> distinct-values()
=> count()
eq 1">Spaces don’t match: <value-of select="$orth_spaces"/> in orth,
<value-of select="$translit_spaces"/> in translit, <value-of
select="$ilg_spaces"/> in ilg.</assert>
<!-- ================================================== -->
<!-- Variables for checking hyphens -->
<!-- ================================================== -->
<let name="orth_hyphens"
value="string-length(orth) - string-length(translate(orth, '-', ''))"/>
<let name="translit_hyphens"
value="string-length(translit) - string-length(translate(translit, '-', ''))"/>
<let name="ilg_hyphens"
value="string-length(ilg) - string-length(translate(ilg, '-', ''))"/>
<!-- ================================================== -->
<!-- Test for checking hyphens -->
<!-- ================================================== -->
<assert test="
($orth_hyphens, $translit_hyphens, $ilg_hyphens)
=> distinct-values()
=> count()
eq 1">Hyphens don’t match: <value-of select="$orth_hyphens"/> in
orth, <value-of select="$translit_hyphens"/> in translit, <value-of
select="$ilg_hyphens"/> in ilg.</assert>
</rule>
</pattern>
</schema>
This assignment poses two challenges. The first is how to count the number of spaces
or hyphens in each of the tiers, since that is a prerequisite to comparing the
counts. The second has two parts: where to report the comparison (that is, what
should be the value of the @context
attribute
on the <rule>
element) and how to compare
the counts.
XPath does not have a function that counts the number of characters of a particular
type inside a string. That is, there is no core function that takes two arguments, a
string and a character, and returns the number of times the character appears in the
string. Some of you may have tried to use the
count()
function, which may feel correct
because the count()
function is … well … how
XPath counts things. But the count()
function
takes, as its one and only argument, a sequence of things to count, so if we want to
use it, we need to give it that type of sequence. There is no way to say with just
the XPath count()
function count the number
of hyphen characters in this string
.
There are at least four ways to approach the task. We present them here from the
simplest to the most complex, and if your goal is just to solve the immediate task,
it’s okay to study the first, skip the other three, and pick up below the last of them, where we discuss the
@context
attribute of the Schematron
<rule>
element. But the more complex
approaches provide a terrific, contextualized introduction to increasingly advanced
features of XPath, so even if you use the first method in Real Life, we encourage
you to take the time to read about the others.
string-length()
and
translate()
functionsThe number of space characters in a sentence is equal to the length of the
original sentence minus the length of the original sentence after you’ve
stripped out the space characters. In other words,
string-length($tier) - string-length(translate($tier, ' ',''))
is an indirect way of counting the number of space characters in a string
represented by the variable $tier
. The
same is true of hyphens. This is the approach we use in our basic solution,
above.
tokenize()
functionFrom a linguistic perspective we aren’t interested in spaces per se as much
as we are in making sure that the number of whitespace-delimited words in
the tiers is the same. We can count whitespace-delimited words with
count(tokenize($tier))
(or, more
legibly, tokenize($tier) => count()
).
The tokenize()
function splits the
string represented by the $tier
variable into substrings wherever there is one or more whitespace characters
and then counts the number of substrings. Note that this approach and the
preceding one return different results when there are multiple spaces
between words, since this one splits on sequences of whitespace,
while the preceding one counts each individual whitespace
character. Whether you want to require exactly one space character between
words (and treat multiple spaces as a data-entry error) is up to you, and
you can choose your Schematron rule to enforce your preference in either
case.
We can split the string into substrings on hyphens the same way we do on space characters. That’s an unusual thing to do in Real Life (unlike splitting a line of text into words on whitespace, since words are real things and we often divide sentences into them), but as a strategy for counting, the method works the same way for hyphens as it does for spaces.
string-to-codepoints()
,
codepoints-to-string()
, and
(optionally) index-of()
functionsThis approach is closer to the human description of the problem than the preceding ones because it is based on finding the character of interest in the string directly and counting how many times it appears. The preceding two approaches, on the other hand, do not count either spaces or hyphens directly, although they do count them indirectly. The two preceding approaches are simpler, so you might favor them in Real Life, but it’s worth reading through this third alternative to learn about some new XPath functions.
You can break a string into a sequence of individual characters (represented
by one-character strings) within an XPath
for
expression, as in
for $char in string-to-codepoints($tier)
return codepoints-to-string($char)
The way for
expressions work is that
they are followed by a return
statement, and for each item in the sequence after the
for
you return the result of applying
the return
statement to that item. If
you run this expression against the input string
obdurodon
with
for $char in string-to-codepoints("obdurodon")
return codepoints-to-string($char)
you’ll get back
"o", "b", "d", "u", "r", "o", "d", "o", "n"
,
a sequence of nine one-character strings. What the code says is use
string-to-codepoints()
to break the
string into a sequence of integer values (codepoints), one for each
character, and then for each of those integers use
codepoints-to-string()
to turn it into
a one-character string. (The integers are the Unicode codepoints of the
characters in decimal form. See the Wikipedia List of
Unicode characters for some examples. If you want to see the
numbers, you can return just $char
,
instead of first converting it from a number to a string.)
An XPath-idiomatic alternative to a for
expression uses the simple mapping operator
(!
), and with that approach we can
write
$tier ! string-to-codepoints(.) ! codepoints-to-string(.)
Reading from left to right, this says to take the string we’re interested in
(imagining that we’ve saved it as the value of a variable called
$tier
), break it into a sequence of
integers with the
string-to-codepoints()
function, and
turn each integer, in turn, into a string with the
codepoints-to-string()
function. Once
we have a sequence of one-character strings we can count the occurrences of
spaces or hyphens in the following indirect way. What’s indirect about it is
that we won’t count the character instances themselves; what we’ll do
instead is find the positions in the sequence where the character we care
about occurs and we’ll count the number of positions. Since there is one
position for each instance of the character we care about, the positions can
serve as proxies for the characters, which is to say that the count of
positions will equal the number of characters.
We can find the offset of a particular item in a sequence of items with the
index-of()
function, which takes two
arguments: the first is the haystack (the sequence of items in
which you’re searching) and the second is the needle (the single
item you want to find in the sequence, which may occur more than once). For
example,
index-of(
string-to-codepoints('obdurodon') ! codepoints-to-string(.)
, 'o'
)
returns a sequence of three integers, 1, 6, and 8, because the letter
o
is the first, sixth, and eighth letter of the string
obdurodon
. For our purposes we don’t care about the specific
positions, but we do care how many such positions there are, so if
we wrap count()
around that long
expression, it will return the single integer value of 3 because the
character occurs in three positions in the string obdurodon
.
We can use this approach to count the number of space characters or hyphens
in a tier, but what’s peculiar or unnatural about it is that we return the
offset positions without caring what they are, since all we do is count
them. That fact makes our code harder to understand, since when we see the
index-of()
function, our first thought
is that we care about the specific values, and not just how many of them
there are. We can avoid that confusion by applying a predicate to filter the
sequence of one-character strings, using a comparison operation inside the
predicate to find the instances of matching characters themselves and then
counting them, without bothering to compute (distractingly) their offsets.
The counts will be the same in either case, since each occurrence has its
own offset; the advantage of the more direct approach is that it doesn’t
leave us wondering why we computed specific offset values if we don’t care
what they are. The new version, also using the simple mapping and arrow
operators, looks like
$tier
! string-to-codepoints(.)
! codepoints-to-string(.)[. eq '-']
=> count()
analyze-string()
function (also introducing the
serialize()
function)XPath 3.0 introduced the function
analyze-string()
, which lets us use
regular expressions to find substrings inside a string.
analyze-string()
takes two arguments,
the first (the haystack) is the string to parse and the second (the
needle) is a regular expression that we match against the string.
For example, if our regular expression were
[aeiou]
(a character class that matches
any single vowel letter) and our string were
obdurodon
, the regex pattern would
match one instance of o
, then one of u
, and then two more of
o
. The function also keeps track of what it doesn’t match. As the
function name implies,
analyze-string()
, with its support for
regex, is the most powerful and flexible way of performing string surgery,
which means that in situations where the simpler methods above don’t meet
your needs, analyze-string()
is your
superpower.
The output of analyze-string()
is
complex because it has to report, in the order in which it encounters them
in the haystack, each matching and non-matching substring. The function
reports this in an XML structure that is in a particular namespace and looks
like:
<analyze-string-result xmlns="http://www.w3.org/2005/xpath-functions">
<match>o</match>
<non-match>bd</non-match>
<match>u</match>
<non-match>r</non-match>
<match>o</match>
<non-match>d</non-match>
<match>o</match>
<non-match>n</non-match>
</analyze-string-result>
Note that the output of the function has a predefined root element name
(<analyze-string-result>
) in a
predefined namespace (declared as the default with
@xmlns
on the element), and the root
contains a sequence of <match>
and
<non-match>
elements (in that same
namespace, because default namespaces are inherited by descendants). These
child elements are in the order in which they are encountered in the input
string, and if you put them through the
string-join()
function to stitch them
together in order, you’ll reconstruct the original input string, which is
sort of cool, even if not especially useful.
You can see this structured XML result by running:
analyze-string('obdurodon', '[aeiou]') => serialize()
in the XPath/XQuery builder in <oXygen/>. The output of
analyze-string()
is an XML tree (not
text with angle brackets), which isn’t human readable, so we have to pipe
the output into the serialize()
function if we want to convert it to XML that is formatted, with angle
brackets, as human-readable text. (If you don’t do that, you won’t get any
human-readable output.) We don’t need to serialize the output in order to
count hyphens or spaces (counting is an XPath process on the tree), but if
we want to peek for ourselves at what the immediate output of the function
looks like, the serialize()
function
will do that for us.
We can use the output of
analyze-string()
(without serializing
it) to give us <match>
elements,
which we can then count, since that count will be equal to the number of
times our regex was matched in the input. For this particular assignment,
then, we can run analyze-string()
over
each of the tiers, specifying a single hyphen or a single space character as
the value of the regex to be matched, and we can then count the number of
<match>
elements in the result. We
have to be mindful of the namespace, though! Our XPath expression would look
like the following (in these examples we continue looking for vowel letters
in the string obdurodon
; for the homework assignment you would, of
course, look instead for either hyphens or space characters in the content
of each of the tiers):
analyze-string('obdurodon', '[aeiou]')/Q{http://www.w3.org/2005/xpath-functions}match => count()
We already know that the first part of this XPath expression outputs an XML
structure in a particular namespace, but this may be the first time you are
seeing the Q{}
notation. Because we
haven’t mapped any prefix to the namespace of the
<match>
elements that are output by
the analyze-string()
function, we need
to specify it literally, which we do here by putting the namespace string
inside the curly braces in Q{…}
. The
official term in the XPath spec for this way of representing a namespace is
BracedURILiteral; see https://www.w3.org/TR/xpath-31/#doc-xpath31-BracedURILiteral. Since
the function returns the root element of the
analyze-string()
output (with its
contents), we can append a path step to look on its child axis in order to
find and count the <match>
child
elements.
The Q{}
notation is difficult to type,
and <oXygen/> helpfully predefines the
fn:
prefix as mapped to this namespace,
even when you don’t declare that prefix yourself. This means that as long as
we are working inside <oXygen/> we can, alternatively and more
legibly, write:
analyze-string('obdurodon', '[aeiou]')/fn:match => count()
That predefinition is an <oXygen/> convenience, though, and we can’t
rely on every XPath processor to know about a prefix that we don’t declare
ourselves. (The prefix is predefined in XQuery, which uses XPath, but it
isn’t predefined in XPath itself, which means that it is not reliably
available in XPath when used outside of XQuery.) For that reason, it’s safe
to use fn:
in <oXygen/>, but
before you use it with any other XPath processor, test it and make sure that
it’s predefined there, too.
There are a couple of other, alternative ways to negotiate the namespace that are safe to use here, but that should be used with caution elsewhere:
analyze-string('obdurodon', '[aeiou]')/*:match => count()
The notation *:match
uses a
namespace wildcard to match a
<match>
element in any namespace.
The asterisk where the namespace prefix would usually go means that any
namespace is acceptable, including none (that is, this expression also
matches elements in no namespace.)
Similarly, in:
analyze-string('obdurodon', '[aeiou]')/*[name() eq 'match'] => count()
our path step finds all element children of the root element (an asterisk
matches all element nodes, of any type) and then uses the
name()
function inside a predicate to
filter them to keep only those for which the element name is
<match>
. The
name()
function returns the name as it
is written in the serialized XML (this is called a lexical QName,
which is a name with an optional namespace prefix), character by character.
You could also use:
analyze-string('obdurodon', '[aeiou]')/*[local-name() eq 'match'] => count()
The local-name()
function matches the
non-namespace part of the element name, so it has the same meaning as the
wildcard namespace here.
What won’t work here is what may seem to a human like the most natural approach:
analyze-string('obdurodon', '[aeiou]')/match => count()
This will return no results because it looks for
<match>
elements only if they are in
no namespace, and in the output of
analyze-string()
the
<match>
elements are in the
namespace they inherit from the default specified on their parent. In other
words, you can specify the namespace 1) with the
Q{}
notation; 2) with the
fn:
namespace prefix if it’s predefined
for you, as it is in <oXygen/>; or 3) by using a predicate with
name()
or
local-name()
to match on the element
name as it would appear in the serialized XML, irrespective of the
namespace.
@context
attribute of the Schematron
<rule>
elementBecause we want to report the error at the level of the
<sentence>
element (see above), we set the
value of the @context
attribute on our
Schematron <rule>
element to
sentence
. This ensures that if we have multiple
sentences, an error will be reported (with a red squiggly line) on the sentence
where it occurs, which makes it easy to find and fix. Since all of our tests, for
both spaces and hyphens, are run at the level of
<sentence>
, we have just one Schematron
<rule>
element, which contains one
<assert>
to test the spaces and one to test the
hyphens.
We use the first of the four strategies described above, the one that relies on the
translate()
function, to set up six variables,
recording the number of spaces and hyphens in our
<orth>
,
<translit>
, and
<ilg>
tiers. We use the variables both
because they make our <assert>
statements
briefer and easier to understand and because they save us from performing the same
calculations more than once (once to run the test and again to report the
results).
We need to run a three-way test, that is, we need to compare the count of spaces or
hyphens in three strings to determine whether they all have the same count. Some
programming languages permit transitive comparisons, and in those languages you
could write something like a = b = c
to test
whether a
,
b
, and c
are all equal. XPath is not that kind of language, though, so we need an alternative
strategy. One straightforward approach would combine two tests in one:
$orthSpaces eq $translitSpaces and $orthSpaces eq $ilgSpaces
In a combined test with the and
operator, the
result of the test is true only if both parts succeed. We don’t need to run a third
test, comparing transliteration to interlinear glossing directly, because arithmetic
equality transitivity ensures that if a = b and b = c, it is also true that a =
c.
We opted, though, for a more elegant approach. If we take the distinct values of our
counts for all tiers, removing any repeated values, there will be one item remaining
only if all values are equal. For that reason, we pass the three values into the
distinct-values()
function, count the distinct
values, test whether the count is equal to 1, and report a problem if it isn’t:
<assert test="($orth_spaces, $translit_spaces, $ilg_spaces)
=> distinct-values()
=> count()
eq 1">
To make it easier for the human to find the error, we used the Schematron
<value-of>
element to report the space or
hyphen counts for each tier. It is possible to find and report the specific words
with discrepancies in their hyphens, but not with just the approach above because
we’re doing our counting on the level of the entire line, and not on a word-by-word
basis. If you need to do the equivalent of operating on the words for your project,
or if you’re just curious, let us know and we’ll be happy to explore that with
you.