Digital humanities

Maintained by: David J. Birnbaum ( [Creative Commons BY-NC-SA 3.0 Unported License] Last modified: 2012-10-29T03:24:26+0000

Schematron test

The text

For this test we are using Skyrim, which you can download from For test purposes you should delete the line at the top that links to the Relax NG schema. You won’t need to do any Relax NG validation, and you don’t have the Relax NG schema file anyway, so deleting this line will save you from getting a lot of distracting schema file not found error messages.

How information about factions is represented in the text

The file has a cast of characters at the top that contains a list of characters and factions that are mentioned in the body section, below the cast list. For the test we are going to ignore characters and concentrate only on factions. An entry for a faction in the cast list looks like:

<faction id="MythicDawn" alignment="evil"/>

Meanwhile, a reference to a faction in the body looks like:

The <faction ref="MythicDawn">assassins</faction> first attacked …

Any mention of a <faction> element in the body should point to a matching <faction> element in the cast list. The way it does this is that a <faction> element in the body always has a @ref attribute, and the value of that @ref attribute should match the @id attribute of some <faction> element in the cast list. In other words, the cast should list all of the factions that occur in the body, and every faction that occurs in the body should point to some faction in the cast list.

How a developer could screw up

There are at least two ways a developer could mangle these cross-references:

  1. It’s possible to encode a <faction> element in the body with a @ref attribute that doesn’t point to (that is, correspond to) the @id attribute of a <faction> element in the cast list. The @id attribute might be on a <character> element in the cast list, instead of on a <faction> element, or there might be no corresponding @id attribute at all.
  2. It’s possible to encode an unused <faction> element in the cast list, that is, one that is not pointed to by the @ref attribute of any <faction> element in the body. Since the inventory of factions in the cast list is supposed to summarize which factions occur in the body, such an error would bring the list out of sync with the reality of the body.

The task

You should write a Schematron schema that will guard against the type of error described above by checking for consistency in two ways:

  1. Write a rule for factions in the cast list to verify that all factions mentioned there also occur in the body. That is, there should be no faction listed in the cast list that is not also present in the body.
  2. Write a rule for factions in the body that verifies that they have a @ref attribute that points to an @id attribute on a <faction> element in the cast list. Note that it isn’t enough to check for the existence of a corresponding @id attribute, since there are @id attributes on <character> elements in the cast list, and not just on <faction> elements. Not only must the @id exist, but it must be associated with a <faction> in the cast list.

You should check the effectiveness of your rules by introducing errors into the XML file to see whether your Schematron rules report them.

What to do if you finish early

You don’t have to do anything more than what’s described above, but in case you finish early: One might think one could use the same type of validation to check for cross-references on <character> elements: is every character mentioned in the cast list also encountered in the body and does every character mentioned in the body have a @ref attribute that points to the @id attribute of a <character> element in the cast list?

This turns out to be harder than with factions because there are elements in the body like:

the <character ref="hero Jauffre MartinSeptim">three of them</character> made their way

The problem here is that there is no <character> element in the cast list with an @id attribute whose value is hero Jauffre MartinSeptim. Instead, this is a pointer to three separate characters in the header. The strategy for checking coreference therefore has to involve breaking apart the @ref attribute and checking each of the three pointers separately. This is the sort of task for which the XPath tokenize() function was created.

There is a hypothetical parallel problem concerning the other half of the assignment. Suppose there is a <character id="Eric" loyalty="empire" alignment="neutral"/> element in the head, but the only time Eric occurs in the body is in combination with another character, e.g., <character ref="Eric David">. We can’t just check whether there is a @ref attribute in the body that matches the string Eric because there isn’t; here, too, we have to break apart the value of the @ref and check each part separately. This situation doesn’t happen to occur in our text, but it is potentially possible and therefore something against which a well-designed development environment would protect the user.