Digital humanities

Maintained by: David J. Birnbaum ( [Creative Commons BY-NC-SA 3.0 Unported License] Last modified: 2018-02-17T19:42:12+0000

Test #2: Relax NG: Answers

Your task for this test was to create a schema for an XML document that found a balance between constraint and flexibility. Our solution (not the only possible solution) is below:

start = article
article = element article { meta, body }
meta = element meta { title, agency?, author*, date }
title = element title { text }
agency = element agency { text }
author = element author { text }
date = element date { text }

body = element body { p+ }
p = element p { mixed { ( q | place | name )* } }

q = element q { sp, mixed { ( place | name)* } }
place = element place { type, text }
name = element name { ref, text } 

sp = attribute sp { text }
type = attribute type { "building" | "state" | "city" | "region" }
ref = attribute ref { "animal" | "human" }

We’ve aimed here for a balance in our schema between constraining the content as much as possible (to prevent errors) while allowing the same schema to be used for other (hypothetical) documents of this type. You may have struck this balance in a different place than we did, which is fine; the important thing is to consider the competing needs for control, on the one hand, and flexiblity, on the other.

Take a look at our content model for <article>. We used the comma as our combination indicator in this case because we want to maintain a consistent order of the <meta> and <body> elements in our (hypothetical) corpus. A computer wouldn’t care about the order, but consistency makes things easier for a human.

We also used repetition indicators to show that certain elements may or may not appear, and some may appear more than once. For example, there may be more than one <author> element for a particular story, or it may be anonymous, and the news <agency> could be unknown. For this reason, we used the asterisk to indicate that there can be zero or more <author> elements and we used a question mark after <agency>, since the agency of publication may be unknown, but (we assume) there will never be more than one. A tighter model that required exactly one of each of these elements would be correct if we could be confident that all documents in our (hypothetical) corpus would contain all of them exactly once. In the content model for <body> we follow the <p> with a plus sign, since an article must have at least one paragraph, but it might (and probably does) have more.

Most of the content models for the rest of the elements contain mixed content, and when we create a content model for an element that contains mixed content, we always used the same construction. Take a look at the content model for <q>:

q = element q { sp, mixed { ( place | name)* } }

First we have the label for the element (q), then the equal sign, and then we specify that the label represents an element (using the keyword element), followed by the element name as it appears in the XML (q). Then, inside our content model, which is between the curly braces, we specify the attributes first (if there are any), followed by a comma. We suggest putting attributes as the first item inside a content model because attributes are written inside the start tag, that is, near the beginning of the serialization. Importantly, attributes are not mixed into mixed content the way elements are, so although including the attribute labels inside a mixed group won’t raise a validation error, it nonetheless isn’t good practice because it makes your schema harder to understand.

We specify the actual content of the element as mixed, using the mixed keyword. Since it can contain zero or more places or names in any order, we represent that with a repeatable or-group. Reading from the inside out, the model says that you may choose to have a <place> or a <name>, and you can make that choice (the same choice or the other choice) zero or more times. The vertical bar (called a pipe) says make a choice and the asterisk outside the parentheses means that you can make that choice zero or more times. Note that the repetition indicator (asterisk) goes outside the parentheses, since what’s optional and repeatable is choosing one or the other element. The way to read this model is that a <q> element contains an obligatory sp attribute and then, between the start and end tags, zero or more instances of <place> and <name> elements in any order, with plain text allowed anywhere.

Last, you’ll see that we included the attribute values in our content models for attributes, specifying them as or-groups of strings. Whether this is a good solution depends on how consistent the corpus is. The or-group protects you from including a place or person that isn’t part of the text, but it also means that if a new document is introduced that deals with a different place or a different person, it won’t be valid. Whether you use the or-group or the reserved word text here depends on how you envision the (hypothetical) corpus.