Digital humanities


Maintained by: David J. Birnbaum (djbpitt@gmail.com) [Creative Commons BY-NC-SA 3.0 Unported License] Last modified: 2022-08-09T23:14:21+0000


Relax NG content models

What you know so far

The general guide to Relax NG at http://dh.obdurodon.org/relaxng.xhtml describes how to model elements that contain only other elements and elements that contain only text. To refresh your memories:

Element content

poem = element poem { stanza+ }

The preceding example says that there is an element of type <poem> that must contain one or more elements of type <stanza> and nothing else.

Text content

line = element line { text }

The preceding example says that there is an element of type <line> that contains only plain text. Note that text is a reserved word in Relax NG, and when used in a content model, it means plain text, and not an element of type <text>. (There is, of course, also a way to specify an element of type <text> if you need to, but this isn’t it.)

One possibly unexpected property of the reserved keyword text is that it means zero or more textual characters. This means, perhaps counterintuitively, that an absence of text counts as text! For example,

<line/>

is an empty element, and it uses the XML self-closing tag as a representation, which means the same thing as <line></line>. This is valid according to a Relax NG content model that says that <line> elements must contain text because it contains zero textual characters. There are a few ways to override this unusual (and often inconvenient) default behavior; ask us for the details if it comes up in your project.

Additional types of content models

Mixed content

In addition to the preceding, Relax NG provides a special notation for elements with mixed content, that is, with a combination of plain text and elements. This notation uses the reserved keyword mixed (reserved means that it has a special, pre-defined meaning that Relax NG automatically knows about) and an extra set of curly braces. As an example, if you have a paragraph that can contain, say, a mixture of plain text, <title> elements, and <emphasis> elements, this can be described in Relax NG as:

paragraph = element paragraph { mixed { ( title | emphasis )* } }

Reading from the outside in, this means that an element of type <paragraph> contains mixed content, that is, plain text mixed in with whatever is inside the embedded curly braces. What is embedded in this case is an or-group, which says either an element of type <title> or an element of type <emphasis>, and that you make that choice zero or more times. The vertical bar (pipe) means make a choice; the asterisk means do it zero or more times.

Importantly, the asterisk must be outside the parentheses because what is repeatable is the act of choosing between the two types of element content, and not the titles or instances of emphasis. ( title* | emphasis*) is valid Relax NG but it means something completely different from ( title | emphasis )*. What does it mean and why are the meanings different?

All together (now reading from the inside out), the model means you’ll have zero or more instances of elements of type <title> and <emphasis>, in any order, with plain text mixed in anywhere before, between, or after them. As we mention above, plain text means zero or more textual characters, so mixed content allows text before, between, and after elements, but does not require it.

There are other ways to model mixed content, but we recommend using the keyword mixed (which we use consistently in our own work) because the presence of that word makes it easy for human developers to see at a glance that they are working with a mixed content model.

Note that if you use the keyword mixed, it is an error to use the keyword text inside the or-group. That’s because mixed already says that text is allowed anywhere inside the content.

Empty elements

As we discussed earlier, elements may also be empty. An example of the Relax NG syntax for describing an empty element is

lineBreak = element lineBreak { empty }

empty is also a reserved word, so it means that the element is empty; it does not mean that a line break contains an element called <empty>. (You can have an element called <empty>, of course, but you have to describe it differently in Relax NG.)

An empty element, that is, one with no text or data content, may nonetheless have attributes, as in:

characterRef = element characterRef { id, type, living, empty }
id = attribute xml:id { xsd:ID }
type = attribute type { text }
living = attribute living { "yes" | "no" | "unknown" }

The preceding Relax NG snippet means that a <characterRef> element has four components: an xml:id attribute, a type attribute, a living attribute, and no data content. These Relax NG statements can be understood as follows:

This construction might be used as follows in an XML document (assuming that by living we mean alive at the end of the story):

<characterRef xml:id="OliverTwist" type="orphan" living="yes"/>

Note that this is an empty element (using the self-closing single tag notation). The empty keyword in the Relax NG rule specifies that the element must not have content, that is, must not have anything between its start- and end-tags. Empty elements may have attributes because attributes are properties, rather than content.