Maintained by: David J. Birnbaum (djbpitt@gmail.com)
Last modified:
2022-08-09T23:14:21+0000
The general guide to Relax NG at http://dh.obdurodon.org/relaxng.xhtml describes how to model elements that contain only other elements and elements that contain only text. To refresh your memories:
poem = element poem { stanza+ }
The preceding example says that there is an element of type
<poem>
that must contain one or more
elements of type <stanza>
and nothing
else.
line = element line { text }
The preceding example says that there is an element of type
<line>
that contains only plain text.
Note that text
is a reserved
word in Relax NG, and when used in a content model, it means plain
text, and not an element of type <text>
.
(There is, of course, also a way to specify an element of type
<text>
if you need to, but this isn’t
it.)
One possibly unexpected property of the reserved keyword
text
is that it means zero or
more textual characters
. This means, perhaps counterintuitively, that an
absence of text counts as text! For example,
<line/>
is an empty element, and it uses the XML self-closing tag as a representation,
which means the same thing as
<line></line>
. This is valid
according to a Relax NG content model that says that
<line>
elements must contain text
because it contains zero textual characters. There are a few ways to override
this unusual (and often inconvenient) default behavior; ask us for the details
if it comes up in your project.
In addition to the preceding, Relax NG provides a special notation for elements
with mixed content, that is, with a combination of plain text and
elements. This notation uses the reserved keyword
mixed
(reserved means that it has a
special, pre-defined meaning that Relax NG automatically knows about) and an
extra set of curly braces. As an example, if you have a paragraph that can
contain, say, a mixture of plain text,
<title>
elements, and
<emphasis>
elements, this can be
described in Relax NG as:
paragraph = element paragraph { mixed { ( title | emphasis )* } }
Reading from the outside in, this means that an element of type
<paragraph>
contains mixed content, that
is, plain text mixed in with whatever is inside the embedded curly braces. What
is embedded in this case is an or-group, which says either an
element of type
, and
that you make that choice zero or more times. The vertical bar (pipe)
means make a choice; the asterisk means do it zero or more times.<title>
or an
element of type <emphasis>
Importantly, the asterisk must be outside the parentheses because what is
repeatable is the act of choosing between the two types of element content,
and not the titles or instances of emphasis.
( title* | emphasis*)
is valid
Relax NG but it means something completely different from
( title | emphasis )*
. What does it
mean and why are the meanings different?
All together (now reading from the inside out), the model means you’ll have zero
or more instances of elements of type
<title>
and
<emphasis>
, in any order, with plain
text mixed in anywhere before, between, or after them. As we mention above,
plain text
means zero or more textual characters, so mixed content
allows text before, between, and after elements, but does not
require it.
There are other ways to model mixed content, but we recommend using the keyword
mixed
(which we use consistently in our own
work) because the presence of that word makes it easy for human developers to
see at a glance that they are working with a mixed content model.
Note that if you use the keyword mixed
, it
is an error to use the keyword text
inside
the or-group. That’s because
mixed
already says that text is allowed
anywhere inside the content.
As we discussed earlier, elements may also be empty. An example of the Relax NG syntax for describing an empty element is
lineBreak = element lineBreak { empty }
empty
is also a reserved word, so it means that the
element is empty; it does not mean that a line break contains an element called
<empty>
. (You can have an element called
<empty>
, of course, but you have to describe it
differently in Relax NG.)
An empty element, that is, one with no text or data content, may nonetheless have attributes, as in:
characterRef = element characterRef { id, type, living, empty }
id = attribute xml:id { xsd:ID }
type = attribute type { text }
living = attribute living { "yes" | "no" | "unknown" }
The preceding Relax NG snippet means that a
<characterRef>
element has four components: an
xml:id
attribute, a
type
attribute, a
living
attribute, and no data content. These Relax
NG statements can be understood as follows:
xml:id
attribute has datatype
xsd:ID
, which is a datatype that
refers to a string that must be unique in its document. (Note that XML is case
sensitive, so the attribute name must be
xml:id
[lower-case] but the datatype
in your Relax NG must be xsd:ID
[upper-case].)
That is, no element can have an xml:id
attribute value that is the same as the xml:id
attribute value of any other element. That the value is unique is handy for ensuring
that you can distinguish, in this case, references to specific characters. The
attribute name xml:id
and the content model
xsd:ID
are the traditional way to define a
unique identifier. You may use other attribute names (such as
id
) or content models (such as
text
), but if you do that, there is no
guarantee that two different elements won’t have the same value, which isn’t what
you want if you need to be able to point to something uniquely and reliably. Unique
id values (that is, attributes modeled as datatype
xsd:ID
), are subject to the same naming
restrictions as element and attribute names, so they cannot, for example, contain
spaces or begin with digits.type
attribute can be any
text. This is a common type of content model for attributes where the actual content
could be just about anything, that is, where you can’t identify in advance a small
set of positive values. It offers less protection against error than if you were to
list the legal values (see below), but it offers more flexibility in situations
where your documents are varied and you don’t know all possible values in
advance.living
attribute is given as a
choice of three possible string values (the quotation marks can be singles or
doubles, as long as they are paired correctly, but string values must be quoted).
This rule allows only those three values, and raises an error if the tagger enters
anything else. We use an or-group like this when we can identify all
possible values in advance. This construction might be used as follows in an XML document (assuming that by
living
we mean alive at the end of the story):
<characterRef xml:id="OliverTwist" type="orphan" living="yes"/>
Note that this is an empty element (using the self-closing single tag notation). The
empty
keyword in the Relax NG rule specifies that
the element must not have content, that is, must not have anything between
its start- and end-tags. Empty elements may have attributes because attributes are
properties, rather than content.