Maintained by: David J. Birnbaum (djbpitt@gmail.com)
Last modified:
2021-01-29T17:07:26+0000
The general guide to Relax NG at http://dh.obdurodon.org/relaxng.xhtml describes how to model elements that contain only other elements and elements that contain only text. To refresh your memories:
poem = element poem { stanza+ }
The preceding example says that there is an element of type
<poem>
that must contain one or more elements of type
<stanza>
and nothing else.
line = element line { text }
The preceding example says that there is an element of type
<line>
that contains only plain text. Note that
text
is a reserved word in Relax NG, and when used
in a content model, it means plain text, and not an element of type
<text>
. (There is, of course, also a way to specify an
element of type <text>
if you need to, but this isn’t
it.)
One unexpected aspect of the reserved keyword text
is that it means
zero or more textual characters
. This means, perhaps
counterintuitively, that an absence of text counts as text! For example,
<line/>
is an empty element, and it uses the XML self-closing tag as a representation,
which means the same thing as <line></line>
. This is
valid according to a Relax NG content model that says that
<line>
elements must contain text because it contains
zero textual characters. There are a few ways to override this unusual default
behavior; ask us for the details if it comes up in your project.
In addition to the preceding, Relax NG provides a special notation for elements
with mixed content, that is, with a combination of plain text and
elements. This notation uses the reserved keyword mixed
(reserved
means that it has a special, pre-defined meaning that Relax NG automatically
knows about) and an extra set of curly braces. As an example, if you have a
paragraph that can contain, say, a mixture of plain text,
<title>
elements, and <emphasis>
elements, this can be described in Relax NG as:
paragraph = element paragraph { mixed { ( title | emphasis )* } }
Reading from the outside in, this means that an element of type
<paragraph>
contains mixed content, that is, plain text
mixed in with whatever is inside the embedded curly braces. What is embedded in
this case is an or-group, which says either an element of type
, and that you make that choice zero or
more times. The vertical bar (pipe) means make a choice; the asterisk
means do it zero or more times. All together (now reading from the inside out),
the model means you’ll have zero or more instances of elements of type
<title>
or an element of type
<emphasis>
<title>
and <emphasis>
, in any order,
with plain text mixed in anywhere before, between, or after them.
There are other ways to model mixed content, but we recommend using the keyword
mixed
(which we use consistently in our own work) because the
presence of that word makes it easy for human developers to see at a glance that
they are working with a mixed content model.
As we discussed earlier, elements may also be empty. An example of the Relax NG syntax for describing an empty element is
lineBreak = element lineBreak { empty }
empty
is also a reserved word, so it means that the element is empty; it
does not mean that a character contains an element called <empty>
.
(You can have an element called <empty>
, but you have to describe it
differently in Relax NG.)
An empty element, that is, one with no text or data content, may nonetheless have attributes. For example:
character = element character { id, type, gender, empty } id = attribute xml:id { xsd:ID } type = attribute type { text } living = attribute { "yes" | "no" | "unknown" }
The preceding Relax NG snippet means that a <character>
element has
four components: an xml:id
attribute, a type
attribute, a
living
attribute, and no data content. These Relax NG statements can be
understood as follows:
xml:id
attribute has datatype xsd:ID
,
which is a datatype that refers to a string that must be unique. That is,
no element can have an xml:id
attribute value that is the same as the
xml:id
attribute value of any other element. That the value is
unique is handy for ensuring that you can distinguish, in this case, references to
specific characters. The name xml:id
and the content model
xsd:ID
are the traditional way to define a unique identifier. You
may use other names (such as id
) or content models (such as
text
), but if you do that, there is no guarantee that two different
elements won’t have the same value, which isn’t what you want if you need to be able
to point to something uniquely and reliably.type
attribute can be any text. This is a common type
of content model for attributes where the actual content could be just about
anything, that is, where you can’t identify in advance a small set of positive
values. It offers less protection against errors that if you were to list the legal
values (see below), but it offers more flexibility in situations where your
documents are varied and you don’t know all possible values in advance.living
attribute is given as a choice of three
possible string values (the quotation marks can be singles or doubles, as long as
they are paired correctly, but string values must be quoted). This rule allows only
those three values, and raises an error if the tagger enters anything else. We use
an or-group like this when we can identify all possible values in
advance. This construction might be used as follows in an XML document (assuming that by
living
we mean alive at the end of the story):
<character xml:id="oliverTwist" type="orphan" living="yes"/>
Note that this is an empty element (using the self-closing single tag notation). The
empty
keyword in the Relax NG rule specifies that the element must not
have content, that is, must not have anything between its start and end tags
(empty elements may have attributes because attributes are properties, rather
than content).