Maintained by: David J. Birnbaum (djbpitt@gmail.com) Last modified: 2024-07-16T21:43:04+0000
Relax NG (Regular Language for XML Next Generation) is a schema language that allows you to express in a formal way the structure and content of an XML document. For example:
start = poem
poem = element poem { head, body }
The above rules state that the root element of an XML document authored
according to this schema must be <poem>
and that
the <poem>
element must contain exactly one
<head>
element followed by exactly one
<body>
element. There are several ways to
express the structure of an XML document using Relax NG, and the one described here
(which is what we use in our own work) has a start rule followed by rules that take the
form:
label = type name {content}
where label
is a name you assign to the object you
are modeling, type
is the type of object (element
or attribute), name
is the name of the object
(e.g., an element named poem
), and
content
is what the element or attribute contains
(see the details below). This type of statement uses what is called a named
pattern. We find it most convenient to use a label for the pattern that is the
same as the name of the element or attribute being defined (in the example above, the
first occurrence of the word “poem,” to the left of the equal sign, is the label for the
pattern and the second, after the word “element,” is the name of the element being
defined), but Relax NG doesn’t require that.
The content model for the element or attribute being defined must be wrapped in curly braces. This indicates what is directly part of the element being defined, whether it be attributes, elements, text, or all three. The term content model is somewhat misleading because:
A content model for an element includes attributes for that element even through attributes technically are properties of an element but not content. That’s because content is the stuff between start- and end-tags and attributes are inside the start-tag, and not between it and the end-tag. Attributes are nonetheless included in what is called a content model in Relax NG.
Elements and plain text nested immediately in an element (between its start- and end-tags) but not inside another element that is inside the element you’re looking at at the moment are called the children of that element. Elements and plain text that are nested more deeply are descendants, but not children. We’ll learn more about this shortly, when we get to the XPath portion of our course.
A content model contains only attributes (see above) and children of the element you are modeling, but not descendants. For example, in:
Hamlet
Good morning, Ophelia!
]]>
the content model for a <speech>
element
would include the style
attribute and the
<speaker>
and
<lines>
elements, but not the
<line>
element. That’s because the
<line>
is a descendant of
<speech>
, but not a child. The Relax NG
declaration for a <speech>
element might
look like:
The content model of an element includes only attributes and content that are
directly parts of the element being specified. For example, we wouldn’t say
that the element <speech>
above contains
<line>
elements because
<line>
elements are not children of
<speech>
, even thought they are descendants.
A content model contains of labels for attributes and elements that are defined
elsewhere, as well as labels that are predefined, such as
text
for plain text. In the example above, the
content model for speech
contains labels for a
style
attribute and
<speaker>
and
<lines>
child elements, the content of which
will have to be defined elsewhere in the schema. Those components of the content model
are combined using combination indicators and (optionally) modified using
repetition indicators.
The two principal combination indicators are comma
(,
) and the vertical bar
(|
, commonly called a pipe). (There is
also an interleave connector, represented by an ampersand
[&
], which we do not discuss here.) The comma
means sequence, that is, that the two components must occur in the specified
order. In the example above, a <poem>
must
contain a <head>
and a
<body>
, and the comma means that they must occur
in that order. The pipe means choice; a content model of
head | body
would mean that the content may be
either a <head>
or a
<body>
, but that one or the other must
occur.
The four repetition indicators are nothing, question mark
(?
), plus sign
(+
), and asterisk
(*
, sometimes called a “splat”), and they operate
as follows:
Nothing. Exactly one. The item is required and occurs only once. In the example
above, the content model for poem
contains
head
and
body
with no overt repetition indicator,
which means that each must occur exactly once.
Question mark (?
). Zero or one. The item may
occur zero or one time, that is, it is optional. It cannot occur more than once.
A statement like:
poem = element poem { head?, body }
would mean that a <poem>
must have
exactly one <body>
element, which may or
may not be preceded by one <head>
element.
Plus sign (+
). One or more. The item is
required and may be repeated more than once. A model like:
stanza = element stanza { line+ }
means that a <stanza>
element must
contain at least one <line>
element, and
may contain more.
Asterisk (*
). Zero or more. The item is
optional and repeatable. A model like:
poem = element poem { head, body, note* }
means that a <poem>
element must contain
exactly one <head>
element followed by
exactly one <body>
element, optionally
followed by zero or more <note>
elements.
The power of these simple combination and repetition indicators comes from using them together. For example:
bibliographicEntry = element bibliographicEntry
{ (author | editor)+, title+, publisher, publicationPlace, date }
means that a bibliographic entry must start with one or more author or editor (any combination of any number of authors and editors is permitted, as long as there is at least one author or at least one editor), followed by one or more titles, followed by a publisher, a publication place, and a date. Note the use of parentheses at the beginning to create a subgroup. The correct way to read this statement is from the inside out: choose (thus the vertical bar) between author or editor (the two options inside the parentheses that are separated by the choice connector) at least once (the plus sign means there must be at least one author or one editor), then make another choice (the same or different) if you’d like, and when you’ve encoded all of the authors and editors you’d like, move on to the title(s). If you happen to encode a bibliographic entry with no author or editor, or with the date before the publisher, or in any other way in violation of the rule specified by the schema, a validating (see below) XML parser will notify you that you’ve made an error.
As we discussed earlier, XML elements typically contain elements, plain text, a mixture
of the two (called mixed content), or nothing at all (called an empty
element). Relax NG specifies plain text with the reserved word
text
, which means that in the following schema
statement:
line = element line { text }
the system knows that a <line>
element contains
plain text (and not an element whose name, or generic identifier, is the word
text
). (There is, of course, also a way to create an element called
<text>
, should you need one.) Relax NG also
supports a variety of data types, or data that looks like plain text, but
that is constrained in specific ways. For example:
pageNumber = element pageNumber { xsd:int }
requires that a page number contain a string that can be interpreted as an integer. The library of pre-defined datatypes, as well as methods for modifying them using facets, are described in Chapter 8 of Eric van der Vlist’s Relax NG book, which is available in digital form through the Pitt library system (authentication required).
The rules specified in Relax NG constitute a schema, a type of document that details the hierarchy chosen to encode a text with XML. In this way, a schema functions as a type of blueprint or grammar, expressing constraints on the structure and content of encoded text. As in the examples above, the constraints expressed by the schema go beyond the simple rules of well-formedness and can dictate the order and content of elements and attributes.
There are two other schema languages in general use to constrain the structure of XML markup: DTD (Document Type Definition) and XML Schema. We use Relax NG instead of the other schema languages for two reasons. First, Relax NG is available in both an XML syntax and a compact syntax. Compact syntax, which is what we use in the examples above (and in our own work), allows for simpler, less verbose code that is easier to develop, read, and maintain. XML syntax, as the name implies, uses XML to specify the rules of the schema, which makes it much more verbose. The second reason we favor Relax NG is that it enables the creation of more complex constraints than are allowed by DTD’s.
As we discussed earlier, all XML documents must be well formed, which means that 1) they must have a single root element and 2) all elements must be properly nested. Optionally, XML may also be valid, which means that it can be validated against rules specified in a schema. The point of validating a document is to ensure that the XML used to encode a text matches the markup patterns described in the schema. Validating a text essentially means comparing it to a predetermined set of rules and patterns. In this way, validating an XML document with a schema can prevent typos, errors, and inconsistencies, and most digital humanities projects are developed according to a schema (that is, they are valid, and not merely well formed). In principle, you should first perform document analysis to determine the inherent structure of your document, then create a schema to model that structure, and then mark up the text according to the rules specified in the schema. In practice, as you discover new complexities in the text you are marking up you will probably need to revise your schema to deal with issues you failed to notice during your initial document analysis.
The following example shows a fairly straightforward XML markup hierarchy:
<play>
<heading>
<title>The Importance of Being Ernest</title>
<year>1895</year>
</heading>
<body>
<quote speaker="Algernon">I don’t know that I am much interested in
your family life, Lane.</quote>
<quote speaker="Lane">No, sir; it is not a very interesting subject.
I never think of it myself.</quote>
</body>
</play>
When you create a schema to model and constrain a document like the one above, you
specify the content and occurrence restrictions on each of the elements (or attributes,
if there are any) in your document. For example, the
<year>
element should contain only digits, and
we can specify that the <year>
element will not
be able to contain additional letters, characters, or elements with a rule like:
year = element year { xsd:int }
The above snippet of Relax NG specifies that year is an element. The expression
xsd:int
dictates that the content of the
<year>
element can only contain integers. In
practice, you might want to constrain the content further, excluding, for example,
negative integers, zero, etc.
An entire Relax NG schema for this sample of XML code might look like the following:
start = play
play = element play {heading, body}
heading = element heading {title, year}
title = element title {text}
year = element year {xsd:int}
body = element body {quote*}
quote = element quote {speaker, text}
speaker = attribute speaker {text}
To create a Relax NG Schema document in <oXygen/>, click on the icon shaped like a
sheet of paper in the upper left hand corner of the screen. Then, select the document
type “RELAX NG Schema - Compact” from the list (it will be under either “Recently used”
or “New Document”) and click create. <oXygen/> inserts some boilerplate that isn’t
actually needed, so once you’ve created the document, the first thing you should do is
select and delete all of the content, so that you begin with a clean slate. Then type
start =
followed by the label you want to use to
refer to your root element (as mentioned above, we use the same value for the label as
for the generic identifier, so, for example, if your root element is going to be
<play>
, your schema will start with
start = play
). You can then go on to define the
structure of your XML document and dictate the content of the elements and attributes.
In order for your schema to be valid, you need to specify the name and content of all
the elements and attributes you mention. Consider the following modification of the
schema for plays, above:
start = play
play = element play {heading, body}
heading = element heading {title, year}
year = element year {xsd:int}
body = element body {quote*}
quote = element quote {speaker, text}
speaker = attribute speaker {text}
The preceding schema is not valid, because the
<heading>
element must include a
<title>
element, but the
<title>
element itself is never defined. All
elements and attributes must be defined in the schema, although reserved words, like
text
or
xsd:int
do not require an explicit definition
because Relax NG inherently knows what they mean.
Once you have created and saved your schema and are ready to validate an XML document, open the document you want to validate and click “Document” and select “Schema” from the drop down menu, and then select “Associate Schema.” Then select the schema you have created and saved and click “Okay.” Associating your schema with an XML document will insert a line of code at the top of your document. If your document validates properly, a small green box will appear at the top of the <oXygen/> window, similar to when you check for well-formedness. If the document is not valid, an error message will appear.
If your XML document is not associated with a schema, <oXygen/> automatically checks only for well-formedness. If it is associated with a schema, <oXygen/> automatically checks for well-formedness and validates the document against the schema.
<oXygen/> does real-time validation and well-formedness checking, so you don’t have to tell it to validate your document or check for well-formedness. However, although the real-time validation and well-formedness checking highlight the problem spots, they provide only brief error messages. If you want more information, with more detailed error reports, after you have associated a schema with your document, you can click on the icon shaped like a white piece of paper with a red check mark in it, which will instruct <oXygen/> to validate the document against the schema and display an error report in a separate panel. You can also use a keyboard shortcut, which is Ctrl+Shift+v under Windows and Cmd+Shift+v under MacOS. You can only validate a document that has a schema associated with it, but you can check any document for well-formedness. To do that, click on the little drop-down arrow to the right of the validation checkmark icon, and you’ll drop down a list that includes an option to check well-formedness. The keyboard shortcut for a well-formedness check is Ctrl+Shift+w under Windows and Cmd+Shift+w under MacOS.
The preceding is just a brief introduction to a small number of basic features of Relax NG, intended to enable new users to begin to construct and apply simple schemas. It is not complete and it uses some non-standard terminology. After reading through this introduction, users might with to consult a tutorial by the designers of the Relax NG standard, available at http://relaxng.org/compact-tutorial-20030326.html. See also the excellent book by Eric van der Vlist, Relax NG, available on line through the Pitt library system (authentication required) and—freely—at the Internet Archive.