Digital humanities

Maintained by: David J. Birnbaum (djbpitt@gmail.com) Last modified: 2024-07-16T21:43:04+0000

Introduction to Relax NG

Relax NG rules

Relax NG (Regular Language for XML Next Generation) is a schema language that allows you to express in a formal way the structure and content of an XML document. For example:

start =  poem
poem = element poem { head, body }

The above rules state that the root element of an XML document authored according to this schema must be <poem> and that the <poem> element must contain exactly one <head> element followed by exactly one <body> element. There are several ways to express the structure of an XML document using Relax NG, and the one described here (which is what we use in our own work) has a start rule followed by rules that take the form:

label = type name {content}

where label is a name you assign to the object you are modeling, type is the type of object (element or attribute), name is the name of the object (e.g., an element named poem), and content is what the element or attribute contains (see the details below). This type of statement uses what is called a named pattern. We find it most convenient to use a label for the pattern that is the same as the name of the element or attribute being defined (in the example above, the first occurrence of the word “poem,” to the left of the equal sign, is the label for the pattern and the second, after the word “element,” is the name of the element being defined), but Relax NG doesn’t require that.

Content models

The content model for the element or attribute being defined must be wrapped in curly braces. This indicates what is directly part of the element being defined, whether it be attributes, elements, text, or all three. The term content model is somewhat misleading because:

A content model for an element includes attributes for that element even through attributes technically are properties of an element but not content. That’s because content is the stuff between start- and end-tags and attributes are inside the start-tag, and not between it and the end-tag. Attributes are nonetheless included in what is called a content model in Relax NG.
Elements and plain text nested immediately in an element (between its start- and end-tags) but not inside another element that is inside the element you’re looking at at the moment are called the children of that element. Elements and plain text that are nested more deeply are descendants, but not children. We’ll learn more about this shortly, when we get to the XPath portion of our course.

A content model contains only attributes (see above) and children of the element you are modeling, but not descendants. For example, in:
```
    Hamlet
    
        Good morning, Ophelia!
    
]]>
```
the content model for a <speech> element would include the style attribute and the <speaker> and <lines> elements, but not the <line> element. That’s because the <line> is a descendant of <speech>, but not a child. The Relax NG declaration for a <speech> element might look like:
Attributes have values but not content. In Relax NG they nonetheless are described as having content models. The content model for an attribute, then, models the possible values of the attribute, and not something that could strictly be called content.

The content model of an element includes only attributes and content that are directly parts of the element being specified. For example, we wouldn’t say that the element <speech> above contains <line> elements because <line> elements are not children of <speech>, even thought they are descendants.

A content model contains of labels for attributes and elements that are defined elsewhere, as well as labels that are predefined, such as text for plain text. In the example above, the content model for speech contains labels for a style attribute and <speaker> and <lines> child elements, the content of which will have to be defined elsewhere in the schema. Those components of the content model are combined using combination indicators and (optionally) modified using repetition indicators.

The two principal combination indicators are comma (,) and the vertical bar (|, commonly called a pipe). (There is also an interleave connector, represented by an ampersand [&], which we do not discuss here.) The comma means sequence, that is, that the two components must occur in the specified order. In the example above, a <poem> must contain a <head> and a <body>, and the comma means that they must occur in that order. The pipe means choice; a content model of head | body would mean that the content may be either a <head> or a <body>, but that one or the other must occur.

The four repetition indicators are nothing, question mark (?), plus sign (+), and asterisk (*, sometimes called a “splat”), and they operate as follows:

Nothing. Exactly one. The item is required and occurs only once. In the example above, the content model for poem contains head and body with no overt repetition indicator, which means that each must occur exactly once.
Question mark (?). Zero or one. The item may occur zero or one time, that is, it is optional. It cannot occur more than once. A statement like:
```
poem = element poem { head?, body }
```
would mean that a <poem> must have exactly one <body> element, which may or may not be preceded by one <head> element.
Plus sign (+). One or more. The item is required and may be repeated more than once. A model like:
```
stanza = element stanza { line+ }
```
means that a <stanza> element must contain at least one <line> element, and may contain more.
Asterisk (*). Zero or more. The item is optional and repeatable. A model like:
```
poem = element poem { head, body, note* }
```
means that a <poem> element must contain exactly one <head> element followed by exactly one <body> element, optionally followed by zero or more <note> elements.

The power of these simple combination and repetition indicators comes from using them together. For example:

bibliographicEntry = element bibliographicEntry 
  { (author | editor)+, title+, publisher, publicationPlace, date }

means that a bibliographic entry must start with one or more author or editor (any combination of any number of authors and editors is permitted, as long as there is at least one author or at least one editor), followed by one or more titles, followed by a publisher, a publication place, and a date. Note the use of parentheses at the beginning to create a subgroup. The correct way to read this statement is from the inside out: choose (thus the vertical bar) between author or editor (the two options inside the parentheses that are separated by the choice connector) at least once (the plus sign means there must be at least one author or one editor), then make another choice (the same or different) if you’d like, and when you’ve encoded all of the authors and editors you’d like, move on to the title(s). If you happen to encode a bibliographic entry with no author or editor, or with the date before the publisher, or in any other way in violation of the rule specified by the schema, a validating (see below) XML parser will notify you that you’ve made an error.

Data types

As we discussed earlier, XML elements typically contain elements, plain text, a mixture of the two (called mixed content), or nothing at all (called an empty element). Relax NG specifies plain text with the reserved word text, which means that in the following schema statement:

line = element line { text }

the system knows that a <line> element contains plain text (and not an element whose name, or generic identifier, is the word text). (There is, of course, also a way to create an element called <text>, should you need one.) Relax NG also supports a variety of data types, or data that looks like plain text, but that is constrained in specific ways. For example:

pageNumber = element pageNumber { xsd:int }

requires that a page number contain a string that can be interpreted as an integer. The library of pre-defined datatypes, as well as methods for modifying them using facets, are described in Chapter 8 of Eric van der Vlist’s Relax NG book, which is available in digital form through the Pitt library system (authentication required).

Relax NG as a schema

The rules specified in Relax NG constitute a schema, a type of document that details the hierarchy chosen to encode a text with XML. In this way, a schema functions as a type of blueprint or grammar, expressing constraints on the structure and content of encoded text. As in the examples above, the constraints expressed by the schema go beyond the simple rules of well-formedness and can dictate the order and content of elements and attributes.

There are two other schema languages in general use to constrain the structure of XML markup: DTD (Document Type Definition) and XML Schema. We use Relax NG instead of the other schema languages for two reasons. First, Relax NG is available in both an XML syntax and a compact syntax. Compact syntax, which is what we use in the examples above (and in our own work), allows for simpler, less verbose code that is easier to develop, read, and maintain. XML syntax, as the name implies, uses XML to specify the rules of the schema, which makes it much more verbose. The second reason we favor Relax NG is that it enables the creation of more complex constraints than are allowed by DTD’s.

Validation

As we discussed earlier, all XML documents must be well formed, which means that 1) they must have a single root element and 2) all elements must be properly nested. Optionally, XML may also be valid, which means that it can be validated against rules specified in a schema. The point of validating a document is to ensure that the XML used to encode a text matches the markup patterns described in the schema. Validating a text essentially means comparing it to a predetermined set of rules and patterns. In this way, validating an XML document with a schema can prevent typos, errors, and inconsistencies, and most digital humanities projects are developed according to a schema (that is, they are valid, and not merely well formed). In principle, you should first perform document analysis to determine the inherent structure of your document, then create a schema to model that structure, and then mark up the text according to the rules specified in the schema. In practice, as you discover new complexities in the text you are marking up you will probably need to revise your schema to deal with issues you failed to notice during your initial document analysis.

Examples

The following example shows a fairly straightforward XML markup hierarchy:

<play>
  <heading>
    <title>The Importance of Being Ernest</title>
    <year>1895</year>
  </heading>
  <body>
    <quote speaker="Algernon">I don’t know that I am much interested in 
      your family life, Lane.</quote>
    <quote speaker="Lane">No, sir; it is not a very interesting subject. 
      I never think of it myself.</quote>
  </body>
</play>

When you create a schema to model and constrain a document like the one above, you specify the content and occurrence restrictions on each of the elements (or attributes, if there are any) in your document. For example, the <year> element should contain only digits, and we can specify that the <year> element will not be able to contain additional letters, characters, or elements with a rule like:

year = element year { xsd:int }

The above snippet of Relax NG specifies that year is an element. The expression xsd:int dictates that the content of the <year> element can only contain integers. In practice, you might want to constrain the content further, excluding, for example, negative integers, zero, etc.

An entire Relax NG schema for this sample of XML code might look like the following:

start = play
play = element play {heading, body}
heading = element heading {title, year}
title = element title {text}
year = element year {xsd:int}
body = element body {quote*}
quote = element quote {speaker, text}
speaker = attribute speaker {text}

Creating a Relax NG Schema in <oXygen/>

To create a Relax NG Schema document in <oXygen/>, click on the icon shaped like a sheet of paper in the upper left hand corner of the screen. Then, select the document type “RELAX NG Schema - Compact” from the list (it will be under either “Recently used” or “New Document”) and click create. <oXygen/> inserts some boilerplate that isn’t actually needed, so once you’ve created the document, the first thing you should do is select and delete all of the content, so that you begin with a clean slate. Then type start = followed by the label you want to use to refer to your root element (as mentioned above, we use the same value for the label as for the generic identifier, so, for example, if your root element is going to be <play>, your schema will start with start = play). You can then go on to define the structure of your XML document and dictate the content of the elements and attributes. In order for your schema to be valid, you need to specify the name and content of all the elements and attributes you mention. Consider the following modification of the schema for plays, above:

start = play
play = element play {heading, body}
heading = element heading {title, year}
year = element year {xsd:int}
body = element body {quote*}
quote = element quote {speaker, text}
speaker = attribute speaker {text}

The preceding schema is not valid, because the <heading> element must include a <title> element, but the <title> element itself is never defined. All elements and attributes must be defined in the schema, although reserved words, like text or xsd:int do not require an explicit definition because Relax NG inherently knows what they mean.

Once you have created and saved your schema and are ready to validate an XML document, open the document you want to validate and click “Document” and select “Schema” from the drop down menu, and then select “Associate Schema.” Then select the schema you have created and saved and click “Okay.” Associating your schema with an XML document will insert a line of code at the top of your document. If your document validates properly, a small green box will appear at the top of the <oXygen/> window, similar to when you check for well-formedness. If the document is not valid, an error message will appear.

If your XML document is not associated with a schema, <oXygen/> automatically checks only for well-formedness. If it is associated with a schema, <oXygen/> automatically checks for well-formedness and validates the document against the schema.

<oXygen/> does real-time validation and well-formedness checking, so you don’t have to tell it to validate your document or check for well-formedness. However, although the real-time validation and well-formedness checking highlight the problem spots, they provide only brief error messages. If you want more information, with more detailed error reports, after you have associated a schema with your document, you can click on the icon shaped like a white piece of paper with a red check mark in it, which will instruct <oXygen/> to validate the document against the schema and display an error report in a separate panel. You can also use a keyboard shortcut, which is Ctrl+Shift+v under Windows and Cmd+Shift+v under MacOS. You can only validate a document that has a schema associated with it, but you can check any document for well-formedness. To do that, click on the little drop-down arrow to the right of the validation checkmark icon, and you’ll drop down a list that includes an option to check well-formedness. The keyboard shortcut for a well-formedness check is Ctrl+Shift+w under Windows and Cmd+Shift+w under MacOS.

To learn more

The preceding is just a brief introduction to a small number of basic features of Relax NG, intended to enable new users to begin to construct and apply simple schemas. It is not complete and it uses some non-standard terminology. After reading through this introduction, users might with to consult a tutorial by the designers of the Relax NG standard, available at http://relaxng.org/compact-tutorial-20030326.html. See also the excellent book by Eric van der Vlist, Relax NG, available on line through the Pitt library system (authentication required) and—freely—at the Internet Archive.