Digital humanities


Maintained by: David J. Birnbaum (djbpitt@gmail.com) [Creative Commons BY-NC-SA 3.0 Unported License] Last modified: 2021-09-23T16:29:09+0000


Relax NG test answer key

The Task

Please manually write a Relax NG schema that can validate an XML version of T.S. Eliot’s The hollow men, found at http://dh.obdurodon.org/the-hollow-men.xml. Your schema should aim for a sensible compromise between enforcing (don’t allow anything that shouldn’t be there) and enabling (allow things that are likely to show up in similar documents). For example, you don’t want to require all poems in a hypothetical corpus to have exactly the same number and arrangements of stanzas and lines as this one, or exactly the same title, but you also don’t want to allow markup that doesn’t belong in poems.

Create your answers in <oXygen/> as a Relax NG schema using compact syntax (that is, the type of schema we’ve been practicing). Your answer file must follow our file-naming conventions (http://dh.obdurodon.org/file-naming_conventions.xhtml), using .rnc as the filename extension (to indicate that you are submitting a Relax NG compact schema file). Please do not submit the XML; we will associate the XML with your schema ourselves. If you want to include any comments for us, format them as Relax NG comments, and you can refresh your memory about how to do that at the top of our Patterns and anti-patterns page.

Solution

There are multiple ways to model this document type effectively in Relax NG, so your solution need not have matched ours exactly. Among other things, Relax NG schemas are typically written to model not just a single document, but a document type—in this case, perhaps this poem and others that may be similar to it. For that reason, you will want to construct a schema that is not overly permissive, but one that also is not overly restrictive. Below is one solution:

start = poem
poem = element poem { metadata, epigraph*, section+ }
# 
# #####
# Metadata section
# #####
#
metadata = element metadata { title, author, year, source-url }
title = element title { text }
author = element author { text }
# I used xsd:int (which allows only integer values, and not arbitrary
# text, to be more specific about the kind of information that can go here!
year = element year { xsd:int }
# Similarly, I used xsd:anyURI instead of text to restrict the values
# allowed for the <source-url> element
source-url = element source-url { xsd:anyURI }
# 
# #####
# Epigraphs
# #####
#
epigraph = element epigraph { type, source, text }
# I could also have just left the content model of the type attribute as text, 
# but since there are only two distinct values used in this poem, I kept it 
# specific so I can't accidentally misspell one!
type = attribute type { "quote" | "saying" }
source = attribute source { text }
#
# #####
# Main content
# #####
#
section = element section { header, body }
header = element header { text }
# I'm using an "or" construction here because fragments are scattered inconsistently 
# between stanzas. My repeatable or-group allows <stanza> and <fragment> elements to
# appear in any order.
body = element body { (stanza | fragment)+ }
stanza = element stanza { line+ }
# Here I use a named value to specify the content of <line> and <fragment> because they
# behave exactly the same and have the exact same content. I do this so that if I want
# to make a change, I only have to do it in one place, rather than two.
line_like = (source?, figlang?, mixed { imagery* })
line = element line { line_like }
fragment = element fragment { line_like }
imagery = element imagery { image, text }
image = attribute image { text }
figlang = attribute figlang { text }

Discussion

Comments

We used Relax NG comments, which begin with hash marks, to make it easier to find the different sections of our schema: metadata, epigraph, and main content of the poem. You don’t have to add comments to your Relax NG, especially if it is as short as this schema, but we normally would.

Relax NG doesn’t care about the order of your declarations, but we find that grouping and labeling them this way makes life easier for the developer. A line that begins with a single hash mark in Relax NG is a comment, but lines that begin with two consecutive hash marks are not (see the RELAX NG compact syntax tutorial for details), so we insert a space after the first hash mark to ensure that <oXygen/> will recognize the line as a regular comment. By the way, comments do not have to begin only at the start of a line; you can include a comment at the end of a line of schema code, e.g.:

source = attribute source { text } # place from which the epigraph comes

Named values

As discussed in our Relax NG comments, we assign content models that are used in multiple places to a named value. In this case, <line> and <fragment> have the same content model, and using a named value ensures that the content models for the two element types will be the same.

Mixed content

We used a repeatable or-group (assigned to a named value; see above) to model mixed content in our defintion of <line> and <fragment> elements. Because attributes are not mixed in with plain text the way child elements are (attributes are sequestered inside the start tag), we don’t include them inside the mixed portion of the content model. Relax NG won’t care if you write them inside the mixed portion, but it’s best to make your schema as self-documenting as possible. Additionally, attributes are not repeatable, so putting them inside a repeatable or-group would misrepresent them as repeatable. Well-formedness ensures that you won’t be allowed to repeat them, but that’s all the more reason not to let your schema say that you can.

Dates

Years in this document are positive integer values, so we used the xsd:int datatype to constrain them to integer values. We could, alternatively, have used a datatype specifically for years: xsd:gYear (see http://books.xmlschemata.org/relaxng/ch19-77127.html for discussion).

URLs

URLs can be required with a specific datatype for Uniform Resource Identifiers: xsd:anyURI (see http://books.xmlschemata.org/relaxng/ch19-77009.html for discussion). Specifying text would also work, but, as with integers, above, using a more constrained datatype provides better protection from error.

Text vs or-groups

Whether you describe a value with the keyword text or an or-group of string values depends on the range of values permitted. For example, in this document the only values that occur for the type attribute are "quote" and "saying". If you think those are the only values you will encounter in your project, it would be best to model the type attribute as:

type = attribute type { "quote" | "saying" }

Using an or-group instead of text will raise an error if you accidentally write "quotation" instead of "quote", or something else that would introduce inconsistency into your markup. If, though, you think that other values are likely and unpredictable, the flexibility provided by text might be more appropriate. We sometimes start development with text and then switch to an or-group of strings once we think we’ve seen all the values we are likely to encounter.