Digital humanities


Maintained by: David J. Birnbaum (djbpitt@gmail.com) [Creative Commons BY-NC-SA 3.0 Unported License] Last modified: 2022-02-21T03:17:33+0000


Relax NG test answer key

The task

Your task is to create a Relax NG schema for our XML version of the first two pargraphs of the first letter of Mary Shelley’s Frankenstein, available at http://dh.obdurodon.org/2224_test-02_relax-ng.xml.

Your schema should constrain the XML while also allowing for the reasonable integration of new material (that is, similar content in other parts of the novel), and there may be more than one way to do that. Hint: You will want to use repeatable or-groups when you model mixed content, and you can read about those under Mixed content in our Relax NG content models posting.

You are not permitted to change the XML, so your schema has to be valid against the XML as given. Make sure that you associate your schema with the XML and use it to validate the XML file. If the XML is not valid against the schema, you’ll want to find the problem and adjust the schema.

Should you have any questions about the test, please post them in the Relax NG channel of our Slack discussion board (and when you respond to someone’s query, which you are encouraged to do, you can nudge them in the right direction, but don’t give away the answer). You may use any reference material you would like while creating your schema (books, Internet, etc.), except that you cannot receive help from another person, and your work needs to be your own. When you are finished, upload your schema to Canvas, where we have created an assignment for it (do not upload the XML).

Solution

There are multiple ways to model this document type effectively in Relax NG, so your solution need not have matched ours exactly. Among other things, Relax NG schemas are typically written to model not just a single document, but a document type—in this case, perhaps this text and others that may be similar to it. For that reason, you will want to construct a schema that is not overly permissive, but also one that is not overly restrictive.

start = letter
letter = element letter { metadata, body }
# 
# #####
# metadata
# #####
#
metadata = element metadata { title, addressee, from, date }
title =
  element title { text, num }
num = element num { xsd:int }
addressee =
  element addressee {
    mixed { person*, location? }
  }
from =
  element from {
    mixed { location* }
  }
#
# ##### date #####
#
date = element date { month, day, year }
month = element month { text }
day = element day { text }
year = element year { text }
# 
# #####
# body
# #####
#
body = element body { p+ }
p =
  element p {
    mixed { (location | person)* }
  }
#
# ##### location #####
#
location =
  element location { where, real, text}
where = attribute where { text }
real = attribute real { xsd:boolean }
#
# ##### person #####
#
person =
  element person { who, rel, text }
who = attribute who { text }
rel = attribute rel { text }

Discussion

Comments

We used Relax NG comments, which begin with hash marks, to make it easier to find the different sections of our schema: metadata, body, and inline elements (elements contained in mixed content). In Real Life the only time we don’t include these types of sectioning comments in our Relax NG is when it is so short that we can see it all easily on a single screen.

Relax NG doesn’t care about the order of your declarations, but we find that grouping and labeling them this way makes life easier for the developer. A line that begins with a single hash mark in Relax NG is a comment, but lines that begin with two consecutive hash marks are not, so we insert a space after the first hash mark. By the way, comments do not have to begin only at the start of a line; you can include a comment at the end of a line of schema code, e.g.:

who = attribute who { text } # person to whom an epithet refers

Mixed content and repeatable or-groups

We used a repeatable or-group to model mixed content in our defintion of <p> elements. This is a common way to model paragraph-like structures; in the case of our example, a paragraph is mostly plain text, but with <location> and <person> elements mixed in, and more than one can appear in a single paragraph.

A repeatable or-group is the best choice for a paragraph in this document, but where the order of the content items is fixed and there is no plain text, a content model should specify a sequence of elements. We took this approach in our models for <metadata> and <date>.

Location

Locations in fictional works can be either be real or imaginary, and our XML uses two attributes for <location> elements as a way of indicating both the target of a location reference and whether it is real or imaginary. The attribute that identifies the location reference matters because places are not always called by their names in the text, e.g.:

here]]>

The markup makes it possible for a computer to find all references to St. Petersburg even where the reference does not use the name of the city.

The real attribute allows us to distinguish real and unreal places, so that, for example, if we wanted to create a map of places mentioned, we wouldn’t waste our time trying to determine the longitude and latitude of places that cannot be mapped, such as Heaven. As it happens, all of the places mentioned in these two paragraphs are real. Since we want to allow only a true or false value for this attribute, we constrain the information to boolean values by using the xsd:boolean datatype. (For test purposes specifying the value as "true" | "false" is fine, but in Real Life we would use the more precise datatype.) Specifying just text is suboptimal because the possible values are predictable in advance (unlike with the who or where attributes), so it’s better to use a tighter content model that does not allow unanticipated values.

You can read about xsd:boolean at http://books.xmlschemata.org/relaxng/ch19-77025.html. The value space is limited to true and false and the lexical space allows us to write either true or 1 to represent the value true and either false or 0 to represent the value false. The distinction between value space and lexical space may seem peculiar on first encounter, but it turns out to be useful because it makes it possible to allow the same information value to be spelled in different ways and still be recognized as the same by a computer.

Person

For research purposes we might want to keep track of who appears, is mentioned, or talks in the novel. The who attribute allows us to use a unique and consistent identifier for each distinct person in the novel, regardless of how they might be identified in the prose (e.g., Mrs. Saville and Margaret and my dear sister are all the same person). The rel attribute is less interesting in this excerpt because the person is always the same person, but over the course of the novel it would allow us access to changing patterns in family relationships.

In Real Life we might manage the cast of characters by creating a metadata resource that associates the person who is Margaret Saville with the relationship that is sister, so that we wouldn’t have to specify both values each time Mrs. Saville is mentioned in a letter from her brother, Robert Walton. We’ll say more about that type of data resource in your project groups should the need arise.

Repetition indicators

We make frequent use of repetition indicators after certain content pieces in our schema to ensure that we get the right number of them (when only an exact number is correct) or to allow the number to vary if there is variation in the document. For example, we allow our <body> element to contain one or more <p> elements. There are exactly two paragraphs in this abridged version of the letter, but we would want our schema to be able to model contents with varying numbers of paragraph children.