Maintained by: David J. Birnbaum (djbpitt@gmail.com)
Last modified:
2022-02-21T03:17:33+0000
Your task is to create a Relax NG schema for our XML version of the first two pargraphs of the first letter of Mary Shelley’s Frankenstein, available at http://dh.obdurodon.org/2224_test-02_relax-ng.xml.
Your schema should constrain the XML while also allowing for the reasonable
integration of new material (that is, similar content in other parts of the novel),
and there may be more than one way to do that. Hint: You will want to use
repeatable or-groups when you model mixed content, and you can read
about those under Mixed content
in our Relax NG content models posting.
You are not permitted to change the XML, so your schema has to be valid against the XML as given. Make sure that you associate your schema with the XML and use it to validate the XML file. If the XML is not valid against the schema, you’ll want to find the problem and adjust the schema.
Should you have any questions about the test, please post them in the Relax NG
channel of our Slack discussion board (and when you respond to someone’s query,
which you are encouraged to do, you can nudge them in the right direction, but don’t
give away the answer). You may use any reference material you would like while
creating your schema (books, Internet, etc.), except that you cannot receive help
from another person, and your work needs to be your own. When you are finished,
upload your schema to Canvas, where we have created an assignment
for it (do
not upload the XML).
There are multiple ways to model this document type effectively in Relax NG, so your solution need not have matched ours exactly. Among other things, Relax NG schemas are typically written to model not just a single document, but a document type—in this case, perhaps this text and others that may be similar to it. For that reason, you will want to construct a schema that is not overly permissive, but also one that is not overly restrictive.
start = letter
letter = element letter { metadata, body }
#
# #####
# metadata
# #####
#
metadata = element metadata { title, addressee, from, date }
title =
element title { text, num }
num = element num { xsd:int }
addressee =
element addressee {
mixed { person*, location? }
}
from =
element from {
mixed { location* }
}
#
# ##### date #####
#
date = element date { month, day, year }
month = element month { text }
day = element day { text }
year = element year { text }
#
# #####
# body
# #####
#
body = element body { p+ }
p =
element p {
mixed { (location | person)* }
}
#
# ##### location #####
#
location =
element location { where, real, text}
where = attribute where { text }
real = attribute real { xsd:boolean }
#
# ##### person #####
#
person =
element person { who, rel, text }
who = attribute who { text }
rel = attribute rel { text }
We used Relax NG comments, which begin with hash marks, to make it easier to find the different sections of our schema: metadata, body, and inline elements (elements contained in mixed content). In Real Life the only time we don’t include these types of sectioning comments in our Relax NG is when it is so short that we can see it all easily on a single screen.
Relax NG doesn’t care about the order of your declarations, but we find that grouping and labeling them this way makes life easier for the developer. A line that begins with a single hash mark in Relax NG is a comment, but lines that begin with two consecutive hash marks are not, so we insert a space after the first hash mark. By the way, comments do not have to begin only at the start of a line; you can include a comment at the end of a line of schema code, e.g.:
who = attribute who { text } # person to whom an epithet refers
We used a repeatable or-group to model mixed content in our defintion of
<p>
elements. This is a common way to
model paragraph-like structures; in the case of our example, a paragraph is
mostly plain text, but with <location>
and <person>
elements mixed in, and more
than one can appear in a single paragraph.
A repeatable or-group is the best choice for a paragraph in this document, but
where the order of the content items is fixed and there is no plain text, a
content model should specify a sequence of elements. We took this approach in
our models for <metadata>
and
<date>
.
Locations in fictional works can be either be real or imaginary, and our XML uses
two attributes for <location>
elements
as a way of indicating both the target of a location reference and whether it is
real or imaginary. The attribute that identifies the location reference matters
because places are not always called by their names in the text, e.g.:
here]]>
The markup makes it possible for a computer to find all references to St. Petersburg even where the reference does not use the name of the city.
The real
attribute allows us to distinguish
real and unreal places, so that, for example, if we wanted to create a map of
places mentioned, we wouldn’t waste our time trying to determine the longitude
and latitude of places that cannot be mapped, such as Heaven
. As it
happens, all of the places mentioned in these two paragraphs are real. Since we
want to allow only a true
or
false
value for this attribute, we constrain
the information to boolean values by using the
xsd:boolean
datatype. (For test purposes
specifying the value as "true" | "false"
is
fine, but in Real Life we would use the more precise datatype.) Specifying just
text
is suboptimal because the possible values
are predictable in advance (unlike with the
who
or where
attributes), so it’s better to use a tighter content model that does not allow
unanticipated values.
You can read about xsd:boolean
at http://books.xmlschemata.org/relaxng/ch19-77025.html. The
value space is limited to
true
and
false
and the lexical space
allows us to write either true
or
1
to represent the value
true
and either
false
or
0
to represent the value
false
. The distinction between value space
and lexical space may seem peculiar on first encounter, but it turns out to
be useful because it makes it possible to allow the same information value
to be spelled in different ways and still be recognized as the same by a
computer.
For research purposes we might want to keep track of who appears, is mentioned,
or talks in the novel. The who
attribute
allows us to use a unique and consistent identifier for each distinct person in
the novel, regardless of how they might be identified in the prose (e.g.,
Mrs. Saville
and Margaret
and my dear sister
are all
the same person). The rel
attribute is less
interesting in this excerpt because the person is always the same person, but
over the course of the novel it would allow us access to changing patterns in
family relationships.
In Real Life we might manage the cast of characters by creating a metadata
resource that associates the person who is Margaret Saville
with the
relationship that is sister
, so that we wouldn’t have to specify both
values each time Mrs. Saville is mentioned in a letter from her brother,
Robert Walton. We’ll say more about that type of data resource in your
project groups should the need arise.
We make frequent use of repetition indicators after certain content pieces in our
schema to ensure that we get the right number of them (when only an exact number
is correct) or to allow the number to vary if there is variation in the
document. For example, we allow our
<body>
element to contain one or more
<p>
elements. There are exactly two
paragraphs in this abridged version of the letter, but we would want our schema
to be able to model contents with varying numbers of paragraph children.