Maintained by: David J. Birnbaum (djbpitt@gmail.com)
Last modified:
2023-03-06T23:49:57+0000
Your schema does not have to look the same as ours, but one possible schema that could be used for this XML document (and other songs of the same type) could be:
elements have a kind attribute with values like building, nature, etc.
# elements have a ref attribute that contains a standardized character name
# The ref value is defined as text, rather or-group fixed values, to accomodate other shows
place = element place { kind, text }
kind = attribute kind { text }
chara = element chara { ref, text }
ref = attribute ref { text }]]>
The schema tries to use self-documenting names and explanatory comments wherever possible, but here are a few additional details:
Repetition in metadata: For the metadata section, whether
you made items repeatable depends on what you expect XML documents of the
same type
to look like. For our sample schema above we thought of
each song in this type of corpus as having exactly two
<title>
elements, one for the show
title and one for the song title, so it was important to allow more than
one, which we did with a plus sign. The model above is really too loose,
though, for two related reasons: 1) it allows multiple titles of the same
type and 2) it allows titles of only one type and none of the other type. It
also allows the types of titles to appear in any order, which may not matter
to XML processing, but if different songs put their titles in different
orders it introduces cognitive friction that can compromise the
developer’s focus. In Real Life we would use a tighter model, requiring
exactly two titles (not the looser one or more
) and requiring them in
a specific order. That might look like:
With this revision the content model for
<metadata>
requires exactly two
titles, one of each type, and in a specific order. Because the
<title>
elements have different
values for the type
attribute, we use
different labels (show_title
,
song_title
), both of which describes
elements of type <title>
but with
different attribute labels
(show_title_type
,
song_title_type
). Those attribute
labels both describe attributes with the name
type
but with different single, specific
string values.
This revision is stricter, which is an advantage, since it prevents us from
having more than two <title>
elements, it requires both song and musical titles, and it requires that the
two, which are distinguished by the value of their
type
attributes, occur in a specific
order. At the same time, it’s more difficult to read because of the
indirection, that is, because we have to read through several schema lines
to discover that each of the two type
attributes has a single allowed value. The specificity of this part of the
model makes this a good opportunity to insert a bit of Russian
doll Relax NG syntax into our schema:
With Russian doll notation, we don’t use labels for the attributes on the two
types of titles inside the content models. What we do instead is define the
type
attribute right inside the content
model for the element that uses it. We don’t recommend Russian doll notation
in general because deeper nesting becomes difficult to read, but here we
find that because of the simplicity of the attribute declarations the schema
becomes clearer if we define them in place. Your mileage may vary, so any of
the models above are acceptable, but in Real Life we would use this last
one.
As for the artist, you may have allowed more than one, thinking maybe that
different songs would be arranged by different people, or because you want
to include composer, lyricist, arranger, and perhaps others. Our sample
schema, above, assumed that one person was responsible for all of these
creative roles, and if our corpus consisted of songs from this show we’d
leave it as that, so that’s what we’ve done here. If you anticipate using
your schema with other shows that may have different persons in different
artistic roles, it would be better to make
<artist>
repeatable. Note that we’ve
used a comment to document the assumption that dictated our specific schema
rule.
Attribute values as or-groups of strings: In our solution we
chose to write out some attribute values as or-groups of strings so that we
could constrain them to one of two or three options (for example, the value
of the attribute tone
can be
"positive"
,
"negative"
, or
"neutral"
, but it has to be one of
those three exact strings). Whether to specify a value as an or-group of
strings or, more loosely, with the keyword
text
, depends on whether we can
anticipate all possible values in advance, and we hardcoded values only for
items where that was the case. Since for the
delivery
attribute we assumed that
there were two types of line delivery in a musical,
"spoken"
or
"sung"
, we represented that as an
or-group of strings, as well. On the other hand, the
ref
attribute of the
chara
element is left open because there
are many characters in this musical besides the three or four mentioned in
the text and we anticipate using the same schema to validate those other
songs. For the same reason we define the
speaker
attribute as
text
, since Natalya and Rach aren’t the
only characters who sing in the musical. As long as the inventory of
possibilities is finite (and not insanely long) it would not be a mistake to
use an or-group of strings that lists all of the character names in the
show, but that would limit the schema to this one show. If you want to be
able to use the same schema for any musical, the list of possible characters
becomes infinite, and therefore not enumerable, so
text
becomes the only option.
Whether you hardcoded specific string values or left them open by using the
keyword text
, it is always best to
leave a comment in your schema explaining why, as we do above.
Flexibility: The element
<direction>
indicates a stage direction,
which will not always appear at the beginning of the lyrics section, and there’s no
reason there couldn’t be multiple stage directions within a song. Putting it in a
schema without attaching a repetition indicator and without allowing some
flexibility to its location risks limiting the schema to just this song.
Empty elements: The element
<toneshift/>
is an empty element.
Empty elements can be represented in two ways that are syntactically different but
have exactly the same meaning:
<toneshift/>
is a self-closing
empty-element tag. Note the slash at the end. This notation does not use
separate start- and end-tags; the single self-closing tag represents the
entire element. Empty elements have no content (nothing between start- and
end-tags, which is why they can get away with just a single tag). They may
have attributes, although
<toneshift>
doesn’t.
<toneshift></toneshift>
is a
combination of a regular start-tag followed by regular end-tag with nothing
between them.
These notations mean exactly the same thing, but we prefer using the self-closing empty-element single-tag version because it is more self-documenting and easier to understand at a glance.
The HTML specification recommends using the single-tag notation only for elements that must always be empty and the combination of start- and end-tag with nothing between them for elements that could have content in principle but happen not to in a particular location. That recommendation is not part of XML, where the two notations are exactly synonmous and can be used in exactly the same locations.
Content models: Whatever is listed in the content model of an
element is only its attributes plus whatever (elements or text) is directly
between its start- and end-tags, that is, its children but not its deeper
descendants. For example, if you have a
<song>
element that contains a
<stanza>
element that, in turn, contains
lines, you wouldn’t model the <song>
as:
because <song>
doesn’t have any child
<line>
child elements (although it does have
<line>
descendants).