Digital humanities


Maintained by: David J. Birnbaum (djbpitt@gmail.com) [Creative Commons BY-NC-SA 3.0 Unported License] Last modified: 2022-10-02T20:51:10+0000


Text #2: Relax NG: answers

The most common mistake in the submitted schemas was making attributes repeatable. Repeating an attribute on an element would be a well-formedness violation, and schema rules cannot override well-formedness constraints. This means that even if your schema says that an attribute is repeatable, it isn’t, and it’s a mistake for your schema to say that it allows something that isn’t well-formed because the schema is easier to understand when it tells the truth about the model.

Another common issue was to include attributes inside a mixed model. That is correct Relax NG syntax and it will do what you want, but it’s a Bad Idea because attributes (which are spelled out inside the start-tag) are not really mixed in with plain text the way elements are. Spelling out the attributes separately before textual content (see our model for the <line> element, below) makes the code easier for a human to understand.

A third issue involved modeling rhyme as an or-group of strings, that is, a list of possible fixed values. Additional songs will introduce new sounds, and specifying all possible rhyming sequences for a (hypothetical) corpus of songs as an or-group of strings isn’t realistic.

Your schema does not have to look the same as ours, but one possible schema for modeling the XML document (and others of the same type) is:

 elements and attributes
#
# Two types of  are specified so that we 
# can control the order in which they appear.
#
metadata =
  element metadata { show-title, song-title, authors?, producers?, date*, keywords, source?, note* }
show-title = element title { show-level, text }
show-level = attribute level { "show" }
song-title = element title { song-level, text }
song-level = attribute level { "song" }
authors = element authors { author+ }
author = element author { text }
producers = element producers { producer+ }
producer = element producer { text }
date = element date { type, text }
type = attribute type { text }
keywords = element keywords { keyword+ }
keyword = element keyword { text }
source = element source { xsd:anyURI }
note = element note { resp, iso-date, text }
resp = attribute resp { text }
iso-date = attribute date { xsd:date }
# 
# ###
#  and  elements and attributes
#
# Sung and spoken verses are modeled separately because
# the "mode" attribute applies only to sung verses
# ###
#
lyrics = element lyrics { (spoken | sung)+ }
spoken = element verse { voice, spoken-delivery, line+ }
sung = element verse { voice, sung-delivery, mode, line+ }
# Value of "voice" attribute specified as "text" to accommodate other songs
voice = attribute voice { text }
spoken-delivery = attribute delivery { "spoken" }
sung-delivery = attribute delivery { "sung" }
mode = attribute mode { "patter" | "lyric" }
# 
# ###
#  elements and attributes
#
# "rhetoric" attribute values are an or-list because the inventory, even over
#     a large corpus, is limited
# "rhyme" attribute values are plain text because the possibilities are
#     essentially limitless
# ###
#
line = element line { rhetoric?, rhyme?, next?, previous?, text }
rhetoric = attribute rhetoric { "anaphora" | "internal-rhyme" | "irony" | "repetition" }
rhyme = attribute rhyme { text }
# 
# "next" and "previous" connect split lines
#
next = attribute next { xsd:boolean }
previous = attribute previous { xsd:boolean }
]]>
    

Most of the schema should be self-explanatory, at least with the in-place comments, bbut here some additional information about specific moments where we considered alternatives:

  • Repetition indicators on metadata elements: Whether you make metadata items required or optional, and singular or repeatable, depends on what you think XML versions of similar documents are likely to include. For example, we made the metadata <date> element repeatable because we could imagine re-release dates, but it’s reasonable not to allow for that hypothetical possibility.
  • Types of titles: Our schema requires the show title to precede the song title in the metadata section, and we implemented that by creating two named patterns for <title> elements, each with a single, fixed attribute value. Not only does this approach enforce a consistent order for the two types of title, where consistency can make the documents easier to read, but it also allows elements of type <title> to be repeatable without allowing multiple show titles or multiple song titles, so that <title> is repeatable only as long as the title type is different. For test purposes it’s fine to have made <title> repeatable and allowed any text as the value of the level attribute.
  • Datatypes: In Real Life we would use the datatype xsd:anyURI for links and xsd:data for dates formatted according to ISO standards, that is, YYYY-MM-DD, but for test purposes it’s okay to model both of these as just text. You can read about datatypes in Relax NG at http://books.xmlschemata.org/relaxng/relax-CHP-8-SECT-1.html#relax-CHP-8-SECT-1.1.
  • Date as both element and attribute: The XML contains both an element called <date> and an attribute called date, and you can’t use the word date as the label for both. See the last item in our Patterns, anti-patterns, and other Relax NG details for an explanation of how we deal with that issue here.
  • Spoken and sung verses: Sung verses can have a mode attribute and spoken verses can’t, so we used different labels to model the two types of <verse> element differently. Making the mode attribute optional on all <verse> elements isn’t the most serious of mistakes, but unless you consciously decided that you might want to allow different modes for spoken verses, too, it’s overly permissive in a way that could have been avoided.
  • Lines: If you model <line> as mixed content Relax NG will do what you want, but it’s nonetheless a mistake because attributes are not mixed in with plain text and therefore don’t really participate in mixed content. It’s best to write the attributes first and then, because the only thing allowed between the start- and end-tags is plain text, to model the content as text. This won’t affect the validation, but it plays a role in how easily a human reader can understand the schema, and making code easy for a human to understand is important because developers often reread and revise code.
  • Rhetoric: We used an or-group of strings for the value of the rhetoric attribute because we anticipate that even with a very large corpus the inventory of possible values for this attribute will be limited, which means that a value like text would wind up allowing errors and inconsistencies. Early in our development process, though, we might prefer a content model of text, which we would then replace with an or-group once we had seen enough data to be reasonably confident that we had encountered all possible values. For test purposes text is fine.
  • Rhyme: We used IPA (International phonetic alphabet) to represent the sound of the rhyming portion of lines that partipate in rhyme. Because different speakers may pronounce the same words in different ways, any representation of the sound of a line is necessarily an abstraction or generalization. This attribute should not be modeled as an or-group of strings because the rhyming part of a line in a song can contain just about any sequence of sounds, so each additional song is likely to introduce new rhymes into the inventory.
  • Broken lines: The last line of the song is broken across two speakers, and we model this with Boolean attributes that indicate whether there is or is not a preceding or following portion (lead-in or continuation) of the same line. The term Boolean means that only two opposing values are allowed (true/false, yes/no, 0/1, etc.), and the Relax NG xsd:boolean datatype has a lexical space (that is, inventory of allowed spellings) of true, false, 0, and 1. For test purposes it’s fine to have typed this as "true" | "false" or as text, although in Real Life we would use a more constrained datatype than just text to reduce the opportunity for inconsistency. You can read about the eponymous early nineteenth-century philosopher, logician, and mathematician George Boole on Wikipedia.