Digital humanities


Maintained by: David J. Birnbaum (djbpitt@gmail.com) [Creative Commons BY-NC-SA 3.0 Unported License] Last modified: 2017-01-13T22:08:11+0000


XML test answers

Original text

<?xml version="1.0" encoding="UTF-8"?>
<title>Alma Mater<title>
<stanza><l><latin language type=phrase>Alma Mater</latin language>, wise and glorious,</l>
<l>Child of Light & Bride of Truth,</l>
<l>Over fate and foe victorious,</l>
<l><action>Dowered</action> with eternal youth,</l>
<l><action>Crowned</action> with love of son and daughter,</l>
<l>Thou shalt <action>conquer</Action> as of yore,</l>
<l>Dear old <city><centralLoc>Pittsburgh</city></centralLoc>, Alma Mater,</l>
<l>God <action>preserve</action> Thee evermore!</stanza>

Corrected text

<?xml version="1.0" encoding="UTF-8"?>
<poem>        
<title>Alma Mater</title>
<stanza><l><latin_language type="phrase">Alma Mater</latin_language>, wise and glorious,</l>
<l>Child of Light &amp; Bride of Truth,</l>
<l>Over fate and foe victorious,</l>
<l><action>Dowered</action> with eternal youth,</l>
<l><action>Crowned</action> with love of son and daughter,</l>
<l>Thou shalt <action>conquer</action> as of yore,</l>
<l>Dear old <city><centralLoc>Pittsburgh</centralLoc></city>, Alma Mater,</l>
<l>God <action>preserve</action> Thee evermore!</l></stanza>
</poem>

Explanation

  1. The original text was missing a root element, which all XML documents much have to be well-formed. We fixed this by creating the <poem> element right after the document declaration, and we closed it after our <stanza> end tag.
  2. The <title> element was not properly closed. We needed to add a / to the second appearance, since that’s an end tag.
  3. Naming conventions do not allow for spaces within element names, as in <latin language>, and there are a few ways we could fix this. We decided just to replace the space with an underscore (i.e., <latin_language>), but we could, alternatively, have renamed the element <latinLang> or <latin>. It would be easier to decide on the best correction if we knew why the developers had decided to tag this element, that is, how the information about the language would be used in subsequent processing.
  4. The attribute for the (now corrected) element <latin_language> needs to be in either single or double quotation marks.
  5. Because ampersand is a reserved character (that is, a character used for markup) in XML, the ampersand in this text that represents textual content needs to be escaped in order for XML to be able to recognize that it is text, and not markup. The character entity used to represent ampersand in XML is &amp;. To remind yourselves about ampersand and other reserved characters, and how to code around them when they are part of your textual content, review the Entities and numerical character references section of the XML tutorial that you read last week.
  6. The <action> tags around conquer need to match, so we replaced the uppercase A with a lowercase a in the end tag. Whether you use upper or lower case is up to you, but it’s conventional in XML to start element names with lower-case characters. Being consistent is important because it reduces the opportunity for human error.
  7. The two sets of elements around Pittsburgh are not properly nested. To fix this we swapped the end tags (you could, alternatively, have swapped the start tags), so that <centralLoc> is properly nested within <city>.
  8. The final <l> element is missing the matching end tag, so we added </l> before </stanza>. Be careful to keep those properly nested!

Bonus: Many of you caught the fact that the second time the phrase Alma Mater appeared in the text it was not tagged the way it was in the first instance. This is not a well-formedness error, so you weren’t expected to correct it. It is, though, an error that occured in the developers’ document analysis; they needed to spend a little longer thinking about the text before they started marking it up.