Digital humanities


Maintained by: David J. Birnbaum (djbpitt@gmail.com) [Creative Commons BY-NC-SA 3.0 Unported License] Last modified: 2023-01-27T00:39:58+0000


Test #1: XML: Answer key

Instructions

Answer the following questions and upload your answers to Canvas. All tests in this course are open book, so you can look things up and you can try out your code in <oXygen/>, but you cannot receive help from another person.

Create your answers in <oXygen/> as a plain-text document, which you can do by creating a new document (using the File → New menu options) and typing text into the Type filter text box. Do not use a word processor (like Microsoft Word) because word processors do things like changing your straight quotation marks to curly ones, which you don’t want. The file you upload must follow our file-naming conventions, using .txt as the filename extension (to indicate that you are submitting a plain-text file).

Questions

  1. Question:The following is an example of an XML document. What is the name of the root element of this example?

    
      sugar
      butter
      flour
      chocolate chips
    ]]>

    Answer: The root element, or the element that encloses all other elements in the document, is <ingredients>. A well-formed document must have exactly one root element.

  2. Question: What is the difference between descriptive markup and presentational markup? Give an example of each type and explain 1) why the descriptive markup is descriptive, 2) why the presentational markup is presentational, and 3) why we recommend using descriptive markup in digital humanities projects.

    Answer:

    1. Descriptive markup focuses on the role and meaning of an element in its context, while presentational markup focuses on the rendered appearance of an element.

    2. A piece of italicized text might be used in a text to represent the thoughts of a character, and descriptive markup might tag that piece as <thought> to represent its meaning to the story. Meanwhile, presentational markup might simply tag that piece as <italic>.

    3. We prioritize descriptive markup in digital humanities work because there are more types of meaning in the world than types of appearance, which means a computer can derive the appearance from the meaning more reliably than if it has to to derive the meaning from the appearance.

  3. Question: Which of the following scenarios would cause a well-formedness error (red square) in an XML document? (Some may be poor style because they’re confusing, while still being well-formed, and the task is to identify only the ones that are not well-formed.) For each example that is not well-formed, explain why they are not well-formed and how you might fix them.

    Answers listed under each scenario

    1. An element in the document does not contain any content, e.g.:

      <bookmark></bookmark>

      Answer: Empty elements are allowed in XML, so this does not raise a well-formedness error.

    2. A speech is tagged as having two speakers, e.g.:

      <speech speaker="homer" speaker="bart">Doh!</speech>.

      Answer: An element can only have one attribute with a given name, so having two attributes called speaker is a well-formedness violation. The most robust way to fix this is to combine the speaker names under a single attribute, e.g., ]]>.

      Some of you may have tried ]]>. This is well formed and it received full credit on this test, but for reasons we’ll explore later, it is not as useful an approach as the one we suggest above.

      Duplicating the tag and its content to create two separate <speech> tags (i.e., Doh! Doh!]]>) will produce a well-formed document, but isn’t informationally correct because it creates two speeches where only one existed in the source document. In this example, it would make it appear as though each character independently said D’oh, one after the other, rather than saying it in unison, which is what the original version tries to express. If you want to use two separate elements, there are strategies for representing the simultaneity with attributes (along the lines described in the answer to 3H, below).

      Removing one of the speaker attributes/values would create a well-formed document, but that’s not a useful strategy because it loses information. If, for example, you were to delete you would get a green square at the expense of not knowing, when you perform analysis later, that Homer spoke here.

    3. A comment occurs inside an element, e.g.:

      
          
          As I do live, my honoured lord, 'tis true;
          And we did think it writ down in our duty
          To let you know of it.
      ]]>

      Answer: Comments can go nearly anywhere in an XML document, including inside elements. This would not raise a well-formedness error.

    4. An attribute name (in this case, main speaker) contains a space character, e.g.:

      
          If it be,
          Why seems it so particular with thee?
      ]]>

      Answer: An attribute name cannot contain a space, since a space would cause <oXygen/> to see the two words as separate attribute names (in this case main, with no attribute value, which is itself a well-formedness error, and speaker). This could be fixed by combining the two-word name in several ways, such as snake case (e.g., main_speaker), kebab case (e.g., main-speaker), and camel case (e.g., mainSpeaker).

    5. An attribute value contains a space character, e.g.:

      
          If it be,
          Why seems it so particular with thee?
      ]]>

      Answer: An attribute value can contain a space, since the content of the value is a literal string, and not the name of an XML element or attribute. This will not raise a well-formedness error.

    6. Paired start- and end-tags differ in capitalization:

      
          If it be,
          Why seems it so particular with thee?
      ]]>

      Answer: Start- and end-tags must be exactly the same, including capitalization, because XML is case-sensitive, so an element called <line> is a completely different type of element than one called <Line>. This error can be fixed by making the start- and end-tags agree exactly in the spelling of the element name, e.g., … ]]>.

      Most XML taggers prefer lower-case element names, so although … ]]> is also well-formed (and receives full credit on this test), we recommend standardizing on the lower-case spelling instead.

    7. A complete document lacks a single root node, e.g.:

      
      Curly
      Larry
      Moe]]>

      Answer: All XML documents must have a root element that contains all other elements. This well-formedness error can be fixed by wrapping everything but the XML declaration (which is not document content and which is allowed only as the absolute first line of the file) in a root element (in this case something like <stooges>).

    8. An element contains a less-than sign:

      1 < 2]]>

      Answer: Since a less-than sign (<) is the symbol used for opening a tag, <oXygen/> thinks that it must be followed immediately by an element name, optional attribute information, and then by a greater-than sign. Since the character immediately after the less-than sign in this example is a space and element names cannot contain space characters, and also because there is no following greater-than sign, this raises a well-formedness error.

      You can fix this by replacing the reserved character with an appropriate character entity, which an XML processor will understand correctly as representing a literal character that could not be entered directly because it has a special meaning. The character entity that represents < is &lt; (the lt is a mnemonic for less than). See the Entities and numerical character references section of our XML tutorial for discussion and examples.

      Alternatively, you could replace the < symbol with its corresponding numerical charaacter reference (or NCR) in Unicode; this works much like the character entities mentioned above, but with a Unicode value in place of the mnemonic. The NCR for < is &#60; (decimal) or &#x3c; (hex). We prefer the character entity because it is easier for a human to recognize and understand.

    9. An element contains a greater-than sign:

      2 > 1]]>

      Answer: This will not raise a well-formedness error because although the greater-than sign represents the end of a tag, it has that special markup meaning only if it has been preceded by a less-than sign. Since that isn’t the case here, an XML processor, such as <oXygen/>, understands that it has a literal meaning and does not raise an error.

      There is a character entity for the greater-than sign, which is spelled &gt;, and it is not incorrect to use it, although we struggle to think of a situation where it is required to avoid having a greater-than sign mistaken for a tag end-delimiter.

    10. An element (in this case <quote>) starts in one context and ends in another (in this case, two different <line> elements), e.g.:

      
          And all should cry, Beware! Beware!
          His flashing eyes, his floating hair!
          Weave a circle round him thrice,
          And close your eyes with holy dread,
          For he on honey-dew hath fed
          And drunk the milk of Paradise.
      ]]>

      Answer: Elements must start and end in the same context (that is, inside the same parent element). There are a few ways to resolve this error, each of which involves compromises, and the simplest may be to tag separately each of the quote pieces inside each <line> element. This isn’t fully satisfactory by itself because the lines do not actually contain separate quotes (for example, this strategy wouldn’t let us count quotes in the text by counting <quote> elements), and you can adjust for that by adding an attribute that shows which <quote> elements go together. For example, if you tag all parts of the same quote as <quote group="a">, XML processing can use the shared attribute value a to know that these constitute a single textual quote. We’ll introduce other methods later in the course.

      Just putting the <quote> tags around the last two words of the first line gets a green square, but it isn’t an adequate remedy because it changes the meaning of the document by failing to say that the following lines are also quoted text.

      The markup languages community refers to the issues involved in this example as overlap and discontinuity. The overlap involves quotes and lines that overlap one another, and the discontinuity involves tagging the quote components separately even though they are really parts of a single quote. XML was designed to represent hierarchy well, and it does, but that benefit comes at the expense of not being able to represent overlap or discontinuity in a way that feels natural to humans.

      We might think that if XML didn’t prohibit overlap the problem would go away, but the issue is that the prohibition against overlap is there for a reason: that’s what enables a computer to recognize XML as a hierarchy. Computers can navigate and process hierarchies more easily than overlapping or discontinuous structures, and, for better or for worse, the designers of XML decided to commit to hierarchy and accept the inconveniences that come with that commitment.