Digital humanities


Maintained by: David J. Birnbaum (djbpitt@gmail.com) [Creative Commons BY-NC-SA 3.0 Unported License] Last modified: 2022-09-13T00:44:02+0000


Text #1: XML answers

Instructions

Answer the following questions and upload your answers to Canvas. All tests in this course are open book, so you can look things up and you can try out your code in <oXygen/>, but you cannot receive help from another person.

Create your answers in <oXygen/> as a plain-text document, which you can do by creating a new document (using the File → New menu options) and typing text into the Type filter text box. Do not use a word processor (like Microsoft Word) because word processors do things like changing your straight quotation marks to curly ones, which you don’t want. The file you upload must follow our file-naming conventions, using .txt as the filename extension (to indicate that you are submitting a plain-text file).

Required questions

  1. Question: What is the difference between descriptive and presentational markup? Which one do we prioritize in the digital humanities and why?

    Answer: Descriptive markup focuses on the role and meaning of an element in its context. For example a paragraph is (descriptively) a rhetorical subdivision of a prose text that is typically characterized by topical or thematic unity. Presentational markup focuses on the rendered appearance of an element. For example, a paragraph might look like (presentationally) a group of lines separated from preceding and following groups of lines by a blank line or with left indentation at the beginning. We prioritize descriptive markup in digital humanities work because there are more types of meaning in the world than types of appearance, which means a computer can derive the appearance from the meaning more reliably than if it has to to derive the meaning from the appearance.

  2. XML elements can have four types of content, listed below. For each one provide a brief description and an example (in XML):

    1. Question: Element content

      Answer: An element may contain only other elements, with no plain text (except whitespace characters). An <act> element in a play might contain only a <heading> element (e.g., an element that itself contains just text like Act 1) followed by one or more <scene> elements that hold the contents of the individual scenes. If an <act> element has element content, it contains nothing except other elements.

    2. Question: Text content

      Answer: An element may contain only literal text. For example, the <heading> element mentioned above contains only the string of characters that reads Act 1; it does not contain any other elements.

    3. Question: Mixed content

      Answer: An element may contain plain text characters with elements mixed in. For example, within a paragraph we might tag personal names and placenames, so that the paragraph would contain plain text with those types of elements mixed in.

    4. Question: Empty element

      Answer: Empty elements have no content (no elements or plain text between the start- and end-tags), but they may have attributes (which are not technically content). If we want to indicate where images (which are non-textual) appear in our otherwise textual document, we might use an empty element that points, with the help of an attribute, to the image file, e.g.,

      <img src="https://upload.wikimedia.org/wikipedia/commons/a/a2/Whitby_Abbey_ruins_18.jpg"/>

      Empty elements may be spelled either as a start-tag followed immediately by an end-tag, with nothing between them, or as a self-closing single tag (illustrated here; note the slash at the end of the tag). These two representations of empty elements mean the same thing to an XML processor.

  3. Question: If you paste the following document into <oXygen/> you’ll get a red square because your document won’t be well formed. What’s wrong with it and how can you fix it? (There may be more than one thing to fix.)

    
    Shopping list
    Apples
    Yogurt
    Ice cream]]>

    Answer: There are two sets of mistakes in this example: there is no root element and the attribute values are not quoted. A correct version might be:

    
    
        Shopping list
        Apples
        Yogurt
        Ice cream
    ]]>

    You’re allowed to use <oXygen/> when completing this type of test, and we’d recommend doing that. Almost all of you noticed the missing root element in this question, but a few of you didn’t notice that the attribute values weren’t quoted. Had you pasted the document into <oXygen/> and added a root element, the lingering red square would have alerted you that there was an additional problem with the markup. It’s difficult for a human to notice these sorts of finicky details, which is why it can be helpful to use tools like <oXygen/> to edit XML.

  4. Question: If you copy and paste the following document into <oXygen/> you’ll get a red square because your document won’t be well formed. What’s wrong with it and how can you fix it? (There may be more than one thing to fix.)

    
    1 < 2]]>

    Answer: The less-than sign (<) is a reserved character, which means that it cannot be used to represent a literal textual angle bracket in an XML document. You can fix this by replacing the less-than sign with the spelling &lt;. When it comes time to render the document, this will be displayed the way you want, as a literal less-than sign. The reason for this constraint is that an XML processor assumes that a less-than sign is the beginning of a tag, and not a literal angle bracket character. The spelling &lt; is called a character entity.

    Here, too, using <oXygen/> can help. All of you noticed that the less-than sign was a problem, but a few of you didn’t notice that the end-tag was missing the slash character. Had you pasted the document into <oXygen/> and fixed the less-than sign, <oXygen/> would have alerted you about the mistake in the end-tag.

    By the way, don’t confuse backslash (\) with slash (/) (also called forward slash). To a computer these are completely different characters.

Optional extra-credit questions

  1. Question: What is the difference between a) well-formed, b) valid, and c) well-balanced XML?

    Answer: Well-formed XML must conform (tautologically) to the well-formedness constraints that are part of the official XML specification (the authoritative document that defines XML, available at https://www.w3.org/TR/xml/). These include requirements like a single root element, matching start- and end-tags, quoted attribute values, etc. Valid XML meets all well-formedness requirements and is also valid against a schema, a set of rules that further constrain which elements can appear where. For example, your schema might specify that there are elements called <p> in your document but no elements called <paragraph>. Without schema validation, XML wouldn’t care what you called your elements, or where they appeared, as long as they didn't violate any well-formedness constraints. Well-balanced XML satisfies all well-formedness requirements except the single root element. For example, a set of <letter> elements might constitute well-balanced XML that represents Oscar Wilde’s correspondence with Bosie, where each letter by itself could be a well-formed XML document, but if you concatenate them into a single document you would have to wrap all of them in a single root element (such as <letters>) for your well-balanced XML to constitute a well-formed XML document.

  2. Question: If Curly, Larry, and Moe all speak in unison and you try to represent it as:

    Nyuk, nyuk nyuk!]]>

    you’ll raise an error. Why is this not well formed and how can you fix it in a way that is amenable to machine processing?

    Answer: Repeating an attribute name on an element is a well-formedness violation. You can, however, set the value of an attribute equal to a sequence of items, e.g., <speech speaker="Curly Larry Moe">. XML by itself doesn’t know that an attribute value like this one represents three speakers, rather than one with a first, middle, and last name (e.g., John Fitzgerald Kennedy), but you can encode that specification in a schema.

    Some of you tried something like <speech speaker="Curly, Larry, and Moe">. This is not incorrect, but for reasons that will become clear later, the commas and the conjunction make it harder for a computer to process the value than if the names were just separated by space characters. Some of you tried something like <speech speaker1="Curly" speaker2="Larry" speaker3="Moe">. This also is not incorrect, but it’s easier for a computer to find, say, speeches where Curly is one (any one) of the speakers if the attribute name is always the same. We’ll see later, once we begin processing our XML (and not just tagging it), why it’s easier for a computer to deal with <speech speaker="Curly Larry Moe"> than with different attribute names for each speaker.

  3. Question: One of the questions above notes that:

    
    1 < 2]]>

    is not well formed. Yet:

    
    1 > 2]]>

    is well formed. Why might an XML parser (the part of <oXygen/> that has to translate our character-by-character XML into a tree structure) treat these differently?

    Answer: Parsers read from begining to end, so when a parser sees a greater-than sign it knows whether it has already seen a less-than sign. This means that it knows whether the greater-than sign is markup, since it can only be markup if it follows a less-than sign. When the parser sees a less-than sign, though, it doesn’t yet know what will come later, so it assumes that all less-than signs are the beginnings of tags, and not literal characters. This means that if your document contains a literal less-than sign you must represent it with a character entity, while a greater-than sign can be represented either as a character entity (&gt;) or as a literal > character.