Digital humanities


Maintained by: David J. Birnbaum (djbpitt@gmail.com) [Creative Commons BY-NC-SA 3.0 Unported License] Last modified: 2021-09-12T23:54:59+0000


Text #1: XML answers

Instructions

Answer the following questions and upload your answers to Canvas. All tests in this course are open book, so you can look things up, but you cannot receive help from another person.

Create your answers in <oXygen/> as a plain-text document, which you can do by creating a new document (using the File → New menu) and typing text into the Type filter text box. Do not use a word processor (like Microsoft Word) because word processors do things like changing your straight quotation marks to curly ones, which you don’t want. Your answer file must follow our file-naming conventions (http://dh.obdurodon.org/file-naming_conventions.xhtml), using “.txt” as the filename extension (to indicate that you are submitting a plain-text file).

Required questions

  1. The following elements do not follow proper naming conventions. For each element, explain why this is the case and suggest an alternative element name that does not violate well-formedness. (2 points)

    1. <3musketeers> </3musketeers>

      Answer: XML allows digits in element names, but not as the first characters. Change to something like <three_musketeers>.

    2. <phone book> </phone book>

      Answer: XML does not allow spaces in element names. Change to something like <phone_book> or <phoneBook> or <phone-book>. Fun fact: <phoneBook> is called camel case, <phone-book> is called kebab case, and <phone_book> is called snake case.

  2. Name two XML well-formedness requirements (other than the element naming restrictions above). (2 points; name more than 2 for extra credit)

    Answer:

    • An XML document must have a single root element, wrapped around everything else (except the optional XML declaration).

    • Elements must be properly nested. For example, if an element <x> is opened followed by an element <y>, the <y> element must be closed before the <x> element. Below is an example of improper nesting:

      <p><dialogue speaker="bear" subject="woman">“OW, OW!” growled 
      the <character who="bear">bear.</dialogue></character></p>
    • All start-tags must have matching end-tags (except in the case of self-closing empty element tags like <tag/>).

    • Attribute values must be quoted.

    • Attribute names are subject to the same rules as element names: no initial digits and no space characters anywhere.

    • Attribute names cannot be repeated on an element (e.g., <speech speaker="curly" speaker="larry">, when they speak in unison, is not well-formed; use <speech speaker="curly larry"> instead).

    • Certain characters cannot appear as literal textual content or attribute values, and must be replaced by entities. < and & must always be replaced by entities (&lt; and &amp;, respectively). > (&gt;), " (&quot;), and ' (&apos;) must be replaced only in contexts where they could be confused with markup.

  3. Describe how descriptive markup differs from presentational markup and give an example of each. (2 points)

    Answer:

    • Descriptive markup describes what a component of the text is, that is, what it means. Presentational markup describes what text looks like. <italic> is a presentational element, which describes appearance. <title> (which might eventually be rendered in italics) is descriptive.

    • Optional: One reason to use descriptive markup is that it can easily be transformed into presentational markup, but the reverse is often not the case. Descriptive markup also facilitates multipurposing: text is marked up once but given different presentations as it is used for different purposes.

  4. XML elements may have 1) element content, 2) text content, or 3) mixed content, or they may be 4) empty. Give an example of each. (4 points)

    Answer:

    • Element content: An element could contain only elements. For example, a <toDoList> element might contain nothing but <item> elements.

    • Text content: An element may contain only plain text. For example, an <item> element might contain only the name of a task in plain language.

    • Mixed content: An element could contain a mixture of plain text and elements, as is the case with the <p> element below:

      <p>In the morning early, while it was still grey dawn, the <character 
      who="Giant">Giant</character> strode off to the wood.</p>
    • Empty element: An element may have no content. These are typically used to mark moments in a document that have no associated text, or where textual content is represented by attribute values. For example, a page boundary might be encoded as an empty <pb/> element.

Extra-credit questions

  1. What is the difference between a well-formed and well-balanced XML document? (1 point)

    Answer: A well-balanced document is one that lacks a single root node but is otherwise well-formed. A well-formed document is always well-balanced, but only certain well-balanced documents (those with root elements) are well-formed. Think of it like a square is a rectangle, only certain rectangles (those with sides of equal length) are squares.

  2. Why is the formal model of XML described as a tree? Include a small example to illustrate your explanation. (2 points)

    Answer: The XML formal model is based on a tree, an ordered hierarchy, because trees branch from a single trunk to a few large boughs to successively smaller branches to twigs to leaves, and XML documents also branch out hierarchically. The analogy is imperfect in several ways (e.g., tree roots, and not just branches, also spread out, and the single trunk is below the branches, while XML traditionally thinks of the root element as the top of the tree), but it is nonetheless standard terminology in computer science.

    A tree, in the computer-science sense, consists first of a root, which contains all other components and subcomponents, called nodes. For example, the root element of a poem might be a <poem> element. A poem element contains <stanza> elements, which each contain <line> elements, which each contain text nodes. <stanza> elements can be described as the parent nodes of <line> elements, and <line> elements as child nodes of <stanza> elements. The children of <line> elements, in turn, might be text nodes.

    Below is a visual representation of a poem marked up in this way, where the poem consists of three two-line stanzas. Element nodes are blue ellipses that are inscribed with their element type. Text nodes are pink rectangles that contain placeholder text (line), standing in for an actual line of verse:

<poem> <stanza> <stanza> <stanza> <line> <line> <line> <line> <line> <line> (line) (line) (line) (line) (line) (line)