Digital humanities


Maintained by: David J. Birnbaum (djbpitt@gmail.com) [Creative Commons BY-NC-SA 3.0 Unported License] Last modified: 2022-09-06T02:34:26+0000


You aren’t a beginning XML developer anymore

Below are a some mistakes that arise when people are first learning XML. They aren’t errors in the sense that they don’t raise well-formedness issues, but they are nonetheless mistakes because they have limitations that would not be present with more robust, idiomatic markup. Most beginners make mistakes like this because they lack the experience to be able to anticipate adverse consequences, but you aren’t beginners any more, and being alert for the issues below, and attentive to avoiding them, will pay off in your own projects.

Contents

Whitespace isn’t markup

A common beginner mistake is to mark up a poem (or song) along the following lines:



  The rime of the ancient mariner
  It is an ancient Mariner,
  And he stoppeth one of three.
  'By thy long grey beard and glittering eye,
  Now wherefore stopp'st thou me?
  
  The Bridegroom's doors are opened wide,
  And I am next of kin;
  The guests are met, the feast is set:
  May'st hear the merry din.'

]]>

We reproduce in this example only the first two quatrains of the first part of a much longer poem.

This markup is well formed and the stanzas are tagged, but the lines are not tagged. To a human it looks as if each stanza contains four lines, but because the lines are not tagged, to a computer it looks as if each stanza contains just undifferentiated text. To see how this document looks to a computer you can pretty-print it in <oXygen/>, which you can do with Ctrl+Shift+p (Windows) or Cmd+Shift+p (Mac), or click on the pretty-print icon above the main window (it looks like five horizontal lines with the middle three indented). When we do that, the document looks like:



  The rime of the ancient mariner
  It is an ancient Mariner, And he stoppeth one of three. 'By thy long grey beard and
    glittering eye, Now wherefore stopp'st thou me?

  The Bridegroom's doors are opened wide, And I am next of kin; The guests are met, the
    feast is set: May'st hear the merry din.'

]]>

This tells you that the lines are not easily or naturally accessible to XML processing, and that’s a mistake. You can fix it by tagging the lines explicitly:



  The rime of the ancient mariner
  
    It is an ancient Mariner,
    And he stoppeth one of three.
    'By thy long grey beard and glittering eye,
    Now wherefore stopp'st thou me?
  
  
    The Bridegroom's doors are opened wide,
    And I am next of kin;
    The guests are met, the feast is set:
    May'st hear the merry din.'
  

]]>

Lines matter in poetry (and in songs) because they are major structural units. Humans know this because when we render the text without the correct lineation it looks wrong. Prose texts are broken into lines, as well, but that’s because the page or screen has limited width, so the text wraps, arbitrarily, whenever it bumps up against the right edge. Since the lineation of a prose text most often is not informational, we normally don’t tag lines in prose. (Exception: medieval manuscript studies often use the lineation, even of prose documents, as a reference point, and in situations like that we do tag lines because we need to be able to find them when we later analyze and render our documents.)

We also need to tag paragraphs in prose texts. Consider the following letter by Oscar Wilde:



  The following letter was written shortly after Wilde's release from prison:  
  Rouen, August 1897
  My own Darling Boy,
  I got your telegram half an hour ago, and just send a line to say that I feel 
    that my only hope of again doing beautiful work in art is being with you. It was 
    not so in the old days, but now it is different, and you can really recreate in 
    me that energy and sense of joyous power on which art depends.
    
  Everyone is furious with me for going back to you, but they don't understand us. I 
  feel that it is only with you that I can do anything at all. Do remake my ruined 
  life for me, and then our friendship and love will have a different meaning to 
  the world.
  
  I wish that when we met at Rouen we had not parted at all. There are such wide 
  abysses now of space and land between us. But we love each other.
  Goodnight, dear. Ever yours,
  Oscar
]]>

This looks to a human as if it has three paragraph, but if we pretty-print it, we see:



  The following letter was written shortly after Wilde's release from prison:
  Rouen, August 1897
  My own Darling Boy,
  I got your telegram half an hour ago, and just send a line to say that I feel that my only
    hope of again doing beautiful work in art is being with you. It was not so in the old days, but
    now it is different, and you can really recreate in me that energy and sense of joyous power on
    which art depends. Everyone is furious with me for going back to you, but they don't understand
    us. I feel that it is only with you that I can do anything at all. Do remake my ruined life for
    me, and then our friendship and love will have a different meaning to the world. I wish that
    when we met at Rouen we had not parted at all. There are such wide abysses now of space and land
    between us. But we love each other.
  Goodnight, dear. Ever yours,
  Oscar
]]>

As with the poem, above, the paragraphs here are not easily accessible to a computer because they aren’t tagged. Since we will need them to render a reading view in a culturally appropriate way, we should tag them:



  The following letter was written shortly after Wilde's release from prison:
  Rouen, August 1897
  My own Darling Boy,
  
    

I got your telegram half an hour ago, and just send a line to say that I feel that my only hope of again doing beautiful work in art is being with you. It was not so in the old days, but now it is different, and you can really recreate in me that energy and sense of joyous power on which art depends.

Everyone is furious with me for going back to you, but they don't understand us. I feel that it is only with you that I can do anything at all. Do remake my ruined life for me, and then our friendship and love will have a different meaning to the world.

I wish that when we met at Rouen we had not parted at all. There are such wide abysses now of space and land between us. But we love each other.

Goodnight, dear. Ever yours, Oscar
]]>

We don’t tag the lines because lines in a prose text don’t fulfill the same type of structural and information role as lines in poetry or song.

Be alert to implicit hierarchy

If you’re marking up a one-paragraph letter, you might think of the hierarchy along the following lines:



  January 1893, Babbacombe Cliff
  My Own Boy,
  Your sonnet is quite lovely, and it is a marvel that those red-roseleaf lips of yours
    should be made no less for the madness of music and song than for the madness of kissing.
    Your slim gilt soul walks between passion and poetry. I know Hyacinthus, whom Apollo loved
    so madly, was you in Greek days. Why are you alone in London, and when do you go to
    Salisbury? Do go there to cool your hands in the grey twilight of Gothic things, and come
    here whenever you like. It is a lovely place and lacks only you; but go to Salisbury first.
  Always, with undying love,
  Yours, Oscar
]]>

In reality, the content of the <body> element isn’t plain text; it’s a single paragraph. If the letter had happened to contain more than one paragraph we would probably have noticed that paragraphs constitute a hierarchical level below <body> and above plain text, and we would have tagged them for reasons described above. In a one-paragraph letter, though, the paragraphing is easy to overlook because a single paragraph doesn’t provide the visual cues (blank line between paragraphs, perhaps also indentation) that would stand out in a multi-paragraph letter.

The reason we want to tag the paragraph even if there’s only one is that more often than not we work with corpora of similar documents, and not with just one document, and a corpus of letters is likely to include some that contain more than paragraph. The more consistent our markup is within the corpus, the easier it becomes to analyze and process our XML.

Don’t confuse order with hierarchy

When we read a page in English in print or on the web, we normally read from left to right and top to bottom. This might lead us to think of markup like:



  The rime of the ancient mariner
  Part 1
  
    It is an ancient Mariner,
    And he stoppeth one of three.
    'By thy long grey beard and glittering eye,
    Now wherefore stopp'st thou me?
  
  
    The Bridegroom's doors are opened wide,
    And I am next of kin;
    The guests are met, the feast is set:
    May'st hear the merry din.'
  

]]>

The problem with this is that what we’ve tagged as <part> isn’t a part; it’s just the label for a part. A part in this poem contains a label followed by many quatrains (we include only the first two here). Not only is it misleading to tag the label as if it were a part, but the markup above does not provide any formal, machine-actionable representation of the fact that the label goes with the following quatrains, and not the preceding ones. That isn’t an issue with the first part because there are no preceding quatrains, but the situation is different with the other parts. A human knows, of course, that labels in this type of context precede the thing they’re labeling, but a computer doesn’t know that, so we need to make it explicit with markup. You can do that by using the <part> element to wrap (tag) the entire actual part, that is, the label plus the poetic content, along the lines of:



  The rime of the ancient mariner
  
    Part 1
    
      It is an ancient Mariner,
      And he stoppeth one of three.
      'By thy long grey beard and glittering eye,
      Now wherefore stopp'st thou me?
    
    
      The Bridegroom's doors are opened wide,
      And I am next of kin;
      The guests are met, the feast is set:
      May'st hear the merry din.'
    
  

]]>

Note that the <part> element is the actual part, so it contains both the label and the quatrains.

Don’t put numbers into your element names

If you care about the order of components in your document, you may be tempted to use markup like:



  The rime of the ancient mariner
  
    Part 1
    
      It is an ancient Mariner,
      And he stoppeth one of three.
      'By thy long grey beard and glittering eye,
      Now wherefore stopp'st thou me?
    
    
      The Bridegroom's doors are opened wide,
      And I am next of kin;
      The guests are met, the feast is set:
      May'st hear the merry din.'
    
  

]]>

This is well formed but it is nonetheless a mistake. Here’s why:

The reason you don’t want to use element names like <stanza1> and <stanza2> is that when an XML processor compares two element names, the only thing it can determine easily is whether the names are identical or not identical. A human can see that these are two types of stanzas, but a computer can’t. The element+attribute strategy, though, records that the stanzas are both stanzas, since the element name is identical. But that strategy also records the stanza number, so it is easily available to an XML processor.

As we’ll see immediately below, most of the times you don’t want to number stanzas or lines at all, even in attributes.

XML elements have context

When tagging a document that contains sections, subsections, and sub-subsections, you might be tempted to use markup like the following:



  
Title of first section goes here

First paragraph of first section.

Title of first subsection of first section goes here

First paragraph of first subsection of first section.

Title of first sub-subsection of first sub-section of first section goes here

First paragraph of first sub-subsection of first subsection of first section>

]]>

In the example above we’ve used different element names for sections (and their titles) at different levels of the hierarchy. The main reason this is undesirable is that if you later decide to split up your document by making each section its own document, you would have to retag the subsections as sections and the sub-subsections as subsections (and likewise with their titles). What you should do instead is:



  
Title of first section goes here

First paragraph of first section.

Title of first subsection of first section goes here

First paragraph of first subsection of first section.

Title of first sub-subsection of first sub-section of first section goes here

First paragraph of first sub-subsection of first subsection of first section>

]]>

The reason you can use the same element names at different hierachical levels is that XML elements have both a name and a context, and XML processing is aware of the context. This means that you can distinguish sections from subsections from sub-subsections (and likewise with their titles) during processing even if they are tagged identically because they occur in different contexts—in this case, at different levels of the hierarchy.

That XML processing knows about context means that you normally shouldn’t number sequential items at all. You already know (see above) that you shouldn’t use element names like <act-1>, <act-2>, etc. for the acts of a play, but you also shouldn’t use <act n="1"> or <act>1</act>. What you want is just the element name, with no numbering in either an attribute or the textual content of the element, e.g., <act>.

The reason you don’t have to number sequential items in either markup or textual content is that the ordinal position of an element (first, second, etc.) is part of the context, which means that XML processing can distinguish elements with the same name in different positions without your having to specify the position explicitly, and you can output the numbers during processing if you need them. Letting the computer do the counting is valuable because humans can easily skip or repeat a number by accident, and also because if you later insert or remove an item or rearrange them you would have to adjust any explicit numbering accordingly. If you let the computer number the items as it finds them, the numbering will always be correct.

There are exceptions to the general policy of not numbering explicitly. If your numbering doesn’t begin at 1 (perhaps you’re encoding only acts 4, 5, and 6 from a play and you want to be able to render the original numbers) or if it isn’t continuous (perhaps you’re encoding only acts 1, 3, and 7 and you want to be able to render the original numbers), XML can’t know what the original numbers were. In situations like that you need to make additional information available in your markup.