Digital humanities


Maintained by: David J. Birnbaum (djbpitt@gmail.com) [Creative Commons BY-NC-SA 3.0 Unported License] Last modified: 2022-01-20T14:53:18+0000


You aren’t a beginning XML developer anymore

Below are a some common pitfalls that arise when people are first learning XML. The issues they entail are common and general, so being alert for them, and attentive to avoiding them, will pay off in your own projects.

Tag lines and paragraphs

A common beginner mistake is to mark up a poem (or song) along the following lines:



  The rime of the ancient mariner
  It is an ancient Mariner,
  And he stoppeth one of three.
  'By thy long grey beard and glittering eye,
  Now wherefore stopp'st thou me?
  
  The Bridegroom's doors are opened wide,
  And I am next of kin;
  The guests are met, the feast is set:
  May'st hear the merry din.'

]]>

Here and below, this poem has several parts, labeled as such in the text. We reproduce in this example only the first two quatrains of the first part.

This markup is well formed and the stanzas are tagged, but the lines are not tagged. To a human it looks as if each stanza contains four lines, but because the lines are not tagged, to a computer it looks as if the stanza contains just undifferentiated text. To see how this document looks to a computer you can pretty-print it in <oXygen/>, which you can do with Ctrl+Shift+p (Windows) or Cmd+Shift+p (Mac), or click on the pretty-print icon above the main window (it looks like five horizontal lines with the middle three indented). When we do that, the document looks like:



  The rime of the ancient mariner
  It is an ancient Mariner, And he stoppeth one of three. 'By thy long grey beard and
    glittering eye, Now wherefore stopp'st thou me?

  The Bridegroom's doors are opened wide, And I am next of kin; The guests are met, the
    feast is set: May'st hear the merry din.'

]]>

This tells you that the lines are not easily or naturally accessible to XML processing, and that’s a mistake. You can fix it by tagging the lines explicitly:



  The rime of the ancient mariner
  
    It is an ancient Mariner,
    And he stoppeth one of three.
    'By thy long grey beard and glittering eye,
    Now wherefore stopp'st thou me?
  
  
    The Bridegroom's doors are opened wide,
    And I am next of kin;
    The guests are met, the feast is set:
    May'st hear the merry din.'
  

]]>

Lines matter in poetry (and in songs) because they are major structural units. Humans know this because when we render the text without the correct lineation, it looks wrong. Prose texts are broken into lines, as well, but that’s because the page or screen has limited width, so the text wraps, arbitrarily, whenever it bumps up against the right edge. Since the lineation of a prose text most often is not informational, we normally don’t tag lines in prose. (Exception: medieval manuscript studies often use the lineation, even of prose documents, as a reference point, and in situations like that we do tag lines because we need to be able to find them when we later analyze and render our documents.)

Consider the following letter by Oscar Wilde:



  The following letter was written shortly after Wilde's release from prison:  
  Rouen, August 1897
  My own Darling Boy,
  I got your telegram half an hour ago, and just send a line to say that I feel 
    that my only hope of again doing beautiful work in art is being with you. It was 
    not so in the old days, but now it is different, and you can really recreate in 
    me that energy and sense of joyous power on which art depends.
    
  Everyone is furious with me for going back to you, but they don't understand us. I 
  feel that it is only with you that I can do anything at all. Do remake my ruined 
  life for me, and then our friendship and love will have a different meaning to 
  the world.
  
  I wish that when we met at Rouen we had not parted at all. There are such wide 
  abysses now of space and land between us. But we love each other.
  Goodnight, dear. Ever yours,
  Oscar
]]>

This looks to a human as if it has three paragraph, but if we pretty-print it, we see:



  The following letter was written shortly after Wilde's release from prison:
  Rouen, August 1897
  My own Darling Boy,
  I got your telegram half an hour ago, and just send a line to say that I feel that my only
    hope of again doing beautiful work in art is being with you. It was not so in the old days, but
    now it is different, and you can really recreate in me that energy and sense of joyous power on
    which art depends. Everyone is furious with me for going back to you, but they don't understand
    us. I feel that it is only with you that I can do anything at all. Do remake my ruined life for
    me, and then our friendship and love will have a different meaning to the world. I wish that
    when we met at Rouen we had not parted at all. There are such wide abysses now of space and land
    between us. But we love each other.
  Goodnight, dear. Ever yours,
  Oscar

]]>

As with the poem, above, the paragraphs here are not easily accessible to a computer because they aren’t tagged. Since we will need them to render a reading view in a culturally appropriate way, we should tag them:



  The following letter was written shortly after Wilde's release from prison:
  Rouen, August 1897
  My own Darling Boy,
  
    

I got your telegram half an hour ago, and just send a line to say that I feel that my only hope of again doing beautiful work in art is being with you. It was not so in the old days, but now it is different, and you can really recreate in me that energy and sense of joyous power on which art depends.

Everyone is furious with me for going back to you, but they don't understand us. I feel that it is only with you that I can do anything at all. Do remake my ruined life for me, and then our friendship and love will have a different meaning to the world.

I wish that when we met at Rouen we had not parted at all. There are such wide abysses now of space and land between us. But we love each other.

Goodnight, dear. Ever yours, Oscar
]]>

We don’t tag the lines because lines in a prose text don’t fulfill the same type of structural and information role as lines in poetry or song.

Don’t confuse order with hierarchy

When we read a page in English in print or on the web, we normally read from left to right and top to bottom. This might lead us to think of markup like:



  The rime of the ancient mariner
  Part 1
  
    It is an ancient Mariner,
    And he stoppeth one of three.
    'By thy long grey beard and glittering eye,
    Now wherefore stopp'st thou me?
  
  
    The Bridegroom's doors are opened wide,
    And I am next of kin;
    The guests are met, the feast is set:
    May'st hear the merry din.'
  

]]>

The problem with this is that what we’ve tagged as <part> isn’t a part; it’s just the label for a part. A part in this poem contains a label followed by many quatrains. Not only is it misleading to tag the label as if it were a part, but the markup above does not provide any formal, machine-actionable representation of the fact that the label goes with the following quatrains, and not the preceding ones. That isn’t an issue with the first part because there are no preceding quatrains, but the situation is different with the other parts. A human knows, of course, that labels in this type of context precede the thing they’re labeling, but a computer doesn’t know that, so we need to make it explicit with markup. You can do that by using the <part> element to tag the actual part, that is, the label plus the poetic content, along the lines of:



  The rime of the ancient mariner
  
    Part 1
    
      It is an ancient Mariner,
      And he stoppeth one of three.
      'By thy long grey beard and glittering eye,
      Now wherefore stopp'st thou me?
    
    
      The Bridegroom's doors are opened wide,
      And I am next of kin;
      The guests are met, the feast is set:
      May'st hear the merry din.'
    
  

]]>

Note that the <part> element is the actual part, so it contains both the label and the quatrains. (The actual part contains more quatrains, and the actual poem contains more parts, but we’ve abbreviated both here.)

Don’t put numbers into your element names

If you care about the order of components in your document, you may be tempted to use markup like:



  The rime of the ancient mariner
  
    Part 1
    
      It is an ancient Mariner,
      And he stoppeth one of three.
      'By thy long grey beard and glittering eye,
      Now wherefore stopp'st thou me?
    
    
      The Bridegroom's doors are opened wide,
      And I am next of kin;
      The guests are met, the feast is set:
      May'st hear the merry din.'
    
  

]]>

This is well formed but it is nonetheless a mistake. Here’s why:

The reason you don’t want to use element names like <stanza1> and <stanza2> is that when an XML processor compares two element names, the only thing it can determine easily is whether the names are identical or not identical. A human can see that these are two types of stanzas, but a computer can’t. The element+attribute strategy, though, records that the stanzas are both stanzas, since the element name is identical. But that strategy also records the stanza number, so it is easily available to an XML processor.

So when do we need to specify the number? Although an XML processor can count, by default it starts counting at 1 and counts consecutively. That’s what we want here: the first stanza is #1 and the next #2, and the same is true for lines. But suppose we want to represent only selected stanzas, perhaps not including the first and perhaps not proceeding sequentially? In that case we need to specify the numbers because a computer won’t know where to begin, where to skip, etc.