Maintained by: David J. Birnbaum (djbpitt@gmail.com) Last modified: 2024-09-03T16:32:34+0000
Below are a some mistakes that arise when people are first learning XML. They aren’t errors in the sense that they don’t raise well-formedness issues, but they are nonetheless mistakes because they have limitations that would not be present with more robust, idiomatic markup. Most beginners make mistakes like this because they lack the experience to be able to anticipate adverse consequences, but you aren’t beginners any more, and being alert for the issues below, and attentive to avoiding them, will pay off in your own projects.
A common beginner mistake is to mark up a poem (or song) along the following lines:
The rime of the ancient mariner
It is an ancient Mariner,
And he stoppeth one of three.
'By thy long grey beard and glittering eye,
Now wherefore stopp'st thou me?
The Bridegroom's doors are opened wide,
And I am next of kin;
The guests are met, the feast is set:
May'st hear the merry din.'
]]>
We reproduce in this example only the first two quatrains of the first part of a much longer poem.
This markup is well formed and the stanzas are tagged, but the lines are not tagged. To a human it looks as if each stanza contains four lines, but because the lines are not tagged, to a computer it looks as if each stanza contains just undifferentiated text. To see how this document looks to a computer you can pretty-print it in <oXygen/>, which you can do with Ctrl+Shift+p (Windows) or Cmd+Shift+p (Mac), or click on the pretty-print icon above the main window (it looks like five horizontal lines with the middle three indented). When we do that, the document looks like:
The rime of the ancient mariner
It is an ancient Mariner, And he stoppeth one of three. 'By thy long grey beard and
glittering eye, Now wherefore stopp'st thou me?
The Bridegroom's doors are opened wide, And I am next of kin; The guests are met, the
feast is set: May'st hear the merry din.'
]]>
This tells you that the lines are not easily or naturally accessible to XML processing, and that’s a mistake. You can fix it by tagging the lines explicitly:
The rime of the ancient mariner
It is an ancient Mariner,
And he stoppeth one of three.
'By thy long grey beard and glittering eye,
Now wherefore stopp'st thou me?
The Bridegroom's doors are opened wide,
And I am next of kin;
The guests are met, the feast is set:
May'st hear the merry din.'
]]>
Lines matter in poetry (and in songs) because they are major structural units. Humans know this because when we render the text without the correct lineation it looks wrong. Prose texts are broken into lines, as well, but that’s because the page or screen has limited width, so the text wraps, arbitrarily, whenever it bumps up against the right edge. Since the lineation of a prose text most often is not informational, we normally don’t tag lines in prose. (Exception: medieval manuscript studies often use the lineation, even of prose documents, as a reference point, and in situations like that we do tag lines because we need to be able to find them when we later analyze and render our documents.)
We also need to tag paragraphs in prose texts. Consider the following letter by Oscar Wilde:
The following letter was written shortly after Wilde's release from prison:
Rouen, August 1897
My own Darling Boy,
I got your telegram half an hour ago, and just send a line to say that I feel
that my only hope of again doing beautiful work in art is being with you. It was
not so in the old days, but now it is different, and you can really recreate in
me that energy and sense of joyous power on which art depends.
Everyone is furious with me for going back to you, but they don't understand us. I
feel that it is only with you that I can do anything at all. Do remake my ruined
life for me, and then our friendship and love will have a different meaning to
the world.
I wish that when we met at Rouen we had not parted at all. There are such wide
abysses now of space and land between us. But we love each other.
Goodnight, dear. Ever yours,
Oscar
]]>
This looks to a human as if it has three paragraph, but if we pretty-print it, we see:
The following letter was written shortly after Wilde's release from prison:
Rouen, August 1897
My own Darling Boy,
I got your telegram half an hour ago, and just send a line to say that I feel that my only
hope of again doing beautiful work in art is being with you. It was not so in the old days, but
now it is different, and you can really recreate in me that energy and sense of joyous power on
which art depends. Everyone is furious with me for going back to you, but they don't understand
us. I feel that it is only with you that I can do anything at all. Do remake my ruined life for
me, and then our friendship and love will have a different meaning to the world. I wish that
when we met at Rouen we had not parted at all. There are such wide abysses now of space and land
between us. But we love each other.
Goodnight, dear. Ever yours,
Oscar
]]>
As with the poem, above, the paragraphs here are not easily accessible to a computer because they aren’t tagged. Since we will need them to render a reading view in a culturally appropriate way, we should tag them:
The following letter was written shortly after Wilde's release from prison:
Rouen, August 1897
My own Darling Boy,
I got your telegram half an hour ago, and just send a line to say that I feel that my only
hope of again doing beautiful work in art is being with you. It was not so in the old days,
but now it is different, and you can really recreate in me that energy and sense of joyous
power on which art depends.
Everyone is furious with me for going back to you, but they don't understand us. I feel that
it is only with you that I can do anything at all. Do remake my ruined life for me, and then
our friendship and love will have a different meaning to the world.
I wish that when we met at Rouen we had not parted at all. There are such wide abysses now of
space and land between us. But we love each other.
Goodnight, dear. Ever yours,
Oscar
]]>
We don’t tag the lines because lines in a prose text don’t fulfill the same type of structural and information role as lines in poetry or song.
If you’re marking up a one-paragraph letter, you might think of the hierarchy along the following lines:
January 1893, Babbacombe Cliff
My Own Boy,
Your sonnet is quite lovely, and it is a marvel that those red-roseleaf lips of yours
should be made no less for the madness of music and song than for the madness of kissing.
Your slim gilt soul walks between passion and poetry. I know Hyacinthus, whom Apollo loved
so madly, was you in Greek days. Why are you alone in London, and when do you go to
Salisbury? Do go there to cool your hands in the grey twilight of Gothic things, and come
here whenever you like. It is a lovely place and lacks only you; but go to Salisbury first.
Always, with undying love,
Yours, Oscar
]]>
In reality, the content of the <body>
element isn’t
plain text; it’s a single paragraph. If the letter had happened to contain more than one
paragraph we would probably have noticed that paragraphs constitute a hierarchical level
below <body>
and above plain text, and we would have
tagged them for reasons described above. In a
one-paragraph letter, though, the paragraphing is easy to overlook because a single
paragraph doesn’t provide the visual cues (blank line between paragraphs, perhaps also
indentation) that would stand out in a multi-paragraph letter.
The reason we want to tag the paragraph even if there’s only one is that more often than not we work with corpora of similar documents, and not with just one document, and a corpus of letters is likely to include some that contain more than one paragraph. The more consistent our markup is within the corpus, the easier it becomes to analyze and process our XML.
When we read a page in English in print or on the web, we normally read from left to right and top to bottom. This might lead us to think of markup like:
The rime of the ancient mariner
Part 1
It is an ancient Mariner,
And he stoppeth one of three.
'By thy long grey beard and glittering eye,
Now wherefore stopp'st thou me?
The Bridegroom's doors are opened wide,
And I am next of kin;
The guests are met, the feast is set:
May'st hear the merry din.'
]]>
The problem with this is that what we’ve tagged as
<part>
isn’t a part; it’s just the label
for a part. A part in this poem contains a label followed by many quatrains (we include only
the first two here). Not only is it misleading to tag the label as if it were a part, but
the markup above does not provide any formal, machine-actionable representation of the fact
that the label goes with the following quatrains, and not the preceding ones. That isn’t an
issue with the first part because there are no preceding quatrains, but the situation is
different with the other parts. A human knows, of course, that labels in this type of
context precede the thing they’re labeling, but a computer doesn’t know that, so we need to
make it explicit with markup. You can do that by using the
<part>
element to wrap (tag) the entire actual part,
that is, the label plus the poetic content, along the lines of:
The rime of the ancient mariner
Part 1
It is an ancient Mariner,
And he stoppeth one of three.
'By thy long grey beard and glittering eye,
Now wherefore stopp'st thou me?
The Bridegroom's doors are opened wide,
And I am next of kin;
The guests are met, the feast is set:
May'st hear the merry din.'
]]>
Note that the <part>
element is the actual part, so
it contains both the label and the quatrains.
If you care about the order of components in your document, you may be tempted to use markup like:
The rime of the ancient mariner
Part 1
It is an ancient Mariner,
And he stoppeth one of three.
'By thy long grey beard and glittering eye,
Now wherefore stopp'st thou me?
The Bridegroom's doors are opened wide,
And I am next of kin;
The guests are met, the feast is set:
May'st hear the merry din.'
]]>
This is well formed but it is nonetheless a mistake. Here’s why:
XML processors can count (we’ll show you how in a few weeks), so you can find the first or second or tenth or all of the odd-numbered stanzas without writing any numbers into your markup. In general it’s best not to insert explicit markup that you don’t need.
If you do need to number explicitly (see below), this is the situation for which attributes were invented. You could use markup like the following instead:
<line n="1">It is an ancient Mariner,</line>
This records that the line is a line and that it is line #1 (the attribute name
n
is common for values that are numbers, but you
can give your attribute any name that will be meaningful for you).
The reason you don’t want to use element names like
<stanza1>
and
<stanza2>
is that when an XML processor compares two
element names, the only thing it can determine easily is whether the names are identical or
not identical. A human can see that these are two types of stanzas, but a computer can’t.
The element+attribute strategy, though, records that the stanzas are both stanzas, since the
element name is identical. But that strategy also records the stanza number, so it is easily
available to an XML processor.
As we’ll see immediately below, most of the times you don’t want to number stanzas or lines at all, even in attributes.
When tagging a document that contains sections, subsections, and sub-subsections, you might be tempted to use markup like the following:
Title of first section goes here
First paragraph of first section.
Title of first subsection of first section goes
here
First paragraph of first subsection of first section.
Title of first sub-subsection of first sub-section of first
section goes here
First paragraph of first sub-subsection of first subsection of first section>
]]>
In the example above we’ve used different element names for sections (and their titles) at different levels of the hierarchy. The main reason this is undesirable is that if you later decide to split up your document by making each section its own document, you would have to retag the subsections as sections and the sub-subsections as subsections (and likewise with their titles). What you should do instead is:
Title of first section goes here
First paragraph of first section.
Title of first subsection of first section goes here
First paragraph of first subsection of first section.
Title of first sub-subsection of first sub-section of first section goes
here
First paragraph of first sub-subsection of first subsection of first section>
]]>
The reason you can use the same element names at different hierachical levels is that XML elements have both a name and a context, and XML processing is aware of the context. This means that you can distinguish sections from subsections from sub-subsections (and likewise with their titles) during processing even if they are tagged identically because they occur in different contexts—in this case, at different levels of the hierarchy.
That XML processing knows about context means that you normally shouldn’t number sequential
items at all. You already know (see above) that you shouldn’t use
element names like <act-1>
,
<act-2>
, etc. for the acts of a play, but you also
shouldn’t use <act n="1">
or
<act>1</act>
. What you want is just the
element name, with no numbering in either an attribute or the textual content of the
element, e.g., <act>
.
The reason you don’t have to number sequential items in either markup or textual content is that the ordinal position of an element (first, second, etc.) is part of the context, which means that XML processing can distinguish elements with the same name in different positions without your having to specify the position explicitly, and you can output the numbers during processing if you need them. Letting the computer do the counting is valuable because humans can easily skip or repeat a number by accident, and also because if you later insert or remove an item or rearrange them you would have to adjust any explicit numbering accordingly. If you let the computer number the items as it finds them, the numbering will always be correct.
There are exceptions to the general policy of not numbering explicitly. If your numbering doesn’t begin at 1 (perhaps you’re encoding only acts 4, 5, and 6 from a play and you want to be able to render the original numbers) or if it isn’t continuous (perhaps you’re encoding only acts 1, 3, and 7 and you want to be able to render the original numbers), XML can’t know what the original numbers were. In situations like that you need to make additional information available in your markup.