Digital humanities


Maintained by: David J. Birnbaum (djbpitt@gmail.com) [Creative Commons BY-NC-SA 3.0 Unported License] Last modified: 2025-02-21T18:59:57+0000


Invisible XML and ambiguity

The problem

Suppose you want to tag the Roman numerals in the following one-line plain-text document using ixml:

XV CL

The desired output is:


   XV
   CL
]]>

If you’re familiar with regular expressions, your first attempt might be something like:

{ Produces ambiguous result }
line: (roman; space)+, newline?.
roman: ["IVXCLDM"]+.
space: " ".
newline: #d?, #a.

The first line of the preceding grammar is an ixml comment. The other four lines say that:

Our input document may or may not have a newline at the end. If we use this grammar to parse an input document that doesn’t end with a newline, we’ll notice at least two undesirable details. The output of Markup Blitz after pretty-printing (other ixml processors produce similar results) is:



  XV
   
  CL
]]>

The output is close to our desired result, except that:

If our input document does have a newline at the end, there is one additional undesirable result:



   XV
    
   CL
   

]]>

In addition to the ]]> element and the ambiguity, there is also an unwanted ]]> element, the content of which is a single newline, which is why the end-tag is on the line below the one with the start-tag.

Below we discuss how to fix those issues.

Removing the ]]> and ]]>tags

We can tell the ixml processor not to output markup for the ]]> and ]]> elements by putting a minus sign before the lines where they are defined. That step alone suppresses just the markup, and we have to handle the content (the actual space or newline characters) separately. In the case of the space we want to keep the space character, while in the case of the newline we want to remove the characters. Much as a hyphen before the left side of the grammar rule removes the markup for the element, a hyphen before a character representation on the right side removes the character itself. The following modification, which adds one hyphen on the fourth line and three on the fifth, removes the ]]> tag but not the space character, while removing both the ]]> tags and the newline characters:

{ Produces ambiguous result }
line: (roman; space)+, newline?.
roman: ["IVXCLDM"]+.
-space: " ".
-newline: -#d?, -#a.

Unless we specify otherwise, the output of an ixml process will be written as a single long line, which can make it difficult for humans to read. Below we’ve broken the long line manually in a way that preserves the single space character that the ixml process correctly emits between the two ]]> elements:



    XV CL
]]>

If we reformat the output using a pretty-print utility like xmllint or xq, we get:



  XV
  CL
]]>

The space and newline are now behaving the way we want, but why is the result reported as ambiguous? When we look at the input it looks like two Roman numerals, and when we look at the output it tags the Roman numerals as we’d expect. So what about it is ambiguous?

Ambiguity in ixml

Coffeepot supports a command-line switch that lets us specify a particular parse, so that, for example, adding --parse:2 to the command line returns the second parse (the first is returned by default). What counts as first or second, etc. is not predictable; the processor is free to number the parses as it pleases. You can also ask for all parses at once by appending --parse-count:all. Here are the four possible parses:


   XV
    
   CL



   XV
    
   C
   L



   X
   V
    
   CL



   X
   V
    
   C
   L
]]>

Before you read further, see whether you can figure out why the parser finds four possible results and why the first one strikes most of us as correct, while we don’t think immediately of the other three.

The reason for this behavior is that although the grammar rule for a ]]> element looks like a regular expression, the plus sign has a different meaning in ixml than it has with regular expressions. With regular expressions the plus sign (and the asterisk) are greedy, which is a technical term that means that they match as many characters as possible. If you match the five-character string XV CL with the regular expression [IVXCLDM]+, it finds exactly two two-character Roman numerals. That happens because a plus sign in a regular expression doesn’t just match one or more characters, although we sometimes describe it that way; what it actually matches is as many characters as it can before it finds a non-matching character. That means that it doesn’t stop after the first character matches (it conform to the requirement of one or more) because the second also matches, but it does stop after the second character matches because the third (the space) doesn’t match.

Patterns with repetition (+ or *) in ixml, unlike in regular expressions, don’t match only the longest passible matching sequence; they simultaneously match all possible matching sequences. This means that, for example, the text XV is matched as one two-character Roman numeral (the same as with a regular expression), but it is also matched as two adjacent one-character Roman numerals. Since there are two two-character Roman numerals in the original input line, the four possible parses are to expand (treat as two consecutive one-character Roman numerals) just the first, just the second, both, or neither.

We learned the preceding from Steven Pemberton’s https://homepages.cwi.nl/~steven/ixml/advanced/tutorial.xhtml (select Grammars are not regular expressions from the drop-down list at the top). Steven’s entire tutorial provides a thoughtful overview of features of ixml grammars that may not be immediately obvious to new users.

Coping with ambiguity

The ixml grammar above produces ambiguous parses because it allows two immediately adjacent Roman numerals, that is, it allows XV to be parsed as XV]]>. We can prohibit that interpretation by changing our definition of a line; instead of saying that a line is one or more instances of Roman numerals and spaces in any order, we can say that it can have as many Roman numerals as we want, but the line must begin and end with a Roman numeral (that is, no spaces at the beginning or end) and there must be exactly one space between any two Roman numerals. (If we wanted, we could allow initial and final spaces and we could allow multiple consecutive spaces, but we’ve simplified the circumstances so that we can focus on identifying the Roman numerals correctly. The crucial detail is that Roman numerals cannot be immediately adjacent to other Roman numerals, which our first ixml grammar permitted.)

A common idiom in ixml is the use of the double plus sign (++) to mean one or more instances of the thing on the left separated by the thing on the right. That is, the rule:

line: roman++space.

means that a ]]> contains one or more ]]> elements separated by exactly one ]]> element between each two ]]> elements. This pattern prohibits two immediately consecutive Roman numerals, without any intervening space, which means that XV must be a single Roman numeral, and cannot be a sequence of X]]> followed immediately by V]]>.

With our new grammar:

line: roman++space, newline?.
roman: ["IVXCLDM"]+.
-space: " ".
-newline: -#d?, -#a.

the result is unambiguous:


   XV
   CL
]]>

To learn more about dealing with ambiguity in ixml, see:


Appendix: Visualizing ambiguity

This section is optional. Feel free to skip it, but we recommend at least looking it over quickly to get a sense of what it contains, so that you’ll know why you might want to come back to it at some point.

Coffeepot provides an option to output a graph of what the grammar matches; the technical term for this model is forest because it can be understood as multiple trees. A graph visualization can sometimes clarify why an ixml grammar is ambiguous, and you can create one by appending -G:filename.svg to the command line (replacing filename with your filename of choice). Reading these graphs can be challenging because they are very complete, and below the following image we’ll describe how to read it and highlight what to look for.

For the ambiguous grammar above the forest looks like:

SPPF cluster_id0 cluster_id1 cluster_id2 cluster_id3 cluster_id9 cluster_id10 cluster_id11 cluster_id12 cluster_id13 cluster_id14 cluster_id15 cluster_id22 cluster_id23 cluster_id24 cluster_id32 cluster_id33 cluster_id34 cluster_id35 cluster_id44 cluster_id45 cluster_id46 cluster_id47 cluster_id48 cluster_id49 cluster_id50 cluster_id51 cluster_id52 cluster_id53 cluster_id54 cluster_id55 cluster_id56 cluster_id57 cluster_id58 nodeid0 'X' « 1 » nodeid10 $3_plus « 2 » nodeid1 $3_plus « 1 » nodeid1->nodeid0 nodeid2 roman « 1 » nodeid2->nodeid1 nodeid3 $1_alt « 1 » nodeid3->nodeid2 nodeid54 $2_$1_alt-plus « 2 – 6 » nodeid9 'V' « 2 » nodeid10->nodeid9 nodeid11 $3_plus « 1 – 3 » anond3e13 nodeid11->anond3e13 anond3e13->nodeid0 anond3e13->nodeid10 nodeid12 roman « 2 » nodeid12->nodeid10 nodeid13 roman « 1 – 3 » nodeid13->nodeid11 nodeid14 $1_alt « 2 » nodeid14->nodeid12 nodeid53 $2_$1_alt-plus « 3 – 6 » nodeid15 $1_alt « 1 – 3 » nodeid15->nodeid13 nodeid22 ' ' « 3 » nodeid23 space « 3 » nodeid23->nodeid22 nodeid24 $1_alt « 3 » nodeid24->nodeid23 nodeid52 $2_$1_alt-plus « 4 – 6 » / 2 nodeid32 'C' « 4 » nodeid45 $3_plus « 5 » nodeid33 $3_plus « 4 » nodeid33->nodeid32 nodeid34 roman « 4 » nodeid34->nodeid33 nodeid35 $1_alt « 4 » nodeid35->nodeid34 nodeid51 $2_$1_alt-plus « 5 » nodeid44 'L' « 5 » nodeid45->nodeid44 nodeid46 $3_plus « 4 – 6 » anond3e41 nodeid46->anond3e41 anond3e41->nodeid32 anond3e41->nodeid45 nodeid47 roman « 5 » nodeid47->nodeid45 nodeid48 roman « 4 – 6 » nodeid48->nodeid46 nodeid49 $1_alt « 5 » nodeid49->nodeid47 nodeid50 $1_alt « 4 – 6 » nodeid50->nodeid48 nodeid51->nodeid49 nodeid52->nodeid50 anond3e57 nodeid52->anond3e57 anond3e57->nodeid35 anond3e57->nodeid51 anond3e61 nodeid53->anond3e61 anond3e61->nodeid24 anond3e61->nodeid52 anond3e65 nodeid54->anond3e65 anond3e65->nodeid14 anond3e65->nodeid53 nodeid55 $2_$1_alt-plus « 1 – 6 » / 2 anond3e69 nodeid55->anond3e69 anond3e72 nodeid55->anond3e72 nodeid56 $4_newline-optionⁿ anond3e69->nodeid15 anond3e69->nodeid53 anond3e72->nodeid3 anond3e72->nodeid54 epsilond3e76 ε nodeid56->epsilond3e76 nodeid57 line « 1 – 6 » anond3e78 nodeid57->anond3e78 anond3e78->nodeid55 anond3e78->nodeid56 nodeid58 $$ « 1 – 6 » nodeid58->nodeid57

The shapes with the thick brown edges describe the path associated with what Coffeepot decides to use as its first parse, the one that finds two two-character Roman numerals. The number ranges (e.g., 1 – 6 on the topmost node) identify the input characters; read this as the root node of the forest involves characters 1 until 6 (the second value is exclusive, that is, not included in the range; there is no sixth character). A one-character range is just a single digit, e.g., the ‘L’ « 5 » at the bottom means that the L is the fifth character.

Since we’re interested in ambiguity that affects ]]> elements, focus on the ellipses labeled roman and the house-like shapes that contain Roman-numeral characters. If you look at the left side of the image, there’s an ellipse labeled roman «1 – 3» that is an ancestor of the house-like shapes labeled ‘X’ « 1 » and ‘V’ « 2 ». This represents the two-character Roman numeral XV. But the X node also has an ancestor labeled roman « 1 » that is not an ancestor of the V node and the V node has an ancestor labeled roman « 2 » that is not an ancestor of the X node. These are the separate one-character Roman numerals X and V. You’ll find a similar structure on the right side for the CL portion of the input.

A graph of the forest created by parsing our input document with our unambiguous ixml grammar is much simpler:

SPPF cluster_id0 cluster_id9 cluster_id10 cluster_id11 cluster_id12 cluster_id19 cluster_id20 cluster_id21 cluster_id32 cluster_id33 cluster_id34 cluster_id35 cluster_id36 cluster_id38 cluster_id39 cluster_id40 cluster_id41 cluster_id42 nodeid0 'X' « 1 » nodeid10 $3_plus « 2 » nodeid9 'V' « 2 » nodeid10->nodeid9 nodeid11 $3_plus « 1 – 3 » anond3e7 nodeid11->anond3e7 anond3e7->nodeid0 anond3e7->nodeid10 nodeid12 roman « 1 – 3 » nodeid12->nodeid11 nodeid38 $2_space-starⁿ « 3 – 6 » nodeid19 ' ' « 3 » nodeid20 space « 3 » nodeid20->nodeid19 nodeid35 roman « 4 – 6 » nodeid21 'C' « 4 » nodeid33 $3_plus « 5 » nodeid32 'L' « 5 » nodeid33->nodeid32 nodeid34 $3_plus « 4 – 6 » anond3e20 nodeid34->anond3e20 anond3e20->nodeid21 anond3e20->nodeid33 nodeid35->nodeid34 nodeid36 $4_space-plus « 3 – 6 » anond3e27 nodeid36->anond3e27 anond3e27->nodeid20 anond3e27->nodeid35 nodeid38->nodeid36 nodeid39 $1_roman-plus-sep « 1 – 6 » anond3e33 nodeid39->anond3e33 nodeid40 $5_newline-optionⁿ anond3e33->nodeid12 anond3e33->nodeid38 epsilond3e37 ε nodeid40->epsilond3e37 nodeid41 line « 1 – 6 » anond3e39 nodeid41->anond3e39 anond3e39->nodeid39 anond3e39->nodeid40 nodeid42 $$ « 1 – 6 » nodeid42->nodeid41

All shapes havae thick brown edges because thick brown edges are associated with the first parse and in this case there is only one.