Invisible XML and ambiguity

Maintained by: David J. Birnbaum (djbpitt@gmail.com)

Last modified: 2025-04-01T14:22:58+0000

The problem

Suppose you want to tag the Roman numerals in the following one-line plain-text document using ixml:

XV CL

The desired output is:


   XV
   CL
]]>

If you’re familiar with regular expressions, your first attempt might be something like:

{ Produces ambiguous result }
line: (roman; space)+, newline?.
roman: ["IVXCLDM"]+.
space: " ".
newline: #d?, #a.

The first line of the preceding grammar is an ixml comment. The other four lines say that:

A ]]> element contains one or more instances of ]]> elements and ]]> elements in any order. There may or may not be a newline code (either Linux/MacOS or Windows) at the end.
A ]]> element contains a sequence of one or more Roman-numeral characters. (This pattern will also allow combinations of the characters that aren’t valid Roman numerals, such as IVX, but we’ll ignore that concern for the moment.)
A ]]> element contains a single space character.
A ]]> element consists of an LF (#a) that may or may not be preceded by a CF (#d).

Our input document may or may not have a newline at the end. If we use this grammar to parse an input document that doesn’t end with a newline, we’ll notice at least two undesirable details. The output of Markup Blitz after pretty-printing (other ixml processors produce similar results) is:



  XV
   
  CL
]]>

The output is close to our desired result, except that:

It contains ]]> tags that we don’t want, although we do want the space character.
It is tagged as ambiguous; note the @ixml:state attribute on the root element.

If our input document does have a newline at the end, there is one additional undesirable result:



   XV
    
   CL
   

]]>

In addition to the ]]> element and the ambiguity, there is also an unwanted ]]> element, the content of which is a single newline, which is why the end-tag is on the line below the one with the start-tag.

Below we discuss how to fix those issues.

Removing the `]]>` and `]]>`tags

We can tell the ixml processor not to output markup for the ]]> and ]]> elements by putting a minus sign before the lines where they are defined. That step alone suppresses just the markup, and we have to handle the content (the actual space or newline characters) separately. In the case of the space we want to keep the space character, while in the case of the newline we want to remove the characters. Much as a hyphen before the left side of the grammar rule removes the markup for the element, a hyphen before a character representation on the right side removes the character itself. The following modification, which adds one hyphen on the fourth line and three on the fifth, removes the ]]> tag but not the space character, while removing both the ]]> tags and the newline characters:

{ Produces ambiguous result }
line: (roman; space)+, newline?.
roman: ["IVXCLDM"]+.
-space: " ".
-newline: -#d?, -#a.

Unless we specify otherwise, the output of an ixml process will be written as a single long line, which can make it difficult for humans to read. Below we’ve broken the long line manually in a way that preserves the single space character that the ixml process correctly emits between the two ]]> elements:



    XV CL
]]>

If we reformat the output using a pretty-print utility like xmllint or xq, we get:



  XV
  CL
]]>

The space and newline are now behaving the way we want, but why is the result reported as ambiguous? When we look at the input it looks like two Roman numerals, and when we look at the output it tags the Roman numerals as we’d expect. So what about it is ambiguous?

Ambiguity in ixml

Coffeepot supports a command-line switch that lets us specify a particular parse, so that, for example, adding --parse:2 to the command line returns the second parse (the first is returned by default). What counts as first or second, etc. is not predictable; the processor is free to number the parses as it pleases. You can also ask for all parses at once by appending --parse-count:all. Here are the four possible parses:

Before you read further, see whether you can figure out why the parser finds four possible results and why the first one strikes most of us as correct, while we don’t think immediately of the other three.

The reason for this behavior is that although the grammar rule for a ]]> element looks like a regular expression, the plus sign has a different meaning in ixml than it has with regular expressions. With regular expressions the plus sign (and the asterisk) are greedy, which is a technical term that means that they match as many characters as possible. If you match the five-character string XV CL with the regular expression [IVXCLDM]+, it finds exactly two two-character Roman numerals. That happens because a plus sign in a regular expression doesn’t just match one or more characters, although we sometimes describe it that way; what it actually matches is as many characters as it can before it finds a non-matching character. That means that it doesn’t stop after the first character matches (it conforms to the requirement of one or more) because the second also matches, but it does stop after the second character matches because the third (the space) doesn’t match.

Patterns with repetition (+ or *) in ixml, unlike in regular expressions, don’t match only the longest passible matching sequence; they simultaneously match all possible matching sequences. This means that, for example, the text XV is matched as one two-character Roman numeral (the same as with a regular expression), but it is also matched as two adjacent one-character Roman numerals. Since there are two two-character Roman numerals in the original input line, the four possible parses are to expand (treat as two consecutive one-character Roman numerals) just the first, just the second, both, or neither.

We learned the preceding from Steven Pemberton’s https://homepages.cwi.nl/~steven/ixml/advanced/tutorial.xhtml (select Grammars are not regular expressions from the drop-down list at the top). Steven’s entire tutorial provides a thoughtful overview of features of ixml grammars that may not be immediately obvious to new users.

Coping with ambiguity

The ixml grammar above produces ambiguous parses because it allows two immediately adjacent Roman numerals, that is, it allows XV to be parsed as XV]]>. We can prohibit that interpretation by changing our definition of a line; instead of saying that a line is one or more instances of Roman numerals and spaces in any order, we can say that it can have as many Roman numerals as we want, but the line must begin and end with a Roman numeral (that is, no spaces at the beginning or end) and there must be exactly one space between any two Roman numerals. (If we wanted, we could allow initial and final spaces and we could allow multiple consecutive spaces, but we’ve simplified the circumstances so that we can focus on identifying the Roman numerals correctly. The crucial detail is that Roman numerals cannot be immediately adjacent to other Roman numerals, which our first ixml grammar permitted.)

A common idiom in ixml is the use of the double plus sign (++) to mean one or more instances of the thing on the left separated by the thing on the right. That is, the rule:

line: roman++space.

means that a ]]> contains one or more ]]> elements separated by exactly one ]]> element between each two ]]> elements. This pattern prohibits two immediately consecutive Roman numerals, without any intervening space, which means that XV must be a single Roman numeral, and cannot be a sequence of X]]> followed immediately by V]]>.

With our new grammar:

line: roman++space, newline?.
roman: ["IVXCLDM"]+.
-space: " ".
-newline: -#d?, -#a.

the result is unambiguous:


   XV
   CL
]]>

Appendix: Visualizing ambiguity

This section is optional. Feel free to skip it, but we recommend at least looking it over quickly to get a sense of what it contains, so that you’ll know why you might want to come back to it at some point.

Coffeepot provides an option to output a graph of what the grammar matches; the technical term for this model is forest because it can be understood as multiple trees. A graph visualization can sometimes clarify why an ixml grammar is ambiguous, and you can create one by appending -G:filename.svg to the command line (replacing filename with your filename of choice). Reading these graphs can be challenging because they are very complete, and below the following image we’ll describe how to read it and highlight what to look for.

For the ambiguous grammar above the forest looks like:

The shapes with the thick brown edges describe the path associated with what Coffeepot decides to use as its first parse, the one that finds two two-character Roman numerals. The number ranges (e.g., 1 – 6 on the topmost node) identify the input characters; read this as the root node of the forest involves characters 1 until 6 (the second value is exclusive, that is, not included in the range; there is no sixth character). A one-character range is just a single digit, e.g., the ‘L’ « 5 » at the bottom means that the L is the fifth character.

Since we’re interested in ambiguity that affects ]]> elements, focus on the ellipses labeled roman and the house-like shapes that contain Roman-numeral characters. If you look at the left side of the image, there’s an ellipse labeled roman «1 – 3» that is an ancestor of the house-like shapes labeled ‘X’ « 1 » and ‘V’ « 2 ». This represents the two-character Roman numeral XV. But the X node also has an ancestor labeled roman « 1 » that is not an ancestor of the V node and the V node has an ancestor labeled roman « 2 » that is not an ancestor of the X node. These are the separate one-character Roman numerals X and V. You’ll find a similar structure on the right side for the CL portion of the input.

A graph of the forest created by parsing our input document with our unambiguous ixml grammar is much simpler:

All shapes havae thick brown edges because thick brown edges are associated with the first parse and in this case there is only one.

<oo>→<dh> Digital humanities