Maintained by: David J. Birnbaum (djbpitt@gmail.com)
Last modified:
2025-02-21T18:59:57+0000
Suppose you want to tag the Roman numerals in the following one-line plain-text document using ixml:
XV CL
The desired output is:
XV
CL
]]>
If you’re familiar with regular expressions, your first attempt might be something like:
{ Produces ambiguous result }
line: (roman; space)+, newline?.
roman: ["IVXCLDM"]+.
space: " ".
newline: #d?, #a.
The first line of the preceding grammar is an ixml comment. The other four lines say that:
A ]]>
element contains one
or more instances of ]]>
elements and ]]>
elements
in any order. There may or may not be a newline code (either Linux/MacOS or
Windows) at the end.
A ]]>
element contains a
sequence of one or more Roman-numeral characters. (This pattern will also
allow combinations of the characters that aren’t valid Roman numerals, such
as IVX
, but we’ll ignore that concern for the moment.)
A ]]>
element contains a
single space character.
A ]]>
element consists of
an LF (#a
) that may or may not be preceded by a CF
(#d
).
Our input document may or may not have a newline at the end. If we use this grammar to parse an input document that doesn’t end with a newline, we’ll notice at least two undesirable details. The output of Markup Blitz after pretty-printing (other ixml processors produce similar results) is:
XV
CL
]]>
The output is close to our desired result, except that:
It contains ]]>
tags that
we don’t want, although we do want the space character.
It is tagged as ambiguous; note the
@ixml:state
attribute on the root
element.
If our input document does have a newline at the end, there is one additional undesirable result:
XV
CL
]]>
In addition to the ]]>
element and
the ambiguity, there is also an unwanted
]]>
element, the content of
which is a single newline, which is why the end-tag is on the line below the one
with the start-tag.
Below we discuss how to fix those issues.
]]>
and
]]>
tagsWe can tell the ixml processor not to output markup for the
]]>
and
]]>
elements by putting a minus
sign before the lines where they are defined. That step alone suppresses just the
markup, and we have to handle the content (the actual space or newline characters)
separately. In the case of the space we want to keep the space character, while in
the case of the newline we want to remove the characters. Much as a hyphen before
the left side of the grammar rule removes the markup for the element, a hyphen
before a character representation on the right side removes the character itself.
The following modification, which adds one hyphen on the fourth line and three on
the fifth, removes the ]]>
tag but
not the space character, while removing both the
]]>
tags and the newline
characters:
{ Produces ambiguous result }
line: (roman; space)+, newline?.
roman: ["IVXCLDM"]+.
-space: " ".
-newline: -#d?, -#a.
Unless we specify otherwise, the output of an ixml process will be written as a
single long line, which can make it difficult for humans to read. Below we’ve broken
the long line manually in a way that preserves the single space character that the
ixml process correctly emits between the two
]]>
elements:
XV CL
]]>
If we reformat the output using a pretty-print utility like xmllint
or
xq
, we get:
XV
CL
]]>
The space and newline are now behaving the way we want, but why is the result reported as ambiguous? When we look at the input it looks like two Roman numerals, and when we look at the output it tags the Roman numerals as we’d expect. So what about it is ambiguous?
Coffeepot supports a command-line switch that lets us specify a particular parse, so that, for example, adding --parse:2 to the command line returns the second parse (the first is returned by default). What counts as first or second, etc. is not predictable; the processor is free to number the parses as it pleases. You can also ask for all parses at once by appending --parse-count:all. Here are the four possible parses:
XV
CL
XV
C
L
X
V
CL
X
V
C
L
]]>
Before you read further, see whether you can figure out why the parser finds four possible results and why the first one strikes most of us as correct, while we don’t think immediately of the other three.
The reason for this behavior is that although the grammar rule for a
]]>
element looks like a regular
expression, the plus sign has a different meaning in ixml than it has with regular
expressions. With regular expressions the plus sign (and the asterisk) are
greedy, which is a technical term that means that they match as many
characters as possible. If you match the five-character string XV CL
with the regular expression [IVXCLDM]+
, it finds
exactly two two-character Roman numerals. That happens because a plus sign in a
regular expression doesn’t just match one or more characters, although we sometimes
describe it that way; what it actually matches is as many characters as it
can before it finds a non-matching character. That means that it doesn’t
stop after the first character matches (it conform to the requirement of one or
more
) because the second also matches, but it does stop after the second
character matches because the third (the space) doesn’t match.
Patterns with repetition (+
or *
) in ixml, unlike in
regular expressions, don’t match only the longest passible matching sequence; they
simultaneously match all possible matching sequences. This means that, for
example, the text XV
is matched as one two-character Roman numeral (the
same as with a regular expression), but it is also matched as two adjacent
one-character Roman numerals. Since there are two two-character Roman numerals in
the original input line, the four possible parses are to expand (treat as two
consecutive one-character Roman numerals) just the first, just the second, both, or
neither.
We learned the preceding from Steven Pemberton’s https://homepages.cwi.nl/~steven/ixml/advanced/tutorial.xhtml (select
Grammars are not regular expressions
from the drop-down list at the top).
Steven’s entire tutorial provides a thoughtful overview of features of ixml grammars
that may not be immediately obvious to new users.
The ixml grammar above produces ambiguous parses because it allows two immediately
adjacent Roman numerals, that is, it allows XV
to be parsed as
X
. We
can prohibit that interpretation by changing our definition of a line; instead of
saying that a line is one or more instances of Roman numerals and spaces in any
order, we can say that it can have as many Roman numerals as we want, but the line
must begin and end with a Roman numeral (that is, no spaces at the beginning or end)
and there must be exactly one space between any two Roman numerals. (If we wanted,
we could allow initial and final spaces and we could allow multiple consecutive
spaces, but we’ve simplified the circumstances so that we can focus on identifying
the Roman numerals correctly. The crucial detail is that Roman numerals cannot be
immediately adjacent to other Roman numerals, which our first ixml grammar
permitted.)
A common idiom in ixml is the use of the double plus sign (++
) to mean
one or more instances of the thing on the left separated by the thing on the
right
. That is, the rule:
line: roman++space.
means that a ]]>
contains one or
more ]]>
elements separated by
exactly one ]]>
element between
each two ]]>
elements. This
pattern prohibits two immediately consecutive Roman numerals, without any
intervening space, which means that XV
must be a single Roman numeral,
and cannot be a sequence of
X]]>
followed immediately
by V]]>
.
With our new grammar:
line: roman++space, newline?.
roman: ["IVXCLDM"]+.
-space: " ".
-newline: -#d?, -#a.
the result is unambiguous:
XV
CL
]]>
Steven Pemberton’s Advanced Invisible XML (ixml) Tutorial
Norm Tovey-Walsh’s Writing Invisible XML grammars
Norm Tovey-Walsh’s Ambiguity in iXML and how to control it
This section is optional. Feel free to skip it, but we recommend at least looking it over quickly to get a sense of what it contains, so that you’ll know why you might want to come back to it at some point.
Coffeepot provides an option to output a graph of what the grammar matches; the
technical term for this model is forest because it can be understood as
multiple trees. A graph visualization can sometimes clarify why an ixml grammar is
ambiguous, and you can create one by appending -G:filename.svg to the
command line (replacing filename
with your filename of choice). Reading these
graphs can be challenging because they are very complete, and below the following
image we’ll describe how to read it and highlight what to look for.
For the ambiguous grammar above the forest looks like:
The shapes with the thick brown edges describe the path associated with what
Coffeepot decides to use as its first parse, the one that finds two two-character
Roman numerals. The number ranges (e.g., 1 – 6
on the topmost node)
identify the input characters; read this as the root node of the forest involves
characters 1 until 6
(the second value is exclusive, that is, not
included in the range; there is no sixth character). A one-character range is just a
single digit, e.g., the ‘L’ « 5 »
at the bottom means that the
L
is the fifth character.
Since we’re interested in ambiguity that affects
]]>
elements, focus on the
ellipses labeled roman
and the house-like shapes that contain
Roman-numeral characters. If you look at the left side of the image, there’s an
ellipse labeled roman «1 – 3»
that is an ancestor of the house-like
shapes labeled ‘X’ « 1 »
and ‘V’ « 2 »
. This represents
the two-character Roman numeral XV
. But the X
node also
has an ancestor labeled roman « 1 »
that is not an ancestor of the
V
node and the V
node has an ancestor labeled
roman « 2 »
that is not an ancestor of the X
node. These
are the separate one-character Roman numerals X
and V
.
You’ll find a similar structure on the right side for the CL
portion of
the input.
A graph of the forest created by parsing our input document with our unambiguous ixml grammar is much simpler:
All shapes havae thick brown edges because thick brown edges are associated with the first parse and in this case there is only one.