Maintained by: David J. Birnbaum (djbpitt@gmail.com)
Last modified:
2025-02-27T17:33:13+0000
This three-session lesson-plan introduces learners to two relatively new XML technologies, both first published in 2022:
Invisible XML (ixml) is a language that can be used to add markup to plain-text resources, a task you may previously have undertaken with regular expressions.
Processing ixml: You can practice using ixml at the online jωiXML Invisible XML Workbench without installing anything on your local machine. We nonetheless recommend working locally by installing the CoffeePot and Markup Blitz ixml processors because they can handle larger input files, and we provide instructions for those installations at Configuring XProc and ixml processors.
Optional: Our installation page also provides instructions for installing xmq and using it for stand-alone ixml processing. You don’t need xmq for this week’s activities, but as you develop ixml grammars, you might find it helpful to have access to an additional processor.
Authoring ixml: There is currently no support for ixml in <oXygen/>, which means that you can author your own ixml grammars as <oXygen/> plain-text documents, but <oXygen/> does not provide validation, processing, or other ixml-specific support. The conventional filename extension for ixml grammar documents is .ixml.
XProc 3.0 is a language for processing pipelines, that is, sequences of operations, where the output of one operation may function as the input to the next.
Processing XProc: To use XProc you need to install either XML Calabash 3 or MorganaXProc-IIIse on your local machine, and we provide instructions for those installations at Configuring XProc and ixml processors. <oXygen/> builds in support for XProc 1.0, but not (yet) for 3.0, so you need to install your own 3.0 processor. Once you’ve installed XML Calabash 3 or MorganaXProc-IIIse on your local machine, it’s possible to configure an <oXygen/> transformation scenario to use one or the other of them to process an XProc pipeline, but we find it simpler to process XProc at the command line.
Authoring XProc: You can author XProc 3.0 in
<oXygen/>, which recognizes XProc
as a file type, but
the validation is not up to date with all features of XProc 3.0,
which means that <oXygen/> may incorrectly report errors for
valid code and it may not provide all content completion options.
The conventional filename extension for XProc documents is
.xpl.
This unit takes place over three fifty-minute sessions:
Date | Topic | Homework due next time |
---|---|---|
Wed, 2025-02-26 | (Some other topic; not part of this unit) |
|
Fri, 2025-02-28 | Invisible XML |
|
Mon, 2025-03-03 | XProc 3.0 |
|
Wed, 2025-03-05 | Using Invisible XML and XProc 3.0 together | (Homework for some other topic; not part of this unit) |
Why use ixml: Tagging Shakespeare sonnets (data)
All Shakespeare sonnets: https://github.com/djbpitt/ixml/blob/main/sonnets/sonnets.txt
First three sonnets only: https://github.com/djbpitt/ixml/blob/main/sonnets/sonnets-small.txt
Introduction to ixml grammars: Tagging a date
Ambiguity: Grammars are not regular expressions
Input: https://github.com/djbpitt/ixml/blob/main/roman/roman.txt
Ambiguous grammar: https://github.com/djbpitt/ixml/blob/main/roman/roman-ambiguous.ixml
Regex: greedy-roman-numeral (at http://regexpal.com)
Unambiguous grammar: https://github.com/djbpitt/ixml/blob/main/roman/roman-unambiguous.ixml
Tagging Shakespeare sonnets (grammars)
For Unicode character classes (e.g., L
to match any letter) see Unicode
Standard Annex #44 ( Unicode Character Database), §5.7.1 General Category
Values.
Files for the sonnets activity are located at https://github.com/djbpitt/ixml/tree/main/sonnets.
Our summary of XProc basics, simplified to focus on the type of XProc we write in this unit.
Step by step XProc for Shakespearean sonnets:
Create XProc skeleton:
]]>
Read input into pipeline input using ]]>
step. Input can be either local or remote:
Local input:
]]>
Remote input:
]]>
Emit result on pipeline output using ]]>
step (we modify this later):
]]>
Test pipeline with ]]>
.
Tag input using ]]>
step. Requires ixml grammar.
Transform XML to HTML using ]]>
step. Requires XSLT stylesheet.
Specify serialization parameters on
]]>
step:
]]>
Files for the Blithedale activity are located at https://github.com/djbpitt/ixml/tree/main/blithedale.
Below is a copy of blithedale.xpl, with line annotations below the code.
text{{substring(.,2)}}
]]>
Lines 2–5: Note the namespace declarations inside the
start-tag. The cx:
namespace prefix is used for XProc
extension steps, that is, those that are not part of any of the
Standard step and
optional libraries. The xs:
namespace prefix is used
for standard datatypes, such as the Boolean (true/false) type in line
9.
Line 9: A static parameter is a variable that does
not depend on input data. In this case we declare a
$debug
parameter that governs whether
progress messages are displayed to the screen as the pipeline is processed.
The default value is set to false()
(no
progress messages), and we can override that and set it to
true()
when we run the pipeline. To run
with debug output (note the different syntax for specifying a true value for
XML Calabash and MorganaXProc-IIIse):
xmlcalabash blithedale.xpl debug="?true()"
morgana blithedale.xpl -static:debug=true
Without debug information:
xmlcalabash blithedale.xpl"
morgana blithedale.xpl
Lines 13–14: Read input over the Internet. Setting the value
of @sequence
to
false
means that the input must be
exactly one document. Setting it true allows any number of documents,
including zero.
Lines 18–20: We set the output as empty because we’re going
to write our output to disk using
]]>
. Since we don’t have
exactly one output file (we have none), we have to set
@sequence
to
true
, since a value of
false
would require exactly one output
document (see above concerning
]]>
).
Line 21: Here and elsewhere (lines 31, 41, 46, 54, 61, 69,
76, 84, 93, 104, 114, 124, 133) we use
]]>
, which passes
through its input unchanged, just as a place to hang a
@message
attribute. The value of a
@message
attribute is written to the
screen when the step is executed, and the
@use-when
attribute value controls
whether the step should be executed or not. If
@debug
is
true()
, the messages are displayed; if
it is false()
(the default), they
aren’t.
We could put a @message
attribute on any
step, but we want the working steps to be executed regardless of the value
of the $debug
parameter, so we can’t
put a @use-when
attribute on them.
Putting the message on a separate
]]>
step lets us
control whether it is rendered; the working step is always
executed.
Lines 25–32: Unicode documents may or may not begin with a
byte order
mark (BOM). We could have sniffed out the presence or
absence of a BOM in our ixml step and handled it there, but instead we let
XProc check for it and remove it if it is present. We used a
]]>
step as a wrapper
for a simple XPath expression that strips the initial character if it is a
BOM.
Lines 36–40: Add basic XML markup according to the ixml grammar specified on Line 38. It can be challenging to add elaborate or nuanced markup with ixml, so our ixml step adds just some of the markup we want, and we use XProc steps and XSLT to massage it into the richer XML that we ultimately need.
Line 45: Like the
@matches
attribute on
]]>
in XSLT, the
@matches
attribute on an XProc
]]>
step is an XPath
pattern that matches elements anywhere in the document, so all we
need is the element name, and not the full path. In this case we match
either ]]>
or
]]>
(which our ixml used
to tag Project Gutenberg boilerplate) and removes them.
Lines 51–53: We can tag quotes automatically in XSLT, using
]]>
, by
assuming that they occur in pairs within a paragraph and you can see how we
do this by examining the XSLT stylesheet specified on Line 52. Our assumption could fail
because plain text documents often have stray unpaired quotations marks by
accident, although as far as we can tell, quotation marks happen to be used
correctly in our source. We cannot tag embedded quotes (marked with single
quotes in the plain text) automatically because we cannot reliably
distinguish single end-quotes from apostrophes.
Lines 58–60: We use the Schematron schema specified on Line 59 to verify that there are no remaining quotation marks within paragraphs, which would happen during the preceding step if there were an odd number of quotation mark characters. This Schematron validation will catch errors that result in an odd number of quotation marks within a paragraph, but if there is an even number of missing or superfluous quotation marks, they’ll cancel one another out. The Schematron validation, then, is imperfect, but better than nothing.
Lines 66–68: We divided our markup enrichment over several steps, some using XSLT and some using XProc. We could have done all of the enrichment in a single XSLT stylesheet, but we found it easier to think about one thing at a time. Here we use the XSLT stylesheet specified on Line 67 to tag the front matter and adjust the capitalization in the chapter titles inside the body, and you can see how we do that by examining the XSLT stylesheet.
Lines 73–75: We could have added
@id
attributes to the body chapter
titles and removed the associated
]]>
elements with XSLT,
but we opted to do it with XProc just to practice something
different.
Lines 81: This is our final cleanup of the XML, where we use
the XSLT stylesheet specified in line 82 to convert sequences of two
hyphens (--
) to em-dashes (—
) and we convert
single quotes, which are straight in the source ('
), to curly
(typographic) characters (‘ ’
).
Should the ixml analysis detect ambiguity, the ixml processor will write a
namespaced @ixml:status
attribute into
the root element, and here we remove it.
Lines 88–92: Our XML is now ready for transformation to HTML
or SVG final-form output, and before doing that we validate it against the
Relax NG schema specified on Line 90 to confirm that the markup
matches our expectations. We assign a
@name
attribute to this step on Line
88, and we’ll use that value on Line 120, below.
Line 97: For production purposes we don’t need to save the
XML because the output we care about is the HTML and SVG that will be
derived from it, but during development we might nonetheless want to be able
to look at the XML. Accordingly, we employ a
@use-when
attribute to save the XML
only when in debug mode. Since the working step itself is debug-only, we can
put the @message
on it instead of on a
separate ]]>
step.
Lines 101–03: We use the XSLT stylesheet specified on Line 102 to create an HTML reading view of our XML. Our HTML is HTML 5 with XML syntax, which is the only kind of HTML we use in Real Life.
Lines 108–113: We use
]]>
to write the HTML to
disk. The serialization parameters specify that we want HTML 5 with XML
syntax, we want to include an XML declaration but omit an HTML content-type
specification, and we want to pretty-print the document. This combination of
features is recommended for HTML 5 that uses XML syntax.
Lines 118–23: We use a different XSLT stylehsheet, specified on Line 122, to transform the same
finalized XML to SVG. We have to use an explicit connection for the
main (source) input port into this step because we’re taking input
not from the immediately preceding step (which is what an implicit
connection does), but from an earlier one. We put a
@name
attribute on the
]]>
step that validates the final XML (line 88) so that we can point to it, and
we use ]]>
on line 120 to
tell the XProc processor to take the input into the new XSLT transformation
from the result port of that step, instead of from the immediately
preceding step, which would have been the default behavior otherwise.
This step introduce branching into our pipeline. Until now the content has
been able to flow from the result port of each step into the
source port of the immediately following one, but here we don’t
want to use the HTML as the input to producing SVG, so we need to look
further back to find the XML. As a result, there are two branches that
emerge from the ]]>
step
that finalizes the XML. You can see this branching in the graph of the
pipeline, below.
Lines 128–33: We use
]]>
to save the output
of the SVG transformation to disk. Since that’s the last step, the SVG would
normally also flow into the main result port, but we defined that as empty
(lines 18–20), so the pipeline just throws away the SVG after writing it to
disk and terminates the operation.
We configure the serialization method as XML because SVG is XML and it doesn’t have its own, SVG-specific serialization method, and we turn on pretty-printing to make the raw SVG easier to read. We omit the XML declaration because we might want to embed our SVG inside an HTML document (although we don’t do that here), and an XML declaration is not alowed on an embedded document.
Below is a graph of the pipeline, where, in the interest of making the graph less
complicated and easier to read, we’ve omitted the
]]>
steps that are used only
during debugging. The large box labeled p:if / bom-removal
is the step that
removes the Unicode byte order mark if present. It’s complex because it includes
what happens both when there is a BOM and when there isn’t, but if you ignore what’s
inside the box for the moment and just look at what flows into it and out of it, you
can see that the original input flows in, and what flows out is the original input
after a possible leading BOM has been removed.
Four details to notice are that:
The overall flow of information through the pipeline is linear (the
result port at the bottom of each step flows into the
source port at the top of the next step) until the point where it
splits to create two outputs, one in HTML and one in SVG. Where steps
require secondary input (e.g., a validation step requires a schema, an XSLT
step requires an XSLT stylesheet), those flow into the steps separately from
boxes with cx:document
in their top row.
After the step that deals with the BOM, the flow proceeds in linear fashion through:
A ]]>
step that adds basic markup and transforms the plain text input into
well-formed XML;
A ]]>
step to
remove the Gutenberg boilerplate header and footer;
A ]]>
step to tag
the quotes;
A
]]>
step to check the quotes;
A ]]>
step to tag
the front matter;
A ]]>
step to add @id
attributes to
the body titles;
A ]]>
step to
remove ]]>
elements once their values have been copied into the
@id
attributes created by the
labeling step; and
A
]]>
step to verify that the markup conforms to our
expectations.
The result port of the step that validates with Relax NG has two
out-arrows, each supplying input into the source port of a
]]>
step on the
penultimate row of the graph. This step, then, creates a branch or
split in the pipeline, and the two sides proceed on their own
from there. One of the XSLT trnasformations creates the HTML reading view,
while the other one creates the SVG. The output of each of the XSLT steps
flows into it own ]]>
step, which writes the result to disk.
The graph shows the disposition of every output port (boxes in the bottom row
of each step), both the primary ones (labeled result) and any
possible additional ones (e.g., secondary for a
]]>
step; report
for a validation step, etc.). All of those output ports either flow into
another step or are discarded, which is represented by flowing into a small
black dot. By the bottom of the graph everything has flowed into a small
block dot, and the primary output, shown in the upper right of the graph, is
therefore isolated, that is, nothing in the actual working pipeline flows
into it. That’s consistent with our having specified that the primary output
port should be empty.