Lesson plan: Invisible XML and XProc

Maintained by: David J. Birnbaum (djbpitt@gmail.com)

Last modified: 2025-02-27T17:33:13+0000

Date	Topic	Homework due next time
Wed, 2025-02-26	(Some other topic; not part of this unit)	Read Norm Tovey-Walsh’s Invisible XML introductory tutorial. You will not be able to follow Friday’s class if you haven’t read this tutorial. Install CoffeePot and Markup Blitz on your local machine, following our instructions at Configuring XProc and ixml processors. CoffeePot offers more debugging options than Markup Blitz, but Markup Blitz is likely to be faster (and often much faster) with large input files.
Fri, 2025-02-28	Invisible XML	Read Norm Tovey-Walsh’s Writing Invisible XML grammars and our Invisible XML and ambiguity. Complete and submit Invisible XML assignment 1. Install XML Calabash 3 and MorganaXProc-IIIse on your local machine, following our instructions at Configuring XProc and ixml processors. You will not be able to complete the XProc portion of this unit if you have not installed at least MorganaXProc-IIIse, and we recommend installing both. Read Part A of Martin Kraetke’s XProc 3.0 Tutorial (see the outline at the bottom of that page). You will not be able to follow Monday’s class if you haven’t read this part of the tutorial.
Mon, 2025-03-03	XProc 3.0	Read the remainder of Martin Kraetke’s XProc 3.0 Tutorial. You don’t have to memorize the details, but you”ll want to learn how to employ the features that are likely to be broadly useful (especially XSLT Transformations in XProc), and notice what else is there so that you can Look Stuff Up as the need arises. Complete and submit XProc assignment 1. You may wish to review our summary of XProc basics to remind yourself of key points before you begin to write your own pipeline. Feel free to copy from the pipelines below, making changes to accommodate the assignment-specific goals.
Wed, 2025-03-05	Using Invisible XML and XProc 3.0 together	(Homework for some other topic; not part of this unit)

Session 3: Invisible XML and XProc (Wednesday, 2025-03-05)

Files for the Blithedale activity are located at https://github.com/djbpitt/ixml/tree/main/blithedale.

Below is a copy of blithedale.xpl, with line annotations below the code.



    
    
    
    
    
    
    
    
    
    
    
    
        
    
    
    
    
    
    
        
            
                text{{substring(.,2)}}
            
        
        
    
    
    
    
    
        
            
        
    
    
    
    
    
    
    
    
    
    
    
    
        
    
    
    
    
    
    
        
    
    
    
    
    
    
    
        
    
    
    
    
    
    
    
    
    
    
    
    
    
        
    
    
    
    
    
    
        
            
        
    
    
    
    
    
    
    
    
    
    
        
    
    
    
    
    
    
    
    
    
    
    
        
            
        
        
    
    
    
    
    
    
    
]]>

Lines 2–5: Note the namespace declarations inside the start-tag. The cx: namespace prefix is used for XProc extension steps, that is, those that are not part of any of the Standard step and optional libraries. The xs: namespace prefix is used for standard datatypes, such as the Boolean (true/false) type in line 9.
Line 9: A static parameter is a variable that does not depend on input data. In this case we declare a $debug parameter that governs whether progress messages are displayed to the screen as the pipeline is processed. The default value is set to false() (no progress messages), and we can override that and set it to true() when we run the pipeline. To run with debug output (note the different syntax for specifying a true value for XML Calabash and MorganaXProc-IIIse):
- xmlcalabash blithedale.xpl debug="?true()"
- morgana blithedale.xpl -static:debug=true
Without debug information:
- xmlcalabash blithedale.xpl"
- morgana blithedale.xpl
Lines 13–14: Read input over the Internet. Setting the value of @sequence to false means that the input must be exactly one document. Setting it true allows any number of documents, including zero.
Lines 18–20: We set the output as empty because we’re going to write our output to disk using ]]>. Since we don’t have exactly one output file (we have none), we have to set @sequence to true, since a value of false would require exactly one output document (see above concerning ]]>).
Line 21: Here and elsewhere (lines 31, 41, 46, 54, 61, 69, 76, 84, 93, 104, 114, 124, 133) we use ]]>, which passes through its input unchanged, just as a place to hang a @message attribute. The value of a @message attribute is written to the screen when the step is executed, and the @use-when attribute value controls whether the step should be executed or not. If @debug is true(), the messages are displayed; if it is false() (the default), they aren’t.

We could put a @message attribute on any step, but we want the working steps to be executed regardless of the value of the $debug parameter, so we can’t put a @use-when attribute on them. Putting the message on a separate ]]> step lets us control whether it is rendered; the working step is always executed.
Lines 25–32: Unicode documents may or may not begin with a byte order mark (BOM). We could have sniffed out the presence or absence of a BOM in our ixml step and handled it there, but instead we let XProc check for it and remove it if it is present. We used a ]]> step as a wrapper for a simple XPath expression that strips the initial character if it is a BOM.
Lines 36–40: Add basic XML markup according to the ixml grammar specified on Line 38. It can be challenging to add elaborate or nuanced markup with ixml, so our ixml step adds just some of the markup we want, and we use XProc steps and XSLT to massage it into the richer XML that we ultimately need.
Line 45: Like the @matches attribute on ]]> in XSLT, the @matches attribute on an XProc ]]> step is an XPath pattern that matches elements anywhere in the document, so all we need is the element name, and not the full path. In this case we match either ]]> or ]]> (which our ixml used to tag Project Gutenberg boilerplate) and removes them.
Lines 51–53: We can tag quotes automatically in XSLT, using ]]>, by assuming that they occur in pairs within a paragraph and you can see how we do this by examining the XSLT stylesheet specified on Line 52. Our assumption could fail because plain text documents often have stray unpaired quotations marks by accident, although as far as we can tell, quotation marks happen to be used correctly in our source. We cannot tag embedded quotes (marked with single quotes in the plain text) automatically because we cannot reliably distinguish single end-quotes from apostrophes.
Lines 58–60: We use the Schematron schema specified on Line 59 to verify that there are no remaining quotation marks within paragraphs, which would happen during the preceding step if there were an odd number of quotation mark characters. This Schematron validation will catch errors that result in an odd number of quotation marks within a paragraph, but if there is an even number of missing or superfluous quotation marks, they’ll cancel one another out. The Schematron validation, then, is imperfect, but better than nothing.
Lines 66–68: We divided our markup enrichment over several steps, some using XSLT and some using XProc. We could have done all of the enrichment in a single XSLT stylesheet, but we found it easier to think about one thing at a time. Here we use the XSLT stylesheet specified on Line 67 to tag the front matter and adjust the capitalization in the chapter titles inside the body, and you can see how we do that by examining the XSLT stylesheet.
Lines 73–75: We could have added @id attributes to the body chapter titles and removed the associated ]]> elements with XSLT, but we opted to do it with XProc just to practice something different.
Lines 81: This is our final cleanup of the XML, where we use the XSLT stylesheet specified in line 82 to convert sequences of two hyphens (--) to em-dashes (—) and we convert single quotes, which are straight in the source ('), to curly (typographic) characters (‘ ’).

Should the ixml analysis detect ambiguity, the ixml processor will write a namespaced @ixml:status attribute into the root element, and here we remove it.
Lines 88–92: Our XML is now ready for transformation to HTML or SVG final-form output, and before doing that we validate it against the Relax NG schema specified on Line 90 to confirm that the markup matches our expectations. We assign a @name attribute to this step on Line 88, and we’ll use that value on Line 120, below.
Line 97: For production purposes we don’t need to save the XML because the output we care about is the HTML and SVG that will be derived from it, but during development we might nonetheless want to be able to look at the XML. Accordingly, we employ a @use-when attribute to save the XML only when in debug mode. Since the working step itself is debug-only, we can put the @message on it instead of on a separate ]]> step.
Lines 101–03: We use the XSLT stylesheet specified on Line 102 to create an HTML reading view of our XML. Our HTML is HTML 5 with XML syntax, which is the only kind of HTML we use in Real Life.
Lines 108–113: We use ]]> to write the HTML to disk. The serialization parameters specify that we want HTML 5 with XML syntax, we want to include an XML declaration but omit an HTML content-type specification, and we want to pretty-print the document. This combination of features is recommended for HTML 5 that uses XML syntax.
Lines 118–23: We use a different XSLT stylehsheet, specified on Line 122, to transform the same finalized XML to SVG. We have to use an explicit connection for the main (source) input port into this step because we’re taking input not from the immediately preceding step (which is what an implicit connection does), but from an earlier one. We put a @name attribute on the ]]> step that validates the final XML (line 88) so that we can point to it, and we use ]]> on line 120 to tell the XProc processor to take the input into the new XSLT transformation from the result port of that step, instead of from the immediately preceding step, which would have been the default behavior otherwise.

This step introduce branching into our pipeline. Until now the content has been able to flow from the result port of each step into the source port of the immediately following one, but here we don’t want to use the HTML as the input to producing SVG, so we need to look further back to find the XML. As a result, there are two branches that emerge from the ]]> step that finalizes the XML. You can see this branching in the graph of the pipeline, below.
Lines 128–33: We use ]]> to save the output of the SVG transformation to disk. Since that’s the last step, the SVG would normally also flow into the main result port, but we defined that as empty (lines 18–20), so the pipeline just throws away the SVG after writing it to disk and terminates the operation.

We configure the serialization method as XML because SVG is XML and it doesn’t have its own, SVG-specific serialization method, and we turn on pretty-printing to make the raw SVG easier to read. We omit the XML declaration because we might want to embed our SVG inside an HTML document (although we don’t do that here), and an XML declaration is not alowed on an embedded document.

Below is a graph of the pipeline, where, in the interest of making the graph less complicated and easier to read, we’ve omitted the ]]> steps that are used only during debugging. The large box labeled p:if / bom-removal is the step that removes the Unicode byte order mark if present. It’s complex because it includes what happens both when there is a BOM and when there isn’t, but if you ignore what’s inside the box for the moment and just look at what flows into it and out of it, you can see that the original input flows in, and what flows out is the original input after a possible leading BOM has been removed.

Four details to notice are that:

The overall flow of information through the pipeline is linear (the result port at the bottom of each step flows into the source port at the top of the next step) until the point where it splits to create two outputs, one in HTML and one in SVG. Where steps require secondary input (e.g., a validation step requires a schema, an XSLT step requires an XSLT stylesheet), those flow into the steps separately from boxes with cx:document in their top row.
After the step that deals with the BOM, the flow proceeds in linear fashion through:
1. A ]]> step that adds basic markup and transforms the plain text input into well-formed XML;
2. A ]]> step to remove the Gutenberg boilerplate header and footer;
3. A ]]> step to tag the quotes;
4. A ]]> step to check the quotes;
5. A ]]> step to tag the front matter;
6. A ]]> step to add @id attributes to the body titles;
7. A ]]> step to remove ]]> elements once their values have been copied into the @id attributes created by the labeling step; and
8. A ]]> step to verify that the markup conforms to our expectations.
The result port of the step that validates with Relax NG has two out-arrows, each supplying input into the source port of a ]]> step on the penultimate row of the graph. This step, then, creates a branch or split in the pipeline, and the two sides proceed on their own from there. One of the XSLT trnasformations creates the HTML reading view, while the other one creates the SVG. The output of each of the XSLT steps flows into it own ]]> step, which writes the result to disk.
The graph shows the disposition of every output port (boxes in the bottom row of each step), both the primary ones (labeled result) and any possible additional ones (e.g., secondary for a ]]> step; report for a validation step, etc.). All of those output ports either flow into another step or are discarded, which is represented by flowing into a small black dot. By the bottom of the graph everything has flowed into a small block dot, and the primary output, shown in the upper right of the graph, is therefore isolated, that is, nothing in the actual working pipeline flows into it. That’s consistent with our having specified that the primary output port should be empty.

<oo>→<dh> Digital humanities

Lesson plan: Invisible XML and XProc

Technology overview: what you will learn

Schedule overview

Session 1: Invisible XML (Friday, 2025-02-28)

Session 2: XProc 3.0 (Monday, 2025-03-03)

Session 3: Invisible XML and XProc (Wednesday, 2025-03-05)