Maintained by: David J. Birnbaum (djbpitt@gmail.com) 
            
        
    
    Last modified:
        
        2025-02-27T17:33:13+0000
This three-session lesson-plan introduces learners to two relatively new XML technologies, both first published in 2022:
Invisible XML (ixml) is a language that can be used to add markup to plain-text resources, a task you may previously have undertaken with regular expressions.
Processing ixml: You can practice using ixml at the online jωiXML Invisible XML Workbench without installing anything on your local machine. We nonetheless recommend working locally by installing the CoffeePot and Markup Blitz ixml processors because they can handle larger input files, and we provide instructions for those installations at Configuring XProc and ixml processors.
Optional: Our installation page also provides instructions for installing xmq and using it for stand-alone ixml processing. You don’t need xmq for this week’s activities, but as you develop ixml grammars, you might find it helpful to have access to an additional processor.
Authoring ixml: There is currently no support for ixml in <oXygen/>, which means that you can author your own ixml grammars as <oXygen/> plain-text documents, but <oXygen/> does not provide validation, processing, or other ixml-specific support. The conventional filename extension for ixml grammar documents is .ixml.
XProc 3.0 is a language for processing pipelines, that is, sequences of operations, where the output of one operation may function as the input to the next.
Processing XProc: To use XProc you need to install either XML Calabash 3 or MorganaXProc-IIIse on your local machine, and we provide instructions for those installations at Configuring XProc and ixml processors. <oXygen/> builds in support for XProc 1.0, but not (yet) for 3.0, so you need to install your own 3.0 processor. Once you’ve installed XML Calabash 3 or MorganaXProc-IIIse on your local machine, it’s possible to configure an <oXygen/> transformation scenario to use one or the other of them to process an XProc pipeline, but we find it simpler to process XProc at the command line.
Authoring XProc: You can author XProc 3.0 in
                                <oXygen/>, which recognizes XProc
 as a file type, but
                                the validation is not up to date with all features of XProc 3.0,
                                which means that <oXygen/> may incorrectly report errors for
                                valid code and it may not provide all content completion options.
                                The conventional filename extension for XProc documents is
                                    .xpl.
This unit takes place over three fifty-minute sessions:
| Date | Topic | Homework due next time | 
|---|---|---|
| Wed, 2025-02-26 | (Some other topic; not part of this unit) | 
  | 
                
| Fri, 2025-02-28 | Invisible XML | 
  | 
                
| Mon, 2025-03-03 | XProc 3.0 | 
  | 
                
| Wed, 2025-03-05 | Using Invisible XML and XProc 3.0 together | (Homework for some other topic; not part of this unit)  | 
                
Why use ixml: Tagging Shakespeare sonnets (data)
All Shakespeare sonnets: https://github.com/djbpitt/ixml/blob/main/sonnets/sonnets.txt
First three sonnets only: https://github.com/djbpitt/ixml/blob/main/sonnets/sonnets-small.txt
Introduction to ixml grammars: Tagging a date
Ambiguity: Grammars are not regular expressions
Input: https://github.com/djbpitt/ixml/blob/main/roman/roman.txt
Ambiguous grammar: https://github.com/djbpitt/ixml/blob/main/roman/roman-ambiguous.ixml
Regex: greedy-roman-numeral (at http://regexpal.com)
Unambiguous grammar: https://github.com/djbpitt/ixml/blob/main/roman/roman-unambiguous.ixml
Tagging Shakespeare sonnets (grammars)
For Unicode character classes (e.g., L to match any letter) see Unicode
                    Standard Annex #44 ( Unicode Character Database), §5.7.1 General Category
                    Values.
Files for the sonnets activity are located at https://github.com/djbpitt/ixml/tree/main/sonnets.
Our summary of XProc basics, simplified to focus on the type of XProc we write in this unit.
Step by step XProc for Shakespearean sonnets:
Create XProc skeleton:
    
 ]]>Read input into pipeline input using ]]>
                                step. Input can be either local or remote:
Local input:
]]>Remote input:
]]>Emit result on pipeline output using ]]>
                                step (we modify this later):
]]>Test pipeline with ]]>.
Tag input using ]]>
                                step. Requires ixml grammar.
Transform XML to HTML using ]]>
                                step. Requires XSLT stylesheet.
Specify serialization parameters on
                                ]]> step:
]]>Files for the Blithedale activity are located at https://github.com/djbpitt/ixml/tree/main/blithedale.
Below is a copy of blithedale.xpl, with line annotations below the code.
    
    
    
     
    
    
    
     
    
    
    
    
         
     
     
    
    
    
    
        
            
                text{{substring(.,2)}} 
             
         
         
     
    
    
    
    
        
             
         
     
     
    
    
    
     
     
    
    
    
    
    
         
     
     
    
    
    
    
         
     
     
    
    
    
    
    
         
     
     
    
    
    
     
     
     
    
    
    
    
    
         
     
     
    
    
    
    
        
             
         
     
     
    
    
    
     
    
    
    
    
         
     
     
    
    
    
     
     
    
    
    
    
        
             
         
         
     
     
    
    
    
     
     
 ]]>
            Lines 2–5: Note the namespace declarations inside the
                        start-tag. The cx: namespace prefix is used for XProc
                            extension steps, that is, those that are not part of any of the
                            Standard step and
                            optional libraries. The xs: namespace prefix is used
                        for standard datatypes, such as the Boolean (true/false) type in line
                    9.
Line 9: A static parameter is a variable that does
                        not depend on input data. In this case we declare a
                        $debug parameter that governs whether
                        progress messages are displayed to the screen as the pipeline is processed.
                        The default value is set to false() (no
                        progress messages), and we can override that and set it to
                        true() when we run the pipeline. To run
                        with debug output (note the different syntax for specifying a true value for
                        XML Calabash and MorganaXProc-IIIse):
xmlcalabash blithedale.xpl debug="?true()"
morgana blithedale.xpl -static:debug=true
Without debug information:
xmlcalabash blithedale.xpl"
morgana blithedale.xpl
Lines 13–14: Read input over the Internet. Setting the value
                        of @sequence to
                        false means that the input must be
                        exactly one document. Setting it true allows any number of documents,
                        including zero.
Lines 18–20: We set the output as empty because we’re going
                        to write our output to disk using
                        ]]>. Since we don’t have
                        exactly one output file (we have none), we have to set
                        @sequence to
                        true, since a value of
                        false would require exactly one output
                        document (see above concerning
                        ]]>).
Line 21: Here and elsewhere (lines 31, 41, 46, 54, 61, 69,
                        76, 84, 93, 104, 114, 124, 133) we use
                        ]]>, which passes
                        through its input unchanged, just as a place to hang a
                        @message attribute. The value of a
                        @message attribute is written to the
                        screen when the step is executed, and the
                        @use-when attribute value controls
                        whether the step should be executed or not. If
                        @debug is
                        true(), the messages are displayed; if
                        it is false() (the default), they
                        aren’t.
We could put a @message attribute on any
                        step, but we want the working steps to be executed regardless of the value
                        of the $debug parameter, so we can’t
                        put a @use-when attribute on them.
                        Putting the message on a separate
                        ]]> step lets us
                        control whether it is rendered; the working step is always
                    executed.
Lines 25–32: Unicode documents may or may not begin with a
                            byte order
                                mark (BOM). We could have sniffed out the presence or
                        absence of a BOM in our ixml step and handled it there, but instead we let
                        XProc check for it and remove it if it is present. We used a
                        ]]> step as a wrapper
                        for a simple XPath expression that strips the initial character if it is a
                        BOM.
Lines 36–40: Add basic XML markup according to the ixml grammar specified on Line 38. It can be challenging to add elaborate or nuanced markup with ixml, so our ixml step adds just some of the markup we want, and we use XProc steps and XSLT to massage it into the richer XML that we ultimately need.
Line 45: Like the
                        @matches attribute on
                        ]]> in XSLT, the
                        @matches attribute on an XProc
                        ]]> step is an XPath
                            pattern that matches elements anywhere in the document, so all we
                        need is the element name, and not the full path. In this case we match
                        either ]]> or
                        ]]> (which our ixml used
                        to tag Project Gutenberg boilerplate) and removes them.
Lines 51–53: We can tag quotes automatically in XSLT, using
                        ]]>, by
                        assuming that they occur in pairs within a paragraph and you can see how we
                        do this by examining the XSLT stylesheet specified on Line 52. Our assumption could fail
                        because plain text documents often have stray unpaired quotations marks by
                        accident, although as far as we can tell, quotation marks happen to be used
                        correctly in our source. We cannot tag embedded quotes (marked with single
                        quotes in the plain text) automatically because we cannot reliably
                        distinguish single end-quotes from apostrophes.
Lines 58–60: We use the Schematron schema specified on Line 59 to verify that there are no remaining quotation marks within paragraphs, which would happen during the preceding step if there were an odd number of quotation mark characters. This Schematron validation will catch errors that result in an odd number of quotation marks within a paragraph, but if there is an even number of missing or superfluous quotation marks, they’ll cancel one another out. The Schematron validation, then, is imperfect, but better than nothing.
Lines 66–68: We divided our markup enrichment over several steps, some using XSLT and some using XProc. We could have done all of the enrichment in a single XSLT stylesheet, but we found it easier to think about one thing at a time. Here we use the XSLT stylesheet specified on Line 67 to tag the front matter and adjust the capitalization in the chapter titles inside the body, and you can see how we do that by examining the XSLT stylesheet.
Lines 73–75: We could have added
                        @id attributes to the body chapter
                        titles and removed the associated
                        ]]> elements with XSLT,
                        but we opted to do it with XProc just to practice something
                    different.
Lines 81: This is our final cleanup of the XML, where we use
                        the XSLT stylesheet specified in line 82 to convert sequences of two
                        hyphens (--) to em-dashes (—) and we convert
                        single quotes, which are straight in the source ('), to curly
                        (typographic) characters (‘ ’).
Should the ixml analysis detect ambiguity, the ixml processor will write a
                        namespaced @ixml:status attribute into
                        the root element, and here we remove it.
Lines 88–92: Our XML is now ready for transformation to HTML
                        or SVG final-form output, and before doing that we validate it against the
                            Relax NG schema specified on Line 90 to confirm that the markup
                        matches our expectations. We assign a
                        @name attribute to this step on Line
                        88, and we’ll use that value on Line 120, below.
Line 97: For production purposes we don’t need to save the
                        XML because the output we care about is the HTML and SVG that will be
                        derived from it, but during development we might nonetheless want to be able
                        to look at the XML. Accordingly, we employ a
                        @use-when attribute to save the XML
                        only when in debug mode. Since the working step itself is debug-only, we can
                        put the @message on it instead of on a
                        separate ]]>
                        step.
Lines 101–03: We use the XSLT stylesheet specified on Line 102 to create an HTML reading view of our XML. Our HTML is HTML 5 with XML syntax, which is the only kind of HTML we use in Real Life.
Lines 108–113: We use
                        ]]> to write the HTML to
                        disk. The serialization parameters specify that we want HTML 5 with XML
                        syntax, we want to include an XML declaration but omit an HTML content-type
                        specification, and we want to pretty-print the document. This combination of
                        features is recommended for HTML 5 that uses XML syntax.
Lines 118–23: We use a different XSLT stylehsheet, specified on Line 122, to transform the same
                        finalized XML to SVG. We have to use an explicit connection for the
                        main (source) input port into this step because we’re taking input
                        not from the immediately preceding step (which is what an implicit
                        connection does), but from an earlier one. We put a
                        @name attribute on the
                        ]]>
                        step that validates the final XML (line 88) so that we can point to it, and
                        we use ]]> on line 120 to
                        tell the XProc processor to take the input into the new XSLT transformation
                        from the result port of that step, instead of from the immediately
                        preceding step, which would have been the default behavior otherwise.
This step introduce branching into our pipeline. Until now the content has
                        been able to flow from the result port of each step into the
                            source port of the immediately following one, but here we don’t
                        want to use the HTML as the input to producing SVG, so we need to look
                        further back to find the XML. As a result, there are two branches that
                        emerge from the ]]> step
                        that finalizes the XML. You can see this branching in the graph of the
                        pipeline, below.
Lines 128–33: We use
                        ]]> to save the output
                        of the SVG transformation to disk. Since that’s the last step, the SVG would
                        normally also flow into the main result port, but we defined that as empty
                        (lines 18–20), so the pipeline just throws away the SVG after writing it to
                        disk and terminates the operation.
We configure the serialization method as XML because SVG is XML and it doesn’t have its own, SVG-specific serialization method, and we turn on pretty-printing to make the raw SVG easier to read. We omit the XML declaration because we might want to embed our SVG inside an HTML document (although we don’t do that here), and an XML declaration is not alowed on an embedded document.
Below is a graph of the pipeline, where, in the interest of making the graph less
                complicated and easier to read, we’ve omitted the
                ]]> steps that are used only
                during debugging. The large box labeled p:if / bom-removal
 is the step that
                removes the Unicode byte order mark if present. It’s complex because it includes
                what happens both when there is a BOM and when there isn’t, but if you ignore what’s
                inside the box for the moment and just look at what flows into it and out of it, you
                can see that the original input flows in, and what flows out is the original input
                after a possible leading BOM has been removed.
Four details to notice are that:
The overall flow of information through the pipeline is linear (the
                            result port at the bottom of each step flows into the
                            source port at the top of the next step) until the point where it
                        splits to create two outputs, one in HTML and one in SVG. Where steps
                        require secondary input (e.g., a validation step requires a schema, an XSLT
                        step requires an XSLT stylesheet), those flow into the steps separately from
                        boxes with cx:document
 in their top row.
After the step that deals with the BOM, the flow proceeds in linear fashion through:
A ]]>
                                step that adds basic markup and transforms the plain text input into
                                well-formed XML;
A ]]> step to
                                remove the Gutenberg boilerplate header and footer;
A ]]> step to tag
                                the quotes;
A
                                ]]>
                                step to check the quotes;
A ]]> step to tag
                                the front matter;
A ]]>
                                step to add @id attributes to
                                the body titles;
A ]]> step to
                                remove ]]>
                                elements once their values have been copied into the
                                @id attributes created by the
                                labeling step; and
A
                                ]]>
                                step to verify that the markup conforms to our
                            expectations.
The result port of the step that validates with Relax NG has two
                        out-arrows, each supplying input into the source port of a
                        ]]> step on the
                        penultimate row of the graph. This step, then, creates a branch or
                            split in the pipeline, and the two sides proceed on their own
                        from there. One of the XSLT trnasformations creates the HTML reading view,
                        while the other one creates the SVG. The output of each of the XSLT steps
                        flows into it own ]]>
                        step, which writes the result to disk.
The graph shows the disposition of every output port (boxes in the bottom row
                        of each step), both the primary ones (labeled result) and any
                        possible additional ones (e.g., secondary for a
                        ]]> step; report
                        for a validation step, etc.). All of those output ports either flow into
                        another step or are discarded, which is represented by flowing into a small
                        black dot. By the bottom of the graph everything has flowed into a small
                        block dot, and the primary output, shown in the upper right of the graph, is
                        therefore isolated, that is, nothing in the actual working pipeline flows
                        into it. That’s consistent with our having specified that the primary output
                        port should be empty.