Digital humanities


Maintained by: David J. Birnbaum (djbpitt@gmail.com) [Creative Commons BY-NC-SA 3.0 Unported License] Last modified: 2025-02-27T17:33:13+0000


Lesson plan: Invisible XML and XProc

Technology overview: what you will learn

This three-session lesson-plan introduces learners to two relatively new XML technologies, both first published in 2022:

Schedule overview

This unit takes place over three fifty-minute sessions:

Date Topic Homework due next time
Wed, 2025-02-26 (Some other topic; not part of this unit)
  1. Read Norm Tovey-Walsh’s Invisible XML introductory tutorial. You will not be able to follow Friday’s class if you haven’t read this tutorial.

  2. Install CoffeePot and Markup Blitz on your local machine, following our instructions at Configuring XProc and ixml processors. CoffeePot offers more debugging options than Markup Blitz, but Markup Blitz is likely to be faster (and often much faster) with large input files.

Fri, 2025-02-28 Invisible XML
  1. Read Norm Tovey-Walsh’s Writing Invisible XML grammars and our Invisible XML and ambiguity.

  2. Complete and submit Invisible XML assignment 1.

  3. Install XML Calabash 3 and MorganaXProc-IIIse on your local machine, following our instructions at Configuring XProc and ixml processors. You will not be able to complete the XProc portion of this unit if you have not installed at least MorganaXProc-IIIse, and we recommend installing both.

  4. Read Part A of Martin Kraetke’s XProc 3.0 Tutorial (see the outline at the bottom of that page). You will not be able to follow Monday’s class if you haven’t read this part of the tutorial.

Mon, 2025-03-03 XProc 3.0
  1. Read the remainder of Martin Kraetke’s XProc 3.0 Tutorial. You don’t have to memorize the details, but you”ll want to learn how to employ the features that are likely to be broadly useful (especially XSLT Transformations in XProc), and notice what else is there so that you can Look Stuff Up as the need arises.

  2. Complete and submit XProc assignment 1. You may wish to review our summary of XProc basics to remind yourself of key points before you begin to write your own pipeline. Feel free to copy from the pipelines below, making changes to accommodate the assignment-specific goals.

Wed, 2025-03-05 Using Invisible XML and XProc 3.0 together

(Homework for some other topic; not part of this unit)

Session 1: Invisible XML (Friday, 2025-02-28)

  1. Why use ixml: Tagging Shakespeare sonnets (data)

  2. Introduction to ixml grammars: Tagging a date

  3. Ambiguity: Grammars are not regular expressions

  4. Tagging Shakespeare sonnets (grammars)

For Unicode character classes (e.g., L to match any letter) see Unicode Standard Annex #44 ( Unicode Character Database), §5.7.1 General Category Values.

Session 2: XProc 3.0 (Monday, 2025-03-03)

  1. Files for the sonnets activity are located at https://github.com/djbpitt/ixml/tree/main/sonnets.

  2. Our summary of XProc basics, simplified to focus on the type of XProc we write in this unit.

  3. Step by step XProc for Shakespearean sonnets:

    1. Create XProc skeleton:

      
      
          
      ]]>
    2. Read input into pipeline input using ]]> step. Input can be either local or remote:

      • Local input:

        ]]>
      • Remote input:

        
        ]]>
    3. Emit result on pipeline output using ]]> step (we modify this later):

      ]]>
    4. Test pipeline with ]]>.

    5. Tag input using ]]> step. Requires ixml grammar.

    6. Transform XML to HTML using ]]> step. Requires XSLT stylesheet.

    7. Specify serialization parameters on ]]> step:

      ]]>

Session 3: Invisible XML and XProc (Wednesday, 2025-03-05)

Files for the Blithedale activity are located at https://github.com/djbpitt/ixml/tree/main/blithedale.

Below is a copy of blithedale.xpl, with line annotations below the code.



    
    
    
    
    
    
    
    
    
    
    
    
        
    
    
    
    
    
    
        
            
                text{{substring(.,2)}}
            
        
        
    
    
    
    
    
        
            
        
    
    
    
    
    
    
    
    
    
    
    
    
        
    
    
    
    
    
    
        
    
    
    
    
    
    
    
        
    
    
    
    
    
    
    
    
    
    
    
    
    
        
    
    
    
    
    
    
        
            
        
    
    
    
    
    
    
    
    
    
    
        
    
    
    
    
    
    
    
    
    
    
    
        
            
        
        
    
    
    
    
    
    
    
]]>

Below is a graph of the pipeline, where, in the interest of making the graph less complicated and easier to read, we’ve omitted the ]]> steps that are used only during debugging. The large box labeled p:if / bom-removal is the step that removes the Unicode byte order mark if present. It’s complex because it includes what happens both when there is a BOM and when there isn’t, but if you ignore what’s inside the box for the moment and just look at what flows into it and out of it, you can see that the original input flows in, and what flows out is the original input after a possible leading BOM has been removed.

Four details to notice are that:

  1. The overall flow of information through the pipeline is linear (the result port at the bottom of each step flows into the source port at the top of the next step) until the point where it splits to create two outputs, one in HTML and one in SVG. Where steps require secondary input (e.g., a validation step requires a schema, an XSLT step requires an XSLT stylesheet), those flow into the steps separately from boxes with cx:document in their top row.

  2. After the step that deals with the BOM, the flow proceeds in linear fashion through:

    1. A ]]> step that adds basic markup and transforms the plain text input into well-formed XML;

    2. A ]]> step to remove the Gutenberg boilerplate header and footer;

    3. A ]]> step to tag the quotes;

    4. A ]]> step to check the quotes;

    5. A ]]> step to tag the front matter;

    6. A ]]> step to add @id attributes to the body titles;

    7. A ]]> step to remove ]]> elements once their values have been copied into the @id attributes created by the labeling step; and

    8. A ]]> step to verify that the markup conforms to our expectations.

  3. The result port of the step that validates with Relax NG has two out-arrows, each supplying input into the source port of a ]]> step on the penultimate row of the graph. This step, then, creates a branch or split in the pipeline, and the two sides proceed on their own from there. One of the XSLT trnasformations creates the HTML reading view, while the other one creates the SVG. The output of each of the XSLT steps flows into it own ]]> step, which writes the result to disk.

  4. The graph shows the disposition of every output port (boxes in the bottom row of each step), both the primary ones (labeled result) and any possible additional ones (e.g., secondary for a ]]> step; report for a validation step, etc.). All of those output ports either flow into another step or are discarded, which is represented by flowing into a small black dot. By the bottom of the graph everything has flowed into a small block dot, and the primary output, shown in the upper right of the graph, is therefore isolated, that is, nothing in the actual working pipeline flows into it. That’s consistent with our having specified that the primary output port should be empty.

pipeline cluster_d355e1 / blithedale cluster_d355e1_head cluster_d355e5 cluster_d355e7 p:if / bom-removal cluster_d355e7_head cluster_d355e14 p:when cluster_d355e17 cluster_d355e21 cluster_d355e24 cluster_d355e27 cluster_d355e14_foot cluster_d355e35 p:otherwise cluster_d355e38 cluster_d355e35_foot cluster_d355e7_foot cluster_d355e42 cluster_d355e44 cluster_d355e52 cluster_d355e57 cluster_d355e59 cluster_d355e75 cluster_d355e77 cluster_d355e89 cluster_d355e91 cluster_d355e107 cluster_d355e115 cluster_d355e120 cluster_d355e122 cluster_d355e138 cluster_d355e140 cluster_d355e152 cluster_d355e154 cluster_d355e170 cluster_d355e178 cluster_d355e180 cluster_d355e196 cluster_d355e1_foot d355e2 source cluster_d355e1_head source source d355e2->cluster_d355e1_head:d355e2_head_input cluster_d355e7_head !source !context cluster_d355e1_head:d355e2_head_output->cluster_d355e7_head:d355e8_head_input cluster_d355e27 query source p:xquery result cluster_d355e1_head:d355e2_head_output->cluster_d355e27:d355e30 cluster_d355e38 source p:identity result cluster_d355e1_head:d355e2_head_output->cluster_d355e38:d355e39 cluster_d355e5 cx:empty result cluster_d355e1_foot result cluster_d355e5:d355e6->cluster_d355e1_foot:d355e3_foot cluster_d355e17 source cx:expression starts-with(., '… result cluster_d355e7_head:d355e10_head_output->cluster_d355e17:d355e18 cluster_d355e21 source cx:guard cluster_d355e17:d355e20->cluster_d355e21:d355e22 cluster_d355e24 source cx:inline result cluster_d355e24:d355e26->cluster_d355e27:d355e28 cluster_d355e14_foot !result cluster_d355e27:d355e32->cluster_d355e14_foot:d355e15_foot cluster_d355e7_foot !result cluster_d355e14_foot:d355e15_foot->cluster_d355e7_foot:d355e11_foot cluster_d355e35_foot !result cluster_d355e38:d355e41->cluster_d355e35_foot:d355e36_foot cluster_d355e35_foot:d355e36_foot->cluster_d355e7_foot:d355e11_foot cluster_d355e44 grammar source p:invisible-xml result cluster_d355e7_foot:d355e11_foot->cluster_d355e44:d355e47 cluster_d355e42 cx:document href = "blithedale.ixml" result cluster_d355e42:d355e43->cluster_d355e44:d355e45 cluster_d355e52 source p:delete result cluster_d355e44:d355e49->cluster_d355e52:d355e54 cluster_d355e59 stylesheet source p:xslt result secondary cluster_d355e52:d355e56->cluster_d355e59:d355e62 cluster_d355e57 cx:document href = "blithedale-tag-quotes.xsl" result cluster_d355e57:d355e58->cluster_d355e59:d355e60 cluster_d355e77 schema source p:validate-with-schematron result report cluster_d355e59:d355e64->cluster_d355e77:d355e80 d355e203 cluster_d355e59:d355e65->d355e203 cluster_d355e75 cx:document href = "blithedale-check-quotes.sch" result cluster_d355e75:d355e76->cluster_d355e77:d355e78 cluster_d355e91 stylesheet source p:xslt result secondary cluster_d355e77:d355e82->cluster_d355e91:d355e94 d355e206 cluster_d355e77:d355e83->d355e206 cluster_d355e89 cx:document href = "blithedale-tag-front.xsl" result cluster_d355e89:d355e90->cluster_d355e91:d355e92 cluster_d355e107 source p:label-elements result cluster_d355e91:d355e96->cluster_d355e107:d355e111 d355e209 cluster_d355e91:d355e97->d355e209 cluster_d355e115 source p:delete result cluster_d355e107:d355e113->cluster_d355e115:d355e117 cluster_d355e122 stylesheet source p:xslt result secondary cluster_d355e115:d355e119->cluster_d355e122:d355e125 cluster_d355e120 cx:document href = "blithedale-cleanup-xml.xsl" result cluster_d355e120:d355e121->cluster_d355e122:d355e123 cluster_d355e140 schema source p:validate-with-relax-ng / finalize-xml result report cluster_d355e122:d355e127->cluster_d355e140:d355e143 d355e212 cluster_d355e122:d355e128->d355e212 cluster_d355e138 cx:document href = "blithedale.rnc" result cluster_d355e138:d355e139->cluster_d355e140:d355e141 cluster_d355e154 stylesheet source p:xslt result secondary cluster_d355e140:d355e145->cluster_d355e154:d355e157 cluster_d355e180 source stylesheet p:xslt result secondary cluster_d355e140:d355e145->cluster_d355e180:d355e181 d355e215 cluster_d355e140:d355e146->d355e215 cluster_d355e152 cx:document href = "blithedale-to-reading-view.xsl" result cluster_d355e152:d355e153->cluster_d355e154:d355e155 cluster_d355e170 source p:store result result-uri cluster_d355e154:d355e159->cluster_d355e170:d355e173 d355e218 cluster_d355e154:d355e160->d355e218 d355e221 cluster_d355e170:d355e175->d355e221 d355e224 cluster_d355e170:d355e176->d355e224 cluster_d355e178 cx:document href = "blithedale-to-graph.xsl" result cluster_d355e178:d355e179->cluster_d355e180:d355e183 cluster_d355e196 source p:store result result-uri cluster_d355e180:d355e185->cluster_d355e196:d355e199 d355e228 cluster_d355e180:d355e186->d355e228 d355e231 cluster_d355e196:d355e201->d355e231 d355e234 cluster_d355e196:d355e202->d355e234 d355e3 result cluster_d355e1_foot:d355e3_foot->d355e3