Digital humanities

Maintained by: David J. Birnbaum (djbpitt@gmail.com) Last modified: 2025-03-05T18:02:02+0000

XProc assignment 1: Answer key

See the Assignment page for a discussion of the data and the task. Your solution does not have match ours as long as it does what you want; in ours we included steps for all of the tasks suggested in the assignment. Our XProc is below, followed by line comments:

]]>

Here’s how it works:

Lines 2–5: Note the namespace declarations inside the start-tag. The cx: namespace prefix is used for XProc extension steps, that is, those that are not part of any of the Standard step and optional libraries. The xs: namespace prefix is used for standard datatypes, such as the Boolean (true/false) type in Line 11.
Line 11: A static parameter is a variable that does not depend on input data. In this case we declare a $debug parameter that governs 1) whether to save movies.xml (in addition to movies.xhtml and movies.svg, which are always created) and 2) whether to display progress messages to the screen as the pipeline is processed. The default value is set to false() (no progress messages), and we can override that and set it to true() when we run the pipeline. To run with debug output (note the different syntax for specifying a true value for XML Calabash and MorganaXProc-IIIse):
- xmlcalabash moves.xpl debug="?true()"
- morgana moves.xpl -static:debug=true
Without debug information:
- xmlcalabash moves.xpl
- morgana moves.xpl
Lines 15–16: Read input over the Internet. Setting the value of @sequence to false means that the input must be exactly one document. Setting it true allows any number of documents, including zero.
Lines 20–22: We set the output as empty because we’re going to write the results that care about to disk using ]]>. Since we don’t have exactly one output file (we have none), we have to set @sequence to true, since a value of false would require exactly one output document (see above concerning ]]>).
Line 23: Here and elsewhere (lines 34, 39, 46, 55, 66, 77, 88, 98, 107) we use ]]>, which passes through its input unchanged, just as a place to hang a @message attribute. The value of a @message attribute is written to the screen when the step is executed, and the @use-when attribute value controls whether the step should be executed or not. If @debug is true(), the messages are displayed; if it is false() (the default), they aren’t. The messages are just for our convenience, so that we can track the progress of the pipeline as it is executed.

We could put a @message attribute on any step, but we want the working steps to be executed regardless of the value of the $debug parameter, so we can’t put a @use-when attribute on them. Putting the message on a separate ]]> step lets us control whether it is rendered; the working step is always executed.
Lines 29–34: Add basic XML markup according to the ixml grammar specified on Line 31.

You need to use Markup Blitz, rather than CoffeePot, for this step because the source file is too large for CoffeePot to process comfortably. You can use the abridged movieData-short.txt input with CoffeePot (see the link in the assignment), but not the full movieData.txt file.

If you’re running with MorganaXProc-IIIse configured according to our configuration instructions, you’ll automatically use Markup Blitz, which is what you want. XML Calabash added support for Markup Blitz only in version alpha21, so if you’re running an earlier version of XML Calabash (you can check your version with xmlcalabash version) you can either upgrade (see §2.1 Configuring XML Calabash (XProc) of our Configuring XProc and ixml processors for information about configuring XML Calabash alpha21 to use Markup Blitz) or just use MorganaXProc-IIIse.

If you’re running XML Calabash version alpha21 or later according to those configuration instructions, XML Calabash would normally default to using CoffeePot for ixml. We override that default behavior by using the @cx:processor attribute to tell XML Calabash to use Markup Blitz instead of CoffeePot.
Line 38: We use an XProc step to remove the first ]]> element, which contains header labels and not real film data.
Lines 43–45: We use a ]]> step with our movies-countries.xsl XSLT stylesheet to modify the country and runtime information as described in the assignment.
Lines 50–54: We created movies.rnc to model the XML we intend to create within our pipeline. Since we’ve now reached that stage, we validate the XML against the Relax NG schema.
Lines 61–65: We use movies.sch to validate the XML against a Schematron schema. Counting the number of countries is sort of silly, but validing the content of the ]]> element is useful because we’re going to rely on the shape of the information there when we create SVG later in the pipeline.
Line 70: We don’t normally save the XML because it’s just a step on the way to the output we really want, which is the HTML and SVG. For debugging purposes, though, we might want to see it, so we use a ]]> step (which writes the input to disk with the specified filename and then passes that same input along unchanged to the next step, like a ]]> step) that runs only in debug mode. We can put our progress message directly on this step because, here, unlike with our other working steps, this one, the whole step runs only in debug mode.
Lines 74–76: The XML flows by implicit binding into this next step, which uses movies-to-html.xsl to create an HTML reading view of our data. We created movies.css, for styling the HTML, separately, and our XSLT transformation writes a ]]> element into the HTML that points to it.
Lines 81–87: The output of the XSLT transformation flows by implicit binding into a ]]> step, which writes the HTML to the specified filename. We set the serialization parameters for this step to use HTML 5 with XML syntax, to include the optional XML declaration, and to omit the HTML content type specification, which should not be used when creating HTML 5 with XML syntax. We turn on pretty-printing to make the raw, angle-bracketed HTML easier to read.

The input into the source port of a ]]> step flows out of its result port unchanged, just as with a ]]> step. However, as described immediately below, we’re going to ignore that HTML going forward; so this ]]> step represents the end of this branch of the pipeline.
Lines 92-97: We run a new ]]> step that transforms the XML (not the HTML that we just created!) into SVG with movies-to-svg.xsl. If we didn’t say otherwise, the output of the last step (the HTML that we created earlier) would flow into the source port of this ]]> step through implicit binding. That isn’t the input we want, though,p so we override it and pull the input instead from the step where we performed our Schematron validation. This is why we had to put a @name attribute on the ]]> step; the @name lets us point to it on Line 94, where the @step attribute value on the ]]> element says use the step called finalize-xml and the @port attribute value says to use the result port of that step.

This step creates a branch or fork in the pipeline. The flow of information has been linear (from each step to the next one by implicit binding) until now, but at this stage we reach backward and establish a second flow from the result port of an earlier step. Because a ]]> step passes its input through unchanged, we could have attached the name to either of the two steps before it, the ]]> step that finalizes the XML or the ]]> step that validates it. The result would be the same with any of these three choices; we opted for the last one because it corresponds most directly to our understanding of the processing, and, specificially, that we don’t want to create the HTML and SVG output until we’ve fully validated the XML.
Lines 102–06: We use a ]]> step to write the SVG to disk with the specified filename. The serialization parameters are different from those for the HTML output because SVG is a different type of XML document than HTML 5 with XML syntax.

Unless we say otherwise, the output on the result port of final step in a pipeline passes its information implicitly into the main output port, which we declared on Lines 20–22. However, because we specified in that declaration that the output should be empty, that specification overrides the implicit binding, which means that nothing flows out of the pipeline.

Below is a graph of the pipeline as created by XML Calabash, except that we added the coloring manually. We turn off debug mode, which means that pipeline omits the ]]> steps that are used for messaging and the step that saves movies.xml only when we run the pipeline in debug mode. See also the notes below the graph:

Steps are implicitly connected or bound to the steps that follow them immediately in the pipeline. Here are some details:
- The primary input and primary output ports are declared before anything else. The steps that follow them in this pipeline are all what we call working steps, that is, those that process the document according to our instructions. The primary input has only one port, called source, and it flows implicitly into the source port of the first working step. The primary output has only one port, called result, and the result port of the last working step flows implicitly into it. In this pipeline we override that implicit connection by specifying that the primary output port should ingest not from the implicit connection, but from an empty document.
- Most of the time we let the implicit connections control the flow of information, but sometimes we overrule it with an explicit connection or binding. See the discussion of the branching that creates both HTML and SVG output.
The pink represents the primary flow of information, most often from the result port of one step into the source port of the next.
The blue represents the end of the pipeline. In this project the last thing we do is use ]]> steps to write the HTML and SVG output to disk. Because we configured the primary output port to be empty (lines 20–22), that explicit inline empty value overrides the implicit connection from the last path step to the primary output. The implicit connection is still there, but the primary output port ignores it because we’ve told it that it should be empty.

You can see in the graph above that the ]]> step has two output ports: result and result-uri. In addition to writing to disk, a ]]> step copies the input from its source input port to its output result port unchanged, just like ]]>, and it writes the path to the saved file to the result-uri output port. We ignore both of those, with the result that they disappear without a trace
The diamonds near the top represent the primary input and output steps. The house-shaped polygon at the top represents the plain text document that the primary input port loads. The house-shaped polygon on the far right represents the output of the primary output port, which is empty because that’s how we configured it.
Steps (other than ]]> and ]]>) are represented by rectangles with three rows, as follows:
- The middle row holds the type of the step; for an example see , near the upper left of the graph. If the step has a name, it is displayed after a slash character following the type; for an example see the name finalize-xml for the ]]> step.
- The top row represents the input ports for the step. The primary input into each step is the source port. Some steps have addition inputs, e.g., a grammar port for an ]]> step or a stylesheet port for a ]]> step. ]
- The bottom row represents the output ports for the step. The primary output from each step is the result port. Some steps have additional outputs, e.g., ]]> emits a report on its report output port, which we ignore.
Where steps require secondary input (e.g., grammars for ixml, stylesheets for XSLT, schemas for validation), those enter through a secondary input port. Those inputs come from documents, also represented by by rectangles with three rows, as follows:
- The top row always says cx:document.
- Because we read this input from existing resources, the middle row has an href value that gives the source filename.
- The bottom row is always a single result port, which flows into the appropriate secondary input port of the step that will use it.
All output ports emit output, but in this pipeline except for the primary output on the result port (and sometimes even then) we ignore that output. In the graph this is represented by an arrow from the output port to a small black circle, which means that the output from that port disappears without a trace.