XProc assignment 1

Maintained by: David J. Birnbaum (djbpitt@gmail.com)

Last modified: 2025-03-01T17:42:30+0000

XProc assignment 1

The task

This assignment builds on ixml Assignment 1. In the ixml assignment you created an ixml grammar that could convert a plain-text file of movie data to well-formed XML. In this XProc assignment you’ll embed your ixml processing in an XProc script and use additional XProc features to improve the final XML. Possible pipeline steps include:

Use an XProc step to remove the first ]]> element, which contains column headings, and not real file data.
If your ixml grammar did not remove quotation marks from around movie titles, use an XSLT or XQuery step to do that.
Use an XSLT or XQuery step to modify the representation of countries. Currently the country value for every film looks like UK]]> or, if there are two or more countries, "UK, USA"]]>. In our solution we used XSLT to transform those to:
```
  UK
]]>
```
and
```
  UK
  USA
]]>
```
That is, 1) we changed the main element for country information from singular ]]> to plural ]]>; b) where there were two or more countries, we split them and removed the intervening comma plus space and the quotation marks; and c) we made each country a ]]> child of the ]]> parent.

We opted not to sort the countries alphabetically because we don’t know whether the order is informational or random. If it is informational, we should leave it as is. If it is random, it would be better to alphabetize the contents.
Remove the space plus min from the runtimes, so that the timed values are just a number (for example, 9 min]]> becomes 9]]>), while N/A]]> remains unchanged.
Create a Relax NG grammar that matches what you want your final XML markup to look like and add an XProc step to validate the XML against the Relax NG.
Create a Schematron schema to validate features of your XML markup that are amenable to Schematron validation. For example, with respect to countries you can verify that every film is associated with 1–7 countries (inclusive; 7 happens to be the largest number of countries associated with any single film in the dataset). With respect to runtime, you can verify that the runtime values are all either the string N/A or a positive integer.
Create an XSLT stylesheet to transform the enriched XML into valid XHTML 5. Save the XHTML document to disk. If you want to link your XHTML to CSS, author a CSS stylesheet separately and let the XSLT transformation create the HTML ]]> element that points to the CSS.
Design an SVG visualization that tells an interesting story about the data, perhaps concerning some relationship involving some combination of year, country, and duration. Create an XSLT stylesheet to transform the enriched XML (not the XHTML; you’ll want to derive both the XHTML and the SVG directly from the XML) into your desired SVG. Save the SVG document to disk.

You don’t have to do all of the preceding, but we’d recommend completing at least the first three tasks. Feel free also to think of your own tasks.

XProc and pipelining

ixml is not intended to provide the sort of flexibility and control over detail that is available from XSLT or XQuery. A common use of ixml is to transform plain text input, with structure represented indirectly (e.g., with newlines and spaces and punctuation), into basic XML, with structure encoded explicitly through markup. Once your ixml process has created XML, the entire XML toolchain becomes available, which means that you can enrich the output of the ixml process using other resources, such as XSLT or XQuery.

XProc, as a pipelining language, is well suited to describing a processing chain declaratively, along the lines of first ingest the plain text and turn it into basic XML with ixml, then pass the result along to XSLT for more refined modification, then pass that result along to a different XSLT process for still more modification, etc. With that said, not every XProc pipeline includes ixml; sometimes your starting point may already be XML, and when you do start with plain text, ixml isn’t the only way to get from there to basic XML. For example, XProc itself is able to add markup, and XPath includes functions that can parse plain text using regular expressions and, if you embed it within XSLT or XQuery, you can introduce basic markup into a plain text source document without using ixml. Because the movie data is so regular, though, it’s a good candidate for initial processing with ixml.

In addition to invoking ixml, XSLT, and XQuery processing, XProc includes steps (elements) that can perform some of the same operations as XSLT or XQuery. For example, if you want to delete elements of a certain type from the XML output of an ixml process, you can either 1) use the XProc ]]> step and specify the elements to remove with a @match attribute or, 2) within an XSLT transformation, either never apply templates to the elements you want to remove or apply templates to them that do nothing. Whether you use XProc steps that modify the XML directly or, instead, use XProc ]]> and ]]> steps to invoke XSLT stylesheets or XQuery modules that modify the XML is up to you. Our general preference is to use stand-alone XProc steps for simple modifications and rely on XSLT and XQuery for more complicated transformations.

How to proceed

We approached this assignment by following the order below:

Create a new XProc 3.0 file in <oXygen/>. Configure the primary source and result ports as described in the XProc tutorial that you read earlier. To make sure that this part of your configuration works, set it up with just a single ]]> step. ]]> just copies the input to the output, with no change, which makes it a useful way to test whether your input and output are configured properly before you introduce real functionality. Your initial XProc script might look something like:
```
    
    
    
    
    
    
    
    
    
    
    
    
    
    
]]>
```
You can get your input either over the Internet (as above) or from a local file and you can use either the full data file (as above) or the abbreviated one.
Once you’ve confirmed that your input and output ports are configured properly, remove the ]]> step and add real functionality. The first real functionality we added was a ]]> step, where we invoked the ixml grammar we wrote for the ixml assignment. You can look up how to configure a ]]> step at https://xprocref.org/3.1/p.invisible-xml.html.
Continue to add other steps that implement the functionality described above. Those steps might include ]]> (to remove the header row), ]]> (to enhance the basic markup introduced with ixml and, later, to create HTML or SVG output), ]]>> (to verify that the markup you’ve added conforms to your intentions), ]]> (for additional validation), and ]]> (to save XML, HTML, or SVG results).

You’ll want to look up these steps at https://xprocref.org/ to learn how they work. We find the examples at the bottom of the individual pages especially helpful.
In our implementation we used ]]> to save everything we wanted to save, which means that we could tell the main result port not to create any output. We did that by changing the ]]> step to:
```
  
]]>
```
If you adopt this strategy, get at least one of your ]]> steps working before you turn off the output from the principal output port. If you turn off the output port and don’t configure any ]]> steps, you’ll get no output at all and you won’t be able to see what your code is doing.
If you create just one output (e.g., XML, HTML, or SVG), your pipeline will pass the data along through implicit connections from each step to the immediately following one. That’s fine if all you want to do, for example, is get from plain text to an HTML reading view. If, though, you want to create multiple outputs and your pipeline isn’t completely linear, you’ll need to use explicit connections. For example, if you use linear steps to finalize your XML and then want to generate both HTML and SVG output from the XML, you can use linear, implicit connections to go from the XML to the HTML, but you don’t want to go from the HTML to the SVG; you want to go back to the XML, so that it will be the source of both the HTML and the SVG. You’ll find information about explicit, non-linear flow at Rerouting and interrupting the data flow and XProc 3.0 - Connecting steps using ports (section headed Explicit connections).
For diagnostic purposes you can tell XProc to write messages to the screen while it processes the steps. You can do this by adding a @message attribute to any step, e.g.:
```
]]>
```
To emit messages that aren’t attached to real steps, you can throw in a ]]> step with just a @message attribute, e.g.:
```
]]>
```
Recall that ]]> just copies from the source (input) to the result (output) unchanged, so you can hang a message on it without interfering with your real pipeline logic.

Appendix: Visualizing XProc (optional)

XML Calabash is capable of rendering graphic representations of XProc pipelines. This appendix is optional, since you can develop and use XProc without it, but we find the visualizations especially useful when our XProc does something unexpected, so we’d encourage you at least to read through the appendix to learn how to read a graphic representation of an XProc pipeline.

The following image was generated from a sample XProc script that implements all of the steps listed and described above. We used XML Calabash to produce the image, appending --graphs:. (note the dot) to tell it to create graphic representations of the pipeline in the current directory. We then added the color manually.

Here’s how to read the pieces of the image:

The squares with pink backgrounds represent the primary flow of information, most often from the result port of one step to the source port of the next. The squares with blue backgrounds, in the ]]> steps at the bottom, represent where we write information to disk. Information does flow out of the result ports of those steps, but we ignore it, so that saving to disk is the last step that we care about on the XHTML and SVG branches of the pipeline.
The diamonds at the top are the principal input and output. The output is empty because we set it up that way; we store all of the output we care about to disk with ]]> steps and ultimately replace the main output with nothing.
Each step has three rows:
- A middle row with the name of the step, e.g., p:invisible-xml toward the upper left.
- An upper row that shows the inputs. In the case of invisible XML there are two: the grammar (from the box above labeled movies.ixml) and the plain text document to be tagged (from the source diamond).
- A lower row that represents the result, which typically flows into one or more following steps.
Some steps have just one input, labeled source, such as the ]]> step, which we use to remove the first ]]> element from our XML. The content of that element isn’t a real film; it’s the column labels (title, year, country, runtime), so it isn’t real data. Most steps have two inputs, e.g., a ]]> step has the XML it’s transforming (which flows into a port labeled source) and the XSLT stylesheet used for the transformation (which flows into a port labeled stylesheet). Similarly, the two validation steps have two inputs each, the XML being validated (which flows into the port labeled source) and the Relax NG or Schematron schema (which flows into a port labeled schema).
Where documents like XSLT stylesheets or schemas are introduced into the pipeline, the middle line gives the file identifier (e.g., href="movies-countries.xsl"), the upper row says cx:document, and the lower row says result.
Some steps produce both a main result (typically passed on as input into the next step) and a secondary result, labeled secondary for XSLT steps and result-uri for ]]> steps. In this pipeline we ignore the secondary results. A result (primary or secondary) may flow into another step or end at a dot, which means that it isn’t processed further. In a ]]> step the primary result is written to disk and the result-uri is discarded. We similarly discard the validation reports (XProc will raise an error if the validation fails, so we don’t need to inspect the report) and the secondary output of XSLT steps.

What the image as a whole shows is that information flows from one step to another, with supplementary information (XSLT stylesheets to transform the priamary information, schemas to validate it, etc.) introduced where needed. Notice that the primary result of the Schematron validation has two out-arrows because the validated XML is the input into two ]]> steps, one that creates HTML and another that creates SVG. Each row of the diagram has one step (a box with a middle line that begins with p:; a row may also have boxes with cx:document in the top row for additional input) except that the penultimate row has two XSLT steps. That’s because of the split in the output from the Schematron validation: the validated XML flows into two branches of the pipeline, which proceed independently of each other, each storing a result (one HTML, one SVG).

<oo>→<dh> Digital humanities