Digital humanities


Maintained by: David J. Birnbaum (djbpitt@gmail.com) [Creative Commons BY-NC-SA 3.0 Unported License] Last modified: 2025-03-01T17:42:30+0000


XProc assignment 1

The task

This assignment builds on ixml Assignment 1. In the ixml assignment you created an ixml grammar that could convert a plain-text file of movie data to well-formed XML. In this XProc assignment you’ll embed your ixml processing in an XProc script and use additional XProc features to improve the final XML. Possible pipeline steps include:

  1. Use an XProc step to remove the first ]]> element, which contains column headings, and not real file data.

  2. If your ixml grammar did not remove quotation marks from around movie titles, use an XSLT or XQuery step to do that.

  3. Use an XSLT or XQuery step to modify the representation of countries. Currently the country value for every film looks like UK]]> or, if there are two or more countries, "UK, USA"]]>. In our solution we used XSLT to transform those to:

    
      UK
    ]]>

    and

    
      UK
      USA
    ]]>

    That is, 1) we changed the main element for country information from singular ]]> to plural ]]>; b) where there were two or more countries, we split them and removed the intervening comma plus space and the quotation marks; and c) we made each country a ]]> child of the ]]> parent.

    We opted not to sort the countries alphabetically because we don’t know whether the order is informational or random. If it is informational, we should leave it as is. If it is random, it would be better to alphabetize the contents.

  4. Remove the space plus min from the runtimes, so that the timed values are just a number (for example, 9 min]]> becomes 9]]>), while N/A]]> remains unchanged.

  5. Create a Relax NG grammar that matches what you want your final XML markup to look like and add an XProc step to validate the XML against the Relax NG.

  6. Create a Schematron schema to validate features of your XML markup that are amenable to Schematron validation. For example, with respect to countries you can verify that every film is associated with 1–7 countries (inclusive; 7 happens to be the largest number of countries associated with any single film in the dataset). With respect to runtime, you can verify that the runtime values are all either the string N/A or a positive integer.

  7. Create an XSLT stylesheet to transform the enriched XML into valid XHTML 5. Save the XHTML document to disk. If you want to link your XHTML to CSS, author a CSS stylesheet separately and let the XSLT transformation create the HTML ]]> element that points to the CSS.

  8. Design an SVG visualization that tells an interesting story about the data, perhaps concerning some relationship involving some combination of year, country, and duration. Create an XSLT stylesheet to transform the enriched XML (not the XHTML; you’ll want to derive both the XHTML and the SVG directly from the XML) into your desired SVG. Save the SVG document to disk.

You don’t have to do all of the preceding, but we’d recommend completing at least the first three tasks. Feel free also to think of your own tasks.

XProc and pipelining

ixml is not intended to provide the sort of flexibility and control over detail that is available from XSLT or XQuery. A common use of ixml is to transform plain text input, with structure represented indirectly (e.g., with newlines and spaces and punctuation), into basic XML, with structure encoded explicitly through markup. Once your ixml process has created XML, the entire XML toolchain becomes available, which means that you can enrich the output of the ixml process using other resources, such as XSLT or XQuery.

XProc, as a pipelining language, is well suited to describing a processing chain declaratively, along the lines of first ingest the plain text and turn it into basic XML with ixml, then pass the result along to XSLT for more refined modification, then pass that result along to a different XSLT process for still more modification, etc. With that said, not every XProc pipeline includes ixml; sometimes your starting point may already be XML, and when you do start with plain text, ixml isn’t the only way to get from there to basic XML. For example, XProc itself is able to add markup, and XPath includes functions that can parse plain text using regular expressions and, if you embed it within XSLT or XQuery, you can introduce basic markup into a plain text source document without using ixml. Because the movie data is so regular, though, it’s a good candidate for initial processing with ixml.

In addition to invoking ixml, XSLT, and XQuery processing, XProc includes steps (elements) that can perform some of the same operations as XSLT or XQuery. For example, if you want to delete elements of a certain type from the XML output of an ixml process, you can either 1) use the XProc ]]> step and specify the elements to remove with a @match attribute or, 2) within an XSLT transformation, either never apply templates to the elements you want to remove or apply templates to them that do nothing. Whether you use XProc steps that modify the XML directly or, instead, use XProc ]]> and ]]> steps to invoke XSLT stylesheets or XQuery modules that modify the XML is up to you. Our general preference is to use stand-alone XProc steps for simple modifications and rely on XSLT and XQuery for more complicated transformations.

How to proceed

We approached this assignment by following the order below:

  1. Create a new XProc 3.0 file in <oXygen/>. Configure the primary source and result ports as described in the XProc tutorial that you read earlier. To make sure that this part of your configuration works, set it up with just a single ]]> step. ]]> just copies the input to the output, with no change, which makes it a useful way to test whether your input and output are configured properly before you introduce real functionality. Your initial XProc script might look something like:

    
    
        
        
        
        
        
        
        
        
        
        
        
        
        
        
    ]]>

    You can get your input either over the Internet (as above) or from a local file and you can use either the full data file (as above) or the abbreviated one.

  2. Once you’ve confirmed that your input and output ports are configured properly, remove the ]]> step and add real functionality. The first real functionality we added was a ]]> step, where we invoked the ixml grammar we wrote for the ixml assignment. You can look up how to configure a ]]> step at https://xprocref.org/3.1/p.invisible-xml.html.

  3. Continue to add other steps that implement the functionality described above. Those steps might include ]]> (to remove the header row), ]]> (to enhance the basic markup introduced with ixml and, later, to create HTML or SVG output), ]]>> (to verify that the markup you’ve added conforms to your intentions), ]]> (for additional validation), and ]]> (to save XML, HTML, or SVG results).

    You’ll want to look up these steps at https://xprocref.org/ to learn how they work. We find the examples at the bottom of the individual pages especially helpful.

  4. In our implementation we used ]]> to save everything we wanted to save, which means that we could tell the main result port not to create any output. We did that by changing the ]]> step to:

    
      
    ]]>

    If you adopt this strategy, get at least one of your ]]> steps working before you turn off the output from the principal output port. If you turn off the output port and don’t configure any ]]> steps, you’ll get no output at all and you won’t be able to see what your code is doing.

  5. If you create just one output (e.g., XML, HTML, or SVG), your pipeline will pass the data along through implicit connections from each step to the immediately following one. That’s fine if all you want to do, for example, is get from plain text to an HTML reading view. If, though, you want to create multiple outputs and your pipeline isn’t completely linear, you’ll need to use explicit connections. For example, if you use linear steps to finalize your XML and then want to generate both HTML and SVG output from the XML, you can use linear, implicit connections to go from the XML to the HTML, but you don’t want to go from the HTML to the SVG; you want to go back to the XML, so that it will be the source of both the HTML and the SVG. You’ll find information about explicit, non-linear flow at Rerouting and interrupting the data flow and XProc 3.0 - Connecting steps using ports (section headed Explicit connections).

  6. For diagnostic purposes you can tell XProc to write messages to the screen while it processes the steps. You can do this by adding a @message attribute to any step, e.g.:

    ]]>

    To emit messages that aren’t attached to real steps, you can throw in a ]]> step with just a @message attribute, e.g.:

    ]]>

    Recall that ]]> just copies from the source (input) to the result (output) unchanged, so you can hang a message on it without interfering with your real pipeline logic.

What to submit

Submit your XProc file and any auxiliary files that it invokes, such as ixml grammars, XSLT stylesheets, XQuery modules, or CSS stylesheets that you might link to XHTML output. Do not submit the tagged XML (or XHTML or SVG) that it creates; we’ll run it ourselves to examine the output.


Appendix: Visualizing XProc (optional)

XML Calabash is capable of rendering graphic representations of XProc pipelines. This appendix is optional, since you can develop and use XProc without it, but we find the visualizations especially useful when our XProc does something unexpected, so we’d encourage you at least to read through the appendix to learn how to read a graphic representation of an XProc pipeline.

The following image was generated from a sample XProc script that implements all of the steps listed and described above. We used XML Calabash to produce the image, appending --graphs:. (note the dot) to tell it to create graphic representations of the pipeline in the current directory. We then added the color manually.

pipeline cluster_d280e1 d280e2 source cluster_d280e1_head source source d280e2->cluster_d280e1_head:d280e2_head_input cluster_d280e9 grammar source p:invisible-xml result cluster_d280e1_head:d280e2_head_output->cluster_d280e9:d280e12 cluster_d280e5 cx:empty result cluster_d280e1_foot result cluster_d280e5:d280e6->cluster_d280e1_foot:d280e3_foot cluster_d280e7 cx:document href = "movies.ixml" result cluster_d280e7:d280e8->cluster_d280e9:d280e10 cluster_d280e17 source p:delete result cluster_d280e9:d280e14->cluster_d280e17:d280e19 cluster_d280e24 stylesheet source p:xslt result secondary cluster_d280e17:d280e21->cluster_d280e24:d280e27 cluster_d280e22 cx:document href = "movies-countries.xsl" result cluster_d280e22:d280e23->cluster_d280e24:d280e25 cluster_d280e42 schema source p:validate-with-relax-ng result report cluster_d280e24:d280e29->cluster_d280e42:d280e45 d280e118 cluster_d280e24:d280e30->d280e118 cluster_d280e40 cx:document href = "movies.rnc" result cluster_d280e40:d280e41->cluster_d280e42:d280e43 cluster_d280e57 schema source p:validate-with-schematron / finalize-xml result report cluster_d280e42:d280e47->cluster_d280e57:d280e60 d280e121 cluster_d280e42:d280e48->d280e121 cluster_d280e54 cx:document href = "movies.sch" result cluster_d280e54:d280e55->cluster_d280e57:d280e58 cluster_d280e70 stylesheet source p:xslt result secondary cluster_d280e57:d280e62->cluster_d280e70:d280e73 cluster_d280e95 source stylesheet p:xslt result secondary cluster_d280e57:d280e62->cluster_d280e95:d280e96 d280e124 cluster_d280e57:d280e63->d280e124 cluster_d280e68 cx:document href = "movies-to-html.xsl" result cluster_d280e68:d280e69->cluster_d280e70:d280e71 cluster_d280e86 source p:store result result-uri cluster_d280e70:d280e75->cluster_d280e86:d280e89 d280e127 cluster_d280e70:d280e76->d280e127 d280e131 cluster_d280e86:d280e91->d280e131 d280e134 cluster_d280e86:d280e92->d280e134 cluster_d280e93 cx:document href = "movies-to-svg.xsl" result cluster_d280e93:d280e94->cluster_d280e95:d280e98 cluster_d280e111 source p:store result result-uri cluster_d280e95:d280e100->cluster_d280e111:d280e114 d280e137 cluster_d280e95:d280e101->d280e137 d280e140 cluster_d280e111:d280e116->d280e140 d280e143 cluster_d280e111:d280e117->d280e143 d280e3 result cluster_d280e1_foot:d280e3_foot->d280e3

Here’s how to read the pieces of the image:

What the image as a whole shows is that information flows from one step to another, with supplementary information (XSLT stylesheets to transform the priamary information, schemas to validate it, etc.) introduced where needed. Notice that the primary result of the Schematron validation has two out-arrows because the validated XML is the input into two ]]> steps, one that creates HTML and another that creates SVG. Each row of the diagram has one step (a box with a middle line that begins with p:; a row may also have boxes with cx:document in the top row for additional input) except that the penultimate row has two XSLT steps. That’s because of the split in the output from the Schematron validation: the validated XML flows into two branches of the pipeline, which proceed independently of each other, each storing a result (one HTML, one SVG).