Maintained by: David J. Birnbaum (djbpitt@gmail.com)
Last modified:
2025-03-01T17:42:30+0000
This assignment builds on ixml Assignment 1. In the ixml assignment you created an ixml grammar that could convert a plain-text file of movie data to well-formed XML. In this XProc assignment you’ll embed your ixml processing in an XProc script and use additional XProc features to improve the final XML. Possible pipeline steps include:
Use an XProc step to remove the first
]]>
element, which contains
column headings, and not real file data.
If your ixml grammar did not remove quotation marks from around movie titles, use an XSLT or XQuery step to do that.
Use an XSLT or XQuery step to modify the representation of countries.
Currently the country value for every film looks like
UK]]>
or, if
there are two or more countries,
"UK, USA"]]>
.
In our solution we used XSLT to transform those to:
UK
]]>
and
UK
USA
]]>
That is, 1) we changed the main element for country information from singular
]]>
to plural
]]>
; b) where there
were two or more countries, we split them and removed the intervening comma
plus space and the quotation marks; and c) we made each country a
]]>
child of the
]]>
parent.
We opted not to sort the countries alphabetically because we don’t know whether the order is informational or random. If it is informational, we should leave it as is. If it is random, it would be better to alphabetize the contents.
Remove the space plus min
from the runtimes, so that the timed values
are just a number (for example,
9 min]]>
becomes 9]]>
),
while N/A]]>
remains unchanged.
Create a Relax NG grammar that matches what you want your final XML markup to look like and add an XProc step to validate the XML against the Relax NG.
Create a Schematron schema to validate features of your XML markup that are
amenable to Schematron validation. For example, with respect to countries
you can verify that every film is associated with 1–7 countries (inclusive;
7 happens to be the largest number of countries associated with any single
film in the dataset). With respect to runtime, you can verify that the
runtime values are all either the string N/A
or a positive
integer.
Create an XSLT stylesheet to transform the enriched XML into valid XHTML 5.
Save the XHTML document to disk. If you want to link your XHTML to CSS,
author a CSS stylesheet separately and let the XSLT transformation create
the HTML ]]>
element that
points to the CSS.
Design an SVG visualization that tells an interesting story about the data, perhaps concerning some relationship involving some combination of year, country, and duration. Create an XSLT stylesheet to transform the enriched XML (not the XHTML; you’ll want to derive both the XHTML and the SVG directly from the XML) into your desired SVG. Save the SVG document to disk.
You don’t have to do all of the preceding, but we’d recommend completing at least the first three tasks. Feel free also to think of your own tasks.
ixml is not intended to provide the sort of flexibility and control over detail that is available from XSLT or XQuery. A common use of ixml is to transform plain text input, with structure represented indirectly (e.g., with newlines and spaces and punctuation), into basic XML, with structure encoded explicitly through markup. Once your ixml process has created XML, the entire XML toolchain becomes available, which means that you can enrich the output of the ixml process using other resources, such as XSLT or XQuery.
XProc, as a pipelining language, is well suited to describing a processing chain
declaratively, along the lines of first ingest the plain text and turn it into
basic XML with ixml, then pass the result along to XSLT for more refined
modification, then pass that result along to a different XSLT process for still
more modification
, etc. With that said, not every XProc pipeline includes
ixml; sometimes your starting point may already be XML, and when you do start with
plain text, ixml isn’t the only way to get from there to basic XML. For example,
XProc itself is able to add markup, and XPath includes functions that can parse
plain text using regular expressions and, if you embed it within XSLT or XQuery, you
can introduce basic markup into a plain text source document without using ixml.
Because the movie data is so regular, though, it’s a good candidate for initial
processing with ixml.
In addition to invoking ixml, XSLT, and XQuery processing, XProc includes
steps (elements) that can perform some of the same operations as XSLT or
XQuery. For example, if you want to delete elements of a certain type from the XML
output of an ixml process, you can either 1) use the XProc
]]>
step and specify the
elements to remove with a @match
attribute or,
2) within an XSLT transformation, either never apply templates to the elements you
want to remove or apply templates to them that do nothing. Whether you use XProc
steps that modify the XML directly or, instead, use XProc
]]>
and
]]>
steps to invoke XSLT
stylesheets or XQuery modules that modify the XML is up to you. Our general
preference is to use stand-alone XProc steps for simple modifications and rely on
XSLT and XQuery for more complicated transformations.
We approached this assignment by following the order below:
Create a new XProc 3.0 file in <oXygen/>. Configure the primary source and
result ports as described in the XProc tutorial that you read earlier. To make sure that this part
of your configuration works, set it up with just a single
]]>
step.
]]>
just copies the
input to the output, with no change, which makes it a useful way to test
whether your input and output are configured properly before you introduce
real functionality. Your initial XProc script might look something like:
]]>
You can get your input either over the Internet (as above) or from a local file and you can use either the full data file (as above) or the abbreviated one.
Once you’ve confirmed that your input and output ports are configured
properly, remove the
]]>
step and add real
functionality. The first real functionality we added was a
]]>
step, where
we invoked the ixml grammar we wrote for the ixml assignment. You can look up how
to configure a
]]>
step at https://xprocref.org/3.1/p.invisible-xml.html.
Continue to add other steps that implement the functionality described above.
Those steps might include
]]>
(to remove the
header row), ]]>
(to
enhance the basic markup introduced with ixml and, later, to create HTML or
SVG output),
]]>>
(to verify that the markup you’ve added conforms to your intentions),
]]>
(for additional validation), and
]]>
(to save XML, HTML,
or SVG results).
You’ll want to look up these steps at https://xprocref.org/ to learn how they work. We find the examples at the bottom of the individual pages especially helpful.
In our implementation we used
]]>
to save everything
we wanted to save, which means that we could tell the main result port not
to create any output. We did that by changing the
]]>
step to:
]]>
If you adopt this strategy, get at least one of your
]]>
steps working before
you turn off the output from the principal output port. If you turn off the
output port and don’t configure any
]]>
steps, you’ll get no
output at all and you won’t be able to see what your code is doing.
If you create just one output (e.g., XML, HTML, or SVG), your pipeline will pass the data along through implicit connections from each step to the immediately following one. That’s fine if all you want to do, for example, is get from plain text to an HTML reading view. If, though, you want to create multiple outputs and your pipeline isn’t completely linear, you’ll need to use explicit connections. For example, if you use linear steps to finalize your XML and then want to generate both HTML and SVG output from the XML, you can use linear, implicit connections to go from the XML to the HTML, but you don’t want to go from the HTML to the SVG; you want to go back to the XML, so that it will be the source of both the HTML and the SVG. You’ll find information about explicit, non-linear flow at Rerouting and interrupting the data flow and XProc 3.0 - Connecting steps using ports (section headed Explicit connections).
For diagnostic purposes you can tell XProc to write messages to the screen
while it processes the steps. You can do this by adding a
@message
attribute to any step,
e.g.:
]]>
To emit messages that aren’t attached to real steps, you can throw in a
]]>
step with just a
@message
attribute, e.g.:
]]>
Recall that ]]>
just
copies from the source (input) to the result (output) unchanged, so you can
hang a message on it without interfering with your real pipeline
logic.
Submit your XProc file and any auxiliary files that it invokes, such as ixml grammars, XSLT stylesheets, XQuery modules, or CSS stylesheets that you might link to XHTML output. Do not submit the tagged XML (or XHTML or SVG) that it creates; we’ll run it ourselves to examine the output.
XML Calabash is capable of rendering graphic representations of XProc pipelines. This appendix is optional, since you can develop and use XProc without it, but we find the visualizations especially useful when our XProc does something unexpected, so we’d encourage you at least to read through the appendix to learn how to read a graphic representation of an XProc pipeline.
The following image was generated from a sample XProc script that implements all of the steps listed and described above. We used XML Calabash to produce the image, appending --graphs:. (note the dot) to tell it to create graphic representations of the pipeline in the current directory. We then added the color manually.
Here’s how to read the pieces of the image:
The squares with pink backgrounds represent the primary flow of information,
most often from the result port of one step to the source port
of the next. The squares with blue backgrounds, in the
]]>
steps at the bottom,
represent where we write information to disk. Information does flow out of
the result ports of those steps, but we ignore it, so that saving to
disk is the last step that we care about on the XHTML and SVG branches of
the pipeline.
The diamonds at the top are the principal input and output. The output is
empty because we set it up that way; we store all of the output we care
about to disk with ]]>
steps and ultimately replace the main output with nothing.
Each step has three rows:
A middle row with the name of the step, e.g., p:invisible-xml
toward the upper left.
An upper row that shows the inputs. In the case of invisible XML
there are two: the grammar (from the box above labeled
movies.ixml
) and the plain text document to be tagged
(from the source
diamond).
A lower row that represents the result, which typically flows into one or more following steps.
Some steps have just one input, labeled source
, such as the
]]>
step, which we use
to remove the first ]]>
element from our XML. The content of that element isn’t a real film; it’s
the column labels (title, year, country, runtime), so it isn’t real data.
Most steps have two inputs, e.g., a
]]>
step has the XML it’s
transforming (which flows into a port labeled source
) and the XSLT
stylesheet used for the transformation (which flows into a port labeled
stylesheet
). Similarly, the two validation steps have two inputs
each, the XML being validated (which flows into the port labeled
source
) and the Relax NG or Schematron schema (which flows into a
port labeled schema
).
Where documents like XSLT stylesheets or schemas are introduced into the
pipeline, the middle line gives the file identifier (e.g.,
href="movies-countries.xsl"
), the upper row says
cx:document
, and the lower row says result
.
Some steps produce both a main result (typically passed on as input into the
next step) and a secondary result, labeled secondary
for XSLT steps
and result-uri
for
]]>
steps. In this
pipeline we ignore the secondary results. A result (primary or secondary)
may flow into another step or end at a dot, which means that it isn’t
processed further. In a
]]>
step the primary
result is written to disk and the result-uri is discarded. We similarly
discard the validation reports (XProc will raise an error if the validation
fails, so we don’t need to inspect the report) and the secondary output of
XSLT steps.
What the image as a whole shows is that information flows from one step to another,
with supplementary information (XSLT stylesheets to transform the priamary
information, schemas to validate it, etc.) introduced where needed. Notice that the
primary result of the Schematron validation has two out-arrows because the validated
XML is the input into two ]]>
steps, one that creates HTML and another that creates SVG. Each row of the diagram
has one step (a box with a middle line that begins with p:
; a row may also
have boxes with cx:document
in the top row for additional input) except that
the penultimate row has two XSLT steps. That’s because of the split in the output
from the Schematron validation: the validated XML flows into two branches of the
pipeline, which proceed independently of each other, each storing a result (one
HTML, one SVG).