Maintained by: David J. Birnbaum (djbpitt@gmail.com)
Last modified:
2025-02-23T21:26:30+0000
There’s a plain-text tab-delimited document at https://newtfire.org/courses/tutorials/movieData.txt that contains information about 25093 films released between 1930 and 2018. Each line of the file contains four pieces of information about one film, separated by tab characters: 1) title, 2) year, 3) country, and 4) runtime. The first line of the file contains column headers; the remaining lines are actual data. You may want to open the file in <oXygen/> and look around in order to familiarize yourself with it before you start the assignment.
The format of the data is mostly straight-forward, but here are a few details:
Some titles are surrounded by quotation marks and others are not. The use of quotation marks around titles is arbitrary and not informational.
The country value includes between one and seven countries, inclusive If there are two or more,
the individual countries are separated by a comma and space character and
the whole value is surrounded by quotation marks, e.g.,
"UK, USA"
. The quotation marks in the country field are used
only when there are two or more countries.
The runtime is either a number of minutes followed by a space character and
the abbreviation min (e.g., 93 min
) or just the string
N/A
(which stands for not available
). If a time is
given, it is always given as a number of minutes (i.e., there is no
reference to hours or any other unit of time except minutes).
Your task is to create an ixml grammar that will convert the plain-text file to well-formed XML. For example, for the plain text line:
Goodbye to All That 1930 UK 9 min
your ixml grammar should create:
Goodbye to All That
1930
UK
9 min
]]>
For the moment you don’t have to worry about treating multiple countries specially.
That is, where the input country value is something like "US, UK"
, it’s
okay if your XML says
"US, UK"]]>
. We’ll
process that value further, later in this unit, using XSLT within XProc.
Invisible XML doesn’t create pretty-printed (wrapped and indented) output by default, and we’ll take care of that later, once we begin using XProc. If, in the meanwhile, you’d like to make the result more legible, you can save the output file, open it in <oXygen/>, and pretty-print it there.
CoffeePot and the online jωiXML workbench are unable to process an input file of
this size, so you’ll need to use Markup Blitz (or xmq). If you’d like to do your
initial development using CoffeePot or jωiXML, you can use a shorter file that
we created at https://raw.githubusercontent.com/djbpitt/ixml/refs/heads/main/movies/movieData-short.txt,
which includes items that cover all of the variation described above: titles
both with and without quotation marks; both single and multiple countries, and
both real durations and N/A
. If you do your initial development
with the abridged input file, verify with Markup Blitz or xmq that your ixml
grammar is able to tag the entire input file.
You can either read your plain-text input file directly from the URL above or save it and then read your local copy. Markup Blitz can read directly from a URL, so if you want to use the online text file instead of saving and using a local copy, you can parse the remote file against your ixml grammar with:
blitz movies.ixml https://newtfire.org/courses/tutorials/movieData.txt
If you run the code above and your ixml tagging succeeds, the output will race across the screen. Here are two ways to examine the result more carefully:
If you’re on MacOS or in GitBash in Windows, you can pipe the output into less, a pager that displays one screen of data at a time. The command to do that is:
blitz movies.ixml https://newtfire.org/courses/tutorials/movieData.txt | less
Once the first screen is displayed in less, pressing the space bar moves forward to the next page. Press q to quit.
To save the output to disk, you can redirect it to a file by appending > movies.xml to the command line, i.e.:
blitz movies.ixml https://newtfire.org/courses/tutorials/movieData.txt > movies.xml
This will create a file called movies.xml, which will contain the output of your ixml operation. If the file already exists, running the command will replace the old file with the new output.
You may find the following values useful as you construct your ixml grammar:
Character | Character code (Hex value) |
---|---|
tab | #9 |
quotation mark (") | #22 |
newline (CR?, LF) | #d?, #a |
See below about how to use the newline codes to accommodate plain text input that observe different newline conventions.
Jon Postel, an Internet
pioneer and the developer of many of the protocols that make the Internet work
today, is the originator of what has come to be known as both the Robustness
Principle and Postel’s Law: Be liberal in what you accept, and conservative
in what you send
. The point is that when exchanging information we
cannot control what comes in, so we should try to anticipate and handle input
that may not conform to our preferences, and that may not even be internally
consistent. Meanwhile, we should be considerate when we create input for others
by aiming for consistency and compliance with Best Practice.
Postel’s Law is relevant for this exercise in at least two areas: we cannot predict how ends of lines (EOL) will be encoded in the input we access on the Internet and we also cannot predict whether there will be an EOL character after the last line of the document. For that reason, you’ll want to write your ixml to accept either of the common types of EOL codes and to deal with either the presence or the absence of an EOL code after the last line of the file. We describe below how to do this.
Concerning the last item in the table above, you may already know that lines
end in a single LF character (#a
) in Linux and MacOS and in a
two-character CRLF sequence (#d
followed by #a
) in
Windows. You may not know, though, which of these line-endings has been used
in a file you fetch over the Internet (as it happens, the online
movieData.txt file uses Windows EOL codes, but the abbreviated
movieData-short.txt uses Linux/MacOS EOL codes), and the
expression #d?, #a
will match either. Here’s why: In ixml the
comma means the thing on the left followed by the thing on the right
and the question mark means optional
, so the expression
#d?, #a
matches an LF that may or may not be preceded by a CR.
If the LF is preceded by a CR, it’s a Windows EOL; if the LF is not preceded
by a CR, it’s a Linux/MacOS EOL. In other words, the pattern
#d?, #a
will match the EOL code in both Linux/MacOS and Windows
files. Using this pattern is a way of being liberal in what you accept.
You don’t need to know the following, but in case you’re curious: CR
originally meant carriage return
and LF originally meant line
feed
, and they were related to how a printing device (such as a
typewriter) positioned itself before writing a character. Both belong to
a set of what are called control codes or, more narrowly, C0
control codes. You can read more about them at https://en.wikipedia.org/wiki/C0_and_C1_control_codes. The
terms are not very meaningful in a modern context, but they nonetheless
continue to be used as character names.
A plain text file may or may not have an EOL code at the end of the last
line. Editing applications like Nano typically add an EOL at the end of the
last line when they save a file, but a file you read from a remote site may
end on either an EOL code after the last visible character or the last
visible character itself, without a trailing EOL code. As it happens, the
main movieData.txt file does not have an EOL after the final visible
character, while the abbreviated movieData-short.txt file has a
trailing #a
.
As a way of being liberal in what you accept, your ixml grammar should match a document that either does or does not end with an EOL after the last line. For example, if your document consists of a sequence of lines you can match them with:
doc: line++newline, newline?.
This assumes that 1) you want the root element of the XML you’re creating to
be ]]>
, 2) you’ve defined
the pattern line
elsewhere and your definition does
not include a trailing EOL, and 3) you’ve defined
newline
elsewhere. The part of this pattern before the comma
(line++newline
) matches one or more instances of a
line
pattern, with a single instance of a newline
pattern between each pair of line
patterns. By itself that will
match consecutive lines, except that if there is an EOL after the last line
the match will fail because although the definition of doc
would require newlines between lines, it would not allow a newline at the
end of the document. The part after the comma (newline?
) says
that a doc
may or may not have a trailing EOL after the final
line
. This approach is liberal in accepting a sequence of lines
whether or not there is an EOL code after the last one.
Submit only your ixml grammar. Do not submit the tagged XML that it creates; we’ll run it ourselves to examine the output.