Digital humanities


Maintained by: David J. Birnbaum (djbpitt@gmail.com) [Creative Commons BY-NC-SA 3.0 Unported License] Last modified: 2025-02-23T21:26:30+0000


Invisible XML assignment 1

The data

There’s a plain-text tab-delimited document at https://newtfire.org/courses/tutorials/movieData.txt that contains information about 25093 films released between 1930 and 2018. Each line of the file contains four pieces of information about one film, separated by tab characters: 1) title, 2) year, 3) country, and 4) runtime. The first line of the file contains column headers; the remaining lines are actual data. You may want to open the file in <oXygen/> and look around in order to familiarize yourself with it before you start the assignment.

The format of the data is mostly straight-forward, but here are a few details:

The task

Your task is to create an ixml grammar that will convert the plain-text file to well-formed XML. For example, for the plain text line:

Goodbye to All That	1930	UK	9 min

your ixml grammar should create:


  Goodbye to All That
  1930
  UK
  9 min
]]>

For the moment you don’t have to worry about treating multiple countries specially. That is, where the input country value is something like "US, UK", it’s okay if your XML says "US, UK"]]>. We’ll process that value further, later in this unit, using XSLT within XProc.

Invisible XML doesn’t create pretty-printed (wrapped and indented) output by default, and we’ll take care of that later, once we begin using XProc. If, in the meanwhile, you’d like to make the result more legible, you can save the output file, open it in <oXygen/>, and pretty-print it there.

How to proceed

Reading the plain-text input

CoffeePot and the online jωiXML workbench are unable to process an input file of this size, so you’ll need to use Markup Blitz (or xmq). If you’d like to do your initial development using CoffeePot or jωiXML, you can use a shorter file that we created at https://raw.githubusercontent.com/djbpitt/ixml/refs/heads/main/movies/movieData-short.txt, which includes items that cover all of the variation described above: titles both with and without quotation marks; both single and multiple countries, and both real durations and N/A. If you do your initial development with the abridged input file, verify with Markup Blitz or xmq that your ixml grammar is able to tag the entire input file.

You can either read your plain-text input file directly from the URL above or save it and then read your local copy. Markup Blitz can read directly from a URL, so if you want to use the online text file instead of saving and using a local copy, you can parse the remote file against your ixml grammar with:

blitz movies.ixml https://newtfire.org/courses/tutorials/movieData.txt

Examining or saving the XML output

If you run the code above and your ixml tagging succeeds, the output will race across the screen. Here are two ways to examine the result more carefully:

Useful character codes

You may find the following values useful as you construct your ixml grammar:

Character Character code
(Hex value)
tab #9
quotation mark (") #22
newline (CR?, LF) #d?, #a

See below about how to use the newline codes to accommodate plain text input that observe different newline conventions.

The Robustness Principle (Postel’s Law)

Jon Postel, an Internet pioneer and the developer of many of the protocols that make the Internet work today, is the originator of what has come to be known as both the Robustness Principle and Postel’s Law: Be liberal in what you accept, and conservative in what you send. The point is that when exchanging information we cannot control what comes in, so we should try to anticipate and handle input that may not conform to our preferences, and that may not even be internally consistent. Meanwhile, we should be considerate when we create input for others by aiming for consistency and compliance with Best Practice.

Postel’s Law is relevant for this exercise in at least two areas: we cannot predict how ends of lines (EOL) will be encoded in the input we access on the Internet and we also cannot predict whether there will be an EOL character after the last line of the document. For that reason, you’ll want to write your ixml to accept either of the common types of EOL codes and to deal with either the presence or the absence of an EOL code after the last line of the file. We describe below how to do this.

End of line (EOL, newline)

Concerning the last item in the table above, you may already know that lines end in a single LF character (#a) in Linux and MacOS and in a two-character CRLF sequence (#d followed by #a) in Windows. You may not know, though, which of these line-endings has been used in a file you fetch over the Internet (as it happens, the online movieData.txt file uses Windows EOL codes, but the abbreviated movieData-short.txt uses Linux/MacOS EOL codes), and the expression #d?, #a will match either. Here’s why: In ixml the comma means the thing on the left followed by the thing on the right and the question mark means optional, so the expression #d?, #a matches an LF that may or may not be preceded by a CR. If the LF is preceded by a CR, it’s a Windows EOL; if the LF is not preceded by a CR, it’s a Linux/MacOS EOL. In other words, the pattern #d?, #a will match the EOL code in both Linux/MacOS and Windows files. Using this pattern is a way of being liberal in what you accept.

You don’t need to know the following, but in case you’re curious: CR originally meant carriage return and LF originally meant line feed, and they were related to how a printing device (such as a typewriter) positioned itself before writing a character. Both belong to a set of what are called control codes or, more narrowly, C0 control codes. You can read more about them at https://en.wikipedia.org/wiki/C0_and_C1_control_codes. The terms are not very meaningful in a modern context, but they nonetheless continue to be used as character names.

End of file

A plain text file may or may not have an EOL code at the end of the last line. Editing applications like Nano typically add an EOL at the end of the last line when they save a file, but a file you read from a remote site may end on either an EOL code after the last visible character or the last visible character itself, without a trailing EOL code. As it happens, the main movieData.txt file does not have an EOL after the final visible character, while the abbreviated movieData-short.txt file has a trailing #a.

As a way of being liberal in what you accept, your ixml grammar should match a document that either does or does not end with an EOL after the last line. For example, if your document consists of a sequence of lines you can match them with:

doc: line++newline, newline?.

This assumes that 1) you want the root element of the XML you’re creating to be ]]>, 2) you’ve defined the pattern line elsewhere and your definition does not include a trailing EOL, and 3) you’ve defined newline elsewhere. The part of this pattern before the comma (line++newline) matches one or more instances of a line pattern, with a single instance of a newline pattern between each pair of line patterns. By itself that will match consecutive lines, except that if there is an EOL after the last line the match will fail because although the definition of doc would require newlines between lines, it would not allow a newline at the end of the document. The part after the comma (newline?) says that a doc may or may not have a trailing EOL after the final line. This approach is liberal in accepting a sequence of lines whether or not there is an EOL code after the last one.

What to submit

Submit only your ixml grammar. Do not submit the tagged XML that it creates; we’ll run it ourselves to examine the output.