Invisible XML assignment 1

Maintained by: David J. Birnbaum (djbpitt@gmail.com)

Last modified: 2025-02-23T21:26:30+0000

How to proceed

Reading the plain-text input

CoffeePot and the online jωiXML workbench are unable to process an input file of this size, so you’ll need to use Markup Blitz (or xmq). If you’d like to do your initial development using CoffeePot or jωiXML, you can use a shorter file that we created at https://raw.githubusercontent.com/djbpitt/ixml/refs/heads/main/movies/movieData-short.txt, which includes items that cover all of the variation described above: titles both with and without quotation marks; both single and multiple countries, and both real durations and N/A. If you do your initial development with the abridged input file, verify with Markup Blitz or xmq that your ixml grammar is able to tag the entire input file.

You can either read your plain-text input file directly from the URL above or save it and then read your local copy. Markup Blitz can read directly from a URL, so if you want to use the online text file instead of saving and using a local copy, you can parse the remote file against your ixml grammar with:

blitz movies.ixml https://newtfire.org/courses/tutorials/movieData.txt

Examining or saving the XML output

If you run the code above and your ixml tagging succeeds, the output will race across the screen. Here are two ways to examine the result more carefully:

If you’re on MacOS or in GitBash in Windows, you can pipe the output into less, a pager that displays one screen of data at a time. The command to do that is:
```
blitz movies.ixml https://newtfire.org/courses/tutorials/movieData.txt | less
```
Once the first screen is displayed in less, pressing the space bar moves forward to the next page. Press q to quit.
To save the output to disk, you can redirect it to a file by appending > movies.xml to the command line, i.e.:
```
blitz movies.ixml https://newtfire.org/courses/tutorials/movieData.txt > movies.xml
```
This will create a file called movies.xml, which will contain the output of your ixml operation. If the file already exists, running the command will replace the old file with the new output.

Useful character codes

You may find the following values useful as you construct your ixml grammar:

Character	Character code (Hex value)
tab	#9
quotation mark (")	#22
newline (CR?, LF)	#d?, #a

See below about how to use the newline codes to accommodate plain text input that observe different newline conventions.

The Robustness Principle (Postel’s Law)

Jon Postel, an Internet pioneer and the developer of many of the protocols that make the Internet work today, is the originator of what has come to be known as both the Robustness Principle and Postel’s Law: Be liberal in what you accept, and conservative in what you send. The point is that when exchanging information we cannot control what comes in, so we should try to anticipate and handle input that may not conform to our preferences, and that may not even be internally consistent. Meanwhile, we should be considerate when we create input for others by aiming for consistency and compliance with Best Practice.

Postel’s Law is relevant for this exercise in at least two areas: we cannot predict how ends of lines (EOL) will be encoded in the input we access on the Internet and we also cannot predict whether there will be an EOL character after the last line of the document. For that reason, you’ll want to write your ixml to accept either of the common types of EOL codes and to deal with either the presence or the absence of an EOL code after the last line of the file. We describe below how to do this.

End of line (EOL, newline)

Concerning the last item in the table above, you may already know that lines end in a single LF character (#a) in Linux and MacOS and in a two-character CRLF sequence (#d followed by #a) in Windows. You may not know, though, which of these line-endings has been used in a file you fetch over the Internet (as it happens, the online movieData.txt file uses Windows EOL codes, but the abbreviated movieData-short.txt uses Linux/MacOS EOL codes), and the expression #d?, #a will match either. Here’s why: In ixml the comma means the thing on the left followed by the thing on the right and the question mark means optional, so the expression #d?, #a matches an LF that may or may not be preceded by a CR. If the LF is preceded by a CR, it’s a Windows EOL; if the LF is not preceded by a CR, it’s a Linux/MacOS EOL. In other words, the pattern #d?, #a will match the EOL code in both Linux/MacOS and Windows files. Using this pattern is a way of being liberal in what you accept.

You don’t need to know the following, but in case you’re curious: CR originally meant carriage return and LF originally meant line feed, and they were related to how a printing device (such as a typewriter) positioned itself before writing a character. Both belong to a set of what are called control codes or, more narrowly, C0 control codes. You can read more about them at https://en.wikipedia.org/wiki/C0_and_C1_control_codes. The terms are not very meaningful in a modern context, but they nonetheless continue to be used as character names.

End of file

A plain text file may or may not have an EOL code at the end of the last line. Editing applications like Nano typically add an EOL at the end of the last line when they save a file, but a file you read from a remote site may end on either an EOL code after the last visible character or the last visible character itself, without a trailing EOL code. As it happens, the main movieData.txt file does not have an EOL after the final visible character, while the abbreviated movieData-short.txt file has a trailing #a.

As a way of being liberal in what you accept, your ixml grammar should match a document that either does or does not end with an EOL after the last line. For example, if your document consists of a sequence of lines you can match them with:

doc: line++newline, newline?.

This assumes that 1) you want the root element of the XML you’re creating to be ]]>, 2) you’ve defined the pattern line elsewhere and your definition does not include a trailing EOL, and 3) you’ve defined newline elsewhere. The part of this pattern before the comma (line++newline) matches one or more instances of a line pattern, with a single instance of a newline pattern between each pair of line patterns. By itself that will match consecutive lines, except that if there is an EOL after the last line the match will fail because although the definition of doc would require newlines between lines, it would not allow a newline at the end of the document. The part after the comma (newline?) says that a doc may or may not have a trailing EOL after the final line. This approach is liberal in accepting a sequence of lines whether or not there is an EOL code after the last one.

<oo>→<dh> Digital humanities