Digital humanities

Maintained by: David J. Birnbaum (djbpitt@gmail.com) Last modified: 2025-03-17T21:15:25+0000

Invisible XML assignment 1: Answer key

See the Assignment page for a discussion of the data and the task. Your solution does not have match ours as long as it does what you want, but one possibility is:

films = film++newline, newline?.
film = title, tab, year, tab, country, tab, runtime.
-tab = -#9.
-quote = -#22.
-newline = (-#d?, #a).
title = quote?, ~[#22; #9;#d;#a]+, quote?.
year = ~[#9;#d;#a]+.
country = ~[#9;#d;#a]+.
runtime = ~[#9;#d;#a]+.

Here’s how it works:

Line 1: film (which will be defined on Line 2) does not include a newline at the end, so film++newline means one or more instances of film with a newline between each two instances of film. This part of the pattern, then, means that each film is on its own line with no blank lines between them; See also our Invisible XML and ambiguity.

By itself, the first part of the pattern, before the comma, does not allow a newline after the last film line. Since the last line of the input file may or may not end in a newline, we complete our pattern here by allowing, after all of the films have been seen and processed, an optional newline.
Line 2: A film has four fields with tabs between them. As noted above, the individual film lines, as defined here, do not include newlines at the end because the newlines are managed instead with the film++newline pattern above.
Line 3: The Unicode value for a tab character is #9. The hyphen before the left side means that no ]]> tags should be included in the result. The hyphen before the right side means that the tab character itself should be removed, so that the four tagged pieces of information will follow one another immediately.
Line 4: The Unicode value for a quote (") character is #22. As with tabs, we remove both the ]]> tags and the quote character itself from the output. We’ll use the quote pattern when we match titles on Line 6.
Line 5: See the assignment for a discussion of how this pattern matches both Windows and Linux/MacOS newline patterns. The hyphen before the left side removes ]]> tags from the output. On the right side we remove the #d but retain the #a. Keeping the #a means that each line of the raw XML output will be rendered on a separate line, which isn’t strictly necessary (we’ll pretty-print the XML at the end anyway), but it makes the markup easier to read when we’re checking our work.
Line 6: Since quotation marks around titles are arbitrary and not informational, we match them (as optional) at the beginning and end of a title. A tilde (~) before a character class (demarcated by square brackets) means anything except these characters, so the content of a title between the optional quotation marks is one or more instances of any character except a tab, a quotation mark, or one of the newline characters. Alternatively, you could leave the quotation marks in place, planning to get rid of them later with XSLT, much as you’ll use XSLT to clean up the country values. If you take that approach, you can match titles the same way you match years, countries, and runtimes in lines 7–9.
Lines 7–9: Since the pattern for film fully isolates the year, country, and runtime, our pattern for those three components can match one or more characters that are not tabs or newlines. This matches entire year, country, and runtime values. There’s nothing wrong with using a positive pattern (that is, one that specifies what you do match, rather than the negative pattern we use here, which specifies what you don’t match), but in this case the negative pattern is simpler and we like being able to use the same pattern for multiple fields.