Maintained by: David J. Birnbaum (djbpitt@gmail.com)
Last modified:
2025-03-17T21:15:25+0000
See the Assignment page for a discussion of the data and the task. Your solution does not have match ours as long as it does what you want, but one possibility is:
films = film++newline, newline?.
film = title, tab, year, tab, country, tab, runtime.
-tab = -#9.
-quote = -#22.
-newline = (-#d?, #a).
title = quote?, ~[#22; #9;#d;#a]+, quote?.
year = ~[#9;#d;#a]+.
country = ~[#9;#d;#a]+.
runtime = ~[#9;#d;#a]+.
Here’s how it works:
Line 1:
film
(which will be defined on Line 2) does not include a newline
at the end, so film++newline
means one or more instances of
. This part of the pattern, then, means that each film
is on its own line with no blank lines between them; See also our Invisible XML and
ambiguity.film
with a newline
between each two instances of
film
By itself, the first part of the pattern, before the comma, does not allow a newline after the last film line. Since the last line of the input file may or may not end in a newline, we complete our pattern here by allowing, after all of the films have been seen and processed, an optional newline.
Line 2: A film has four fields with tabs between them. As noted
above, the individual film lines, as defined here, do not include newlines at
the end because the newlines are managed instead with the
film++newline
pattern above.
Line 3: The Unicode value for a tab character is
#9
. The hyphen before the left side means that no
]]>
tags should be included in
the result. The hyphen before the right side means that the tab character itself
should be removed, so that the four tagged pieces of information will follow one
another immediately.
Line 4: The Unicode value for a quote ("
) character
is #22
. As with tabs, we remove both the
]]>
tags and the quote
character itself from the output. We’ll use the quote
pattern when
we match titles on Line 6.
Line 5: See the assignment for a discussion of how this pattern
matches both Windows and Linux/MacOS newline patterns. The hyphen before the
left side removes ]]>
tags
from the output. On the right side we remove the #d
but retain the
#a
. Keeping the #a
means that each line of the raw XML
output will be rendered on a separate line, which isn’t strictly necessary
(we’ll pretty-print the XML at the end anyway), but it makes the markup easier
to read when we’re checking our work.
Line 6: Since quotation marks around titles are arbitrary and
not informational, we match them (as optional) at the beginning and end of a
title. A tilde (~
) before a character class (demarcated by square
brackets) means anything except these characters
, so the content of a
title between the optional quotation marks is one or more instances of any
character except a tab, a quotation mark, or one of the newline characters.
Alternatively, you could leave the quotation marks in place, planning to get rid
of them later with XSLT, much as you’ll use XSLT to clean up the country values.
If you take that approach, you can match titles the same way you match years,
countries, and runtimes in lines 7–9.
Lines 7–9: Since the pattern for film fully isolates the year, country, and runtime, our pattern for those three components can match one or more characters that are not tabs or newlines. This matches entire year, country, and runtime values. There’s nothing wrong with using a positive pattern (that is, one that specifies what you do match, rather than the negative pattern we use here, which specifies what you don’t match), but in this case the negative pattern is simpler and we like being able to use the same pattern for multiple fields.