Maintained by: David J. Birnbaum (djbpitt@gmail.com)
Last modified:
2023-02-08T22:20:31+0000
Assume that we’ve been given a plain-text file like the Project Gutenberg EBook of The Blithedale
Romance, by Nathaniel Hawthorne, and we want to convert it to XML,
but we don’t want to type all of the angle brackets manually. (Note that this site
sometimes shows you a pop-up welcome screen and a list of different versions of the
file, instead of taking you to the plain text one directly. If that happens, click
OK
on the Welcome
pop-up and then select the version labeled Plain
Text UTF-8
.) In this case Project Gutenberg makes the same book available in
HTML, and in Real Life we’d probably convert from HTML to XML (using XSLT, which we’ll
learn later in the semester) rather than from plain text, but since there are situations
where all we have is plain text, we’ll ignore the HTML version on the Gutenberg site,
pretend that all they provide is plain text, and work with that. So what can we tag
automatically, with global find-and-replace operations? Some of the markup we might want
to introduce for analytical purposes might require us to touch every word of the text,
but, at a minimum, we can autotag chapters, chapter titles, paragraphs, and quotations
using regex tools, and that’s the goal of the present assignment.
Select the plain-text version of the document and open it in <oXygen/>. Then cut
out the front matter (before the main title title) and the back matter (after the last
line of the text of the novel, which is I--I myself--was in love--with--Priscilla!
). In Real Life we might want to mark those parts up eventually and reintroduce them
into the XML as metadata, but for this assignment we’ll just delete everything that
isn’t part of the text of the novel.
There’s more than one way to accomplish this task, so it’s fine if your solution is different from ours as long as you use regex in a meaningful way to get the output you want, but one way to approach the problem is as follows:
The plain text file could, at least in principle, contain characters that have special meaning in XML: the ampersand and the angle brackets. You need to search for those and replace them with their corresponding XML entities, and it turns out that there aren’t any instances of these characters in this document.
The blank lines are pseudo-markup that tell us where titles and paragraphs begin and end,
but in some cases there are multiple blank lines in a row (for example, there are two
blank lines between the title and the word by
). Those extra blank lines don’t
tell us anything useful, so we’ll start by getting rid of them. We search for
\n{3,}
(three or more consecutive new line
characters) and replace them with \n\n
(exactly two
new line characters). Note that the numerical quantifiers work in searches but not in
replaces, so you need to write \n\n
(rather than
\n{2}
) in the replace box.
What’s left after deleting the beginning and ending metadata and extra blank lines is
mostly (except for the stuff at the top) a bunch of chapter titles and paragraphs,
separated from one another by single blank lines, and we can use a regex to find all
blank lines (\n{2}
) and replace them with the
sequence </p>\n<p>
. For human legibility,
we insert a new line character between the tags, instead of just outputting the end tag
followed immediately by the start tag, so that each paragraph will start on a new line.
We then add the <p>
start tag before the first
paragraph and the </p>
end tag after the last
one manually.
We’ve erroneously tagged chapter titles as paragraphs, and we can fix that by searching
for <p>([IVX]+\..*)</p>
and replacing it
with <title>\1</title>
. The parentheses in
the regex capture the information between them, which is everything except the
<p>
tags, and we can write the captured content
into the new <title>
tags by using
\1
to retrieve the first (and, in this case, only)
capture group. The <p>
tags serve to anchor the
search so that it has to match everything between the start and end tag. In our pattern,
the first thing after the start tag is a character class consisting of one or more
instances of I
, V
, or X
, that is, a Roman numeral. To avoid
matching one-line paragraphs that begin with the pronoun I
, we require the Roman
numeral to be followed by a literal period, and since the dot in regex means any
character except a new line
, we have to precede it with a backslash so that it
will lose that special meaning and match only a literal dot. The second dot, though,
does mean any character except a new line, and the asterisk tells the regex processor to
include in the match all characters on the line after the literal period following the
Roman numeral and up to the closing </p>
tag.
If all we do here is retag the chapter titles correctly, we wind up with a sequence of
<title>
and
<p>
elements, but nothing that records that the
chapters themselves are a hierarchical level in the structure of the document. That is,
the body of the novel does not contain just a combination of titles and paragraphs; it
contains chapters, and it’s the chapters that contain the titles and paragraphs. We can
fix that by searching for our <title>
start tags
(since we know that they’re at the beginning of each chapter) and replacing them with a
</chapter>
end tag, a new line, a
<chapter>
start tag, and another new line, so
that we wind up with a structure like the following for each chapter:
<chapter>
<title><!– title goes here --></title>
<p><!– text of first paragraph goes here --></p>
<!-- more paragraphs -->
</chapter>
We then have to fix the first and last chapters manually, removing the spurious
</chapter>
end tag from before the first chapter
and adding the missing </chapter>
end tag after
the last chapter. This is similar to the way we tagged the sonnets in the first
assignment, where we used the sonnet numbers as milestones to mark the
boundary between a preceding and a following unit, and whenever we use that sort of
strategy, we have to fix the first and last units manually because there is no
between
in their case. In this case it’s the titles that function as the
milestones, since the presence of a title signals the end of the preceding chapter and
the beginning of a new one.
We have to check Dot matches all
in the Find-and-replace dialog in order to
capture matches that might cross new line characters, and if we do that and then try
".+"
, it might look as if we would match a quotation
mark and then one or more characters, up to the next quotation mark. That doesn’t do
what we want, though, because by default regex matches are greedy, that is,
they select the longest possible match, so we wind up matching everything from the very
first quotation mark in the entire document up to the last, as if there were only one
very long quotation in the novel.
Here’s how greediness works: If you match a pattern like
".+"
, you’re asking to match one or more consecutive
characters between quotation marks, and if there are multiple matches that conform to
the pattern, it chooses the longest one. Since quotation marks are characters, too, if
your text reads He said "hello" and she said "good bye"
, the greedy match
".+"
will start at the quotation mark before
hello
, but when it has to decide which of the three
following quotation marks should delimit the end of the quotation, if chooses the one
after good bye
, as if there were one long quotation instead of two short ones
with a narrative interpolation between them. Regex patterns are greedy by default, but
you can make them non-greedy by putting a question mark after the pattern, so that
".+?"
will match each quotation individually.
In other contexts a question mark is a repetition indicator that means zero or one
(that is, optional
), but here it means non-greedy
. How does a regex parser
know when a question mark means zero or one
and when it means non-greedy
?
The answer is that it only means non-greedy
when it occurs after a plus sign or
an asterisk. That’s the only context in which non-greediness is an issue (it has to know
where to stop if there are multiple possible matches, of different lengths), and it
wouldn’t make sense to put a repetition indicator after another repetition indicator, so
it can’t mean zero or one
there. The technical term for using the same symbol
with different meanings in different contexts is overloading. It’s an
efficient way to avoid reserving more symbols than you need for special functions, but
it can be confusing for humans until we realize that we need to look not only at the
symbol, but also at the context. It might make more sense if you remember that normal
English orthography has overloading, too. For example, a single curly apostrophe can
function as an apostrophe (in contractions like don’t,
for example or possessive
expressions like Santa’s reindeer
), but it functions as a closing quotation mark
when there’s a single opening curly quotation mark earlier (e.g.: He said XML stands
for
).extensible markup language
When we tag the quotation by writing it between <q>
tags, we want to remove and discard the original quotation marks, since those were only
pseudo-markup, used to represent the beginnning and end of a quotation in plain text,
and in the XML version we’re replacing them with real markup. We do that by using
parentheses to capture only the content between the quotation marks, and we then write
that capture group into the output. Our final search regex is
"(.+?)"
and we replace it with
<q>\1</q>
.
At this point we clean up the results by fixing the starting material (title, author,
table of contents) by hand and adding a root element. We then save the file as XML (with
a .xml
filename extension), reopen it in <oXygen/>, and check it for
well-formedness. Once it’s well formed, you can also pretty-print it to improve the
indentation.