Maintained by: David J. Birnbaum (djbpitt@gmail.com)
Last modified:
2023-02-13T18:00:36+0000
Oscar Wilde’s The importance of being Earnest is available in plain text from Project Gutenberg at http://www.gutenberg.org/cache/epub/844/pg844.txt. Download the text and manually remove the Project Gutenberg boilerplate from the beginning and end, so that all that remains is the text as Oscar Wilde wrote it.
Your task is to begin to prepare an XML-encoded digital edition of this play from the plain text using search and replace operations to introduce the markup. As is appropriate for a play, we eventually want the XML to identify any of the following that it finds: acts, scenes, settings, speeches, speakers, and stage directions. Our goal is to use search and replace operations with regular expressions to create descriptive well-formed XML markup (rather than, for example, to create a presentational HTML editon). We want to avoid manual tagging except in situations that occur so rarely that they don’t justify search and replace operations, such as tagging the title of the play or creating a root element.
Delete the Gutenberg boilerplate manually.
Use regex to search for and replace reserved characters. What are they, and how many did you find?
In this case there are no ampersands or less-than (or greater-than) signs, so nothing to replace. It’s nonetheless important to look for these characters at the beginning of any up-conversion, and to replace them with character entities before you add any real markup.
Use regex to collapse multiple blank lines, leaving only one blank line betweeen lines of the input document.
Search for \n{3,}
and replace with
\n\n
. Quantified expressions (inside
curly braces) work in search patterns but not in replacement patterns, so we
have to spell out the two newlines we want in the output. For this step it
doesn’t matter whether dot-matches-all is checked or not, since there are no
dots in the match pattern, but we normally leave it unchecked as a default,
and check it only when that behavior is required.
Use regex to remove multiple consecutive space characters to leave only single spaces.
Search for +
(that’s a literal space
character followed by a plus sign) and replace with a single space. As with
the preceding step, there is no dot in the match pattern, so whether
dot-matches-all is checked doesn’t affect the behavior.
Manually select and cut everything before the line that says FIRST ACT
and save it to a different file. You’ll tag that separately later, as part
of the test, and then paste it back in.
Italics are represented by underscores. There are seven instances, which seem to fall into three types:
Emphasis, e.g., I believe it _is_ a very pleasant state,
sir.
Newspaper title, e.g., … will appear in the _Morning Post_ on
Saturday …
Individual punctuation marks that seem to have been italicized for no
apparent reason, e.g., [Lane goes out_._]
Use regex to:
Strip out the underscores around the single punctuation marks without replacement.
Since this is the only one-character italicized pattern, you can use
_(.)_
and replace with
\1
. A stricter strategy would
require the single character to be a punctuation mark, which you can
do by matching _(\p{P})_
(see Unicode
regular expressions for a more detailed explanation of how
this works). Since there is a dot involved it’s good general
practice to uncheck dot-matches-all, but it won’t matter in this
case because there are no instances of two underscores surrounding
just a newline.
Assume that if the content begins with an upper-case letter it
represents a newspaper title and use regex to replace the
underscores with <title>
tags.
Search for _([A-Z].*?)_
and
replace with
<title>\1</title>
.
Dot-matches-all should be checked because a newspaper title could
span a newline, although that happens not to be the case in this
document. Not checking dot-matches-all and happening to get the
desired output because the document happens not to contain a
possible situation that your pattern doesn’t handle is called a
brittle solution; the opposite is robust.
It’s good practice to write robust patterns that will handle
situations that might reasonably occur, instead of hoping that they
don’t.
Assume that the rest are emphasis and replace the underscores with
<emph>
tags.
Since you have removed all underscores except the ones intended to
represent emphasis, you can now match
_(.+?)_
and replace with
<emph>\1</emph>
. You
should check dot-matches-all because an emphasized string could span
a newline, although that happens not to occur in this
document.
Stage directions (that is, actions performed by characters) are inside square
brackets, sometimes on their own line and sometimes inside a speech. A
speech may have more than one stage direction, and stage directions may
cross line boundaries. Nothing else is inside square brackets. Use regex to
tag all stage directions as <stage>
,
removing the square brackets.
Match \[(.+?)\]
and replace with
<stage>\1</stage>
.
Dot-matches-all must be checked. You need to escape the square bracket
characters by preceding them with backslashes because square brackets are
special characters (their normal role is to delimit character classes, like
the one you used to match Roman numerals elsewhere), so if you want to match
a literal square bracket character, you need to precede it with a backslash
so that it will instead have its literal meaning.
As a way of learning about how acts and scenes are labeled, search for every
line that contains no lower-case letters with
^[^a-z\n]+$
. (This is a negated
character class; see the discussion at https://www.regular-expressions.info/charclass.html.) Note that
acts begin with the act number, then SCENE
, and then they end with
ACT DROP
, except that the last one ends with TABLEAU
.
Since every act has just one scene, use a regex to remove the lines that say
SCENE
or ACT DROP
or TABLEAU
(including their
trailing new lines) by replacing them with nothing. You can do this with a
single regex.
Search for (SCENE|TABLEAU|ACT DROP)\n*
and replace with nothing. Because there is no dot in the match pattern it
doesn’t matter whether dot-matches-all is checked.
The parentheses create a capture group because parentheses always create a
capture group, but that’s not why we use them here. We use them here to
create a subpattern, so that we match any of the options in the
parenthesized or-group plus any immediately following newline characters.
The string TABLEAU
at the end doesn’t have any following newline
characters because it’s at the very end of the document, so we match zero or
more newlines to consume those that are present after the other matches, but
ensure that we’ll also match against the final TABLEAU
because the
newlines are optional.
Now tag all of the act labels (lines that read FIRST ACT
, etc.) as
<act>
elements. These aren’t really
acts; they’re just the headings for acts, and we’ll fix that later.
Search for ^.+?ACT$
and replace with
<act>\0</act>
.
Dot-matches-all should be unchecked, since you want to match only a single
line.
There appears to be a description of the setting, in the form of a plain-text
paragraph, at the beginning of each act, we can find it because it occurs
immediately after the "ACT" label, and those are the only places where the
string "ACT", all in upper case, occurs. Find these paragraphs using regex
and tag them as <setting>
.
With dot-matches-all checked, match
(</act>\n+)(.+?)\n\n
and replace
with
\1<setting>\2</setting>\n\n
.
The first capture group helps us find the paragraph after the act labels
that we just created. We capture that and then put it back unchanged; we’re
matching it only to find our way to the right place, and not because we need
to process it. (We could, alternatively, use look-behind here, but we find
it easier to capture and replace than to remember the syntax for a
look-behind pattern.) The second capture group matches any sequence of
characters, including newlines, up to the first instance of two consecutive
newlines. This is the setting paragraph, so we write it back into the output
with <setting>
tags around
it.
Tagging with regular expressions relies on consistent patterns in the plain text, which means that if the plain text has any inconsistencies, the regular expression matching becomes less straightforward. To tag speeches we identified patterns that would help us recognize a speech, and then patterns that would help us recognize the speaker name within the speech, and that method misfires in three places because the patterns we relied on were not consistent. In Real Life we would fix the few exceptional cases by hand, but below we also describe how to adapt our regular expressions to deal with the exceptions more systematically.
We found it easier to define the beginnings and ends of speeches separately, so we tagged speeches in two steps:
Dot matches all doesn’t matter for the first step because there aren’t any
dots in our pattern. We match
\n\n([^<])
and replace it with
\n\n<speech>\1
. The match pattern takes
advantage of the fact that we’ve already tagged everything that begins after
two newlines that isn’t a speech: the act labels and stand-alone stage
directions. We use a negative character class to match two newlines followed
by anything that isn’t the beginning of a tag and write that back into the
replacements, except that we insert a
<speech>
start-tag after the
newlines.
With dot matches all checked (because speeches can cross multiple lines), we
match (<speech>.+?)\n*
and replace it
with \1</speech>\n\n
. This uses the
<speech>
start-tag that we inserted
in the previous step to find the beginnings of speeches and we match
everything from there through the first instance of two newlines, which is
the end of the speech. We write that match back into the replacement, except
that we insert the </speech>
end-tag
before the newlines. We made the trailing newlines optional because the last
speech may not be followed by newlines.
We’ve tagged the speeches and now need to tag the speakers within them as
<speaker>
and the spoken text as
<text>
. We assume (not entirely correctly)
that the speaker name begins immediately after the
<speech>
start-tag and continues until the
first literal dot. With dot matches all checked, we match the entire speech with
<speech>(.+?)\. (.+?)</speech>
, capture
the speaker and the spoken lines in two capture groups, and tag them with
<speech>\n<speaker>\1</speaker>\n<text>\2</text>\n</speech>
.
After we wrap a root element around our document and validate it as XML, <oXygen/> notifies us that there are two well-formedness errors, which are actually three tagging errors. The original plain text is:
Gwendolen and Cecily [Speaking together.] Your Christian names are still an insuperable barrier. That is all! Jack and Algernon [Speaking together.] Our Christian names! Is that all? But we are going to be christened this afternoon. Gwendolen. [To Jack.] For my sake you are prepared to do this terrible thing?
Our tagging method yields incorrect results because the speaker names Gwendolyn
and Cecily
and then Jack and Algernon
are not followed by a dot and,
immediately, a space, which we had relied on. With the speeches (but not yet the
speakers) tagged we see:
Gwendolen and Cecily Speaking together. Your Christian names are still
an insuperable barrier. That is all!
Jack and Algernon Speaking together. Our Christian names! Is that
all? But we are going to be christened this afternoon.
Gwendolen. To Jack. For my sake you are prepared to do this terrible
thing? ]]>
When we tag speakers, we wind up with the incorrect:
Gwendolen and Cecily Speaking together. Your Christian names are still
an insuperable barrier
That is all!
Jack and Algernon Speaking together. Our Christian names! Is that
all? But we are going to be christened this afternoon.
Gwendolen
To Jack. For my sake you are prepared to do this terrible
thing?
]]>
Since we discovered this inconsistency only after running the find-and-replace
operation and it affects only three consecutive speeches, in Real Life we would fix
it manually. A reasonably robust alternative, had we known earlier, would have been
to tag the speeches, then fix the two places where speakers are not followed by a
dot, and then tag the speakers. Those two places are the only instances where a
stage direction reads [Speaking together.]
, so we could use that pattern to
find the moments that require repair. Specifically, we could interpose a search for
<stage>Speaking together\.</stage>
(note the leading space character) and replace it with
.\0
. This inserts a literal dot after the
speaker names, which regularizes the plain text and enables our original find-and
replace operations to match and tag all speakers correctly.