Maintained by: David J. Birnbaum (djbpitt@gmail.com)
Last modified:
2021-02-15T17:53:48+0000
Oscar Wilde’s The importance of being Earnest is available in plain text from Project Gutenberg at http://www.gutenberg.org/cache/epub/844/pg844.txt. Download the text and manually remove the Project Gutenberg boilerplate from the beginning and end, so that all that remains is the text as Oscar Wilde wrote it.
Your task is to prepare an XML-encoded digital edition of this play from the plain text using search and replace operations to introduce the markup. The specific markup you use is up to you, but as is appropriate for a play, you will want your XML to identify at least acts, scenes, speeches, speakers, and stage directions. Note that your goal is to use search and replace operations, with or without regular expressions, to create descriptive well-formed XML markup (rather than, for example, to create a presentational HTML editon). You should not use manual tagging except in situations that occur so rarely that they don’t justify search and replace operations or stylesheet transformations (such as tagging the title of the play or creating a root element).
When you have completed your tagging, you should upload the XML document you create along with a separate page describing any global search and replace operations you used (through the search and replace dialog box) to introduce markup.
There is no single target output for this assignment. Any well-formed markup you create that is appropriate and sensible for the play is fine.
Remove the Project Gutenberg boilerplate from the beginning and end manually, as described in the instructions. Remove everything before the first act (title, cast, etc.) temporarily into another file, tag it separately, and then paste it back in.
Search for all instances of ampersand and angle brackets and replace them with the
appropriate XML entities: &
, <
, and
>
. As it happens, there aren’t any, but in Real Life the
only way to find that out is to look.
We care about blank lines (see below), but sequences of multiple blank lines aren’t
especially useful, so let's replace them with single blank lines. Search for
\n\n\+
(or \n{3,}
; the number in curly braces followed
by a comma means 3 or more
) and replace it with \n\n
.
There are multiple spaces after sentences, but since single spacing is now more
common (see the discussion
on Wikipedia for arguments and links), we’ll collapse our double spaces into
singles. We do that by searching for {2,}
(that’s a literal space
character before the curly braces) and replacing with
(a single
literal space character). We search for a literal space character, and not for
\s
, because \s
matches both spaces and new lines (and
tabs and a few other whitespace characters), and we don’t want to lose new
lines.
Stage directions are in square brackets. With Dot matches all
checked, match \[(.+?)\]
and replace with
<stage>\1</stage>
. Turn off Dot matches
all when this is done. We start with stage directions because we’re
working from the inside out, and stage directions may appear inside speeches, but
the reverse is not the case. The result will look like:
Jack. My own one! Chasuble. <stage>To Miss Prism.</stage> Laetitia! <stage>Embraces her</stage> Miss Prism. <stage>Enthusiastically.</stage> Frederick! At last! Algernon. Cecily! <stage>Embraces her.</stage> At last! Jack. Gwendolen! <stage>Embraces her.</stage> At last! Lady Bracknell. My nephew, you seem to be displaying signs of triviality.
As a way of learning about how acts and scenes are labeled, search for every line
that contains no lower-case letters with ^[^a-z\n]+$
. (This is a
negated character class; see the discussion at https://www.regular-expressions.info/charclass.html.) Note that acts begin
with the act number, then SCENE
, and then they end with ACT DROP
,
except that the last one ends with TABLEAU
. Since every act has just one
scene, we can remove the lines that say SCENE
or ACT DROP
or
TABLEAU
(including their trailing new lines) by replacing them with
nothing. You can do this with a regex by searching for (SCENE|TABLEAU|ACT
DROP)\n*
and replacing it with nothing.
There appears to be a description of the setting, in the form of a plain-text
paragraph, at the beginning of each act, and we can find it because it occurs
immediately after the ACT
label, and those are the only places where the
string ACT
, all in upper case, occurs. Turn on Dot matches
all. We can match the settings with
^(.+?ACT\n\n)(.+?)\n\n
. This expression captures the act label,
including the following two new line characters, and it then also captures the
setting, which is all of the text until the next sequence of two new line
characters. If there were settings that consisted of more than one paragraph, with
blank lines between the paragraphs, we would need a different strategy, but in this
play all settings are described in single paragraphs. Our replacement is
\1<setting>\2</setting>\n\n
, which writes the act label
with its new lines (capture group #1) back into the file, followed by the setting,
now wrapped in <setting>
tags.
Speeches (and stand-alone stage directions, and a few other things) are separated by
blank lines, and non-speeches that follow blank lines have now mostly been tagged
(settings, stage directions not embedded in speeches) or deleted (strings like
SCENE
, etc.). We might think, then any block of text that is separated
from others by blank lines and that doesn’t begin with an angle bracket must be
either a speech or an act label. We can match only the speeches because they appear
to contain a period after the speaker name, while the act labels do not contain
periods. In other words, any line-initial sequence of characters that doesn’t begin
with an angle bracket, up until a period and its following space character, should
be a speaker name, and the text after that until the next blank line should be the
speech. (This turns out to be wrong, but in Real Life we didn’t discover that until
later.)
Turn on Dot matches all. We can match \n\n([^<]+?)\.
(.+?)\n\n
, and replace it with \n\n<speech
speaker="\1">\2</speech>\n\n
. That is, we match a sequence of
non-angle-bracket characters after a blank line up to the first period (this is the
speaker name), then a period and space, and then everything until the next sequence
of two new lines (that is, the next blank line). This lets us tag the speeches,
specify the speakers as attributes, and throw away the trailing period after the
speaker name, which was pseudo-markup in plain text to set off the speaker name from
the speech. We don’t match the act labels because our match pattern requires a
period character, which is not present in the act labels.
In fact, though, we tag only half of the speeches, and the reason is that each match consumes the blank lines after itself, which means that the next speech will not be preceded by a blank line. The easiest way to fix this is to process half of the speeches and then run it again to process the rest. The result should look like:
<speech speaker="Algernon">Thank you, Aunt Augusta.</speech> <speech speaker="Lady Bracknell">Cecily, you may kiss me!</speech> <speech speaker="Cecily"><stage>Kisses her.</stage> Thank you, Lady Bracknell.</speech> <speech speaker="Lady Bracknell">You may also address me as Aunt Augusta for the future.</speech> <speech speaker="Cecily">Thank you, Aunt Augusta.</speech>
This fails to tag the last speech because it doesn”t have new lines after it, so we fix that manually.
Acts begin with strings like FIRST ACT
and continue until the next act or
until the end of the play, and if we’ve tagged everything else correctly, act labels
should be the only contexts where a line after a blank line does not begin with an
angle bracket or a new line. We can test this by searching for
\n\n[^<\s]
and when we do that, we see five results, even though
there should be only two (the second and third acts; we don’t expect to see the
first act because it isn’t preceded by a blank line). When we look at those lines,
and then back at the original, we can see what went wrong. The three offending lines
in our file, which occur together are:
Gwendolen and Cecily <stage>Speaking together.</stage> Your Christian names are still an insuperable barrier. That is all! Jack and Algernon <stage>Speaking together.</stage> Our Christian names! Is that all? But we are going to be christened this afternoon.
The original said:
Gwendolen and Cecily [Speaking together.] Your Christian names are still an insuperable barrier. That is all! Jack and Algernon [Speaking together.] Our Christian names! Is that all? But we are going to be christened this afternoon.
What happened, then, is that there were two speeches in the original where the speaker names were not followed a period, which meant that our regular expression, which depended on the period, failed to recognize them as speeches. Data in the wild is often inconsistent, so this sort of situation is not uncommon. Since there are only two, we would fix them by repairing those speeches manually at this point. If, on the other hand, our regular expression had missed a large number of speeches, or if we intended to use the same process on a large number of typographically similar plays, it would be better to figure out how to autotag those, whether by improving our original find-and-replace operation or running an additional one to capture what the first one missed.
Since there are only three acts, it is easiest to search for them and tag them manually. Remember, though, that an act is not an act label; the tags need to go around the entire act, both the label and all of the speeches. If there were a lot of acts (or, for example, a lot of chapter in a book or sonnets in the Shakespeare sonnet collection) you could search for them and replace them with an end-tag for the preceding act, a start-tag for the new one, and the label, properly tagged. You would then clean up spurious or missing tags at the beginning or end manually.
Add a root element, save as XML, reopen, pretty-print. If when you reopen your XML is not well-formed, you’ll need to figure out what went wrong and how to fix it. That isn’t uncommon; in fact, we had to do it a few times when we were preparing this assignment. Once it is well formed, though, you can tag the front matter manually and paste it back in. You can also use regular expressions to remove the blank lines between acts, speeches, etc. in your XML if you want.