Digital humanities


Maintained by: David J. Birnbaum (djbpitt@gmail.com) [Creative Commons BY-NC-SA 3.0 Unported License] Last modified: 2018-02-05T02:27:28+0000


Regex assignment #3: answers

The text

Oscar Wilde’s The importance of being Earnest is available in plain text from Project Gutenberg at http://www.gutenberg.org/cache/epub/844/pg844.txt. Download the text and manually remove the Project Gutenberg boilerplate from the beginning and end, so that all that remains is the text as Oscar Wilde wrote it.

The task

Your task is to prepare an XML-encoded digital edition of this play from the plain text using search and replace operations to introduce the markup. The specific markup you use is up to you, but as is appropriate for a play, you will want your XML to identify at least acts, scenes, speeches, speakers, and stage directions. Note that your goal is to use search and replace operations, with or without regular expressions, to create descriptive well-formed XML markup (rather than, for example, to create a presentational HTML editon). You should not use manual tagging except in situations that occur so rarely that they don’t justify search and replace operations or stylesheet transformations (such as tagging the title of the play or creating a root element).

When you have completed your tagging, you should upload the XML document you create along with a separate page describing any global search and replace operations you used (through the search and replace dialog box) to introduce markup.

There is no single target output for this assignment. Any well-formed markup you create that is appropriate and sensible for the play is fine.

One solution

Remove the Project Gutenberg boilerplate from the beginning and end manually, as described in the instructions.

Search for all instances of ampersand and angle brackets and replace them with the appropriate XML entities: &, <, and >. As it happens, there aren’t any, but the only way to find that out is to look.

We care about blank lines (see below), but sequences of multiple blank lines aren’t especially useful, so let's replace them with single blank lines. Search for \n\n\+ (or \n{3,}; the number in curly braces followed by a comma means 3 or more) and replace it with \n\n.

There are multiple spaces after sentences. Collapse them by searching for {2,} and replacing with . We search for a literal space character, and not for \s, because \s matches both spaces and new lines (and tabs and a few other whitespace characters), and we don’t want to lose new lines.

Remove everything before the line that reads FIRST ACT, tag it in a separate document, and paste it back in at the end.

Stage directions are in square brackets. With Dot matches all checked, match \[(.+?)\] and replace with <stage>\1</stage>. Turn off Dot matches all when this is done. We start with stage directions because we’re working from the inside out, and stage directions may appear inside speeches, but the reverse is not the case. The result will look like:

Jack. My own one!

Chasuble. <stage>To Miss Prism.</stage> Laetitia! <stage>Embraces her</stage>

Miss Prism. <stage>Enthusiastically.</stage> Frederick! At last!

Algernon. Cecily! <stage>Embraces her.</stage> At last!

Jack. Gwendolen! <stage>Embraces her.</stage> At last!

Lady Bracknell. My nephew, you seem to be displaying signs of
triviality.

As a way of learning about how acts and scenes are labeled, search for every line that contains no lower-case letters with ^[^a-z\n]+$. Note that acts begin with the act number, then SCENE, and then they end with ACT DROP, except that the last one ends with TABLEAU. Since every act has just one scene, we can remove the lines that say SCENE or ACT DROP or TABLEAU (including their trailing new lines) by replacing them with nothing. You can do this with a regex by searching for (SCENE|TABLEAU|ACT DROP)\n* and replacing it with nothing.

Speeches (and stand-alone stage directions, and a few other things) are separated by blank lines. Assume they’re all speeches (that is, overgeneralize), match \n{2,}, and replace it with \n</speech>\n<speech>\n. Add a <speech> tag at the very beginning and remove the extra one at the end. The result should look like:

<speech>
Jack. My own one!
</speech>
<speech>
Chasuble. <stage>To Miss Prism.</stage> Laetitia! <stage>Embraces her</stage>
</speech>
<speech>
Miss Prism. <stage>Enthusiastically.</stage> Frederick! At last!
</speech>
<speech>
Algernon. Cecily! <stage>Embraces her.</stage> At last!
</speech>
<speech>
Jack. Gwendolen! <stage>Embraces her.</stage> At last!
</speech>
<speech>
Lady Bracknell. My nephew, you seem to be displaying signs of
triviality.
</speech>

Speeches begin with the speaker’s name, followed by a period. Search for <speech>\n([A-Z].+?)\. (including the trailing space and replace with <speech>\n<speaker>\1</speaker>\n. Note that we remove the period. The result will look like:

<speech>
<speaker>Jack</speaker>
My own one!
</speech>
<speech>
<speaker>Chasuble</speaker>
<stage>To Miss Prism.</stage> Laetitia! <stage>Embraces her</stage>
</speech>
<speech>
<speaker>Miss Prism</speaker>
<stage>Enthusiastically.</stage> Frederick! At last!
</speech>
<speech>
<speaker>Algernon</speaker>
Cecily! <stage>Embraces her.</stage> At last!
</speech>
<speech>
<speaker>Jack</speaker>
Gwendolen! <stage>Embraces her.</stage> At last!
</speech>
<speech>
<speaker>Lady Bracknell</speaker>
My nephew, you seem to be displaying signs of
triviality.
</speech>

The part of the <speech> that isn’t the speaker is the spoken text (possibly with stage directions mixed in). Turn on Dot matches all, since stage directions may cross line ends, search for </speaker>\n(.+?)</speech>, and replace it with </speaker>\n<lines>\1</lines>\n</speech>. The result will look like:

<speech>
<speaker>Jack</speaker>
<lines>My own one!
</lines>
</speech>
<speech>
<speaker>Chasuble</speaker>
<lines><stage>To Miss Prism.</stage> Laetitia! <stage>Embraces her</stage>
</lines>
</speech>
<speech>
<speaker>Miss Prism</speaker>
<lines><stage>Enthusiastically.</stage> Frederick! At last!
</lines>
</speech>
<speech>
<speaker>Algernon</speaker>
<lines>Cecily! <stage>Embraces her.</stage> At last!
</lines>
</speech>
<speech>
<speaker>Jack</speaker>
<lines>Gwendolen! <stage>Embraces her.</stage> At last!
</lines>
</speech>
<speech>
<speaker>Lady Bracknell</speaker>
<lines>My nephew, you seem to be displaying signs of
triviality.
</lines>
</speech>

The new line before the </lines> end tag is annoying. We could refine our regex to get rid of it, but instead we’ll fix it later. Turn off Dot matches all!

The act labels are all incorrectly tagged as speeches. Match them with <speech>\n.+ACT\n</speech> and replace with </act>\n<act>. Fix the start tag for the first act and the end tag for the last one manually.

The first thing tagged as <speech> in each act is the initial setting. Change that manually to a stage direction.

Add a root element, save as XML, reopen, pretty-print. Tag the front matter manually and paste it back in.

You may have spurious leading and trailing white space at the beginnings and ends of elements. Remove that by replacing >\s+ with just >, and \s+< with just <. Add the space back before <stage> and after </stage>. Pretty print.

If you want to autotag the cast of characters, those lines all contain a colon, and there is only one other line inside a speech that contains a colon (Algernon says No: the appointment is in London.), so you can use a regex to match and tag the lines and then fix the one false hit manually. Given the brevity of the list, though, you may find it easier just to tag all of the characters manually. Alternatively, the <oXygen/> find-and-replace dialog has a Scope switch, which lets you choose whether to apply the find-and-replace operation to All or Only selected lines. This means that you can select the case of characters and constrain a find-and-replace operation to only the selected lines, which means that the operation won’t even look at lines outside the selected area, so you don’t have to worry about matching something there.

There are 48 instances of <speech> elements that actually contain nothing but stage directions, e.g.:

<speech>
    <speaker>Lane</speaker>
    <lines>Mr. Ernest Worthing.</lines>
</speech>
<speech>
    <stage>Enter Jack.</stage>
</speech>
<speech>
    <stage>Lane goes out_._</stage>
</speech>
<speech>
    <speaker>Algernon</speaker>
    <lines>How are you, my dear Ernest? What brings you up to town?</lines>
</speech>

You can fix these by checking Dot matches all and then searching for <speech>\s*(<stage>[^<]+</stage>)\s*</speech> and replacing it with \1. This throws away the <speech> start and end tags and the extra white space.

If your XML is well formed, you can pretty-print it to wrap and indent legibly. If it isn’t well formed, you’ll need to examine the errors, and there are three types of strategies for dealing with them:

You should also read (or, at least, skim) through your document to look for mistakes that don’t interfere with well-formedness, but that nonetheless give you bad results. The stage directions that we had tagged as speeches (#16), above, are an example of that.