Digital humanities


Maintained by: David J. Birnbaum (djbpitt@gmail.com) [Creative Commons BY-NC-SA 3.0 Unported License] Last modified: 2021-02-15T17:53:48+0000


Regex assignment #3: answers

The text

Oscar Wilde’s The importance of being Earnest is available in plain text from Project Gutenberg at http://www.gutenberg.org/cache/epub/844/pg844.txt. Download the text and manually remove the Project Gutenberg boilerplate from the beginning and end, so that all that remains is the text as Oscar Wilde wrote it.

The task

Your task is to prepare an XML-encoded digital edition of this play from the plain text using search and replace operations to introduce the markup. The specific markup you use is up to you, but as is appropriate for a play, you will want your XML to identify at least acts, scenes, speeches, speakers, and stage directions. Note that your goal is to use search and replace operations, with or without regular expressions, to create descriptive well-formed XML markup (rather than, for example, to create a presentational HTML editon). You should not use manual tagging except in situations that occur so rarely that they don’t justify search and replace operations or stylesheet transformations (such as tagging the title of the play or creating a root element).

When you have completed your tagging, you should upload the XML document you create along with a separate page describing any global search and replace operations you used (through the search and replace dialog box) to introduce markup.

There is no single target output for this assignment. Any well-formed markup you create that is appropriate and sensible for the play is fine.

One solution

Remove the Project Gutenberg boilerplate from the beginning and end manually, as described in the instructions. Remove everything before the first act (title, cast, etc.) temporarily into another file, tag it separately, and then paste it back in.

Search for all instances of ampersand and angle brackets and replace them with the appropriate XML entities: &, <, and >. As it happens, there aren’t any, but in Real Life the only way to find that out is to look.

We care about blank lines (see below), but sequences of multiple blank lines aren’t especially useful, so let's replace them with single blank lines. Search for \n\n\+ (or \n{3,}; the number in curly braces followed by a comma means 3 or more) and replace it with \n\n.

There are multiple spaces after sentences, but since single spacing is now more common (see the discussion on Wikipedia for arguments and links), we’ll collapse our double spaces into singles. We do that by searching for {2,} (that’s a literal space character before the curly braces) and replacing with (a single literal space character). We search for a literal space character, and not for \s, because \s matches both spaces and new lines (and tabs and a few other whitespace characters), and we don’t want to lose new lines.

Stage directions are in square brackets. With Dot matches all checked, match \[(.+?)\] and replace with <stage>\1</stage>. Turn off Dot matches all when this is done. We start with stage directions because we’re working from the inside out, and stage directions may appear inside speeches, but the reverse is not the case. The result will look like:

Jack. My own one!

Chasuble. <stage>To Miss Prism.</stage> Laetitia! <stage>Embraces her</stage>

Miss Prism. <stage>Enthusiastically.</stage> Frederick! At last!

Algernon. Cecily! <stage>Embraces her.</stage> At last!

Jack. Gwendolen! <stage>Embraces her.</stage> At last!

Lady Bracknell. My nephew, you seem to be displaying signs of
triviality.

As a way of learning about how acts and scenes are labeled, search for every line that contains no lower-case letters with ^[^a-z\n]+$. (This is a negated character class; see the discussion at https://www.regular-expressions.info/charclass.html.) Note that acts begin with the act number, then SCENE, and then they end with ACT DROP, except that the last one ends with TABLEAU. Since every act has just one scene, we can remove the lines that say SCENE or ACT DROP or TABLEAU (including their trailing new lines) by replacing them with nothing. You can do this with a regex by searching for (SCENE|TABLEAU|ACT DROP)\n* and replacing it with nothing.

There appears to be a description of the setting, in the form of a plain-text paragraph, at the beginning of each act, and we can find it because it occurs immediately after the ACT label, and those are the only places where the string ACT, all in upper case, occurs. Turn on Dot matches all. We can match the settings with ^(.+?ACT\n\n)(.+?)\n\n. This expression captures the act label, including the following two new line characters, and it then also captures the setting, which is all of the text until the next sequence of two new line characters. If there were settings that consisted of more than one paragraph, with blank lines between the paragraphs, we would need a different strategy, but in this play all settings are described in single paragraphs. Our replacement is \1<setting>\2</setting>\n\n, which writes the act label with its new lines (capture group #1) back into the file, followed by the setting, now wrapped in <setting> tags.

Speeches (and stand-alone stage directions, and a few other things) are separated by blank lines, and non-speeches that follow blank lines have now mostly been tagged (settings, stage directions not embedded in speeches) or deleted (strings like SCENE, etc.). We might think, then any block of text that is separated from others by blank lines and that doesn’t begin with an angle bracket must be either a speech or an act label. We can match only the speeches because they appear to contain a period after the speaker name, while the act labels do not contain periods. In other words, any line-initial sequence of characters that doesn’t begin with an angle bracket, up until a period and its following space character, should be a speaker name, and the text after that until the next blank line should be the speech. (This turns out to be wrong, but in Real Life we didn’t discover that until later.)

Turn on Dot matches all. We can match \n\n([^<]+?)\. (.+?)\n\n, and replace it with \n\n<speech speaker="\1">\2</speech>\n\n. That is, we match a sequence of non-angle-bracket characters after a blank line up to the first period (this is the speaker name), then a period and space, and then everything until the next sequence of two new lines (that is, the next blank line). This lets us tag the speeches, specify the speakers as attributes, and throw away the trailing period after the speaker name, which was pseudo-markup in plain text to set off the speaker name from the speech. We don’t match the act labels because our match pattern requires a period character, which is not present in the act labels.

In fact, though, we tag only half of the speeches, and the reason is that each match consumes the blank lines after itself, which means that the next speech will not be preceded by a blank line. The easiest way to fix this is to process half of the speeches and then run it again to process the rest. The result should look like:

<speech speaker="Algernon">Thank you, Aunt Augusta.</speech>

<speech speaker="Lady Bracknell">Cecily, you may kiss me!</speech>

<speech speaker="Cecily"><stage>Kisses her.</stage> Thank you, Lady Bracknell.</speech>

<speech speaker="Lady Bracknell">You may also address me as Aunt Augusta for the future.</speech>

<speech speaker="Cecily">Thank you, Aunt Augusta.</speech>

This fails to tag the last speech because it doesn”t have new lines after it, so we fix that manually.

Acts begin with strings like FIRST ACT and continue until the next act or until the end of the play, and if we’ve tagged everything else correctly, act labels should be the only contexts where a line after a blank line does not begin with an angle bracket or a new line. We can test this by searching for \n\n[^<\s] and when we do that, we see five results, even though there should be only two (the second and third acts; we don’t expect to see the first act because it isn’t preceded by a blank line). When we look at those lines, and then back at the original, we can see what went wrong. The three offending lines in our file, which occur together are:

Gwendolen and Cecily <stage>Speaking together.</stage> Your Christian names are still
an insuperable barrier. That is all!

Jack and Algernon <stage>Speaking together.</stage> Our Christian names! Is that 
all? But we are going to be christened this afternoon.

The original said:

Gwendolen and Cecily [Speaking together.]  Your Christian names are still
an insuperable barrier.  That is all!

Jack and Algernon [Speaking together.]  Our Christian names!  Is that
all?  But we are going to be christened this afternoon.

What happened, then, is that there were two speeches in the original where the speaker names were not followed a period, which meant that our regular expression, which depended on the period, failed to recognize them as speeches. Data in the wild is often inconsistent, so this sort of situation is not uncommon. Since there are only two, we would fix them by repairing those speeches manually at this point. If, on the other hand, our regular expression had missed a large number of speeches, or if we intended to use the same process on a large number of typographically similar plays, it would be better to figure out how to autotag those, whether by improving our original find-and-replace operation or running an additional one to capture what the first one missed.

Since there are only three acts, it is easiest to search for them and tag them manually. Remember, though, that an act is not an act label; the tags need to go around the entire act, both the label and all of the speeches. If there were a lot of acts (or, for example, a lot of chapter in a book or sonnets in the Shakespeare sonnet collection) you could search for them and replace them with an end-tag for the preceding act, a start-tag for the new one, and the label, properly tagged. You would then clean up spurious or missing tags at the beginning or end manually.

Add a root element, save as XML, reopen, pretty-print. If when you reopen your XML is not well-formed, you’ll need to figure out what went wrong and how to fix it. That isn’t uncommon; in fact, we had to do it a few times when we were preparing this assignment. Once it is well formed, though, you can tag the front matter manually and paste it back in. You can also use regular expressions to remove the blank lines between acts, speeches, etc. in your XML if you want.