Digital humanities


Maintained by: David J. Birnbaum (djbpitt@gmail.com) [Creative Commons BY-NC-SA 3.0 Unported License] Last modified: 2023-02-13T18:00:36+0000


Regex assignment #3: answers

The text

Oscar Wilde’s The importance of being Earnest is available in plain text from Project Gutenberg at http://www.gutenberg.org/cache/epub/844/pg844.txt. Download the text and manually remove the Project Gutenberg boilerplate from the beginning and end, so that all that remains is the text as Oscar Wilde wrote it.

The task

Your task is to begin to prepare an XML-encoded digital edition of this play from the plain text using search and replace operations to introduce the markup. As is appropriate for a play, we eventually want the XML to identify any of the following that it finds: acts, scenes, settings, speeches, speakers, and stage directions. Our goal is to use search and replace operations with regular expressions to create descriptive well-formed XML markup (rather than, for example, to create a presentational HTML editon). We want to avoid manual tagging except in situations that occur so rarely that they don’t justify search and replace operations, such as tagging the title of the play or creating a root element.

Steps to complete

  1. Delete the Gutenberg boilerplate manually.

  2. Use regex to search for and replace reserved characters. What are they, and how many did you find?

    In this case there are no ampersands or less-than (or greater-than) signs, so nothing to replace. It’s nonetheless important to look for these characters at the beginning of any up-conversion, and to replace them with character entities before you add any real markup.

  3. Use regex to collapse multiple blank lines, leaving only one blank line betweeen lines of the input document.

    Search for \n{3,} and replace with \n\n. Quantified expressions (inside curly braces) work in search patterns but not in replacement patterns, so we have to spell out the two newlines we want in the output. For this step it doesn’t matter whether dot-matches-all is checked or not, since there are no dots in the match pattern, but we normally leave it unchecked as a default, and check it only when that behavior is required.

  4. Use regex to remove multiple consecutive space characters to leave only single spaces.

    Search for + (that’s a literal space character followed by a plus sign) and replace with a single space. As with the preceding step, there is no dot in the match pattern, so whether dot-matches-all is checked doesn’t affect the behavior.

  5. Manually select and cut everything before the line that says FIRST ACT and save it to a different file. You’ll tag that separately later, as part of the test, and then paste it back in.

  6. Italics are represented by underscores. There are seven instances, which seem to fall into three types:

    • Emphasis, e.g., I believe it _is_ a very pleasant state, sir.

    • Newspaper title, e.g., … will appear in the _Morning Post_ on Saturday …

    • Individual punctuation marks that seem to have been italicized for no apparent reason, e.g., [Lane goes out_._]

    Use regex to:

    • Strip out the underscores around the single punctuation marks without replacement.

      Since this is the only one-character italicized pattern, you can use _(.)_ and replace with \1. A stricter strategy would require the single character to be a punctuation mark, which you can do by matching _(\p{P})_ (see Unicode regular expressions for a more detailed explanation of how this works). Since there is a dot involved it’s good general practice to uncheck dot-matches-all, but it won’t matter in this case because there are no instances of two underscores surrounding just a newline.

    • Assume that if the content begins with an upper-case letter it represents a newspaper title and use regex to replace the underscores with <title> tags.

      Search for _([A-Z].*?)_ and replace with <title>\1</title>. Dot-matches-all should be checked because a newspaper title could span a newline, although that happens not to be the case in this document. Not checking dot-matches-all and happening to get the desired output because the document happens not to contain a possible situation that your pattern doesn’t handle is called a brittle solution; the opposite is robust. It’s good practice to write robust patterns that will handle situations that might reasonably occur, instead of hoping that they don’t.

    • Assume that the rest are emphasis and replace the underscores with <emph> tags.

      Since you have removed all underscores except the ones intended to represent emphasis, you can now match _(.+?)_ and replace with <emph>\1</emph>. You should check dot-matches-all because an emphasized string could span a newline, although that happens not to occur in this document.

  7. Stage directions (that is, actions performed by characters) are inside square brackets, sometimes on their own line and sometimes inside a speech. A speech may have more than one stage direction, and stage directions may cross line boundaries. Nothing else is inside square brackets. Use regex to tag all stage directions as <stage>, removing the square brackets.

    Match \[(.+?)\] and replace with <stage>\1</stage>. Dot-matches-all must be checked. You need to escape the square bracket characters by preceding them with backslashes because square brackets are special characters (their normal role is to delimit character classes, like the one you used to match Roman numerals elsewhere), so if you want to match a literal square bracket character, you need to precede it with a backslash so that it will instead have its literal meaning.

  8. As a way of learning about how acts and scenes are labeled, search for every line that contains no lower-case letters with ^[^a-z\n]+$. (This is a negated character class; see the discussion at https://www.regular-expressions.info/charclass.html.) Note that acts begin with the act number, then SCENE, and then they end with ACT DROP, except that the last one ends with TABLEAU. Since every act has just one scene, use a regex to remove the lines that say SCENE or ACT DROP or TABLEAU (including their trailing new lines) by replacing them with nothing. You can do this with a single regex.

    Search for (SCENE|TABLEAU|ACT DROP)\n* and replace with nothing. Because there is no dot in the match pattern it doesn’t matter whether dot-matches-all is checked.

    The parentheses create a capture group because parentheses always create a capture group, but that’s not why we use them here. We use them here to create a subpattern, so that we match any of the options in the parenthesized or-group plus any immediately following newline characters. The string TABLEAU at the end doesn’t have any following newline characters because it’s at the very end of the document, so we match zero or more newlines to consume those that are present after the other matches, but ensure that we’ll also match against the final TABLEAU because the newlines are optional.

  9. Now tag all of the act labels (lines that read FIRST ACT, etc.) as <act> elements. These aren’t really acts; they’re just the headings for acts, and we’ll fix that later.

    Search for ^.+?ACT$ and replace with <act>\0</act>. Dot-matches-all should be unchecked, since you want to match only a single line.

  10. There appears to be a description of the setting, in the form of a plain-text paragraph, at the beginning of each act, we can find it because it occurs immediately after the "ACT" label, and those are the only places where the string "ACT", all in upper case, occurs. Find these paragraphs using regex and tag them as <setting>.

    With dot-matches-all checked, match (</act>\n+)(.+?)\n\n and replace with \1<setting>\2</setting>\n\n. The first capture group helps us find the paragraph after the act labels that we just created. We capture that and then put it back unchanged; we’re matching it only to find our way to the right place, and not because we need to process it. (We could, alternatively, use look-behind here, but we find it easier to capture and replace than to remember the syntax for a look-behind pattern.) The second capture group matches any sequence of characters, including newlines, up to the first instance of two consecutive newlines. This is the setting paragraph, so we write it back into the output with <setting> tags around it.

Optional bonus tasks

Tagging with regular expressions relies on consistent patterns in the plain text, which means that if the plain text has any inconsistencies, the regular expression matching becomes less straightforward. To tag speeches we identified patterns that would help us recognize a speech, and then patterns that would help us recognize the speaker name within the speech, and that method misfires in three places because the patterns we relied on were not consistent. In Real Life we would fix the few exceptional cases by hand, but below we also describe how to adapt our regular expressions to deal with the exceptions more systematically.

We found it easier to define the beginnings and ends of speeches separately, so we tagged speeches in two steps:

  1. Dot matches all doesn’t matter for the first step because there aren’t any dots in our pattern. We match \n\n([^<]) and replace it with \n\n<speech>\1. The match pattern takes advantage of the fact that we’ve already tagged everything that begins after two newlines that isn’t a speech: the act labels and stand-alone stage directions. We use a negative character class to match two newlines followed by anything that isn’t the beginning of a tag and write that back into the replacements, except that we insert a <speech> start-tag after the newlines.

  2. With dot matches all checked (because speeches can cross multiple lines), we match (<speech>.+?)\n* and replace it with \1</speech>\n\n. This uses the <speech> start-tag that we inserted in the previous step to find the beginnings of speeches and we match everything from there through the first instance of two newlines, which is the end of the speech. We write that match back into the replacement, except that we insert the </speech> end-tag before the newlines. We made the trailing newlines optional because the last speech may not be followed by newlines.

We’ve tagged the speeches and now need to tag the speakers within them as <speaker> and the spoken text as <text>. We assume (not entirely correctly) that the speaker name begins immediately after the <speech> start-tag and continues until the first literal dot. With dot matches all checked, we match the entire speech with <speech>(.+?)\. (.+?)</speech>, capture the speaker and the spoken lines in two capture groups, and tag them with <speech>\n<speaker>\1</speaker>\n<text>\2</text>\n</speech>.

After we wrap a root element around our document and validate it as XML, <oXygen/> notifies us that there are two well-formedness errors, which are actually three tagging errors. The original plain text is:

Gwendolen and Cecily [Speaking together.]  Your Christian names are still
an insuperable barrier.  That is all!

Jack and Algernon [Speaking together.]  Our Christian names!  Is that
all?  But we are going to be christened this afternoon.

Gwendolen.  [To Jack.]  For my sake you are prepared to do this terrible
thing?

Our tagging method yields incorrect results because the speaker names Gwendolyn and Cecily and then Jack and Algernon are not followed by a dot and, immediately, a space, which we had relied on. With the speeches (but not yet the speakers) tagged we see:

Gwendolen and Cecily Speaking together. Your Christian names are still
an insuperable barrier. That is all!

Jack and Algernon Speaking together. Our Christian names! Is that
all? But we are going to be christened this afternoon.

Gwendolen. To Jack. For my sake you are prepared to do this terrible
thing?]]>

When we tag speakers, we wind up with the incorrect:


Gwendolen and Cecily Speaking together. Your Christian names are still
an insuperable barrier
That is all!



Jack and Algernon Speaking together. Our Christian names! Is that
all? But we are going to be christened this afternoon.

Gwendolen
To Jack. For my sake you are prepared to do this terrible
thing?

]]>

Since we discovered this inconsistency only after running the find-and-replace operation and it affects only three consecutive speeches, in Real Life we would fix it manually. A reasonably robust alternative, had we known earlier, would have been to tag the speeches, then fix the two places where speakers are not followed by a dot, and then tag the speakers. Those two places are the only instances where a stage direction reads [Speaking together.], so we could use that pattern to find the moments that require repair. Specifically, we could interpose a search for  <stage>Speaking together\.</stage> (note the leading space character) and replace it with .\0. This inserts a literal dot after the speaker names, which regularizes the plain text and enables our original find-and replace operations to match and tag all speakers correctly.