Regex assignment #3

Before you begin

There’s a lot that we could do to autotag this document and it isn’t realistic to try to squeeze all of it into a homework assignment, so we’ll stop at some point and leave the rest as optional bonus tasks.

The text

Oscar Wilde’s The importance of being Earnest is available in plain text from Project Gutenberg at http://www.gutenberg.org/cache/epub/844/pg844.txt. Download the text and manually remove the Project Gutenberg boilerplate from the beginning and end, so that all that remains is the text as Oscar Wilde wrote it.

The task

Your task is to begin to prepare an XML-encoded digital edition of this play from the plain text using search and replace operations to introduce the markup. As is appropriate for a play, we eventually want the XML to identify any of the following that it finds: acts, scenes, settings, speeches, speakers, and stage directions. Our goal is to use search and replace operations with regular expressions to create descriptive well-formed XML markup (rather than, for example, to create a presentational HTML editon). We want to avoid manual tagging except in situations that occur so rarely that they don’t justify search and replace operations, such as tagging the title of the play or creating a root element.

What to submit

For this homework assignment please complete just the tasks described below and submit a markdown document describing the steps you took. Include your exact regular expressions and, if you are matching and replacing something, the exact replacement expression. For each regex operation, specify whether dot-matches-all was checked or unchecked. Note that we are not asking you to tag speeches or speakers yet, even though that’s most of the text; those are optional bonus tasks.

Steps to complete

Delete the Gutenberg boilerplate manually.
Use regex to search for and replace reserved characters. What are they, and how many did you find?
Use regex to collapse multiple blank lines, leaving only one blank line betweeen lines of the input document.
Use regex to remove multiple consecutive space characters to leave only single spaces.
Manually select and cut everything before the line that says FIRST ACT and save it to a different file. In Real Life you would tag that separately later and then paste it back in.
Italics are represented by underscores. There are seven instances, which seem to fall into three types:
- Emphasis, e.g., I believe it _is_ a very pleasant state, sir.
- Newspaper title, e.g., … will appear in the _Morning Post_ on Saturday …
- Individual punctuation marks that seem to have been italicized for no apparent reason, e.g., [Lane goes out_._]
Use regex to:
- Strip out the underscores around the single punctuation marks without replacement.
- Assume that if the content begins with an upper-case letter it represents a newspaper title and use regex to replace the underscores with <title> tags.
- Assume that the rest are emphasis and replace the underscores with <emph> tags.
Stage directions (that is, actions performed by characters) are inside square brackets, sometimes on their own line and sometimes inside a speech. A speech may have more than one stage direction, and stage directions may cross line boundaries. Nothing else is inside square brackets. Use regex to tag all stage directions as <stage>, removing the square brackets.
As a way of learning about how acts and scenes are labeled, search for every line that contains no lower-case letters with ^[^a-z\n]+$. (This is a negated character class; see the discussion at https://www.regular-expressions.info/charclass.html.) Note that acts begin with the act number, then SCENE, and then they end with ACT DROP, except that the last one ends with TABLEAU. Since every act has just one scene, use a regex to remove the lines that say SCENE or ACT DROP or TABLEAU (including their trailing new lines) by replacing them with nothing. You can do this with a single regex.
Now tag all of the act labels (lines that read FIRST ACT, etc.) as <act> elements. These aren’t really acts; they’re just the headings for acts, and in Real Life we would then use those labels to tag the actual acts.
There appears to be a description of the setting, in the form of a plain-text paragraph, at the beginning of each act, we can find it because it occurs immediately after the "ACT" label, and those are the only places where the string "ACT", all in upper case, occurs. Find these paragraphs using regex and tag them as <setting>.

Optional bonus activities

The bulk of this document is speeches, which are separated from one another by blank lines. You want to tag each speech, and each speaker at the beginning of each speech, along the lines of:


  Algernon
  No cucumbers!
]]>

You don’t have to tag lines because the speeches in this play are prose, so the lineation is not informational. Be careful, though! Some speaker names might include dots, some speeches might be spoken by more than one person in unison, some speeches might include stage directions in places where you don’t expect them, and not everything between blank lines is a speech.

<oo>→<dh> Digital humanities