Digital humanities

Maintained by: David J. Birnbaum ( [Creative Commons BY-NC-SA 3.0 Unported License] Last modified: 2016-09-25T12:31:31+0000

Regex assignment #3: answers

The text

Oscar Wilde’s The importance of being Earnest is available in plain text from Project Gutenberg at Download the text and manually remove the Project Gutenberg boilerplate from the beginning and end, so that all that remains is the text as Oscar Wilde wrote it.

The task

Your task is to prepare an XML-encoded digital edition of this play from the plain text using search and replace operations to introduce the markup. The specific markup you use is up to you, but as is appropriate for a play, you will want your XML to identify at least acts, scenes, speeches, speakers, and stage directions. Note that your goal is to use search and replace operations, with or without regular expressions, to create descriptive well-formed XML markup (rather than, for example, to create a presentational HTML editon). You should not use manual tagging except in situations that occur so rarely that they don’t justify search and replace operations or stylesheet transformations (such as tagging the title of the play or creating a root element).

When you have completed your tagging, you should upload the XML document you create along with a separate page describing any global search and replace operations you used (through the search and replace dialog box) to introduce markup.

There is no single target output for this assignment. Any well-formed markup you create that is appropriate and sensible for the play is fine.

One solution

Remove the Project Gutenberg boilerplate from the beginning and end manually, as described in the instructions.

Search for all instances of ampersand and angle brackets and replace them with the appropriate XML entities: &, <, and >. As it happens, there aren’t any, but the only way to find that out is to look.

We care about blank lines (see below), but sequences of multiple blank lines aren’t especially useful, so let's replace them with single blank lines. Search for \n\n\+ (or \n{3,}; the number in curly braces followed by a comma means 3 or more) and replace it with \n\n.

Most instances of blank lines separate speeches. To replace all blank lines with speech end and start tags, search for \n\n (or \n{2}) and replace it with \n</speech>\n<speech>\n. Here’s a sample of the state after doing that:

Lady Bracknell.  Yes, I remember now that the General was called Ernest,
I knew I had some particular reason for disliking the name.
Gwendolen.  Ernest!  My own Ernest!  I felt from the first that you could
have no other name!
Jack.  Gwendolen, it is a terrible thing for a man to find out suddenly
that all his life he has been speaking nothing but the truth.  Can you
forgive me?

Aside from the title, cast list, etc. at the beginning of the play, which we’ll have to fix manually at the end, we have tagged several instances of non-speeches as speeches. These includes the identification of acts (e.g., FIRST ACT) and a few other details (e.g., TIME: The Present.). Some of these have lower-case letters or punctuation, but they all begin with two or more upper-case letters, which we can match with ^[A-Z]{2} (recall that the number in curly braces is an exact count, so the pattern matches exactly two upper-case letters at the beginning of a line). To match and patch the entire line, search for <speech>\n([A-Z]{2}.*)\n</speech> and replace it with <direction>\1</direction>. The result now looks something like:

[Jack looks indignantly at him, and leaves the room.  Algernon lights a
cigarette, reads his shirt-cuff, and smiles.]
<direction>ACT DROP</direction>
<direction>SECOND ACT</direction>
Garden at the Manor House.  A flight of grey stone steps leads up to the
house.  The garden, an old-fashioned one, full of roses.  Time of year,
July.  Basket chairs, and a table covered with books, are set under a
large yew-tree.

Note that this doesn’t fix the Garden at the Manor House … section, which is erroneously tagged as a speech (as are a few others), and we’d fix those manually.

You’ll want to do something similar to fix stage directions, which begin with square brackets. You can find all of the text between square brackets by searching for \[.*?\] with Dot matches all checked (since some stage directions may span multiple lines). You need to use the square brackets to find the stage directions, so they have to be part of your pattern, but you don’t want to include the brackets themselves in the output because they’re pseudo-markup, and you’re going to replace them with tags. To match the square brackets and whatever is between them but write only the stuff between them into the replacement, you can use parentheses to create a capture group: \[(.*?)\], and you can tag stage directions as <direction>, writing the captured text into the replacements string with <direction>\1</direction>. Note that we use the question mark after the asterisk to make the expression non-greedy (see below); if we fail to do that, we’ll mess up the markup in situations where we may have multiple stage directions in close proximity to one another.

Why do we put backslashes before the square bracket characters? Characters that have a special meaning in regular expressions, such as the square brackets, can be used to represent their literal meaning by preceding them with a backslash. We have to escape the square bracket characters with a backslash here because they are metacharacters in regex (they are used to define character classes, as in the Roman numeral [IVX]+ pattern we used for the Shakespeare sonnet exercise). Similarly, a dot means any character except a backslash, and to match a literal dot, you need to precede the dot character with a backslash (\.).

Speakers are identified by appearing on the first line after a speech. Their names may be more than one word (e.g., Lady Bracknell), but a period always occurs after their names. That lets us match an opening <speech> tag, a new line character, and a string of characters up to the first period and replace it by copying the tag and new line, but wrapping <speaker> tags around the name of the speaker. Search for <speech>\n(.+?)\.\s+ and replace it with <speech>\n<speaker>\1</speaker>. The search string matches the opening <speech> tag and new line character. It then matches one or more instances of any character non-greedily and captures them. It then matches a literal period and one or more white-space characters. It writes the matched tag and new line into the output, as well as the captured pattern, wrapping that pattern in new <speaker> tags. It throws away the period and white space, since those were pseudo-markup, and are no longer needed. This tags a few lines erroneously; we discuss those below.

Some of you searched for each character name (followed by a period) individually. That’s manageable in a single play with a small cast, but it doesn’t scale, that is, it isn’t a procedure you could use easily to tag a large corpus. If we were going to take this approach, we’d probably use XSLT (which we’ll learn later in the semester) to extract the names of all characters from the play for us, since if we can do that programmatically, we can use the strategy with additional plays without requiring human examination to determine the names.

There are three instances where a stand-alone stage direction (inside square brackets) is erroneously tagged as a speech. You can find those by searching for a line that begins with a square bracket using ^\[. Since there are only three, we’d fix those manually.

Reminder about greedy and non-greedy matching: Regex pattern matching is usually greedy, which means that it matches the longest possible match. In a line like Jack. Gwendolen. I can. For I feel that you are sure to change. the pattern .+\.\s+ would find the longest match that is a string of characters, then a literal period and then a space, which means that it would match Jack. Gwendolen. I can. , but we want to match only Jack. . Putting a question mark after a repetition indicator, that is, changing .+ into .+?, tells the system to take the shortest match, which is what we want here. There are other ways to achieve this effect, but the non-greedy question mark modifier is common and easy to read. Note that a question mark by itself as a repetition indicator means zero or one, that is, it has the same meaning as it does in Relax NG. When a question mark is used immediately after another repetition indicator (+ or *), though, it wouldn’t make sense for it to mean zero or one, so it has been coopted to indicate non-greediness.

If you want to autotag the cast of characters, those lines all contain a colon, and there is only one other line inside a speech that contains a colon (Algernon says No: the appointment is in London.), so you can use a regex to match and tag the lines and then fix the one false hit manually. Given the brevity of the list, though, you may find it easier just to tag all of the characters manually. Alternatively, the <oXygen/> find-and-replace dialog has a Scope switch, which lets you choose whether to apply the find-and-replace operation to All or Only selected lines. This means that you can select the case of characters and constrain a find-and-replace operation to only the selected lines, which means that the operation won’t even look at lines outside the selected area, so you don’t have to worry about matching something there.

Because some stage directions contain periods, making them look like speaker names to the regex we used above, there are several places where directions and speeches have been confused, such as:

<speaker><direction>Enter Gwendolen</speaker>Lane goes out.</direction>

We need to fix the lines that start with a <speaker> tag followed immediately by a <direction> tag and then have a closing </speaker> before a closing </direction>. Not only is this not an accurate representation of the structure, but is also isn’t well formed because it has overlapping elements. Since the start and end <speaker> tags here are erroneous, we can strip them out by capturing everything around them, writing the capture groups into the output, and effectively deleting the erroneous tags. We also have to restore the period and space that were deleted when the <speaker> tags were inserted. Search for ^<speaker>(<direction>.*)</speaker>(.*</direction>)$ and replace it with \1. \2. This insert the two capture groups, in their original order, with a period and a space between them.

More about the dot: The dot is a metacharacter, with the meaning any character except a new line, only for search purposes. When you include it in the replacement expression, it has its literal meaning, that is, it inserts a literal dot into the output. This makes sense: after all, how would the system know what to do if the replacement expression said something like write any character except a new line here!?

At this point, you should fix the title, etc. at the top manually, make sure that the end was tagged properly, add a root element, save your document as XML, and reopen it (<oXygen/> doesn’t understand that it has gone from being plain text to XML just by your renaming it; <oXygen/> has to reopen it under the new name to recognize the change in file type).

Once you’ve reopened it as XML, check for well-formedness. If it’s well formed, you can pretty-print it to wrap and indent legibly. If it isn’t well formed, you’ll need to examine the errors, and there are three types of strategies for dealing with them:

There will be annoying spurious white-space characters at the end of each speech, which you can fix by searching for </speech> (that tag is preceded by a space, which is invisible, confusingly, in this HTML explanation) and replacing it with </speech> (the same thing, but without the space).

You can use regex matching to tag the individual acts, but we’ll learn an easier way to do that later in the course.