Digital humanities


Maintained by: David J. Birnbaum (djbpitt@gmail.com) [Creative Commons BY-NC-SA 3.0 Unported License] Last modified: 2022-03-10T19:14:16+0000


Test #3: regular expressions

The task

The task for this test is to complete the transformation of Oscar Wilde’s The importance of being Ernest from plain text to XML that you began in Regex homework #3. Do not start the test from the original plain text file; the test assumes that you are starting from the result of completing homework #3. You are welcome to use our posted solution to that assignment to get to the starting point for this test. Alternatively, the following file is our result after completing homework #3, but before any of the test actions, so you can just use it as your starting point: ernest-pre-test.xml.

Our solution

There are multiple ways to finish tagging this play, and it’s fine if your solution differs from ours as long as you arrived a well-formed XML that tagged all of the pieces you needed to tag and that made reasonable use of regular expressions to add the markup. We start with the prepared interim file linked above, which has already added all of the markup that was part of homework #3. The steps we followed are then:

Fix the anomalies manually

Manually fix the two inconsistencies mentioned in the instructions.

Tag speeches

With dot-matches-all unchecked, search for \n\n and replace with \n\n]]>.

This approach introduces two types of error, which we fix as follows:

  • It creates a spurious </speech> end-tag at the beginning and a blank line followed by an erroneous <speech> start-tag at the end. We remove these manually now.

  • It erroneously tags stand-alone stage directions and settings as speeches. We fix those over-generalizations below.

Fix stand-alone stage directions

With dot-matches-all unchecked use the expression supplied in the prompt to search for ((.+\n?)+)$]]> and replace with \1]]>. The prompt explains how this works and why we took this approach.

Fix settings

You can adapt the method for stand-alone stage directions to match settings, but we searched separately (with dot-matches-all unchecked) first for ]]>, which we replaced with ]]>, and then for ]]>, which we replaced with ]]>.

Fix acts, part 1

Act titles (strings like SECOND ACT) are erroneously tagged as <speech><act>SECOND ACT</act></speech>. We remove the <speech> wrappers by searching for (.+?)]]> (with dot-matches-all unchecked) and replacing it with \1. The pattern doesn’t match the first act, which is fine, since that one does not have the erroneous <speech> wrapper.

Fix acts, part 2

Act titles are still erroneously tagged, this time as FIRST ACT]]>, etc., and we need to move the tags to surround the entire act, and not just its label, while tagging the label as a <title> element. We do this by matching (.+?)]]> (with dot-matches-all unchecked) and replacing it with \n\n\1]]>. This creates a spurious </act> end-tag at the beginning of the document, before the start-tag for the first act, and it fails to write a needed </act> end-tag at the end of the document, after the conclusions of the third act. We fix those manually.

Tag speakers

To move speaker names from the beginning of the content of the <speech> element into an attribute we match <speech>(.+?)\. (with dot-matches-all unchecked; note the space character after the dot) and replace it with <speech speaker="\1">. This step fails to tag the two speeches where characters speak together, but we didn’t discover that until later (see below).

Add a root element

XML requires a root element to be well formed, so we add that by selecting the entire content and wrapping <play> tags around it. <oXygen/> shows a green square, but there are two errors, which we fix below.

Fix unison speech

There are two consecutive speeches where characters speak in unison and our strategy for adding a @speaker attribute failed because there isn’t a dot followed by a space in those lines, which read:

Gwendolen and Cecily Speaking together. Your Christian names are still
an insuperable barrier. That is all!

Jack and Algernon Speaking together. Our Christian names! Is that
all? But we are going to be christened this afternoon.]]>

These are well-formed, so <oXygen/> doesn’t recognize them as errors. We discovered them by running a sanity check, where we searched for <speech> elements without @speaker attributes by searching for just <speech>.

If you tagged speakers without including a space after the dot in the pattern, you’ll see:

Speaking together"> Your Christian names are still
an insuperable barrier. That is all!

 Our Christian names! Is that
all? But we are going to be christened this afternoon.]]>

These lines are not be well-formed, which means that <oXygen/> will report the problems without our having to look for them.

The document is now well formed, with all of the tagging tasks completed.