Digital humanities

Maintained by: David J. Birnbaum ( [Creative Commons BY-NC-SA 3.0 Unported License] Last modified: 2019-02-16T15:39:07+0000

Test #3: Regular Expressions

The task

For this test, we are asking you to up-convert a play called The Post Office ( using regular expressions. Please copy and paste the linked file into a plain text file in <oXygen and use the regex search and replace function to add structural tags to the play. As was the case with homework, we suggest that you first search for any reserved characters and normalize the newLine characters in the play. Then proceed with your regex searches to tag important structural components in the work. Our solution tagged the following:

Bonus up-conversion: When we tagged our acts, we did it in such a way that didn't require us to manually add an <act> tag at the very end of the play and remove one from the very beginning. Tagging in a way that does not require this sort of manual adjustment will receive bonus points.

In our solution, we successfully performed each of the operations above with a single search and replace, and while your solution does not need to match ours exactly, you should use regular expressions that are richer than literal string matches. For example, tagging each of the speakers by name separately, with one regex for MADHAV and a different one for PHYSICIAN, etc., is the wrong strategy; you’ll want to use regex to describe a single pattern that matches all speakers.

Before you type anything, we encourage you to familiarize yourself with the way the structure of the document of the document is represented in the source, e.g., the division of acts, the format of speaker names, etc. As a further pointer in the Right Direction, below is an example of what a <speech> element and its nested child elements (<speaker>, <line>, and <stage>), followed by a stage direction that is not part of the speech, should look like in the XML output:

   <line>God bless my soul! So I'm already as bad as autumn wind and sun, eh! But, friend,
       I know something, too, of the game of keeping them indoors. When my day's work is
       over I am coming in to make friends with this child of yours.
<stage>AMAL enters</stage>

Additionally, be mindful to remove pseudo-markup whenever possible, such as the period that follows each speaker's name and the CURTAIN string that denotes the end of an act, whose purpose as textual markers will have been replaced by the markup you will introduce. We tagged the metadata at the very beginning separately from the body of the play, which is to say we tagged the title, author, translator, publication information, and cast list manually.

What to submit

For this exam, we only ask that you upload your step-by-step process to CourseWeb; we already have a copy of the plain text file that we will use to rerun the steps you describe. For that reason:

  1. Submit your test as a Markdown file, with proper Markdown formatting, and especially with all of your regex expressions contained within code delimiters. You can remind yourself of how to tag code in Markdown at; scroll down to Examples, click on Code, and examine the Markdown characters and their corresponding rendering. We recommend that you look at how your Markdown is interpreted in the formatted view of the file as you are editing it, and confirm that it looks the way it should.
  2. Specify your match and replace expressions exactly because when we rerun your process, we are going to copy and paste those expressions from the formatted view of your Markdown description. We recommend that when you are done but before you submit your answer, you go through that process yourself to verify that you haven’t left anything unclear or unspecified.
  3. Specify where you use the dot matches all feature. We’ll assume that you are not using it except where you tell us explicitly that you are.
  4. Your up-converted file must be well-formed in XML. You can test this by saving it as XML, with an .xml filename extension, and then closing and reopening it in <oXygen/> and checking for the green square.

One last hint

One of our searches was facilitated by distinctive and consistent patterns in the use of whitespace in the play. There is a quirk in Markdown rendering where consecutive spaces between single backticks get squished to a single space, so that, for example, the Markdown:

I searched for `^  [IVX]+` and …

which includes a regex that might match a line that begins with two spaces followed by a Roman numeral, will be rendered as if you had typed:

I searched for `^ [IVX]+` and …

that is, with only one space at the beginning of the line. You can avoid this distortion by using code block formatting instead of inline code formatting, that is, by typing:

I searched for 
^  [IVX]+
and …