Digital humanities


Maintained by: David J. Birnbaum (djbpitt@gmail.com) [Creative Commons BY-NC-SA 3.0 Unported License] Last modified: 2022-02-18T14:19:05+0000


Test #3: regular expressions

The task

The task for this test is to complete the transformation of Oscar Wilde’s The importance of being Ernest from plain text to XML that you began in Regex homework #3. Do not start the test from the original plain text file; the test assumes that you are starting from the result of completing homework #3. You are welcome to use our posted solution to that assignment to get to the starting point for this test. Alternatively, the following file is our result after completing homework #3, but before any of the test actions, so you can just use it as your starting point: ernest-pre-test.xml.

What to submit

Submit just a markdown document that describes how you performed the transformation. For each operation that you performed:

We strongly recommend copying and pasting your expressions into the markdown write-up instead of retyping them. Upload only your markdown file.

Markup to add

The tasks we left unfinished at the end of the homework assignment were that we need to:

Before you begin tagging

Two places in the text of the play have the same inconsistency, and you should fix those places manually before you begin the regex operations for this test. The inconsistency is that normally there are blank lines only between speeches, but there are two speeches that have blank lines inside them, and you should remove those extraneous blank lines manually before you begin. After you complete homework assignment #3 the two manual adjustments are:

  1. Manually delete a blank line to replace:

    Algernon. I think it has been a great success. I'm in love with Cecily,
    and that is everything.
    
    <stage>Enter Cecily at the back of the garden. She picks up the can and begins
    to water the flowers.</stage> But I must see her before I go, and make
    arrangements for another Bunbury. Ah, there she is.

    with:

    Algernon. I think it has been a great success. I'm in love with Cecily,
    and that is everything.
    <stage>Enter Cecily at the back of the garden. She picks up the can and begins
    to water the flowers.</stage> But I must see her before I go, and make
    arrangements for another Bunbury. Ah, there she is.
  2. Manually delete a blank line to replace:

    Cecily. I have never met any really wicked person before. I feel rather
    frightened. I am so afraid he will look just like every one else.
    
    <stage>Enter Algernon, very gay and debonnair.</stage> He does!

    with:

    Cecily. I have never met any really wicked person before. I feel rather
    frightened. I am so afraid he will look just like every one else.
    <stage>Enter Algernon, very gay and debonnair.</stage> He does!

In Real Life we don’t know about these sorts of inconsistencies in advance, and we may discover them only after we have tagged everything the same way and wound up with XML that is not well-formed. When that happens we have to undo our find-and-replace operations, fix the problem, and then restart the transformation. That happened to us when we were preparing this test, and it’s one of the inconveniences of working with Other People’s Data that arises in real projects, but we’re telling you about it here so that you don’t have to go through the same kind-of-annoying discovery process that we did.

Below we describe the steps in the order in which we performed them, and some of them will not work properly if you change that order. You don’t have to follow our steps if an alternative approach makes more sense to you, but if you do, be alert about when operations have ordering constraints.

Our tagging process required one match-and-replace operation that would be unreasonably challenging for this test, so we provide the match and replace patterns for that step below for you to copy. Otherwise we describe what needs to be done, but coming up with the appropriate match and replace patterns is up to you.

Tagging speeches

You can tag speeches with the milestone method we first used in the sonnets homework assignment (feel free to revisit the posted solution to that assignment to refresh your memory). Since speeches are separated by blank lines, match the blank lines and add tags to indicate that a preceding speech ends before a blank line and new speech begins after one. As with the sonnet assignment (and with the milestone strategy in general), you will need to fix the beginning and end of the file manually.

This step erroneously tags things as speeches that are not speeches, such as the settings (at the beginning of each act), stand-alone stage directions (the ones that occupy their own paragraph, and are not embedded in speeches), and act titles, but we find it easier to over-generalize and retag those components later that to try to write a narrower regular expression initially.

Fixing stand-alone stage directions

By stand-alone stage directions we mean stage directions that are their own paragraphs, with blank lines on either side. There are also inline stage-directions, and we don’t have to do anything further with those, but stand-alone stage directions are now tagged incorrectly as:

<speech><stage>Enter Lane.</stage></speech>

and we need to remove the erroneous <speech> tags. That task is challenging because:

We are providing the match and replace patterns for this step; make sure dot-matches-all is unchecked, after which you should look for:

^<speech><stage>((.+\n?)+)</stage></speech>$

and replace it with:

<stage>\1</stage>

Here’s how it works: the inner parentheses match a single line of text up to an optional single newline. The plus sign after the inner parentheses say that we can match one or more of these, which means that we match all of the lines between the tags. The outer parentheses capture that entire match, that is, everything except the parentheses. We don’t capture the old tags, so we wind up deleting them, and we then write the text that we matched inside new <stage> tags in the replacement pattern. This approach works because it will match multiple lines but it won’t match two consecutive newline characters, so it won’t overrun the end of a stand-alone stage direction that is surrounded by blank lines.

Capture groups are numbered from left to right according to their open parenthesis characters. That means that when we have nested parentheses, as we do here, the outer ones are capture-group #1 (because its open parenthesis is further to the left) and the inner group is #2.

Fixing settings

Settings are erroneously wrapped in <speech> tags in the same way as stand-alone stage directions. You can fix them using the method above, but you can, alternatively, fix the start- and end-tags separately. For reasons described above, fixing the start- and end-tags separately would not work for stage directions, but it does for settings.

Fixing acts

We have to do three things with acts:

  1. Remove the errouneous <speech> tag wrappers, as with stand-alone stage directions and settings, above.

  2. Wrap the <act> tags around the entire act, and not just its heading. <act>

  3. Tag the heading as <title>.

We did the first of those as one step and then the second and third together. When we used the milestone method for the first step, we had to fix the beginning and the end of the document manually.

Fixing speakers

Speeches are now tagged correctly as <speech> elements, but each speech contains the name of the speaker at the beginning as if it were part of the speech. For example:

<speech>Algernon. <stage>Inspects them, takes two, and sits down on the sofa.</stage> Oh! . . .
by the way, Lane, I see from your book that on Thursday night, when
Lord Shoreman and Mr. Worthing were dining with me, eight bottles of
champagne are entered as having been consumed.</speech>

We need to change that to:

<speech speaker="Algernon"><stage>Inspects them, takes two, and sits down on the sofa.</stage> Oh! . . .
by the way, Lane, I see from your book that on Thursday night, when
Lord Shoreman and Mr. Worthing were dining with me, eight bottles of
champagne are entered as having been consumed.</speech>

You can match the speaker name because it is at the beginning of the speech and continues until the first dot. When you capture the speaker name to rewrite it into an attribute, throw away both the dot and the space that originally followed the dot.

The assumption that the speaker name continues until the first dot would fail if there were speakers named, for example, Mrs. Prism (instead of Miss Prism). That doesn’t happen in this play, though.

Bonus tasks

The beginning of the homework assignment asked you to remove the front part of the play, before the first act. For extra credit tell us how you could use regular expressions to tag this front matter. Some parts of it will have to be tagged manually, which isn’t very interesting, so tell us only about the places where you can use regular expressions to simplify the task. You will probably want to select parts of the front matter and perform your find-and-replace operations over only selected lines, so should you do that, tell us which section of the material you selected for each operation. As with the regular tasks, above, provide the exact match pattern and replacement pattern and tell us for each operation whether dot matches all is checked or unchecked.