Digital humanities


Maintained by: David J. Birnbaum (djbpitt@gmail.com) [Creative Commons BY-NC-SA 3.0 Unported License] Last modified: 2021-02-21T21:44:36+0000


Test #3: Regular Expressions

The task

To demonstrate your ability to work with regular expressions we are asking you to tag George Bernard Shaw’s Arms and the man. We downloaded this originally from Project Gutenburg and prepared a simplified version that you should use for the test, which is available at http://dh.obdurodon.org/arms-and-the-man.txt.

For test purposes we are asking you to tag the following basic structural elements by using regular-expression find-and-replace operations. You should describe the steps you took by creating a markdown document, which you should upload when you are done. Do not upload your resulting XML, but do verify in <oXygen/> that it is well-formed.

The find-and-replace operations below are listed in no particular order, which is to say that you may not want to perform them in the order in which we present them. Sometimes the order doesn’t matter, but at other times whether you tag from the inside out or the outside in can affect how easy it is to get to the results you want. It may be easiest to perform some of these operations in multiple passes, for example, by overgeneralizing and then repairing the overgeneralizations, as we did in our solution to the Shakespeare sonnet task in our first regex homework assignment. Here are the operations you’ll want to perform:

After you finish your autotagging, and before your upload your write-up, verify:

Eyeball your XML (you can pretty-print it inside <oXygen/> to make it easier to read) to check whether the markup looks correct. When you are done, upload the markdown file in which you document how you performed your regex replacement operations. Do not upload the XML that you created; we will rerun the steps you describe to create the XML ourselves. If you have any questions about how to represent something in markdown, please post them to the markdown channel in Slack and we’ll be happy to help.

Hint: When performing your regular expression operations, be wary of creating overlapping tags, which would violate well-formedness. Do your best to account for them without replacing them manually, either by not creating them in the first place or by using regular expressions to fix them.

Basic solution

There are several ways to approach the problem, and yours does not have to look like ours as long as you complete the items above using regular expressions and your XML is well-formed. Here is how we did it. For all of our steps we have checked Case-sensitive, although for some of them that isn’t strictly necessary.

  1. At the beginning, you should clean up the document, searching for reserved characters, as well as standardizing spacing between the lines. For our solution, we searched for \n{3,} and replaced with \n\n. We found no instances of ampersands or angle brackets.

  2. Because of the way stage directions contain underscores that we don’t want to convert to emphasis elements, we found it easiest to process the stage directions first. If we do that, all remaining underscores will then mark real instances of emphasis.

    You can find stage directions by searching for text surrounded by parentheses that are adjacent to underscores. The pattern you want to find is \(_(.+?)_\) and you want to replace it with <stage>\1</stage>, but before you do that … The challenge is that there may be more than one stage direction in a line and a single stage direction may start on one line and finish on another. Both of these features show up in one of Raina’s speeches, which is spread over two lines in the input document:

    (_with a cry of delight_). Ah! (_Rapturously._) Oh, mother! (_Then,
    with sudden anxiety_) Is father safe?

    To cope with this you need to check Dot matches all (so that you can match a stage direction that includes a new line character) and you need to use the question mark in your pattern, after the plus sign, to tell it to match non-greedily. The second challenge is that parentheses in a match pattern create a capture group, which means that they don’t match literal parentheses, so if you want to match literal parenthesis characters you need to escape them by preceding them with a backslash (\) character. Our pattern, then begins with a literal open parenthesis and underscore (\(_), followed by a parenthesized capture group ((.+?)), followed by a literal underscore and close parenthesis (_\)). Our outermost parentheses are used to match literal parenthesis characters so they have to be escaped with preceding backslashes, while our innermost parentheses form a capture group and do not match literal parenthesis characters.

  3. To tag the emphasis: find _(.+?)_ and replace with <em>\1</em>. As with stage directions, the challenge here is that it is possible for more than one instance of emphasis to occur in a single line, and also for a single instance of emphasis to start on one line and finish on another. This means that you should check Dot matches all and you need to use the question mark after the plus sign to specify non-greedy matching. As it happens, there is only one instances in the entire document where underscores represent real emphasis, and they emphasize the one-letter word I, which means that if you fail to check Dot matches all and fail to use the question mark to specify non-greedy matching in this case you will nonetheless get the result you want. Unless you already know that, though, you want to use a regex that will also handle other possibilities.

  4. Tag the speeches with a speaker attribute: A speaker in the plain text is a line with all capital letters and a final period, some also with spaces (e.g., THE OFFICER.) and curly apostrophes (; this occurs in A MAN’S VOICE.). Be sure that you’ve checked Case-sensitive and that Dot matches all is unchecked, and then find ^([A-Z’ ]+)\.$ and replace it with </speech>\n<speech speaker="\1">. Manually add a </speech> end-tag at the end; we’ll deal with other missing or misplaced <speech> start- or end-tags later.

  5. The setting is the text that immediately follows the name of the act (which is recognizable because the line begins with ACT) and immediately precedes the first speech of the act (which is now recognizable because you’ve tagged all the speeches). To tag it check Dot matches all (because the setting may span multiple lines) and Case-sensitive (in case any lines happen to begin with act insensitive or something similar), find ^(ACT I+)(.+?)(</speech>) and replace it with \1\n<setting>\2</setting>. The captures the act label, with its Roman numeral, and writes it back into the output (we’ll tag the acts later) and then wraps <settings> tags around everything up to the spurious </speech> end-tag that we created before the first tagged speech in each act. This lets us both tag the setting and throw away the unwanted </speech> end-tags before the first real speech of the each act.

    Extra credit: to tag paragraphs within setting elements, first select the lines (including blank ones) immediately after the <setting> start-tag and up to and including the </setting> end tag. Then, with Dot matches all checked and the scope set to Only selected lines, search for (.+?)\n\nand replace with <p>\0</p>\n. Repeat the above steps for the setting of the other two acts.

  6. Acts begin on a new line with the word ACT follow by one or more I letters, indicating the act's number with a Roman numeral. To wrap each act in <act> tags with an attribute with the Roman numeral for the act, check Case-sensitive and find \nACT (I+)and replace with </speech>\n</act>\n<act n="\1">. Note that this expression begins by inserting the missing </speech> tag from the end of the previous act.

  7. Manually add an </act> end-tag to the end of the play and remove the spurious </speech></act> end-tags from the beginning, between the title and the first act.

  8. Finally, tag the title and wrap everything in a root element.

Beyond the basics

It’s reasonable not to have done the following, and especially the second one, because the most effective way to do it requires a feature we haven’t introduced yet. For that reason, anything you did about either of the following issues counts as extra credit, that is, points above what otherwise constitutes a perfect score.

White space

The strategy described above leaves unwanted space characters in a several places. For example, we see a speech that reads, in its entirety:

<speech speaker="CATHERINE"> And what should I be able to say to your father, pray? </speech>

This is suboptimal because the speech doesn’t really begin and end with spaces. You can fix this by searching for whitespace (using the \s regex expression, which matches any space, tab, or new line) after certain start-tags and before certain end-tags and removing it.

Paragraphs in settings

We invited you not to worry about paragraphs in settings because they are difficult to process with regex. You could select the sections and apply regex find-and-replace to the selected lines, but since there are three settings, you would have to do that three times, and because the settings are so short, it might be just as quick to tag the setting paragraphs inside them manually.

As an alternative, we haven’t discussed the box in the <oXygen/> find-and-replace dialog labeled XPath yet because we haven’t introduced XPath, but we’ll do that soon, and you’ll be able to use it to constrain find-and-replace operations by the location in the XML hierarchy. For example, it’s possible to write a global (not confined to selected lines) rule to tag paragraphs (however you might normally do that), but use the XPath box to constrain it to operate only inside <setting> elements.