Digital humanities


Maintained by: David J. Birnbaum (djbpitt@gmail.com) [Creative Commons BY-NC-SA 3.0 Unported License] Last modified: 2020-03-06T01:26:14+0000


Regex test: answer key

The task

To demonstrate your ability to work with regular expressions we are asking you to tag Sophocles’s Oedipus the King. We downloaded this originally from Project Gutenburg and prepared a simplified version that you should use for the text, which is available at http://dh.obdurodon.org/oedipus.txt.

For test purposes we are asking you to tag the following basic structural elements by using regular-expression find-and-replace operations. You should describe the steps you took by creating a markdown document, which you should upload when you are done.

The find-and-replace operations below are listed in no particular order, which is to say that you may not want to perform them in the order in which we present them. Sometimes the order doesn’t matter, but at other times whether you tag from the inside out or the outside in can affect how easy it is to get to the results you want. It may be easiest to perform some of these operations in multiple passes. Here are the operations you’ll want to perform:

When you are done, upload the markdown file that documents how you performed your regex replacement opereations. Do not upload the XML that you created; we will rerun the steps you document to create the XML.

Solution

There are several ways to approach the problem. Here is one approach:

  1. At the beginning, you should clean up the document, deleting the extraneous material and searching for reserved characters, as well as standardizing spacing between the lines.
  2. Tag the lines: Find .+ and replace with <line>\0</line>. (Dot matches all should not be checked, since you want to find each line separately.)
  3. Tag the speeches and speakers: A speaker is a line with all capital letters and possibly also spaces (e.g., SECOND MESSENGER). Be sure that you’ve checked case-sensitive and then tind <line>^([A-Z\s]+)$</line> and replace with </speech>\n<speech speaker= "\1">. You’ll haave to fix up the first and last speeches manually later.
  4. Delete the footnotes. Find \[[0-9]\] and replace with nothing. It’s easiest to delete the footnotes before tackling stage directions.
  5. Tag the stage directions: You can find almost all stage directions by finding text inside square brackets that contains at least two consecutive upper-case letters. That’s because stage directions normally contain the names of characters, and those are written entirely in upper-case letters. You can do that \[(.*[A-Z]{2,}.*)\], using <stage>\1</stage> to replace. If you do that and then looks for remaining text in square brackets with \[.+?\], you’ll find two hits, one of which is a stage direction without a character name ([Exeunt]) and other of which is a line that Oedipus mutters ([None but a fool would credit such as thou.]). It’s easiest to deal with these manually: tag the first as a stage direction and leave the other untouched.
  6. Tag emphasis: Find _(.+)_ and replace with <em>\1</em> with Dot matches all unchecked.
  7. The hardest part of this assignment was the strophes and antistrophes, and we accepted solutions that only marked a single stanza after a strophe. Textually, though, a strophe (or antistrophe) continues until the next strophe or antistrophe or a different speaker, and it took us multiple passes to catch them all. Here’s one way:
    1. Uncheck Case sensitive and check Dot matches all. Then find <line>\(([a-z]{3})\.\s([0-9]+)\)</line>(.+?)(<line>\([a-z]{3}\.\s[0-9]+\)</line>|</speech>). This finds all lines that have text with parenthesis at the front (special character, so use backslash to escape it) followed by a three-character string of upper or lowercase letters and a period, then a space, then a number (we wrote our solution in such a way that it could match any number of digits, although we didn’t strictly need the repetition indicator for this project). After that, we match every line afterwards until it hits the next strophe/antistrophe as above, or the end of the speech tag. What makes this work is the combination of Dot matches all (so that we can match multiple lines at once) and non-greedy matching (.+?), so that we match only until the first stopping point.
    2. Replace with <\1 number="\2">\3</\1>\4, which takes the Str. or Ant. string and makes it the element name, takes the number and makes it an attribute value, and keeps the rest of the text intact.
    3. Some of them won't get picked up on that sweep, so do it again with a slight modification. <line>\(([a-z]{3})\.\s([0-9]+)\)</line>(.+?)(<Str|<Ant|</speech>) Replace it with <\1 number="\2">\3</\1>\4, same as above.
    4. There are some also some unnumbered stropes and antistrophes thatrequire a different approach: <line>\(([a-z]{3})\.\)</line>(.+?)(<Str|<Ant|</speech>) The replacement is very similar to above. <\1 number="none">\2</\1>\3
    5. One last small modification to catch the remainders: search for \(([a-z]{3})\.\)</line>(.+?)(<Str|<Ant|</speech>|\() using the same replacement as above.
  8. This may damage some of the line markup, which we fixed by searching for <line>(</[A,S]) and replacing with \1.