Digital humanities

Maintained by: David J. Birnbaum ( [Creative Commons BY-NC-SA 3.0 Unported License] Last modified: 2021-02-14T20:48:59+0000

Test #3: Regular Expressions

The task

To demonstrate your ability to work with regular expressions we are asking you to tag George Bernard Shaw’s Arms and the man. We downloaded this originally from Project Gutenburg and prepared a simplified version that you should use for the test, which is available at

For test purposes we are asking you to tag the following basic structural elements by using regular-expression find-and-replace operations. You should describe the steps you took by creating a markdown document, which you should upload when you are done. Do not upload your resulting XML, but do verify in <oXygen/> that it is well-formed.

The find-and-replace operations below are listed in no particular order, which is to say that you may not want to perform them in the order in which we present them. Sometimes the order doesn’t matter, but at other times whether you tag from the inside out or the outside in can affect how easy it is to get to the results you want. It may be easiest to perform some of these operations in multiple passes, for example, by overgeneralizing and then repairing the overgeneralizations, as we did in our solution to the Shakespeare sonnet task in our first regex homework assignment. Here are the operations you’ll want to perform:

After you finish your autotagging, and before your upload your write-up, verify:

Eyeball your XML (you can pretty-print it inside <oXygen/> to make it easier to read) to check whether the markup looks correct. When you are done, upload the markdown file in which you document how you performed your regex replacement operations. Do not upload the XML that you created; we will rerun the steps you describe to create the XML ourselves. If you have any questions about how to represent something in markdown, please post them to the markdown channel in Slack and we’ll be happy to help.

Hint: When performing your regular expression operations, be wary of creating overlapping tags, which would violate well-formedness. Do your best to account for them without replacing them manually, either by not creating them in the first place or by using regular expressions to fix them.