Maintained by: David J. Birnbaum (djbpitt@gmail.com)
Last modified:
2021-02-14T20:48:59+0000
To demonstrate your ability to work with regular expressions we are asking you to tag George Bernard Shaw’s Arms and the man. We downloaded this originally from Project Gutenburg and prepared a simplified version that you should use for the test, which is available at http://dh.obdurodon.org/arms-and-the-man.txt.
For test purposes we are asking you to tag the following basic structural elements by using regular-expression find-and-replace operations. You should describe the steps you took by creating a markdown document, which you should upload when you are done. Do not upload your resulting XML, but do verify in <oXygen/> that it is well-formed.
The find-and-replace operations below are listed in no particular order, which is to say that you may not want to perform them in the order in which we present them. Sometimes the order doesn’t matter, but at other times whether you tag from the inside out or the outside in can affect how easy it is to get to the results you want. It may be easiest to perform some of these operations in multiple passes, for example, by overgeneralizing and then repairing the overgeneralizations, as we did in our solution to the Shakespeare sonnet task in our first regex homework assignment. Here are the operations you’ll want to perform:
<speech>
elements
with the name of the speaker as an attribute on the <speech>
called speaker
. For example: <speech speaker="RAINA">Have I not
been generous?</speech> Note that you should remove the period from after the
speaker names. The speeches are prose, so you do not have to tag the individual
lines.<stage>
elements, removing the parentheses and underscores (which, after all, are
pseudo-markup) in the process. As a simplification for test purposes, you may assume
that all stage directions take place within speeches, even if some might more
naturally be considered to fall before, between, or after speeches._
). Replace these with <em>
tags, except where the underscores
occur inside stage directions, since the italics in that case are present because
they are stage directions, and not because they are emphasized.<setting>
elements.
You are not required to preserve the paragraph structure within the setting
descriptions, but feel free to use regular expressions to do that for extra credit
if you wish.<act>
element with
an attribute containing the Roman numeral of the act. As you do that, remove the
word ACTfrom the start of each act. Note that you need to tag the entire act, not just the single line at the beginning that gives the act number.
After you finish your autotagging, and before your upload your write-up, verify:
Dot matches all.
Eyeball your XML (you can pretty-print it inside <oXygen/> to make it easier to read) to check whether the markup looks correct. When you are done, upload the markdown file in which you document how you performed your regex replacement operations. Do not upload the XML that you created; we will rerun the steps you describe to create the XML ourselves. If you have any questions about how to represent something in markdown, please post them to the markdown channel in Slack and we’ll be happy to help.
Hint: When performing your regular expression operations, be wary of creating overlapping tags, which would violate well-formedness. Do your best to account for them without replacing them manually, either by not creating them in the first place or by using regular expressions to fix them.