Digital humanities


Maintained by: David J. Birnbaum (djbpitt@gmail.com) [Creative Commons BY-NC-SA 3.0 Unported License] Last modified: 2023-02-17T19:04:24+0000


Test #3: regular expressions

The task

Your task is to describe how you would use regular expressions to mark up a plain-text document (see below). There are usually alternative ways to complete that sort of task, and your approach does not have to match ours as long as it 1) makes reasonable use of regular expressions to avoid unnecessary manual tagging, 2) tags as many parts of the document structure as you can sensibly tag by using regular expression matching, and 3) produces well-formed XML (possibly with a bit of manual cleanup). Well-formed XML is a baseline requirement in Real Life, and is therefore worth a lot of points on this test. You do not have to add the sort of analytic markup that requires primarily manual tagging.

The document you’ll tag for this test is the first two chapters of Bram Stoker’s Dracula, which you can access at dracula.txt. We have extracted these chapters from the Project Gutenberg publication of the 1897 New York edition, but all you need for the text is our two chapters, so you don’t have to consult anything on the Gutenberg site unless you’re just eager to read more of the story. The italicized phrase Mem. is short for memorandum, that is, Jonathan’s reminder, as a note to himself within his diary, about a detail that requires follow-up. The lines that consist entirely of asterisks and space characters have a structural function that you’ll have to identify, by looking at the contexts in which they appear, in order to decide how that function should be reflected in your XML.

Things to tag (in whatever order makes sense for you; some may require a bit of manual intervention):

Note: You might be tempted to autotag quotations by matching a pair of quotation marks plus whatever is between them, but that will run into trouble because this document uses a common older typographic convention where quotations that extend for more than one paragraph have an open quotation mark at the beginning of each paragraph but a closing quotation mark only after the last one, which means that the quotation marks are not paired. Here’s an example:

He went, but immediately returned with a letter:--

"My Friend.--Welcome to the Carpathians. I am anxiously expecting you. Sleep well to-night. At three to-morrow the diligence will start for Bukovina; a place on it is kept for you. At the Borgo Pass my carriage will await you and will bring you to me. I trust that your journey from London has been a happy one, and that you will enjoy your stay in my beautiful land.

"Your friend,

"DRACULA."

For this test you may leave the quotation marks as they are in the original and not remove them and replace them with markup.

Optional bonus task

You’re welcome to stop here, but if you’d like more practice: If we were undertaking this task in Real Life we would create a schema to validate our XML. Because the sort of regex-supported tagging we use is not XML-aware, it is possible to create XML that is not well formed and that is tagged inconsistently. <oXygen/> will always check XML for well-formedness, but we would use a schema to ensure that our markup consistently says what we want it say. If you’d like, then, create a schema that models what your document analysis tells you about the structure of the document and verify that your markup is valid against the schema.

What to submit

Write up the process you follow in a markdown document (following our file-naming conventions and with an .md filename extension) as a sequence of steps. Inline code snippets must be surrounded in backticks because that’s how markdown formats code snippets, and, for the same reason, code blocks must be fenced, that is, surrounded by triple backticks on separate lines. You can remind yourself about markdown syntax at the GitHub three-minute guide to Mastering markdown that you read earlier. Your markdown must distinguish the individual steps clearly, and it must also format your code snippets and code blocks clearly and correctly. Be sure to check how it looks in the formatted view, and not just the editing view where you type.

In your list of steps be sure to indicate clearly when you use the Dot matches all option. If you don’t say explicitly that you’re using it for a particular step, we’ll assume that it is unchecked for that step. You might want to create a little template for each step in your process that includes 1) match pattern, 2) replacement, 3) whether or not you checked Dot matches all, and 4) a bit of prose that explains what each step does, as well as anything else that seems necessary to explain why you are doing what you’re doing at a particular moment.

Submit just the markdown document—not the XML it produces. We will follow your instructions step by step to replicate your method and recreate your tagged XML ourselves. If you create the optional bonus Relax NG schema, submit that, as well.

Feel free to ask questions in Slack (there are #regex and #markdown channels, and the #general channel for anything that doesn’t fit in any more specific one). We can’t tell you how to do something, but if you’re stuck on a particular detail we’ll be happy to do what we can to point you in the right direction without just telling you the answer.