Maintained by: David J. Birnbaum (djbpitt@gmail.com)
Last modified:
2023-02-17T19:04:24+0000
Your task is to describe how you would use regular expressions to mark up a plain-text document (see below). There are usually alternative ways to complete that sort of task, and your approach does not have to match ours as long as it 1) makes reasonable use of regular expressions to avoid unnecessary manual tagging, 2) tags as many parts of the document structure as you can sensibly tag by using regular expression matching, and 3) produces well-formed XML (possibly with a bit of manual cleanup). Well-formed XML is a baseline requirement in Real Life, and is therefore worth a lot of points on this test. You do not have to add the sort of analytic markup that requires primarily manual tagging.
The document you’ll tag for this test is the first two chapters of Bram Stoker’s
Dracula, which you can access at dracula.txt. We have extracted these chapters from the Project Gutenberg
publication of the 1897 New York edition, but all you need for the text is
our two chapters, so you don’t have to consult anything on the Gutenberg site unless
you’re just eager to read more of the story. The italicized phrase Mem. is
short for memorandum
, that is, Jonathan’s reminder, as a note to himself
within his diary, about a detail that requires follow-up. The lines that consist
entirely of asterisks and space characters have a structural function that you’ll
have to identify, by looking at the contexts in which they appear, in order to
decide how that function should be reflected in your XML.
Things to tag (in whatever order makes sense for you; some may require a bit of manual intervention):
The entire document
The document title
Chapters
Chapter labels (e.g., CHAPTER I
), chapter titles, chapter subtitles
((_Kept in shorthand._)
in Chapter 1)
Dated diary entries (each chapter contains three of these, and the dated diary entries contain the paragraphs)
The date and place information at the beginning of a dated entry (e.g., 5
May. The Castle.
; only some contain places). You can tag these just
as a single element
5 May. The Castle.]]>
or, for extra credit, tag them as a group but also tag the date part and the place part separately inside them:
5 May.
The Castle.
]]>
Paragraphs
Italics (surrounded by underscores; if you can identify the reason for the italics, e.g., foreign word or emphasis, tag descriptively; otherwise tag presentationally)
Note: You might be tempted to autotag quotations by matching a pair of quotation marks plus whatever is between them, but that will run into trouble because this document uses a common older typographic convention where quotations that extend for more than one paragraph have an open quotation mark at the beginning of each paragraph but a closing quotation mark only after the last one, which means that the quotation marks are not paired. Here’s an example:
He went, but immediately returned with a letter:--
"My Friend.--Welcome to the Carpathians. I am anxiously expecting you. Sleep well to-night. At three to-morrow the diligence will start for Bukovina; a place on it is kept for you. At the Borgo Pass my carriage will await you and will bring you to me. I trust that your journey from London has been a happy one, and that you will enjoy your stay in my beautiful land.
"Your friend,
"DRACULA."
For this test you may leave the quotation marks as they are in the original and not remove them and replace them with markup.
You’re welcome to stop here, but if you’d like more practice: If we were undertaking this task in Real Life we would create a schema to validate our XML. Because the sort of regex-supported tagging we use is not XML-aware, it is possible to create XML that is not well formed and that is tagged inconsistently. <oXygen/> will always check XML for well-formedness, but we would use a schema to ensure that our markup consistently says what we want it say. If you’d like, then, create a schema that models what your document analysis tells you about the structure of the document and verify that your markup is valid against the schema.
Write up the process you follow in a markdown document (following our file-naming conventions and with an .md filename extension) as a sequence of steps. Inline code snippets must be surrounded in backticks because that’s how markdown formats code snippets, and, for the same reason, code blocks must be fenced, that is, surrounded by triple backticks on separate lines. You can remind yourself about markdown syntax at the GitHub three-minute guide to Mastering markdown that you read earlier. Your markdown must distinguish the individual steps clearly, and it must also format your code snippets and code blocks clearly and correctly. Be sure to check how it looks in the formatted view, and not just the editing view where you type.
In your list of steps be sure to indicate clearly when you use the Dot matches all option. If you don’t say explicitly that you’re using it for a particular step, we’ll assume that it is unchecked for that step. You might want to create a little template for each step in your process that includes 1) match pattern, 2) replacement, 3) whether or not you checked Dot matches all, and 4) a bit of prose that explains what each step does, as well as anything else that seems necessary to explain why you are doing what you’re doing at a particular moment.
Submit just the markdown document—not the XML it produces. We will follow your instructions step by step to replicate your method and recreate your tagged XML ourselves. If you create the optional bonus Relax NG schema, submit that, as well.
Feel free to ask questions in Slack (there are #regex and #markdown channels, and the #general channel for anything that doesn’t fit in any more specific one). We can’t tell you how to do something, but if you’re stuck on a particular detail we’ll be happy to do what we can to point you in the right direction without just telling you the answer.