Digital humanities


Maintained by: David J. Birnbaum (djbpitt@gmail.com) [Creative Commons BY-NC-SA 3.0 Unported License] Last modified: 2023-04-22T20:53:18+0000


Test #3: Regular expressions: answers

We can think of our markup for this particular document as having five phases. We begin with preparatory cleanup (Phase 1), after which we tag major structural components (Phase 2, blocks but not inline snippets). At that point we do a bit of manual cleanup (Interlude, Phase 3) and add a root element to create a well-formed document. We next tag italics (Phase 4) and, finally, clean up blank lines (Phase 5). Your solution does not have to be divided into phases and does not have to follow our order.

Phase 1: Preparation

Reserved characters

We search for ampersands, less-than signs, and greater-than signs in the original text, since if they were there we would have to replace them with character entities to keep them from looking like markup characters. There aren’t any, but we can’t know that without looking for them, so this is always the first thing we do with any up-conversion task.

Remove extraneous newlines

Match: \n{3,}

Replace with: \n\n

Dot matches all status: Unchecked (doesn’t matter, since there is no dot in the match pattern)

Discussion: We trim instances of multiple blank lines down to a single blank line. For example, initially there are four blank lines between the title of the novel, on the first line, and the label CHAPTER I, on the sixth. Our match pattern uses numerical quantification to match sequences of three or more newlines, but because numerical quantification works only in match patterns, and not in replacements, we have to write two literal newlines into the replacement. There are several alternatives ways to match two or more sequential newlines, so you’re welcome to use whatever you prefer. You don’t have to remove extra newlines at all, but we find that doing so makes the document easier to work with.

Phase 2: Structural markup

Tag paragraphs

Match: \n\n

Replace with: \n\n

]]>

Dot matches all status: Unchecked (doesn’t matter, since there is no dot in the match pattern)

Discussion: We use a blank line (two consecutive newlines) as a surrogate for the transition from one paragraph to the next, so we replace the two newlines with two newlines that have a paragraph end-tag before them and a paragraph start-tag after them. We then fix the first and last paragraphs manually since we’re matching transitions between paragraphs, and the first has no preceding paragraph and the last has no following paragraph. When we’re done with this step our document is a sequence of <p> elements with blank lines between them, and although it is well-balanced, it is not well-formed because it has no root element.

This method erroneously tags some non-paragraphs as paragraphs, most conspicuously the novel title and chapter labels (e.g., CHAPTER I) titles (e.g., JONATHAN HARKER'S JOURNAL, subtitles (e.g., (_Kept in shorthand._)). We find it simplest to let our pattern overgeneralize and then retag these items than to try to exclude them from the initial match pattern.

It is possible to match and replace the paragraphs themselves, rather than the blank lines between them, by using regex lookaround. This is more elegant than the approach we describe above because it does not require manual fix-up, but we nonetheless match the blank lines between paragraphs because we find that easier to understand.

Tag chapter headings

Match: (CHAPTER.+)

]]>

Replace with: \1]]>

Dot matches all status: Must be unchecked.

Discussion: We tag the chapter headings before the titles, subtitles, or the chapters as a whole because they are the easiest to match. Once we’ve tagged them distinctively, we can use our introduced <heading> tags to help find chapter beginnings. It’s fine if you chose to omit the heading as a separate element and instead used an attribute on your chapter tags, e.g., <chapter n="I">.

We’ve assumed that chapter headings, titles, and subtitles all occupy only a single line. A novel with very long chapter titles or subtitles might include newlines in these elements, and our pattern would fail to match in those cases. Whether you make the simplifying assumption that we do here or write a more robust pattern (which requires turning on Dot matches all and using non-greedy matching) is a matter of personal preference. In a short document where we can look at all the titles, such as here, we opt for the simpler pattern. If our corpus were so large that we couldn’t exclude the possibility of multi-line titles and we couldn’t realistically look at them all, we’d write the more robust regular expression.

Chapter titles

Match: \n\n)

(.+)

]]>

Replace with: \2]]>

Dot matches all status: Must be unchecked

Discussion: The first pseudo-paragraph after the <heading> element is the chapter title, so we match the </heading> end-tag, the two newlines that follow it, and then the <p> element after the newlines, which represents the chapter title. We capture everything from that match except the <p> tags, which we replace with <title> tags. We used the element name <title> instead of something like <chapter-title> because we’re going to tag the chapters, so we’ll be able to distinguish titles of chapters from the title of the novel (which we’ll also tag as <title>) by the context.

Chapter subtitles

Match: \n\n)

(\(_.+_\))

]]>

Replace with: \2]]>

Dot matches all status: Must be unchecked

Discussion: The first chapter has what we consider a subtitle and the second one doesn’t. To distinguish them we hypothesize that if a chapter title is followed by a 1) one-line paragraph that is 2) bracketed by parentheses and 3) italicized (represented by underscores inside the parentheses) it represents a subtitle. If the paragraph that follows the title is longer than one line or lacks the parentheses or lacks the italics, we assume that it’s a regular chapter paragraph, and not a subtitle.

Chapters

Match: ]]>

Replace with: \n\n\n\n\0]]>

Dot matches all status: Unchecked (doesn’t matter, since there is no dot in the match pattern)

Discussion: After running the find-and-replace we have to remove manually the spurious ]]> end-tag before the first chapter and we have to add the missing ]]> end-tag after the last chapter.

Chapters are tesselated, which means that each one starts where the last one ends, as was also the case with paragraphs. We can use that information to find the start position of the chapters by using the ]]> elements that we introduced and assuming that the start of any chapter marks the end of the preceding one. As with paragraphs, that assumption will fail for the first and last chapters because the first doesn’t follow any chapter, so we’ll have inserted a spurious ]]> end-tag that we’ll then need to remove, and the second doesn’t have a following chapter, so we’ll incorrectly have failed to insert its end-tag, which we’ll have to add manually. With only two chapters, both of which require manual fix-up, it would be reasonable to tag them both manually, but if we were tagging the entire novel the amount of correct tagging would be much greater than the two extremes that we need to adjust, so the regex-supported approach, even with the clean-up it requires, would simplify the job.

We sometimes refer to the approach that we adopt for our tesselated chapters and paragraphs as a milestone method. Milestones are single numbered mile markers on the highway (nowadays metal signs, but once upon a time stone pillars) that record simultaneous, with one statement, the beginning of a new mile (or mile segment) and the end of the preceding one. Similarly we assume that the transitions between paragraphs or between chapters are milestones that mark, at the same time, the end of the preceding element and the beginning of the new one.

Because milestones signify transitions, the very beginning and very end require special treatment, since the start of the first paragraph can’t also mark the end of the preceding paragraph because there isn’t one, etc. The milestone method isn’t the only way to tag tesselated structures (see our treatment of dated diary entries, below), but where we find it easier to think of finding the transition between items than of finding the items themselves, we have to remember that the first and last items will typically require manual fix-up.

Dated diary entries: start-tags

Match: _(\d.+)_--]]>

Replace with: \n\n\1\n\n

]]>

Dot matches all status: Must be unchecked

Discussion: Diary entries consist of multiple paragraphs. The first paragraph of an entry begins with a date and place surrounded by underscores (which represent italics) and followed by two hyphens. That brief date and place are not the entry (they are just the heading for an entry), and neither is the single paragraph that begins with the date and place (that is just the first paragraph of a multi-paragraph entry). Because the dated diary entry consists of multiple paragraphs, we need to wrap our diary-entry tags around all of them.

To enter the start-tag for all diary entries we extract the entry heading from the paragraph where it occurs, which must be the first paragraph of a diary entry. We tag the extracted date and place as a <date-place>, and add a start-tag for a <dated-entry> before it (not forgetting to restore the paragraph start-tag, which now comes after the new <date-place> element). We remove the double hyphen since it was present only to separate the date and place from the actual entry, and now that distinguish it with markup, we no longer need the pseudo-markup characters. We add the end-tags for diary entries in a separate step. We’ve chosen kebab case (hyphenation) for this element name for a reason, about which we’ll say more below. We make the same simplifying assumption about length here as with chapter headings, title, and subtitles, that is, we assume that all of these diary entry date and place records do not cross a newline.

Dates and places (optional)

Tagging the date and place reference at the beginning of a dated entry as just a single element is fine, but in Real Life we would tag the date and the (optional) place inside, as well. Because only two of the six entries contain a place after the date, and the rest contain only the date, we find it easier to tag the dates and places separately:

Dates

Match: (.+?\.)(.*)]]>

Replace with: \1\2]]>

Dot matches all status: Must be unchecked

Discussion: The date part of a <date-place> element will always be present, and consists of the content up to and including the first dot character. We match and wrap that in <date> tags. There may or may not be text, representing the place, after the date portion, and for now we just capture it and write it back as is, without markup.

Places

Match: (.+?)]]>

Replace with: \1]]>

Dot matches all status: Must be unchecked

Discussion: If there are any characters between the </date> end-tag that we created in the last step and the </date-place> end-tag, those must be the place, which we capture and wrap in <place> tags. If there is no text in that position, the match fails.

Dated diary entries: end-tags at the ends of chapters

Match: ]]>

Replace with: \n\n\0]]>

Dot matches all status: Unchecked (doesn’t matter, since there is no dot in the match pattern)

Discussion:

The end of a diary entry is marked in one of two ways. If the diary entry is at the end of the chapter (two occurrences, since we have two chapters) there is no special mark, and the end of the chapter implicitly ends the diary entry, and we’ll add those end-tags here first. We discuss the second type in the next step.

There are, then, two related aspects to tagging diary entries: 1) wrap the diary-entry tags around all of the paragraphs that together constitute the diary entry and 2) tag the label distinctively. We chose to remove the label from the first paragraph and treat it as a structural component of the entry that precedes all of the paragraphs, that is, that is not contained within the first paragraph (even though it is rendered that way in the plain text).

Dated diary entries: end-tags not at the end of chapters

Match: \n\n)()]]>

Replace with: \n\n\2]]>

Dot matches all status: Unchecked (doesn’t matter, since there is no dot in the match pattern)

Discussion: If the diary entry is preceded within the same chapter by another diary entry, it is preceded by a paragraph (four occurrences). On three of those four occurrences the preceding paragraph is a single line that contains only asterisks and space characters, but we don’t need to care about that because any paragraph that immediately precedes the beginning of a new diary entry is necessarily the last paragraph of the preceding entry.

Asterisks

Match: [* ]+

\n\n]]>

Replace with: [nothing]

Dot matches all status: Unchecked (doesn’t matter, since there is no dot in the match pattern)

Discussion: The pseudo-paragraphs that contain just asterisks and space characters are pseudo-markup that signals a separation of dated diary entries (although not consistently). Now that we have tagged the diary entries these lines carry no information that we need, so we delete them.

Phase 3: Interlude

By this point we have tagged all of the major structural components of the text except the novel title at the top, and we also do not yet have a root element. We fix that by changing the element name for the title at the top (currently DRACULA

]]>) to ]]> manually, and also by adding a root element ]]> around the entire content.

At this point our document is well-formed XML. If you have been working inside a document that you created as a new XML document, <oXygen/> knows that it’s XML and should show you a happy green square. If you created a plain text document to work in, at this point you need to save it as XML (File → Save as and give it a filename that ends in .xml) and then close and reopen it. Just saving it as XML isn’t enough for <oXygen/> to recognize that it is now an XML document that can be checked for well-formedness; you need to close and reopen it.

Once you’re looking at a well-formed XML document you should pretty-print it to improve the legibility. None of the steps below should interfere with well-formedness, so your document should remain well formed for the rest of the processing.

Phase 4: Italics

A regular-expression search for with Dot matches all checked reveals that there are eight instances of italics in this document, represented by underscores that surround small amounts of content. These instances have three different types of meaning: two are in a foreign language (boyar, Quatre Face), four are memorandum notations (Mem.), and two are parts of titles or subtitles and are italicized for no clear reason. Quatre Face has a newline between the two words, and we would fail to find it if we were to search for the simpler pattern _.+_ with Dot matches all unchecked. That is, we can’t safely assume, as we did with other structures earlier, that no italicized phrase will cross a newline boundary.

We can’t use regular expressions to find foreign words, which we’d like to tag as ]]>, but we can find the other two types with regular expressions, tag them as simply ]]> (presentational, because we cannot impute a meaning to the italics in these cases), and then tag whatever is left as ]]>.

Remember when we said that we used the element name <dated-entry> (kebab case, with a hyphen) for a reason? We could safely have used camel case (<datedEntry>), but the point is that we wanted to avoid snake case (<dated_entry>) because introducing underscores would have complicated searching for italics, which were represented in the original input document by underscores. We could have worked around the issue because the <oXygen/> find-and-replace dialog would let us exclude characters in element names from our search (check the Enable XML search options at the bottom of the dialog to see the interface), but since we didn’t have to create markup that would require us to fine-tune our search for italics in that way, we decided that it was simplest just to avoid introducing our own new underscores.

Memoranda

Match:

Replace with: Mem.]]>

Dot matches all status: Unchecked (doesn’t matter, since there is no dot in the match pattern except the literal, escaped one)

Discussion: We escape the dot in the match pattern because we want to match a literal dot, and without the backslash that part of the pattern would match any character except a newline. In practice it doesn’t matter because we know that the document contains no four-letter italic words with the shape MenX, where X stands for any character except a newline, except when X is a literal dot. It’s nonetheless useful to get in the habit of distinguishing literal dots from the dot that is a metacharacter.

Titles and subtitles

Match: (.*?)_(.+?)_(.*?)]]>

Replace with: \2\3\4]]>

Dot matches all status: Must be unchecked

Discussion: We know that the title and subtitle that contain italics don’t include any newline characters, so we uncheck Dot matches all to ensure that that we don’t accidentally overrun the end of a line. We chose to match the title and the subtitle with the same regex, but there’s nothing wrong with using separate expressions if you find that easier to work with. Our regular expression as the following parts, in order:

Notes: Our pattern will fail to tag any italic phrase inside a title or subtitle except the first one, but since we know that there can be at most one in this document, that isn’t an issue. Any title or subtitle that doesn’t contain any underscores won’t match the pattern, so it will be ignored.

Foreign words

Match:

Replace with: \1]]>

Dot matches all status: Must be checked in case foreign phrases span newlines, and that happens in our document with Quatre Face.

Discussion: The only remaining paired underscores are around the two foreign expressions, so we can run a relatively simple search for text between underscores, capture it, and replace it with itself, throwing away the underscores and wrapping it in ]]> tags instead.

Phase 5: Blank lines

The blank lines served a structural purpose in the original plain-text source and we used them to control some of our tagging, but now that our document is fully tagged we no longer need them. You can leave them if you find the XML more legible with them, but if you want to delete them, you can search for two consecutive newlines ( and replace them with a single newline.

Optional, extra-credit schema

The following schema matches the markup we’ve added to the document:

 chapter -> dated-entry -> p
novel = element novel { title, chapter+ }
chapter = element chapter { heading, title, subtitle?, dated-entry+ }
dated-entry = element dated-entry { date, p+ }
p = element p { content }
# 
# Titles and subtitles may have mixed content, like paragraphs
title = element title { content }
subtitle = element subtitle { content }
#
# Dates, italic, foreign, and heading elements have text content
date = element date { text }
italic = element italic { text }
foreign = element foreign { text }
heading = element heading { text }]]>

We’ve generalized a bit and assumed that elements that can contain italics or foreign phrases might (in other chapters) contain any combination of those two element types. For that reason we define a named pattern with the label content so that we can reuse it in multiple places without having to retype it. In that way, should we decide to add another inline element we could add it to the schema in one place and it would be available wherever we’ve used the pattern.