Digital humanities


Maintained by: David J. Birnbaum (djbpitt@gmail.com) [Creative Commons BY-NC-SA 3.0 Unported License] Last modified: 2017-01-31T21:44:28+0000


Regex assignment #2: answers

The task

Assume that we’ve been given a plain-text file like the Project Gutenberg EBook of The Blithedale Romance, by Nathaniel Hawthorne, and we want to convert it to XML, but we don’t want to type all of the angle brackets manually. (Note that this site sometimes shows you a pop-up welcome screen and a list of different versions of the file, instead of taking you to the plain text one directly. If that happens, click OK on the Welcome pop-up and then select the version labeled Plain Text UTF-8.) In this case Project Gutenberg makes the same book available in HTML, and in Real Life we’d probably convert from HTML to XML (using XSLT, which we’ll learn later in the semester) rather than from plain text, but since there are situations where all we have is plain text, we’ll ignore the HTML version on the Gutenberg site, pretend that all they provide is plain text, and work with that. So what can we tag automatically, with global find-and-replace operations? Some of the markup we might want to introduce for analytical purposes might require us to touch every word of the text, but, at a minimum, we can autotag chapters, chapter titles, paragraphs, and quotations using regex tools, and that’s the goal of the present assignment.

Preliminaries

Select the plain-text version of the document and open it in <oXygen/> as a plain text file. Then cut out the front matter (before the main title title) and the back matter (after the last line of the text of the novel, which is I--I myself--was in love--with--Priscilla! ). In Real Life we might want to mark those parts up eventually and reintroduce them into the XML as metadata, but for this assignment we’ll just delete everything that isn’t part of the text of the novel.

Step by step

There’s more than one way to accomplish this task, but one way to approach the problem is as follows:

Reserved characters

The plain text file could, at least in principle, contain characters that have special meaning in XML: the ampersand and the angle brackets. You need to search for those and replace them with their corresponding XML entities, and it turns out that there aren’t any instances of these characters in this document.

Extra blank lines

The blank lines are pseudo-markup that tell us where titles and paragraphs begin and end, but in some cases there are multiple blank lines in a row (for example, there are two blank lines between the title and the word by). Those extra blank lines don’t tell us anything useful, so we’ll start by getting rid of them. We search for \n{3,} (three or more new line characters) and replace them with \n\n (exactly two new line characters). Note that the numerical quantifiers work in searches but not replaces, so you need to write \n\n (rather than \n{2}) in the replace box.

Paragraphs

What’s left after deleting the beginning and ending metadata and extra blank lines is mostly (except for the stuff at the top) a bunch of chapter titles and paragraphs, separated from one another by a single blank line, and we can use a regex to find all blank lines and replace them with the sequence </p><p>. For human legibility, we insert a new line character between the tags, instead of just outputting the end tag followed immediately by the start tag, so that each paragraph will start on a new line. We do this by searching for \n{2} and replacing it with </p>\n<p>. We then add the <p> start tag before the first paragraph and the </p> end tag after the last one manually.

Chapters

We’ve erroneously tagged chapter titles as paragraphs, and we can fix that by searching for <p>([IVX]+\..*)</p> and replacing it with <title>\1</title>. The parentheses in the regex capture the information between them, which is everything except the <p> tags, and we can write the captured content into the new <title> tags by using \1 to retrieve the first (and, in this case, only) capture group. The <p> tags serve to anchor the search so that it has to match everything between the start and end tag. In our pattern, the first thing after the start tag is a character class consisting of one or more instances of I, V, or X, that is, a Roman numeral. To avoid matching one-line paragraphs that begin with the pronoun I, we require the Roman numeral to be followed by a literal period, and since the dot in regex means any character except a new line, we have to precede it with a backslash so that it will lose that special meaning and match only a literal dot. The second dot, though, does mean any character except a new line, and the asterisk tells the regex processor to include in the match all characters on the line after the literal period following the Roman numeral and up to the closing </p> tag.

If all we do here is retag the chapter titles correctly, we wind up with a sequence of <title> and <p> elements, but nothing that records that the chapters themselves are a hierarchical level in the structure of the document. That is, the body of the novel does not contain just a combination of titles and paragraphs; it contains chapters, and it’s the chapters that contain the titles and paragraphs. We can fix that by searching for our <title> start tags (since we know that they’re at the beginning of each chapter) and replacing them with a </chapter> end tag, a new line, a <chapter> start tag, and another new line, so that we wind up with a structure like the following for each chapter:

<chapter>
    <title><!– title goes here --></title>
    <p><!– text of first paragraph goes here --></p>
    <!-- more paragraphs -->
</chapter>

We then have to fix the first and last chapters manually, removing the spurious </chapter> end tag from before the first chapter and adding the missing </chapter> end tag after the last chapter. This is similar to the way we tagged the sonnets in the first assignment, where we used the sonnet numbers as milestones to mark the boundary between a preceding and a following unit, and whenever we use that sort of strategy, we have to fix the first and last units manually because there is no between in their case. In this case it’s the titles that function as the milestones, since the presence of a title signals the end of the preceding chapter and the beginning of a new one.

Quotes

We have to check Dot matches all in the Find-and-replace dialog in order to capture matches that might cross new line characters, and if we do that then try ".+", it looks as if it should match a quotation mark and then one or more characters, up to the next quotation mark. That doesn’t do what we want, though, because by default regex matches are greedy, that is, they select the longest possible match, so we wind up matching everything from the very first quotation mark in the entire document up to the last, as if there were only one very long quotation in the novel.

Here’s how greediness works: If you match a pattern like ".+", you’re asking to match one or more consecutive characters between quotation marks, and if there are multiple matches that conform to the pattern, it chooses the longest one. Since quotation marks are characters, too, if your text reads He said "hello" and she said "good bye", the greedy match ".+" will start at the quotation mark before hello, but when it has to decide which of the three following quotation marks should delimit the end of the quotation, if chooses the one after good bye, as if there were one long quotation instead of two short ones. Regex patterns are greedy by default, but you can make them non-greedy by putting a question mark after the pattern, so that ".+?" will match each quotation individually.

In other contexts a question mark is a repetition indicator that means zero or one (that is, optional), but here it means non-greedy. How does a regex parser know when a question mark means zero or one and when it means non-greedy? The answer is that it only means non-greedy when it occurs after a plus sign or an asterisk. That’s the only context in which non-greediness is an issue (it has to know where to stop if there are multiple possible matches, of different lengths), and it wouldn't make sense to put a repetition indicator after another repetition indicator, so it can’t mean zero or one there. The technical term for using the same symbol with different meanings in different contexts is overloading. It’s an efficient way to avoid reserving more symbols than you need for special functions, but it can be confusing for humans until we realize that we need to look not only at the symbol, but also at the context. It might make more sense if you remember that normal English orthography has overloading, too; a single curly apostrophe can function as an apostrophe (in contractions like don’t, for example), but it can also function as a closing quotation mark when there’s a single opening curly quotation mark earlier (e.g.: He said XML stands for extensible markup language).

When we tag the quotation by writing it between <q> tags, we want to remove and discard the original quotation marks, since those were only pseudo-markup, used to represent the beginnning and end of a quotation in plain text, and in the XML version we’re replacing them with real markup. We do that by using parentheses to capture only the content between the quotation marks, and we then write that capture group into the output. Our final search regex is "(.+?)" and we replace it with <q>\1</q>.

Cleanup and checking the results

At this point we clean up the results by fixing the starting material (title, author, table of contents) by hand and adding a root element. We then save the file as XML (with a .xml filename extension), reopen it in <oXygen/>, and check it for well-formedness. Once it’s well formed, you can also pretty-print it to improve the indentation.