Digital humanities


Maintained by: David J. Birnbaum (djbpitt@gmail.com) [Creative Commons BY-NC-SA 3.0 Unported License] Last modified: 2015-09-21T12:41:08+0000


Regex assignment #2

The task

Assume that we’ve been given a plain-text file like the Project Gutenberg EBook of The Blithedale Romance, by Nathaniel Hawthorne, and we want to convert it to XML, but we don’t want to type all of the angle brackets manually. (Note that this site sometimes shows you a pop-up welcome screen and a list of different versions of the file, instead of taking you to the plain text one directly. If that happens, click OK on the Welcome pop-up and then select the version labeled Plain Text UTF-8.) In this case Project Gutenberg makes the same book available in HTML, and in Real Life we’d probably convert from HTML to XML (using XSLT, which we’ll learn later in the semester) rather than from plain text, but since there are situations where all we have is plain text, we’ll ignore the HTML version on the Gutenberg site, pretend that all they provide is plain text, and work with that. So what can we tag automatically, with global find-and-replace operations? Some of the markup we might want to introduce for analytical purposes might require us to touch every word of the text, but, at a minimum, we can autotag chapters, chapter titles, paragraphs, and quotations using regex tools, and that’s the goal of the present assignment.

Preliminaries

Select the plain-text version of the document and open it in <oXygen/> as a plain text file. Then cut out the front matter (before the main title title) and the back matter (after the last line of the text of the novel, which is I--I myself--was in love--with--Priscilla! ). In Real Life we might want to mark those parts up eventually and reintroduce them into the XML as metadata, but for this assignment we’ll just delete everything that isn’t part of the text of the novel.

Step by step

There’s more than one way to accomplish this task, but one way to approach the problem is as follows:

Reserved characters

The plain text file could, at least in principle, contain characters that have special meaning in XML: the ampersand and the angle brackets. You need to search for those and replace them with their corresponding XML entities; if you don’t remember the entity strings, you can look them up in the Entities and numerical character references section of http://dh.obdurodon.org/what-is-xml.xhtml. Note that you need to process them in the correct order. What is that order, and why is it important?

Extra blank lines

The blank lines are pseudo-markup that tell us where titles and paragraphs begin and end, but in some cases there are multiple blank lines in a row (for example, there are two blank lines between the title and the word by). Those extra blank lines don’t tell us anything useful, so we’ll start by getting rid of them. We want to retain one blank line between titles and paragraphs, etc., but not more than one.

To perform regex searching, you need to check the box labeled Regular expression at the bottom of the <oXygen/> find-and-replace dialog box, which you open with Control-f (Windows) or Command-f (Mac). If you don’t check the Regular expression box, <oXygen/> will just search for what you type literally, and it won’t recognize that some characters in regex have special meaning. You don’t have to check anything else yet.

The regex escape code that matches a new line is \n, so you want to search for more than two of those in succession, and you want to replace them with exactly two. You can search for three blank lines and replace them with two and then keep repeating the process until there are no instances of three blank lines left, or, more elegantly and efficiently, you can search for \n{3,}, which matches three or more new line characters in succession (see the Limiting repetition section of http://www.regular-expressions.info/repeat.html) and replace them with \n\n (the quantifiers work only in matches, but not in replacements, so you have to write it this way).

Note that a transformation that searches for a sequence of two end-of-line characters depends on their being immediately adjacent to each other. If what looks like a blank line to you actually has (invisible) spaces or tabs, the pattern won’t match and the replacement won’t happen because there will be spaces or tabs between the end-of-line characters, which is to say that they won’t be adjacent. If you think that might be the case, you can make those characters visible by going into the <oXygen/> preferences (Preferences → Editor) and checking the boxes labeled Show TAB/NBSP/EOL/EOF marks and Show SPACE marks under Whitespaces. If you do have whitespace characters interfering with your ability to find a blank line (that is, two consecutive new line characters), you can use regex processing to replace them: the pattern \t matches a tab character, a space matches a space, and \s+ matches one or more white-space characters of any sort (including new lines). You can use the Find or Find all options in the find-and-replace dialog to explore the document and make sure that you’re matching what you want to match before you use Replace all to make the changes.

Paragraphs

What’s left after deleting the beginning and ending metadata and extra blank lines is mostly (except for the stuff at the top) a bunch of chapter titles and paragraphs, separated from one another by a single blank line, and we can use a regex to find all blank lines and replace them with the sequence </p><p>. XML doesn’t care about the following, but for human legibility, we’d suggest inserting a new line character between the tags, instead of just outputting the end tag followed immediately by the start tag, so that each paragraph will start on a new line. You’ll have to add the <p> start tag before the first paragraph and the </p> end tag after the last one manually, but you can enter all of the rest automatically with a single regex-aware find-and-replace operation. At this point the document looks like a bunch of <p> elements. Some may contain chapter titles, rather than paragraphs. We’ll fix that below. At the top of the file, the title, author, and list of chapter titles will need special handling. We’ll talk about those below, too.

Chapter titles

The title of the first chapter within the body looks like:

<p>I. OLD MOODIE</p>

the second looks like:

<p>II. BLITHEDALE</p>

and we can see easily, from the list of chapter titles at the top, that there are twenty-nine chapter titles, each of which begins with a Roman numeral, then a period, and then a single space character, and each of which runs until the end of the line. No real textual paragraph looks like that, although some paragraphs could begin with the pronoun I, which looks like a Roman numeral, and some paragraphs might be only one line long. If we can write a regex that matches chapter titles and only chapter titles, then, we can replace the paragraph markup with title markup, retaining the part in the middle.

We’re not going to write that regex for you, but we will tell you the pieces we used. Try building a regex and running Find all to verify that it is matching all of the chapter titles and nothing else. When you can match what you need, then you can think about how to craft the replacement string. Here are the pieces:

Chapters

A book isn’t just a series of paragraphs with titles strewn among them; the book has logical chapters, which must begin with a title, and you want to represent this part of the logical document hierarchy in your markup by inserting <chapter> tags. Much as you used blank lines as milestone delimiters between paragraphs earlier, you can now use your <title> elements as delimiters between chapters. Use a find-and-replace operation to do this; you’ll have to clean up the markup before the first chapter and after the last one manually, since in those cases the <title> element doesn’t have the same milestone function as elsewhere.

Quotes

How are quotations represented in the plain text? How would you find the text of a quotation, that is, how would you find where it starts, where it ends, and what goes between the start and the end? Files on the Internet sometimes have errors and inconsistencies; if you’re relying on cues in the text to identify the beginnings and ends of quotations, what can happen if you miss one?

If we assume that a quotation is text between opening and closing quotation marks (which are the same in this text, which has straight quotation marks, instead of the curly typographic ones where the opening and closing shapes are different), we have at least two concerns:

Let’s address the second problem first. There’s a line in the text that reads:

without further question, only," added she, "it would be a convenience

which represents the end of one quotation and the beginning of another. If we write ".+", the system will incorrectly think that the first quotation mark opens a quotation and the second closes one, and it will also fail to recognize that the material before and after that line is really part of a quotation. We can fix this by checking the Dot matches all box, which changes the meaning of the dot metacharacter from any character except a new line to any character including a new line. This means that we should be able to match quotations that cross line boundaries. Try it and notice the different results. Uh-oh!

So what went wrong? By default regular expressions are greedy, which means that they make the longest possible match. Turning on dot all mode causes the regex to match everything from the very first quotation mark in the entire text through the very last (since quotation mark characters are also characters, the dot in the regex ".+" matches the quotation marks between the first and last ones in the document, just like it matches any other character). Turning off dot all mode won’t fix this because some quotations do cross line boundaries, and we need to be able to recognize and match them.

We can resolve the problem by turning on dot-all mode (since we have to match quotations that span line breaks) but also specifying that the match should be non-greedy, that is, that we should make the shortest possible match (instead of the longest, which is the default), and we do this by following the repetition indicator (the plus sign) with a question mark. (Note that the question mark you met earlier is a repetition indicator that means zero or one instance of whatever it follows. Here is isn’t a repetition indicator, though; here it means don’t be greedy. So if the same symbol can have two such different meanings, how does a regex processor know which meaning to apply?) In other words ".+?" will correctly treat two full quotations on the same line as separate quotations. Try it. You should now correctly be matching each quotation fully, regardless of whether it spans a new line character and regardless of the number of quotations on a line.

Once you can do that, you can capture the text between the quotation marks and write it into the output between <quote> tags. Don’t include the quotation mark characters themselves in the capture group; those are plain-text pseudo-markup, and now that you’re going to be tagging quotations with real markup, you don’t want the quotation mark characters included.

Cleanup

At this point you can fix the title and author lines manually (we’d just delete the line that reads by, since the new <author> tags will make that explicit), as well as the table of contents, and you need to wrap the entire document in a root element (such as <book>). If you’d like a little more regex practice, instead of fixing the table of contents manually, you can use regex find-and-replace to tag it. If you select the table of contents and then open the find-and-replace dialog, you can check the Only selected lines box under Scope to say that instead of applying find-and replace operations to the entire file, you’ll apply them only to the selected lines. You may want to start by stripping out incorrect markup that you’ve inserted when your global find-and-replace operations earlier changed these lines, as well—and of course you’ll want to do that with a regex that matches any tag and replaces it with nothing (that is, deletes it). Once you’ve done that, these lines look like title lines, except that they have space characters before them, and you can use a regex that matches one or more space characters to help match them. You can then capture each line (throwing away the leading white space by excluding it from the capture) and wrap it in <title> tags. You’ll want to get rid of the paragraph tags that are wrapping the whole table of contents, since it isn’t a paragraph, and replace it with something like <toc> (for table of contents).

Checking your results

Although you’ve added XML markup to the document, <oXygen/> remembers that you opened it as plain text, which means that you can’t check it for well-formedness. To fix that, save it as XML with File → Save as and give it the extension .xml. Even that doesn’t tell <oXygen/> that you’ve changed the file type, though; you have to close the file and reopen it. When you do that, <oXygen/> now knows that it’s XML, so you can verify that it’s well formed in the usual way: Control+Shift+W on Windows, Command+Shift+W on Mac, or click on the arrow next to the red check mark in the icon bar at the top and choose Check well-formedness.

What to submit

We don’t need to see the XML that you produce as the output of your transformation because we’re going to recreate it ourselves anyway, but you do need to upload a step-by-step description of what you did. Your write-up can be brief and concise, but it should provide enough information to enable us to duplicate the procedure you followed.

If you don’t get all the way to a solution, just upload the description of what you did, what the output looked like, and why you were not able to proceed any further. As always, you’re encouraged to post any questions on the discussion boards, in this case in the regex forum.