Digital humanities


Maintained by: David J. Birnbaum (djbpitt@gmail.com) [Creative Commons BY-NC-SA 3.0 Unported License] Last modified: 2016-09-21T01:08:24+0000


Regex assignment #1: answers

The assignment

Produce an XML version from the plain-text version of Shakespeare’s sonnets by using the types of techniques we discussed in class. There are many different ways to approach this task, so if you did it differently than described below, your solution may well have been just as good as the one outlined here. Consider this answer sheet as a guide to how one might approach this type of task, rather than as The One True Answer.

Manually delete Gutenberg-specific header and footer information

The first step, as stated in the instructions, is to remove the excess metadata content at the beginning and end of the Gutenberg text. Manually highlight and remove lines 1 through 289, making the very first sonnet’s number, Roman numeral I, the first line of the document. Then, scroll to the end of the document and delete everything from line 2618 through the end (line 2635).

Replace reserved characters with XML entities

The next step we need to take when transforming raw text into XML is to accommodate special characters, such as ampersands and angle brackets, that, if retained unchanged, would interfere with the markup we need to insert into our document. In this document there aren’t any, so you don’t have to do any of the following, but you should get in the habit of always checking for them first.

Although you don’t have to replace any reserved characters in this particular text, you nonetheless need to know how to do it. Since the replacement strings we’ll be inserting all begin with an ampersand, we need to replace any existing ampersands first; otherwise, if we were to replace ampersands after inserting other entities, we would also wind up replacing all of the ampersands we had inserted ourselves, which we don’t want to do, since those really are markup. In Real Life, we do this at the beginning of every up translation, that is, any transformation from plain text to XML. If there aren’t any reserved characters in the input text, no harm is done, since the global find/replace operation finds nothing to replace and therefore changes nothing. If there are, though, this will get any existing reserved characters out of the way, so that we can begin inserting markup.

To replace reserved characters with their entity representations, open your find/replace dialog window in <oXygen/> (using Ctrl-f on Windows or Command-f on Mac) and enter & as the text to find, replacing it with its entity, &amp;. We do the same for left and right angle brackets: find < and replace it with &lt;, then find > and replace it with &gt;. If there aren’t any, so much the better. You’re searching for plain text characters here, so you don’t need to check the regular expressions box.

Get rid of extraneous white space

The plain text edition uses white space (blank lines, leading spaces at the beginning of a line) to demarcate parts of the edition, but since we’re going to use real markup in our XML edition, we want to get rid of any white space we don’t need. In some up-translations we might need to use the white space as part of the regex we try to match, but since we can find what we want to find in this particular edition without that white space, we get rid of it before introducing any markup.

To do that, in the find/replace dialog we check the box to turn on regular expression matching and replace \n+ (sequences of one or more new-line character) with \n (just a single new-line character). This collapses multiple line breaks (that is, blank lines) all at once. You could, alternatively, search for just \n\n, and that will work here because there are no sequences of more than one blank line between lines of text. If there were, though, just \n\n wouldn’t match them, while the plus sign will match any number of consecutive new-line characters.

To get rid of leading spaces (two of them before every line, except that there are four before each line in the final couplets), we do a find/replace operation to relace ^ + with nothing. Note the leading caret; this regular expression finds sequences of one or more space characters only at the beginning of a line and replaces them with nothing, that is, erases them. Without the caret, we would delete spaces between words, and we don’t want to do that!

Our file at this point looks roughly like:

II
When forty winters shall besiege thy brow,
And dig deep trenches in thy beauty's field,
Thy youth's proud livery so gazed on now,
Will be a tatter'd weed of small worth held:
Then being asked, where all thy beauty lies,
Where all the treasure of thy lusty days;
To say, within thine own deep sunken eyes,
Were an all-eating shame, and thriftless praise.
How much more praise deserv'd thy beauty's use,
If thou couldst answer 'This fair child of mine
Shall sum my count, and make my old excuse,'
Proving his beauty by succession thine!
This were to be new made when thou art old,
And see thy blood warm when thou feel'st it cold.
III
Look in thy glass and tell the face thou viewest
Now is the time that face should form another;
Whose fresh repair if now thou not renewest,
Thou dost beguile the world, unbless some mother.
For where is she so fair whose unear'd womb
Disdains the tillage of thy husbandry?
Or who is he so fond will be the tomb,
Of his self-love to stop posterity?
Thou art thy mother's glass and she in thee
Calls back the lovely April of her prime;
So thou through windows of thine age shalt see,
Despite of wrinkles this thy golden time.
But if thou live, remember'd not to be,
Die single and thine image dies with thee.

Note that there are no blank lines and no leading white space at the beginning of any line.

Where we want to wind up

Eventually we want to have each sonnet wrapped in <sonnet> tags and each line of a sonnet wrapped in <line> tags. We want to retain the Roman numerals, but they aren’t lines of poetry, so by the time we’re done we don’t want them to be encoded as <line> elements. We’ll say more below about how we do want them to look at the end of the conversion process. As a first step, though, it’s convenient to tag every line of text in the plain-text edition as a <line> element, whether it’s a line of poetry or the Roman numeral that identifies the sonnet. We’ll later change the markup for the Roman numerals, but for now it’s simplest to treat all lines identically.

Matching every line of the input file

We can wrap every line inside start and end <line> tags with a global search and replace operation, which means that we need to use a regular expression to say match each and every line, no matter what it contains. The regular expression we use for this purpose is .+. Here we exploit the fact that the dot (.) matches any character except a new line, which means that first we’ll match every character in the first line up to the (invisible) new-line at the end, after we’ve processed that, we’ll do the same with next line, etc. We need to specify that we want to allow any number of characters, and so we use the plus sign (+) occurrence indicator to say match one or more instances of whatever precedes the asterisk, that is, match any sequence of one or more characters, but stop when you reach a new-line.

Note that you shouldn’t have Dot matches all selected. By default, the dot matches any character except a new line. If you select Dot matches all, dot will also match a new line. If you check this box, your match will span multiple lines, which isn’t what you want—in fact, you’ll match the entire document as if it were just one line! What you want is to match all of the characters between the beginning of a line and the end of the same line.

The replacement string

Before running a global search and replace we always click Find All in <oXygen/> to verify that we’re matching what we think we’re matching. Once you’ve done that and verified that you are matching each line, now you have to decide what to use to replace the line. In this case, we want to replace them with themselves, plus surrounding <line> tags. In the find/replace dialog box, you can capture all of part of the match and reuse it in the replacement pattern, and that’s the easiest way to say replace the contents of the line with itself, but wrap that self in <line> tags. The string \0 (backslash followed by the digit zero) in the replacement pattern automatically inserts the entire matched string into the replacement. Since we’re capturing the entire line in this case, we could use <line>\0</line> as our replacement string, and it would do what we want.

What if we want to capture only part of a match? In that case we can put the part that we want to capture in parentheses and refer to it in the replacement string with a backslash followed by a number greater than zero. For example, \1 at a particular location in the replacement string would insert into that location whatever was matched by the first parenthesized expression (the first captured part of the match), \2 the second parenthesized (captured) sub-expression, etc., up to as many as we need. Note that parentheses in the match expression are not actually matched, that is, the line doesn’t have to include literal parenthesis characters for the match to succeed. The parentheses are metacharacters that serve to capture part of the match, so that we can reuse them in the replacement. (If you need to match a literal parenthesis character, how do you think you would do that?)

Once we’ve specified our search and replace strings, we click Replace All, which should wrap <line> tags around every line of input. If you make a mistake and don’t get the results you want, you can undo the global search and replace (or anything else) with Ctrl-z (Windows) or Command-z (Mac) and try again.

Adding a root element

At this point, the text in our document is consistent enough to do regular expression pattern-matching to distinguish the <line> elements we’ve created that represent lines of poetry from those that contain Roman numerals, and we can transform them accordingly. Before we do that, though, this is as good a time as any to insert a root element manually by putting a start tag at the top and an end tag at the very bottom. We called this root element <sonnets>. At this point our document looks like:

<sonnets>
…            
<line>II</line>
<line>When forty winters shall besiege thy brow,</line>
<line>And dig deep trenches in thy beauty's field,</line>
<line>Thy youth's proud livery so gazed on now,</line>
<line>Will be a tatter'd weed of small worth held:</line>
<line>Then being asked, where all thy beauty lies,</line>
<line>Where all the treasure of thy lusty days;</line>
<line>To say, within thine own deep sunken eyes,</line>
<line>Were an all-eating shame, and thriftless praise.</line>
<line>How much more praise deserv'd thy beauty's use,</line>
<line>If thou couldst answer 'This fair child of mine</line>
<line>Shall sum my count, and make my old excuse,'</line>
<line>Proving his beauty by succession thine!</line>
<line>This were to be new made when thou art old,</line>
<line>And see thy blood warm when thou feel'st it cold.</line>
<line>III</line>
<line>Look in thy glass and tell the face thou viewest</line>
<line>Now is the time that face should form another;</line>
<line>Whose fresh repair if now thou not renewest,</line>
<line>Thou dost beguile the world, unbless some mother.</line>
<line>For where is she so fair whose unear'd womb</line>
<line>Disdains the tillage of thy husbandry?</line>
<line>Or who is he so fond will be the tomb,</line>
<line>Of his self-love to stop posterity?</line>
<line>Thou art thy mother's glass and she in thee</line>
<line>Calls back the lovely April of her prime;</line>
<line>So thou through windows of thine age shalt see,</line>
<line>Despite of wrinkles this thy golden time.</line>
<line>But if thou live, remember'd not to be,</line>
<line>Die single and thine image dies with thee.</line>
…
</sonnets>

Fixing the sonnet numbers

Lines that contain Roman numerals have a sequence of I, V, X, L, and C characters in any order and nothing else, so we can match that kind of line with the regular expression <line>([IVXLC]+)</line>. Reading from the inside out, the letters inside the square brackets form a character class, which matches any single character from that class. Since we want to match a sequence of one or more characters from that class, we put a plus sign after the closing square bracket.

So why do we then wrap the character class and the plus sign in parentheses? Remember our discussion above of how to capture just part of a regex for reuse in the replacements? We’ve used parentheses to capture the Roman numeral portion of the pattern (separately from the start and end <line>tags) so that we can write it into the replacement string as \1.

We want to write Roman numeral into the replacement as the attribute value, which means that we have to write between the quotation marks that demarcate that value. Our full replacement string, then, is </sonnet>\n<sonnet number="\1">. The way to read this is that we match the line that contains the Roman numeral, capture only the numeral itself, and then replace the entire line that we just matched with a </sonnet> end tag, a new line, and then a <sonnet> start tag with an n attribute, the value of which is the Roman numeral that we captured during the match. After we run a replace all , we have a document that looks like this:

<sonnets>
…            
</sonnet>
<sonnet number="II>
<line>When forty winters shall besiege thy brow,</line>
<line>And dig deep trenches in thy beauty's field,</line>
<line>Thy youth's proud livery so gazed on now,</line>
<line>Will be a tatter'd weed of small worth held:</line>
<line>Then being asked, where all thy beauty lies,</line>
<line>Where all the treasure of thy lusty days;</line>
<line>To say, within thine own deep sunken eyes,</line>
<line>Were an all-eating shame, and thriftless praise.</line>
<line>How much more praise deserv'd thy beauty's use,</line>
<line>If thou couldst answer 'This fair child of mine</line>
<line>Shall sum my count, and make my old excuse,'</line>
<line>Proving his beauty by succession thine!</line>
<line>This were to be new made when thou art old,</line>
<line>And see thy blood warm when thou feel'st it cold.</line>
</sonnet>
<sonnet number="III">
<line>Look in thy glass and tell the face thou viewest</line>
<line>Now is the time that face should form another;</line>
<line>Whose fresh repair if now thou not renewest,</line>
<line>Thou dost beguile the world, unbless some mother.</line>
<line>For where is she so fair whose unear'd womb</line>
<line>Disdains the tillage of thy husbandry?</line>
<line>Or who is he so fond will be the tomb,</line>
<line>Of his self-love to stop posterity?</line>
<line>Thou art thy mother's glass and she in thee</line>
<line>Calls back the lovely April of her prime;</line>
<line>So thou through windows of thine age shalt see,</line>
<line>Despite of wrinkles this thy golden time.</line>
<line>But if thou live, remember'd not to be,</line>
<line>Die single and thine image dies with thee.</line>
…
</sonnets>

Last details

There will be a spurious </sonnet> end tag before the first sonnet and a missing one after the last sonnet, so you can fix those manually. We already added the tags for our root element above, but if you didn’t to it there, you can do it now.

Once you’re done, save the document as XML (that is, with a .xml filename extension), close it, and open it again in <oXygen/>. (You have to close and open it so that <oXygen/> will recognize that it’s XML. Because you were editing it as plain text before, just saving it as XML isn’t enough, and <oXygen/> will incorrectly think it’s still plain text despite the new filename. But if you close it and reopen it, <oXgyen/> will recognize it correctly as XML.) If it’s well formed, you’ll get a green square. If not, you’ll need to figure out what went wrong and fix it. Sometimes you can use regex to clean up regex errors, but often when we mess up this type of autotagging, we find it easier to start anew with a clean copy of the plain text.