Maintained by: David J. Birnbaum (djbpitt@gmail.com)
Last modified:
2023-02-03T17:15:37+0000
Produce an XML version from the plain-text version of Shakespeare’s sonnets by using find-and-replace operations with regular expressions. There are many different ways to approach this task, so if you did it differently than described below, your solution may well have been just as good as the one outlined here. Consider this answer sheet as a guide to how one might approach this type of task, rather than as The One True Answer. It is not necessary to get all the way to complete solution; the point of the assignment is to gain experience with using regex to up-convert (that is, add markup to) documents.
The first step is to remove the excess metadata content at the beginning and end of the Gutenberg text. Manually highlight and remove lines 1 through 289, making the very first sonnet’s number, Roman numeral I, the first line of the document. Then, scroll to the end of the document and delete everything from line 2618 through the end (line 2635).
The next step we need to take when transforming raw text into XML is to accommodate special characters, such as ampersands and angle brackets, that, if retained unchanged, would interfere with the markup we need to insert into our document. In this document there aren’t any, so nothing bad will happen if you don’t do any of the following, but you should nonetheless get in the habit of always checking for them first.
Although you don’t have to replace any reserved characters in this particular text, you nonetheless need to know how to do it. Since the replacement strings we’ll be inserting in this step all begin with an ampersand, we need to replace any existing ampersands first; otherwise, if we were to replace ampersands after inserting other entities, we would also wind up replacing all of the ampersands we had inserted ourselves, which we don’t want to do, since those really are markup. In Real Life, we do this at the beginning of every up transformation, that is, any transformation from plain text to XML. If there aren’t any reserved characters in the input text, no harm is done, since the global find/replace operation finds nothing to replace and therefore changes nothing. If there are, though, this will get any existing reserved characters out of the way, so that we can begin inserting markup.
To replace reserved characters with their entity representations, open your find/replace
dialog window in <oXygen/> (using Ctrl-f on Windows or Command-f on Mac) and enter
&
as the text to find, replacing it with its
entity, &
. We do the same for left and right
angle brackets: find <
and replace it with
<
, then find
>
and replace it with
>
. If there aren’t any, so much the better.
You’re searching for plain text characters here, so you don’t need to check the
Regular expression
box.
The plain text edition uses white space (blank lines, leading spaces at the beginning of a line) to demarcate parts of the edition, but since we’re going to use real markup in our XML edition, we want to get rid of any white space we don’t need. In some up-translations we might need to use the white space (indentation, blank lines) as part of the regex we try to match, but since we can find what we want to find in this particular edition without that white space, we get rid of it before introducing any markup.
To do that, in the find/replace dialog we check the box to turn on regular expression
matching and replace \n+
(sequences of one or more
new-line character) with \n
(just a single new-line
character). This collapses multiple line breaks (that is, blank lines) all at once. You
could, alternatively, search for just \n\n
, and that
will work here because there are no sequences of more than one blank line between lines
of text. If there were, though, just \n\n
would
match just the first two, and you’d have to keep rerunning the operation, removing one
at a time, until they were all gone. The plus sign, though, will match any number of
consecutive new-line characters.
To get rid of leading spaces (two of them before every line, except that there are four
before each line in the final couplets), we do a find/replace operation to relace
^ +
with nothing. Note the leading caret; this
regular expression finds sequences of one or more space characters only at the
beginning of a line and replaces them with nothing, that is, erases them.
Without the caret, we would delete spaces between words, and we don’t want to do
that!
Our file at this point looks roughly like:
II When forty winters shall besiege thy brow, And dig deep trenches in thy beauty's field, Thy youth's proud livery so gazed on now, Will be a tatter'd weed of small worth held: Then being asked, where all thy beauty lies, Where all the treasure of thy lusty days; To say, within thine own deep sunken eyes, Were an all-eating shame, and thriftless praise. How much more praise deserv'd thy beauty's use, If thou couldst answer 'This fair child of mine Shall sum my count, and make my old excuse,' Proving his beauty by succession thine! This were to be new made when thou art old, And see thy blood warm when thou feel'st it cold. III Look in thy glass and tell the face thou viewest Now is the time that face should form another; Whose fresh repair if now thou not renewest, Thou dost beguile the world, unbless some mother. For where is she so fair whose unear'd womb Disdains the tillage of thy husbandry? Or who is he so fond will be the tomb, Of his self-love to stop posterity? Thou art thy mother's glass and she in thee Calls back the lovely April of her prime; So thou through windows of thine age shalt see, Despite of wrinkles this thy golden time. But if thou live, remember'd not to be, Die single and thine image dies with thee.
Note that there are no blank lines and no leading white space at the beginning of any line.
Eventually we want to have each sonnet wrapped in
<sonnet>
tags and each line of a sonnet wrapped
in <line>
tags. We want to retain the Roman
numerals, but they aren’t lines of poetry, so by the time we’re done we don’t want them
to be encoded as <line>
elements. We’ll say more
below about how we do want them to look at the end of the conversion process.
As a first step, though, it’s convenient to tag every line of text in the
plain-text edition as a <line>
element, whether
it’s a line of poetry or the Roman numeral that identifies the sonnet. We’ll later
change the markup for the Roman numerals, but for now it’s simplest to treat all lines
identically. Over-generalizing and then correcting the over-generalizations separately
is a common strategy in up-conversion, since it is often simpler than trying to write
more exact and specific replacement rules.
We can wrap every line with <line>
start- and
end-tags with a global search and replace operation, which means that we need to use a
regular expression to say match each and every line, no matter what it contains.
The regular expression we use for this purpose is
.+
. Here we exploit the fact that the dot
(.
) matches any character except a new line, which
means that first we’ll match every character in the first line up to the (invisible)
new-line at the end, and after we’ve processed that we’ll do the same with next line,
etc. We need to specify that we want to allow any number of characters, and so we use
the plus sign (+
) occurrence indicator to say
match one or more instances of whatever precedes the plus sign, that is, match
any sequence of one or more characters, but stop when you reach a new-line.
Note that you shouldn’t have Dot matches all
selected. By default, the dot matches
any character except a new line. If you select Dot matches all
, dot
will also match a new line. If you check this box, your match will span multiple lines,
which isn’t what you want—in fact, you’ll match the entire document as if it were just
one line! What you want is to match all of the characters between the beginning of a
line and the end of the same line.
Before running a global search and replace we always click Find All
in
<oXygen/> to verify that we’re matching what we think we’re matching. Once you’ve
done that and verified that you are matching each line, now you have to decide what to
use to replace the line. In this case, we want to replace them with themselves, plus
surrounding <line>
tags. More specifically, we
want to capture each entire match (each entire individual line) and reuse it in the
replacement pattern, and that’s the easiest way to say replace the contents of the
line with itself, but wrap that self in
. The string
<line>
tags\0
(backslash followed by the digit zero) in the
replacement pattern automatically inserts the entire matched string into the
replacement. Since we’re capturing the entire line in this case, we could use
<line>\0</line>
as our replacement string,
and it would do what we want.
What if we want to capture only part of a match? In that case we can put the part that we
want to capture in parentheses and refer to it in the replacement string with a
backslash followed by a number greater than zero. For example,
\1
at a particular location in the replacement
string would insert into that location whatever was matched by the first parenthesized
sub-expression (the first captured part of the match),
\2
the second parenthesized (captured)
sub-expression, etc., up to as many as we need. Note that parentheses in the match
expression are not actually matched, that is, the line doesn’t have to include literal
parenthesis characters for the match to succeed. The parentheses are
metacharacters that serve to capture part of the match, so that we can
reuse them in the replacement. (If you need to match a literal parenthesis character,
how do you think you would do that?)
Once we’ve specified our search and replace strings, we click Replace All, which should
wrap <line>
tags around every line of input. If
you make a mistake and don’t get the results you want, you can undo the global search
and replace (or anything else) with Ctrl-z (Windows) or Command-z (Mac) and try
again.
At this point, the text in our document is consistent enough to do regular expression
pattern-matching to distinguish the <line>
elements we’ve created that represent lines of poetry from those that contain Roman
numerals, and we can transform them accordingly. Before we do that, though, this is as
good a time as any to insert a root element manually by putting a start-tag at the top
and an end-tag at the very bottom. We called this root element
<sonnets>
. At this point our document looks
like:
<sonnets>
…
<line>II</line>
<line>When forty winters shall besiege thy brow,</line>
<line>And dig deep trenches in thy beauty's field,</line>
<line>Thy youth's proud livery so gazed on now,</line>
<line>Will be a tatter'd weed of small worth held:</line>
<line>Then being asked, where all thy beauty lies,</line>
<line>Where all the treasure of thy lusty days;</line>
<line>To say, within thine own deep sunken eyes,</line>
<line>Were an all-eating shame, and thriftless praise.</line>
<line>How much more praise deserv'd thy beauty's use,</line>
<line>If thou couldst answer 'This fair child of mine</line>
<line>Shall sum my count, and make my old excuse,'</line>
<line>Proving his beauty by succession thine!</line>
<line>This were to be new made when thou art old,</line>
<line>And see thy blood warm when thou feel'st it cold.</line>
<line>III</line>
<line>Look in thy glass and tell the face thou viewest</line>
<line>Now is the time that face should form another;</line>
<line>Whose fresh repair if now thou not renewest,</line>
<line>Thou dost beguile the world, unbless some mother.</line>
<line>For where is she so fair whose unear'd womb</line>
<line>Disdains the tillage of thy husbandry?</line>
<line>Or who is he so fond will be the tomb,</line>
<line>Of his self-love to stop posterity?</line>
<line>Thou art thy mother's glass and she in thee</line>
<line>Calls back the lovely April of her prime;</line>
<line>So thou through windows of thine age shalt see,</line>
<line>Despite of wrinkles this thy golden time.</line>
<line>But if thou live, remember'd not to be,</line>
<line>Die single and thine image dies with thee.</line>
…
</sonnets>
Lines that contain Roman numerals have a sequence of I
, V
, X
,
L
, and C
characters in any order and nothing else, and no real line of
poetry has content that matches that pattern, so we can match a
<line>
with the regular expression
<line>([IVXLC]+)</line>
. Reading from the
inside out, the letters inside the square brackets form a character class,
which matches any single character from that class. Since we want to match a sequence of
one or more characters from that class, we put a plus sign after the closing square
bracket. The character class means choose any one of these
and the plus sign
after the entire class (that is, after the closing square bracket) makes the act of
choosing repeatable, and therefore means make that choice at least once up to as many
times as you want
.
So why do we then wrap the character class and the plus sign in parentheses? Remember our
discussion above of how to capture just part of a regex for reuse in the replacements?
We’ve used parentheses to capture the Roman numeral portion of the pattern (separately
from the <line>
start- and end-tags) so that we
can write it into the replacement string as \1
.
We want to write Roman numeral into the replacement as the attribute value, which means
that we have to write between the quotation marks that demarcate that value. Our full
replacement string, then, is </sonnet>\n<sonnet
number="\1">
. The way to read this is that we match the line
that contains the Roman numeral, capture only the numeral itself, and then replace the
entire line that we just matched with a </sonnet>
end-tag (because we’re at the end of the preceding sonnet), a new line (for human
legibility), and then a <sonnet>
start-tag with an
n
attribute, the value of which is the Roman numeral
that we captured during the match. Because we don’t capture the original
<line>
tags, we throw them away. After we run a
replace all
, we have a document that looks like this:
<sonnets>
…
</sonnet>
<sonnet number="II">
<line>When forty winters shall besiege thy brow,</line>
<line>And dig deep trenches in thy beauty's field,</line>
<line>Thy youth's proud livery so gazed on now,</line>
<line>Will be a tatter'd weed of small worth held:</line>
<line>Then being asked, where all thy beauty lies,</line>
<line>Where all the treasure of thy lusty days;</line>
<line>To say, within thine own deep sunken eyes,</line>
<line>Were an all-eating shame, and thriftless praise.</line>
<line>How much more praise deserv'd thy beauty's use,</line>
<line>If thou couldst answer 'This fair child of mine</line>
<line>Shall sum my count, and make my old excuse,'</line>
<line>Proving his beauty by succession thine!</line>
<line>This were to be new made when thou art old,</line>
<line>And see thy blood warm when thou feel'st it cold.</line>
</sonnet>
<sonnet number="III">
<line>Look in thy glass and tell the face thou viewest</line>
<line>Now is the time that face should form another;</line>
<line>Whose fresh repair if now thou not renewest,</line>
<line>Thou dost beguile the world, unbless some mother.</line>
<line>For where is she so fair whose unear'd womb</line>
<line>Disdains the tillage of thy husbandry?</line>
<line>Or who is he so fond will be the tomb,</line>
<line>Of his self-love to stop posterity?</line>
<line>Thou art thy mother's glass and she in thee</line>
<line>Calls back the lovely April of her prime;</line>
<line>So thou through windows of thine age shalt see,</line>
<line>Despite of wrinkles this thy golden time.</line>
<line>But if thou live, remember'd not to be,</line>
<line>Die single and thine image dies with thee.</line>
…
</sonnets>
You will have to fix the tags around your first and last sonnets because those are not between two sonnets, which is what our regex matched, We do that manually. In this description we have already added the tags for our root element above, but if you didn’t to it there, you can do it now.
The sonnets are tesselated, which means that the beginning of each sonnet simultaneously marks the end of the preceding sonnet. The tesselated structure is why we don’t have to match the ends of the sonnets at all; when we match something that we expect to find at the beginning of every sonnet (a Roman-numeral line), we know that we have found both the beginning of a new sonnet and the end of the preceding one, so we can introduce both the start-tag for the new sonnet and the end-tag for the preceding one. Except …
The first and last items in a tesselated sequence are special because the start of the first sonnet doesn’t match the end of the preceding one because there is no preceding one, and because we find the end of a sonnet only by finding the beginning of the next one, our pattern doesn”t find the end of the last sonnet. This is why we need to fix the markup of the first and last sonnets manually.
Some versions of regular expressions, including the version that works in the <oXygen/> find-and-replace dialog, have look-ahead and look-behind features, which you can use to match each entire sonnet, including the first and last ones. We don’t teach look-ahead and look-behind because our regular-expression time is limited, they are advanced features, and we can live without them, but they do offer an appealing alternative approach here. If you’re curious, feel free to read up on them at Lookahead and Lookbehind Zero-Length Assertions and try using them to tag the sonnets in a way that doesn’t require manual clean-up of the first and last ones.
Once you’re done, if you told <oXygen/> to create an XML document, it should do well-formedness checking automatically, so you can make sure that you have a green square. If you told <oXygen/> to create a plain text document, though, you have to tell it that it is no longer plain text, and should now be treated as XML. In order to do that you need to use the Save as menu option (under File) to save the document as XML (that is, with an .xml filename extension), close it, and open it again in <oXygen/>. Perhaps confusingly, you have to close and open it so that <oXygen/> will recognize that it’s XML; just saving it as XML isn’t enough, and <oXygen/> will incorrectly think it’s still plain text despite the new filename, but if you close it and reopen it, <oXgyen/> will recognize it correctly as XML. If it’s well formed, you’ll get a green square. If not, you’ll need to figure out what went wrong and fix it. Sometimes you can use regex to clean up regex errors, but often when we mess up this type of autotagging, we find it easier to start anew with a clean copy of the plain text.