Digital humanities


Maintained by: David J. Birnbaum (djbpitt@gmail.com) [Creative Commons BY-NC-SA 3.0 Unported License] Last modified: 2018-02-23T00:00:51+0000


Test #3: Regular Expressions: Answers

For this test, we asked you to up-convert the text The bicyclists and three other farces using regular expressions. One solution, with commentary, is below.

Many of you started your up-conversion from the outside in; that is to say, you began wrapping entire scenes, then titles, then characters, so on and so forth, moving down the hierarchy until you reached the smallest units. This isn’t necessarily a mistake, but we approached the task from the inside out because we usually find that to be more robust. In the explanation below, we describe what we matched, what we replaced it with, and how the change moved us closer to our desired result. Take special note of where we eliminated pseudo-markup, of which there were four different varieties in this play. And although you didn’t need to do this, we temporarily deleted both the play title and the table of contents, and manually tagged and reintroduced them at the end of our up-conversion.

After we normalized all our blank lines by searching for n\{3,} and replacing it with \n\n, we began our up-conversion by tagging every block of text as <speech>, as follows:

Find (DMA): (.+?)(\n\n)
Replace: <speech>\1</speech>\2

Note: Any time we write (DMA) next to our Find string, it means we used dot matches all.

The Find string is telling the computer to Find every instance of a string of text that continues until it runs into two consecutive new-line characters. Note that when we use dot matches all, we need to make our naturally greedy repetition indicator a lazy one, and we accomplish that with the question mark. In other words, when the dot followed by the plus sign matches all characters (including new-lines), it will treat the entire document as a single match unless we give it a stopping point for the match. The non-greedy matching, signaled by adding the question mark after the plus sign, says to stop at the first sequence of two new-line characters, instead of the last. Here’s a snippet of what our text looks like after we complete this step:

<speech>Yardsley.  Oh, never mind the bell! Let her down.</speech>

<speech>Perkins.  I beg your pardon, but I positively refuse. I believe in
doing things right. I'm not going to monkey. Ring that bell, and
down she comes; otherwise--</speech>

<speech>Yardsley.  Tut! You are very tiresome this afternoon, Thaddeus.
Mrs. Perkins, we'll go ahead without dropping the curtain. Now take
your place.</speech>

Because we captured the new-line characters and wrote them back into the output, we retain the original line spacing in the text as we continue our up-convertion.

Now that we have every block of text wrapped in <speech> tags (including everything that is not a speech, like titles, stage directions, and character names), we add both <speaker> and <lines> tags to each speech with a single find-and-replace operation, as follows:

Find (DMA): (<speech>)([A-Za-z \.]+)  (.+?)(</speech>)
Replace: \1\n<speaker>\2</speaker>\n<lines>\3</lines>\n\4

This isn’t the only way to tag speakers and spoken lines, but found it the easiest way to manage those two components of our new <speech> elements without error. This pattern is telling the engine to find every <speech> element that is followed immediately by any combination of uppercase or lowercase letters, space characters, and literal dots, followed by exactly two space characters, and then any character at all until you reach and match a </speech> tag,. It uses four capture groups to retain for reuse everything except the two space characters that separate the speaker from the spoken lines in the original. We exploit the fact that only speaker names are followed by two space characters to distinguish them from other text before dots and single spaces. There were speaker names that had literal dots within them, so if you did not include dots in your character class, it probably spliced a speaker’s name after the dot (stranding, for example, Mrs. separately from Perkins). Notice that we introduce new-line characters into our replacement string to separate the parts of the speech. This is optional, but it helps with legibility. At this point our markup looks as follows:

<speech>
<speaker>Mrs. Perkins.</speaker>
<lines>[appearing in doorway] We have a patent laundry table.</lines>
</speech>

<speech>
<speaker>Barlow.</speaker>
<lines>Just the thing.</lines>
</speech>

Since the dot is included in the character class with the letters and space characters, character names with titles, like Mrs. Perkins, above, are handled properly.

Next we stripped the pseudo markup from each speaker’s name by searching for \.(</speaker>) and replacing it with \1.

Next we tag all stage directions, while also removing the pseudo markup square brackets, which serve no purpose once we add real tags:

Find (DMA): \[(.+?)\]
Replace: <stage>\1</stage>

Now, we have all speeches and stage directions tagged, but there are blocks of text that contain only stage directions but that are tagged erroneously as speeches, such as:

<speech><stage>Yardsley shakes him by the hand, and Barlow goes out. As he
disappears through the portieres Yardsley follows, and, holding the
curtain aside, looks after him until the front door is heard closing.
Then he turns about. Dorothy looks demurely around at him, and as he
starts to go to her side the curtain falls.</stage></speech>

Since these stage directions are not actually speeches, we should remove the outer <speech> tags and retain only the inner <stage> ones, along with whatever text is between them. We do this as follows:

Find (DMA): <speech>(<stage>.+?</stage>)</speech>
Replace: \1

Now that we have all the speech-level elements tagged, we can move onto tagging scenes. First, we wrap all our scenes in <scene> tags with:

Find (DMA): <speech>([A-Z ]+)</speech>(.+?)<stage>CURTAIN</stage>
Replace: <scene>\n<title>\1</title>\2</scene>

We do this in one step, but it can also be broken into smaller individual sub-steps. First, we recognize that scene titles and only scene titles are represented by entire lines of nothing but uppercase letters and spaces, wrapped in <speech> tags (since we over-generalized at the beginning when we first tagged speeches). We begin our match there, which finds the beginning of every scene. We then match both the characters and the scene itself with our second capture group. Last, we terminate the match once the engine finds <stage>CURTAIN</stage>. We reinsert the capture groups and include <scene> tags, along with <title> tags for the titles of each scene. The word CURTAIN that terminates each scene is another instance of pseudo markup, so we replace it in its entirety with a </scene> end tag. The beginnings of our scenes now look like:

<scene>
<title>THE BICYCLERS</title>

<speech>CHARACTERS:
MR. ROBERT YARDSLEY, an expert.
MR. JACK BARLOW, another.
MR. THADDEUS PERKINS, a beginner.
MR. EDWARD BRADLEY, a scoffer.
MRS. THADDEUS PERKINS, a resistant.
MRS. EDWARD BRADLEY, an enthusiast.
JENNIE, a maid.</speech>

<speech>The scene is laid in the drawing-room of Mr. and Mrs. Thaddeus
Perkins, at No. --- Gramercy Square. It is late October; the action
begins at 8.30 o'clock on a moonlight evening. The curtain rising
discloses Mr. and Mrs. Perkins sitting together. At right is large
window facing on square. At rear is entrance to drawing-room.
Leaning against doorway is a safety bicycle. Perkins is clad in
bicycle garb.</speech>

We can identify our character lists all at once by searching for <speech>CHARACTERS:(.+?)</speech> and replacing it with <characters>\1\n</characters>. We drop both the speech tags and the CHARACTERS: string; the tags were not correct and the string was pseudo-markup.

We find all our scene descriptions with a similar strategy. We know that all scene descriptions directly follow the list of characters and two new-line characters, so we usee that to anchor ourselves in the correct places in the play, as follows:

Find (DMA): (</characters>\n\n)<speech>(.+?)</speech>
Replace: \1<sceneDescription>\2</sceneDescription>

We capture and retaine everything except the <speech> tags that surrounded the scene descriptions, which we replace with <sceneDescription> tags. Our preservation of new-line characters keeps the text looking as it had before this find-and-replace operation.

Finally, for the bonus, we find and replace all of the character names and descriptions at once with:

Find: ^([A-Z\. ]+), (.+)
Replace: <character>\n<name>\1</name>\n<desc>\2</desc>\n</character>

This regex matches all 28 character names, and only them, because those are the only strings remaining in the play that begin a line with all capital letters, periods and spaces, have no accompanying markup, and are followed by a comma. That’s enough to match each line, which we then subdivide and separate with new-line characters to add some spacing between elements. Notice that we first capture all the capital letters, spaces, and periods together (which comprise the name), match but do not keep the comma and space character, and then, finally, capture all the text that follows the space character (which comprises the description of the character). With the spacing we include above, our character lists now look like:

<characters>
<character>
<name>MR. ROBERT YARDSLEY</name>
<desc>an expert.</desc>
</character>
<character>
<name>MR. JACK BARLOW</name>
<desc>another.</desc>
</character>
<characters>

Note: The snippet above is not a full list of characters. In the actual results, there are other <character> elements before the closing </characters> tag.

At this point that we restore our play title and contents section and tag them manually.

This is only one of many ways to accomplish this assignment, and an entirely different approach is just as correct as our solution as long as it makes consistent and meaningful use of regular expressions.