Digital humanities


Maintained by: David J. Birnbaum (djbpitt@gmail.com) [Creative Commons BY-NC-SA 3.0 Unported License] Last modified: 2019-04-16T16:31:36+0000


Test #3: Regular Expressions solution

The task

For this test, we asked you to up-convert a play called The Post Office (http://dh.obdurodon.org/2194_regex-text.txt) using regular expressions. While there are several approaches to up-conversions like this, we will only show you the steps we took and comment on why we chose to do what we did. If you used a different method to produce the same output, that’s perfectly fine! We encourage you all the same, however, to follow alongside this solution, copying and pasting our searches and replaces into your own local file, to see how our approach looked throughout the process and to get a fresh perspective on Regular Expression operations.

The most pressing question to immediately address is: Where do I start? Generally speaking, there’s no categorically correct answer to that question. However, we typically find it easier to start from the inside and work our way out with up-converting files like these. That means we overgeneralize our searches at the beginning and fix those overgeneralizations later on with subsequent searches, rather than try to find, for instance, all speaker lines and only speaker lines at a given point.


Our solution

After removing the metadata from the beginning of the document, we checked first for any reserved characters. There weren’t any in this document, so we then normalized all the whitespace by searching for \n{3,} and replacing it with \n\n. Thus, any instance of three or more consecutive newLine characters was replaced by only two newLines.

<speech>

Now that we have our whitespace normalized, there are exactly two newLine characters between every block of text in the entire file -- which includes act titles, block-level stage directions, and speaker lines. We then wrapped every single block of text in its own <speech> tags, which included anything that wasn’t a speech (we discuss how we later removed those spurious <speech> tags below). With dot matches all enabled, we searched for (.+?)(\n\n) and replaced it with <speech>\1</speech>\2. In Plain English, we searched for any sequence of any characters (including newLine) until two consecutive newLine characters are found. Since all our text in the entire document is followed by two newLines, every block of text is captured and wrapped in <speech> tags. Our text output now looks like this:

<speech>GAFFER.  Silence, unbeliever.</speech>

<speech>[SUDHA enters]</speech>

<speech>SUDHA.  Amal!</speech>
            

<speaker> and <line>

We now want to reformat all our speeches, so that we wrap our speaker names in <speaker> tags and any lines they have in <line> tags. A unique feature of this document’s format is that every speaker’s name is followed by a period and two consecutive space characters, so we make sure to include those space characters in our search so as to be maximally accurate in our search for speaker names and lines. With dot matches all enabled, our next search and replace were the following:

Find: (<speech>)([A-Z ]+)\.  (.+?)(</speech>)
Replace: \1\n<speaker>\2</speaker>\n<line>\3</line>\n\4

We used four capture groups here: the first captures our opening <speech> tag; the second, the full text of our speaker name (without the extraneous period); the third, the full text of the speaker’s lines (without the extraneous leading space characters); and the fourth, our <speech> closing tag. Notice that, in our second capture group, we included a space character in the character class that matches any capital letters; there are several instances where a speaker’s name is two separate words, like with the STATE PHYSICIAN. Unless we include that space character in the character class, the Regex Finder won’t match that character’s speech.

When we write the replacement, we wrap our speaker name inside <speaker> tags and the spoken lines (and any inline stage directions) in <line> tags. The newLine characters written here were introduced only to increase legibility in the formatting. That is to say, they have no bearing on the ultimate well-formedness of the document, but their presence makes the document more easy to parse. Here’s what our output looks like at this point:

<speech>
<speaker>GAFFER</speaker>
<line>[Winking hard] I am the fakir.</line>
</speech>

<speech>
<speaker>MADHAV</speaker>
<line>It beats my reckoning what you’re not.</line>
</speech>

<stage>

We next handled all the stage directions in the play by searching for the pseudo-markup that contained them and replacing them with <stage tags. We did this in one step, but it’s okay to run this find and replace in two steps, since two distinct characters are used to denote the beginning and end of a stage direction. With dot matches all enabled, we searched for \[(.+?)\] and replaced it with <stage>\1</stage>. In Regex, square brackets already have a reserved meaning -- they indicate character classes. Since we wanted to find a literal square bracket in our text, we needed to escape the brackets by preceding them with a backslash. We also needed to enable dot matches all with this strategy to account for any stage directions that fall on multiple lines. We could have accomplished this step as easily by searching individually for each square bracket, replacing \[ with <stage> and \] with </stage>, which wouldn’t have required us to activate dot matches all.

We also removed the pseudo-markup with this search and replace. Since the function of the square brackets, as textual indicators of stage directions, was replaced by our actual markup, the square brackets no longer served a purpose. Here’s how our output looks now:

<speech>
<speaker>MADHAV</speaker>
<line><stage>Peering out of the window</stage> I should think the noise has
ceased. They’ve smashed the door.</line>
</speech>

<speech><stage>THE KING’S HERALD enters</stage></speech>
            

Our inline stage diections look good, but our block-level stage directions, i.e., those that don’t fall within a character’s lines, are erroneously wrapped in <speech> tags. Since they don’t indicate actual speech, we want to remove them. We accomplish this by searching for <speech>(<stage>.+</stage>)</speech> and simply replacing it with \1. We didn’t use dot matches all here, since all the block-level stage directions occur on one line.

If we wanted, instead, to account for any block-level stage directions that fall on multiple lines by activating dot matches all we would make our otherwise greedy dot lazy by following our + repetition indicator with a ?. In this particular file, all block-level stage directions fall on a single line, so our search above locates them all.

Here’s what our output looks like at this point:

<stage>SUDHA enters</stage>

<speech>
<speaker>SUDHA</speaker>
<line>Amal!</line>
</speech>

<q>

Next, we need to eliminate any quotation marks (" ") from the text and replace them with <q> tags, since they, like the brackets that surrounded stage directions, are pseudo-markup. Unlike the square brackets, we can’t find and replace these double quotation marks one-at-a-time, since both start and end quotes are straight, i.e., they aren’t the curly quotes we’re used to in reading texts. Here, we must use the dot matches all feature, because a number of quotations in this text fall on multiple lines. With dot matches all enabled, we searched for "(.+?)" and replaced it with <q>\1</q> to wrap each quotation in its own proper tags.

What does the question mark after the repetition indicator mean? As we said earlier, dots in Regex are naturally greedy, i.e., they will capture as much text as they possibly can at one time. For example, if you enabled dot matches all to find these quotes and didn’t include the question mark, your Regex engine would only return one find, which would begin at the very first instance of a set of quotation marks and continue until it finds the very last set of quotation marks in the whole document. When we add the question mark, we make our greedy dot lazy, which means we say, Start as a set of double quotes, match any character at all (including newLines), and continue only until you find the next set of quotation marks in the text. Try this out yourself before moving on to see exactly how the question mark works in the context of greedy characters!


<act>

We’re near the end at this point, and all that remains is the tagging of both our acts. One important structural feature of the play is that each act begins with <speech>ACT {number}</speech> and ends with <speech>CURTAIN</speech. In other words, we already have landmarks that indicate the boundaries of acts, so all we need to do is replace those tags with our own <act> tags. With dot matches all enabled, we searched for <speech>ACT ([I]+)</speech>(.+?)<speech>CURTAIN</speech> and replaced it with <act n="\1">\2</act>. Of all the text that we found in this search, we only wanted to keep the Roman numeral that indicated the scene number (\1) and the textual content of the acts (\2). All the other tags and text either were pseudo-markup or did not reflect the actual text that it contained.

Our method was a one-step solution that required no manual maintenance, but there were two alternative strategies that would get you the same solution with a bit more work. One alternative is to find and replace the beginning of each act, then separately find and replace the end of the act. This alternative is almost exactly like ours, except it would bypass all the textual content of the play and focus only on the tags.

The second alternative would be the most like examples that we used in class. In identifying where one portion of a file starts, we immediately know where the preceding portion of the file ended. In the context of this play, since we know where act two starts, we also know where act one ends. If we searched for <speech>ACT ([I]+)</speech> and replaced it with </act>\n<act n="\1">, we would effectively include proper tags to all but the first and last act in our play. However, since this is a two-act play, we would need to manually fix both of our acts in order to restore well-formedness in this document by removing the extraneous </act> tag from the beginning of the first act and adding a missing </act> act tag to the end of the second act.


Touching up

Our document’s structure is fully tagged at this point, with the exception of the metadata that we removed at the onset of our work. Now, we simply reintroduce and tag that metadata, wrap our whole file in a root element, and we are done!