Digital humanities

Maintained by: David J. Birnbaum (djbpitt@gmail.com) Last modified: 2021-12-27T22:03:53+0000

Regular expressions (regex)

What they are and why we use them

A regular expression (regex) is a pattern that can be used to match a string of text. They are a standard feature of many programming languages that are used for text-processing purposes (and they have the inconvenient habit of being implemented ever-so-slightly differently in different languages). Fortunately, much of the core functionality is consistent, and the XPath peculiarities are described in Kay.

Regular expressions are often used in an XML environment in the following situations:

Autotagging. If you need to up-convert plain text to XML, you can often use regex syntax to identify patterns in the source document and replace them with markup.
XML-to-XML transformation. Given elements marked up the same way but with different content, you can often use regex syntax to treat them differently according to their content, and to change the markup accordingly.

Regex syntax is supported by three XPath functions matches(), replace(), and tokenize(). It is also used by the XSLT element <xsl:analyze-string>. See Kay for details about how these are used, including examples. In the example below we use only matches(), but we make extensive use of all of these in our real work.

Regex in up-conversion

Assume that we’re given a plain-text file like the Project Gutenberg EBook of The Blithedale Romance, by Nathaniel Hawthorne. In this case Project Gutenberg makes the same book available in HTML, and in Real Life we’d probably convert from HTML to XML rather than from plain text, but since there are situation where all we have is plain text, we’ll pretend that’s the case here. A lot of the markup we might introduce for analytical purposes will require us to touch every word of the text, but we can autotag chapter titles, paragraphs, and quotations using regex tools. We can also autotag entire chapters using and XSLT-based XML-to-XML transformation (which we won’t illustrate here).

Preliminaries

Open the file in <oXygen/> as a plain text file. You can either open it from the <oXygen/> menu or create a new text file, copy the text from your browser, and paste it in. We then begin by cutting out the front matter (before the first chapter title) and the back matter (after the last line of the text of the novel). We might want to mark those up eventually and reintroduce them into the XML as metadata, so we can save them to a separate file, but we’ll have to do that manually. Since all we intend to autotag is the actual text of the novel, we start by stripping everying else out of the file manually.

Autotagging paragraphs

What’s left is a bunch of chapter titles and paragraphs, separated from one another by a blank line, and we can use a regex to find all blank lines and replace them with the sequence . To perform a regex full-text search-and-replace in <oXygen/>, hit control-f (Windows) or command-f (MacOS) and check Regular expression, Dot matches all, and Wrap around. In the Text to find field, enter \n\n. \n is the regex for a new-line character, so this expression will find two new-line characters in a row, or a blank line. In the Replace with field, enter . Hit the Replace all button to run the transformation. Inserting this markup has the effect of treating the blank line as signaling the end of the preceding XML paragraph and beginning of the next.

Note that this transformation depends on the two end-of-line characters being immediately adjacent to each other. If what looks like a blank line to you actually has (invisible) spaces or tabs, the pattern won’t match and the replacement won’t happen. If you think that might be the case, you can make those characters visible by going into the <oXygen/> preferences (Tools → Preferences → Editor → Edit modes → Text) and checking the boxes labeled Show TAB/NBSP/EOL/EOF marks and Show SPACE marks. If you do have whitespace characters getting the way, you can use regex processing to replace them: the pattern \s+ matches one or more white-space characters.

You’ll have to add the  start tag before the first paragraph and the  end tag after the last one manually, but you can enter all of the rest automatically with a single regex-aware search-and-replace operation. At this point the document looks like a bunch of  elements. Some are empty and some contain chapter titles, rather than paragraphs. We’ll fix that below.

Autotagging quotes

Quotes in this text are delimited by straight double quotation marks (the " character). This means that a quotation can be defined as the text that occurs between two double-quote character, starting from the first. This is a fragile strategy because plain text files found on the Internet may be missing occasional punctuation marks because of careless typing or proofreading, and since the strategy depends on odd-numbered quotation mark characters falling at the beginning of a quotation and even-numbered ones falling at the end, a single missing mark can throw off the count for the rest of the file. The text we’re using in this exercise doesn’t have that problem, but should you encounter it in the wild, you’ll need to run the search-and-replace described below, find where the count goes off, fix it manually in the input, and then rerun the search-and-replace.

The regex for matching quotation is "(.*?)". This matches a double quotation mark character followed by something else following by another double quotation mark character. The something else is parenthesized for a reason we’ll see shortly; what's important now is that the parentheses are part of the regex syntax, and the processor does not look for parenthesis characters in the document. Inside the parentheses the dot means any character, the asterisk means zero or one instances of whatever precedes it, and the question mark means don’t be greedy. Let’s explore these:

The dot in regex syntax usually means any character except a new line. Since a quotation may span multiple lines (that is, may include a new-line character), we checked Dot matches all in the Find/Replace dialog box. Once we’ve done that, dot matches absolutely any character.
The asterisk has the same meaning in regex in this position as it has in Relax NG: it means zero or more instances of whatever precedes it. The sequence .*, then, means zero or more characters, whatever they may be.
By default, regex patters are greedy, which means that they match the longest possible string. Without the question mark, the pattern "(.*)" would match all of the text in the entire book between the very first quotation mark and the very last. The question mark has the effect of modifying the .* to mean any sequence of characters except the one that immediately follows this part of the pattern. That is, this part of the pattern will stop as soon as it meets another quotation mark. You can test this by entering the two versions in the Text to find box and clicking the Find button. The one with the question mark will find each individual quotation. The one without will find one very long item, starting after the very first double quotation mark in the text and ending before the last.

The Text to find, then, is "(.*?)". The Replace with text you should enter is <q>\1</q>. The tags will be entered literally into the output, just as happened with the paragraph tags during the first search-and-replace operation. The \1 means insert whatever text matched the part of the match pattern between the first set of parentheses. In this case we have only one set of parentheses, so it matches the text that was between the two double quotation marks in the source text, and we copy that and insert it into the output. What we’re doing, then, is copying that text from the input to the output, but where we had quotation marks wrapped around it in the input, we’re replacing those with <q> tags in the output. You can have as many parenthesized expressions as you’d like, and you can use \0 to insert the entire matched pattern (which in this case would include the opening and closing quotation marks plus all the text between them).

The input text includes some quotations that are logically divided, such as

"Mr. Coverdale," said he softly, "can I speak with you a moment?"

From a linguistic or rhetorical perspective this is one utterance, but our transformation will treat it as two, creating two <q> elements, one for each pair of quotation marks. If you want to encode that these are associated with each other, you’ll need to do that manually. Alternatively, you could use a regular expression strategy to catch all apparent quotations that begin with a lower-case letter, infer that they represent continuations of the immediately preceding quotation, and insert the markup automatically. This wouldn’t catch all of the broken quotations (for example, a continuation might begin with an upper-case letter that represents someone’s name), but it could still save a large amount of time when compared to manually uniting all split quotations.

We now have paragraphs and quotations tagged automatically. Let’s get rid of empty paragraphs and change the markup of chapter titles, since they aren’t really paragraphs.

The XSLT identity transformation

Transformation from XML to XML is a common component of preparing texts for publication and analysis in digital humanities. The identity transformation is an XSLT tranformation that converts an XML document to itself by writing back out exactly what it reads. It looks like:

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
    <xsl:template match="@*|node()">
        <xsl:copy>
            <xsl:apply-templates select="@*|node()"/>
        </xsl:copy>
    </xsl:template>
</xsl:stylesheet>

The identity transformation uses a single template rule that matches all attributes (@*) and everything on the child axis (node()), of which we usually care most about elements and text() nodes. We have to specify attributes separately because they are on the attribute axis, and not the child axis, which means that node() doesn’t match them because it is short for child:node(). The <xsl:copy> element makes a shallow copy of whatever it matches, which means that it copies the node, but not its contents. When it matches a  element, for example, it copies it (creates a  node in the output document), but it doesn’t automatically copy the attributes of the original  or any text or other elements inside it. Instead, those components are processed by the <xsl:apply-templates select="@*|node()"/> inside the newly-created copy of the original node.

If all one did with an identify transformation was run it as is, it would serve no purpose, since one could copy the original document in simpler ways. The point of the identity transformation is that it can serve as a default, letting you make small changes only where you want them. We’re going to use an identity transformation to output exactly what we input, except that we’ll treat empty paragraphs and chapter titles differently.

Empty paragraphs

We can match an empty paragraph by using the XPath matches() function. This function takes two arguments: the string that is being searched for the match and the pattern for which you’re searching. It is similar to the contains() function, except that contains can search only for a string, while matches() can search for a regex pattern. We could, for example, easily find all paragraphs that contain the name Cecilia using contains(), but to find, say, all paragraphs that contain a three-digit number we’d want to use a regex pattern.

When used as the value of the @match attribute in an <xsl:template> rule, the XPath pattern p[matches(.,"^$")] will catch any  element that matches the empty string, that is, any empty paragraph. The regex character caret (^) matches the beginning of a string and the dollar sign ($) matches the end. Since there is nothing between them, this regex will match any string that has nothing between its beginning and its end, that is, any empty string. We could, of course, have found empty paragraphs with p[string-length(.) eq 0], so in this case we could have gotten by without a regex.

We can now augment our identify transformation by adding a second template rule, just for empty paragraphs (highlighted below):

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
    <xsl:template match="@*|node()">
        <xsl:copy>
            <xsl:apply-templates select="@*|node()"/>
        </xsl:copy>
    </xsl:template>
    <xsl:template match="p[matches(.,'^$')]"/>
</xsl:stylesheet>

The new template matches empty paragraphs, and because it itself is an empty element, it instructs the XSLT transformation engine to do nothing with whatever it matches. This has the effect of consuming all empty paragraphs and throwing them away.

It might appear as if both template rules, the new one and the original identity one, match empty paragraphs. After all, empty paragraphs are also nodes, so they fit the XPath pattern in the @match attribute of the identity template rule. It turns out that XSLT has build-in priorities to resolve potentially ambiguous matches. The details are complicated, but the short version is that the more specific pattern has priority, and since the new rule targets empty paragraphs much more specifically than the identity one, it gets to handle the empty paragraphs. An XSLT processor will report an ambiguity in priority that it can’t resolve on its own, and you can use the @priority attribute to resolve an ambiguity or to override the default if it doesn’t give the behavior you want.

Chapter titles

Chapter titles are currently tagged the same way as paragraphs, but their textual content has certain properties that can be distinguished with a regular expression. All chapter titles in this text happen to begin with an upper-case roman numeral, followed by a single period, a single space, and then a string of upper-case letters mixed with hyphens, commas, straight apostrophes, and spaces. No true paragraph matches that pattern; they all have lower-case letters, and many have other punctuation.

The pattern in question can be matched by p[matches(.,"^[IVX]+\. [A-Z \-,']+$")]. Here’s how it works:

The caret and dollar sign mean that the match starts at the beginning of the string and ends at the end. Without those anchors, if the pattern happened to occur in the middle of the paragraph, the match would succeed, even though a paragraph that contains something that looks like a chapter title in the middle of other text probably isn’t really a chapter title. With the anchors, it succeeds only if the match constitutes the entirety of the paragraph.
Characters in square brackets are a character group, which means any character in that group. There are twenty-nine chapters in this novel, which can be expressed with the roman numeral digits I, V, and X. The plus sign has the same meaning as in Relax NG; it means match a string that consists of at least one instance of any of these three characters. That, then, takes care of the roman numeral; [IVX]+ will match any sequence of at least one of those three characters in any combination and any order.
The roman numeral is followed in the text by exactly one period and then exactly one space. As was noted above, the dot in regex has a special meaning (any character), so to have it mean a literal period character we escape it by preceding it with a backslash (\). The space character that follows means a literal space.
The textual part of a chapter title in this text is all upper case and may contain spaces and certain punctuation. We can represent this set of characters with another character group, and this time we use a character range as a short cut. The A-Z means any character between A and Z, inclusive. The group, then, inside the usual square brackets, contains all of the uppercase characters, the space, the hyphen (escaped with a backslash so that it won’t be mistaken for part of a range), the comma, and the single straight apostrophe. The plus sign after the square brackets means that one or more of the characters in the group must be found for the pattern to match.

If we’ve analyzed the character properties of chapter titles correctly, then, a  element that matches the pattern will be a chapter title, and we can modify our XSLT stylesheet to change the markup to an <h2> by adding a new template rule for paragraphs that match this pattern, as follows:

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
    <xsl:template match="@*|node()">
        <xsl:copy>
            <xsl:apply-templates select="@*|node()"/>
        </xsl:copy>
    </xsl:template>
    <xsl:template match="p[matches(.,'^$')]"/>
    <xsl:template match="p[matches(.,&quot;^[IVX]+\. [A-Z \-,&apos;]*$&quot;)]">
        <h2>
            <xsl:apply-templates select="@*|node()"/>
        </h2>
    </xsl:template>
</xsl:stylesheet>

When we poked our new regex into the value of the @match attribute of the template rule, we found ourselves short of quotation marks. XPath and XSLT don’t care about the difference between single and double quotation marks when used as delimiters, as long as whenever we use a pair the two parts match, so if we need to wrap some sort of quotation marks around a string that contains a single quotation mark, we can use the double ones for that purpose, and vice versa. The problem in this case is that the value of the @match attribute requires a set of quotation marks (all attribute values must be quoted in XSLT), the entire regex requires a set of quotation marks (it is part of the syntax of the matches() function that the regex pattern must be a string, and without the quotation marks it wouldn’t be a string), and we need a single quotation mark inside the regex. That is, we need quotation marks for three purposes and the character set just gives us two (we aren’t allowed to use curly quotes). XML (and therefore XPath and XSLT) work around this by permitting us to use entity representations of the single and double straight quotation marks, and these can be recognized as different from the literal characters. We’ve seen entities before; we used <, >, and & to represent <, >, and &, respectively. XML also provides " for the double straight quotation mark (") and ' for the single straight quotation mark, or apostrophe (').

As an alternative, we could also have defined the regex pattern as an XSLT variable separately from the matches() function, using <xsl:variable> (see Kay for details), and then used the variable as the second argument to the matches() function:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
    <xsl:variable name="regex">^[IVX]+\. [A-Z \-,']*$</xsl:variable>
    <xsl:template match="@*|node()">
        <xsl:copy>
            <xsl:apply-templates select="@*|node()"/>
        </xsl:copy>
    </xsl:template>
    <xsl:template match="p[matches(.,'^$')]"/>
    <xsl:template match="p[matches(.,$regex)]">
        <h2>
            <xsl:apply-templates select="@*|node()"/>
        </h2>
    </xsl:template>
</xsl:stylesheet>

Here we define the pattern as a variable to which we assign the name $regex. Variable names are assigned without a dollar sign (in the <xsl:variable> element), but references to them include a dollar sign (in this case, as an argument to the matches() function). Because the variable can be defined without the wrapper quotation marks required by an attribute value, with this strategy we don’t run out of quotation marks. See Kay for details. Which strategy you use is a matter of personal preference.

Conclusion

Regular expressions may seem more complicated than they really are for a couple of reasons. First, they are very powerful, with a large number of features, each of which has a notation that must be learned. The good news is that for most purposes you need only a small subset of the available features. We don’t have them all memorized, either; we know the ones we use frequently, and we look up the others when needed. Second, the notation is cryptic. There’s no getting around the fact that learning the meaning of, say, caret and dollar sign is pretty much a matter of brute-force memorization. But the silver lining here, again, is that you don’t have to memorize very much because you can look up the details as you need them.