Digital humanities


Authors: Andrew Nitz and Simon Brown Maintained by: David J. Birnbaum (djbpitt@gmail.com) [Creative Commons BY-NC-SA 3.0 Unported License] Last modified: 2016-01-24T18:36:31+0000


Regex tips

Regular Expressions (regex) are a powerful tool for autotagging texts, that is, identifying and using patterns to find and replace strings in order to add mark up to texts efficiently. See below for some regex expressions commonly used in our projects.

Important regex features

Repetition indicators: +, ?, *

Regex uses the same repetition indicators that we use when wrtiting schemas in Relax NG, and they have the same meaning in both.

To override the repetition-indcator meaning of the plus sign, question mark, or asterisk and match a literal plus sign, question mark, or asterisk, precede it with a backslash. (The technical term for overriding the meaning of a special character is called escaping the character.) The regex Great\? matches the literal string Great followed by a literal question mark.

The dot: .

In regex, the dot corresponds to any character except a newline character. This exception can be overridden in <oXygen/> by checking Dot matches all in the search-and-replace dialog, in which case the dot will match absolutely all characters, including newlines. To see the difference, open a document in <oXygen/> and open the search-and-replace dialog. Type the expression .* in the Text to find window, and the string obdurodon in the Replace with window. Then hit Replace all. You’ll notice that every line of the document now simply reads obdurodon. Each line is replaced separately because the strings of matching text (characters that are not newlines) are separated by newlines, which aren’t part of the match. This means that each string of text between newline characters is a separate successful match, and each is replaced individually by the replacement string.

Undo this change with control-z in Windows or command-z in MacOS and check the Dot matches all box in the Options section of the menu. Now run Replace all again. This time, the entire document has been replaced with exactly one obdurodon string. This is because the dot is now matching newline characters as well, which means that the expression .* will match everything in the document.

Escaping special characters

Regex has two types of reserved, special characters, or metacharacters, which do not always have their literal string-value meaning, and are instead used to define patterns within expressions.

The first type of metacharacter is one that has a special meaning unless you escape it by preceding it with a backslash. For example, the plus sign (+) doesn’t match a literal plus sign; what it does instead is tell the expression to match whatever subpattern precedes the plus sign one or more times (that is, it has the same repetition-indicator meaning as in Relax NG). The pattern ab+c will match a literal a character followed immediately by one or more b characters and then one c character. The fact that the plus sign is a metacharacter means, among other things, that it cannot be used by itself to match a literal plus sign character. To use the plus sign or similar metacharacters as literal string characters (that is, to match them in a string rather than use them for their syntactic metacharacter purpose within regex), escape them by preceding them with a backslash (\). For example, the pattern ab\+c will match the literal sequence of four characters that corresponds to a, then b, then a plus sign, and then c.

The second type of metadata character is one that by itself has its literal meaning, but that acquires a special meaning when preceded by an backslash. For example, a d in a regex usually matches a literal d, but if you precede it with a backslash, it matches a digit (0–9). Thus, because + is the first type of metacharacter and d is the second type:

Backreferencing

A particularly useful feature of regex, especially for autotagging, is the ability to backreference matched strings. This is done by wrapping parts of or entire expressions in parentheses (( )), which captures whatever is matched between the parentheses, and writing the captured backreference into the replacement string by using a numerical value preceded by a backslash. The numbers associated with backreferences in an expression begin at one, and are ordered by appearance of the opening parenthesis. For instance, the expression ^.*$ will match an entire line (as long as dot-all mode is not checked). If wrapped in parentheses as ^(.*)$, it can be backreferenced in the Replace with window, using the expression \1. This means, for example, that you can wrap each line of your document in <p> tags by capturing the contents of the line with parentheses and using <p>\1</p> as the replacement value.

A regex processor automatically captures the entire matched pattern, even without parentheses, and you can insert it into a replacement string with \0. For example, replacing ^.*$ with <line>\0</line> will wrap <line> tags around each line of the input document, which can be handy for autotagging line-oriented input, such as poetry.

Adjusting for XML reserved characters (<, >, &)

There are certain characters that are not permitted as plain text characters within your XML markup, and replacing these with their entity representation is a crucial first step in autotagging a text.

Replace & with &amp;

In XML the ampersand is a reserved character that is used to indicate the beginning of a character entity (such as &lt; for <) or numerical character references. A literal ampersand is not permitted in XML text; if you want to represent an ampersand, you have to replace it with the &amp; entity. You must replace the ampersand before anything else; otherwise any ampersands that you introduce while replacing < and > (see below) will themselves be replaced, which isn’t what you want.

Replace < with &lt;

The angle brackets are used to delimit XML tags, and therefore cannot appear as literal characters, which means that all literal < characters must be replaced by &lt;. XML parsers know that a > isn’t a tag delimiter when it isn’t preceded by a <, but we nonetheless usually replace all > characters with &gt;.

Convenient match patterns

\n\n

Matches two consecutive newline characters, that is, locates blank lines. Useful for identifying and delimiting groups of text that are divided structurally using newlines, such as paragraphs in prose or stanzas in a poem that may be separated by a blank line in plain text representations. Be aware that if there are space characters on the blank line, this pattern won’t match, since it looks only for two immediately consecutive new-line characters.

^.*$

Matches an entire line from intial to final character (the leading ^ means anchor the beginning of the match to the beginning of the line and the trailing $ means anchor the end of the match to the end of the line). This pattern can be used as a building block to find lines that contain a specific substring by using .* twice with the substring in between. For example, ^.*Hamlet.*$ will match a line containing the exact substring Hamlet no matter what (if anything) precedes or follows it on the line.

Note that this is different from matching Hamlet, which will match the string Hamlet wherever it occurs, but it won’t match the entire line that contains that string. It’s easy for humans to overlook the difference, but to a computer, the difference between matching a string and matching an entire line if the line contains that string is not at all the same thing.

^string.*$ and ^.*string$

Matches a line beginning or ending, respectively, with a specific string string. For example, ^Hamlet.*$ will find a line beginning with the string Hamlet.

".*?"

Matches everything inside a pair of quotation marks. Remember that by default the dot doesn’t match a new-line character, so this pattern will match quotes only if they start and end in the same line, that is, if there is no new-line character between the opening and closing quotation-mark character. If you want to match quotes that may span multiple lines, as may happen in a prose text, you will need to check Dot matches all. The .* means match zero or more characters, except for new-lines (unless Dot matches all is checked). The trailing question mark makes the match non-greedy; instead of matching the longest possible stretch between quotation marks (possibly gobbling up multiple independent quotated phrases) it will take the shortest match that fits the pattern. This is important in situations where there may be two separate quotations in a single line. The non-greedy match will match each quotation separately, which is what you want. A greedy match would assume that there was just one quotation, and would capture all of the text from the beginning of the first quotation through the end of the second.

When you replace literal quotation marks in your input with tags (such as the HTML <q> tag), don’t forget to remove the original quotation marks. The original quotation marks are pseudo-markup, and serve to indicate a quotation in a context where markup isn’t available. Once you have real markup, you typically don’t want to retain the literal quotation marks. You can effect the change by matching "(.*?)" and replacing it with <q>\1</q>. You use the parentheses to capture the text between the original quotation marks, and when you write that captured text into the replacement output, you effectively throw away the original literal quotation marks.

\d{4}

Matches four consecutive digits, such as a year. The number inside the curly braces is the exact number of times the preceding item must be repeated in order for the match to succeed. To add a greater degree of constraint, specific digits can be used, followed by a specific number of variable digits to match the pattern. For instance, if you’d like to tag all years in a text without mistakenly tagging non-year strings of four digits, and you know that all years will fall in the twentieth century (and thus begin with 19), you can use 19\d{2} as your match pattern. This might match a four-digit number that isn’t a twentieth-century year and that nonetheless happens to begin with 19, so it isn’t absolute protection against false positives, but at least it can limit them. Note that the pattern \d{4} will also match the first four digits of a five-digit number, etc., and to avoid that you can constrain your pattern still further by specifying that the fourth digit cannot be followed by a digit. There are a few ways to do this and it depends on where you’re doing the processing, so ask us if the situation arises and we’ll show you how.

\d{1,2}[./-]\d{1,2}[./-]\d{2,4}

Matches dates in the dd-mm-yyyy or mm-dd-yyyy format, allowing single- or double-digit numbers for days and months, and numeric strings of length two to four for the year. The delimiter can be a period (.), forward slash (/), or dash (-). You don’t need to escape the dot here because the processor knows that a dot inside a character class (that is, inside square brackets) is a literal dot.

\s[XIVLCDM]+\s

Matches Roman numerals by matching a sequence of one or more of the capitalized letters in the brackets, in any order, delimited by white space on either side. This expression will match the pronoun I, which you don’t want but can’t avoid, so you’ll need to proofread. If you’re lucky, in Real Life you may find that Roman numerals in your text are followed by periods (as in a table of contents), while the personal pronoun I typically is not followed by a period. You may also find that Roman numerals occur in restricted positions, such as chapter titles; if a chapter title must begin with a Roman numeral and you can identify a chapter title in some other way, you know that an initial lone I there is a Roman numeral, and not a pronoun.

\s[A-Za-z]*string.*?\s

Matches any word that contains the string string with its preceding and following spaces. This can be helpful for morphological analysis, when you may be looking for whole words that contain specific morphemes. If you are looking for words with the string ing, for instance, you can search every word that contains that string in any location within the word. If, though, you want to find words that end in ing but not those that contain ing in any other position (to find, for instance, words like running), you can use \s[A-Za-z]+ing\s. When you tag that word, just remember to wrap only the word, not the spaces, in tags. To do that, see the section above on Backreferencing.

yes|no

Matches either of the lowercase strings "yes" and "no".

A space character followed by {2,}

Matches two or more consecutive space characters. Replace with a single space character to collapse excess whitespace.

Using regex in XSLT

There are three XPath functions that use regex: matches(), tokenize(), and replace(), and <xsl:analyze-string> also uses regex. Here are a few details:

//p[matches(.,"^[IVX]+\. [A-Z \-,']+$")]
This XPath expression might be used in an XSLT template to match all <p> elements that begin with a Roman numeral less than 50 followed by a period and a space and then only uppercase letters, spaces, hyphens, commas, and apostrophies. We used this in the Blithedale conversion to identify chapter titles that we had tagged initially as paragraphs, so that we could alter their markup.
//p[matches(.,"^$")]
Sometimes when autotagging lines you may wind up tagging an empty line as a paragraph. This regex matches all <p> elements that have nothing between the beginning and end of the element, that is, that are empty. We used it in the Blithedale conversion to delete blank lines.
replace('abc123', '([a-z])', '$1-')
In most regex processing, a captured pattern is inserted into the replacement by preceding a number with a backslash, so that, for example, \1 refers to the first captured parenthesized pattern. In these XPath functions, though, the number must be preceded not by a backslash, but by a dollar sign; this is an XPath peculiarity. The expression above matches a single lower-case letter (the square brackets define a character class from a to z and the parentheses capture whatever is matched) and replace it by writing it into the output (the $1 insert the first [and, in this case, only] captured pattern) followed by a literal hyphen. The output of the expression above is thus a-b-c-123.

In context

Let’s take a look at a brief section of Hamlet to illustrate the capabilites of regex in a more structured context. Suppose we have the following text:

BERNARDO
Who's there?
FRANCISCO
Nay, answer me: stand, and unfold yourself.
BERNARDO
Long live the king!
FRANCISCO
Bernardo?
BERNARDO
He.
FRANCISCO
You come most carefully upon your hour.
BERNARDO
'Tis now struck twelve; get thee to bed, Francisco.
FRANCISCO
For this relief" much thanks: 'tis bitter cold,
And I am sick at heart.
BERNARDO
Have you had quiet guard?
FRANCISCO
Not a mouse stirring.

and we’d like to tag speeches, speakers, and lines. The simplest way of approaching an auto-tagging problem using regex is often to break it down into individual parts, rather than trying to write the entire expression in one step. One approach is to begin by tagging the speakers by matching ^[A-Z]+$ and replacing it with <speaker>\0</speaker>. This produces:

<speaker>BERNARDO</speaker>
Who's there?
<speaker>FRANCISCO</speaker>
Nay, answer me: stand, and unfold yourself.
<speaker>BERNARDO</speaker>
Long live the king!
<speaker>FRANCISCO</speaker>
Bernardo?
<speaker>BERNARDO</speaker>
He.
<speaker>FRANCISCO</speaker>
You come most carefully upon your hour.
<speaker>BERNARDO</speaker>
'Tis now struck twelve; get thee to bed, Francisco.
<speaker>FRANCISCO</speaker>
For this relief" much thanks: 'tis bitter cold,
And I am sick at heart.
<speaker>BERNARDO</speaker>
Have you had quiet guard?
<speaker>FRANCISCO</speaker>
Not a mouse stirring.

We can then match all lines that don’t begin with < (that is, all lines of speech) with ^[^<].*$. The square brackets delimit a character class, and putting a caret (^) at the beginning of the class (inside the square brackets) makes this a negative character class, which means that the pattern will match all lines that don’t begin with a character in the class (in this case the only character in the class is <). We can replace the match with <line>\0</line>, which produces:

<speaker>BERNARDO</speaker>
<line>Who's there?</line>
<speaker>FRANCISCO</speaker>
<line>Nay, answer me: stand, and unfold yourself.</line>
<speaker>BERNARDO</speaker>
<line>Long live the king!</line>
<speaker>FRANCISCO</speaker>
<line>Bernardo?</line>
<speaker>BERNARDO</speaker>
<line>He.</line>
<speaker>FRANCISCO</speaker>
<line>You come most carefully upon your hour.</line>
<speaker>BERNARDO</speaker>
<line>'Tis now struck twelve; get thee to bed, Francisco.</line>
<speaker>FRANCISCO</speaker>
<line>For this relief" much thanks: 'tis bitter cold,</line>
<line>And I am sick at heart.</line>
<speaker>BERNARDO</speaker>
<line>Have you had quiet guard?</line>
<speaker>FRANCISCO</speaker>
<line>Not a mouse stirring.</line>

We can now identify the boundary between speeches by using the tagged speaker names to identify the beginning of a new speech. (In Real Life we’d switch to XSLT and use <xsl:for-each-group group-starting-with="speaker"> at this point, but since you may be autotagging plain text before you’ve learned about the XSLT strategy, we’ll stick to the <oXygen/> find-and-replace dialog for now.) We can match </line>\n<speaker> and replace it with </line>\n</speech>\n<speech>\n<speaker>, which will produce:

<speaker>BERNARDO</speaker>
<line>Who's there?</line>
</speech>
<speech>
<speaker>FRANCISCO</speaker>
<line>Nay, answer me: stand, and unfold yourself.</line>
</speech>
<speech>
<speaker>BERNARDO</speaker>
<line>Long live the king!</line>
</speech>
<speech>
<speaker>FRANCISCO</speaker>
<line>Bernardo?</line>
</speech>
<speech>
<speaker>BERNARDO</speaker>
<line>He.</line>
</speech>
<speech>
<speaker>FRANCISCO</speaker>
<line>You come most carefully upon your hour.</line>
</speech>
<speech>
<speaker>BERNARDO</speaker>
<line>'Tis now struck twelve; get thee to bed, Francisco.</line>
</speech>
<speech>
<speaker>FRANCISCO</speaker>
<line>For this relief" much thanks: 'tis bitter cold,</line>
<line>And I am sick at heart.</line>
</speech>
<speech>
<speaker>BERNARDO</speaker>
<line>Have you had quiet guard?</line>
</speech>
<speech>
<speaker>FRANCISCO</speaker>
<line>Not a mouse stirring.</line>

As is often the case with this sort of strategy, it’s necessary to patch up the first and last items in a series manually, but at least we’ve been able to tag the rest of the elements with global find-and-replace operations. We’ve added most of our major structural markup, but by breaking the process down into discrete, simple steps, we are able to solve more complex problems. The actual play is more complex, of course, especially with respect to stage directions, but as long as we can identify a structural object through plain-text pseudo-markup (e.g., in many editions of plays, stage directions will be in square brackets, which won’t be used for any other purpose), we can usually translate at least some of that structure automatically to XML markup.

Final notes

As stated above, the most common use of regex in this course is autotagging, that is, converting plain text to XML by using global find-and-replace operations to avoid having to type all of the tags manually. There are often multiple ways of solving the same problem, so be creative; when approaching a problem, think about what it is you want to do, what patterns are available in the plain-text structure, what specific characters in the text represent pseudo-markup that you can exploit, and how can you use these patterns in your regex to identify and tag the parts of the document. It is common to overgeneralize and then fix the errors after the fact; for example, we found it convenient when tagging a novel with chapter titles to tag everything between blank lines as a paragraph, including the titles, and then use a different regex to retag the titles. The point of the use of regex in this context is to automate those markup tasks that can be automated, so that more of your focus can be devoted to other tasks. The time saved using regex to auto-tag texts usually greatly outweights the time it takes to write the expressions.

For further information we recommend the tutorials at http://www.regular-expressions.info/, the regexpal on-line expression tester, and Elisa Beshero-Bondar’s Autotagging with regular expressions (regex).