Authors: Andrew Nitz and Simon Brown Maintained by: David J. Birnbaum (djbpitt@gmail.com) Last modified: 2022-02-04T22:20:36+0000
Regular Expressions (regex) are a powerful tool for autotagging texts, that is, identifying and using patterns to find and replace strings in order to add mark up to texts efficiently. See below for some regex expressions commonly used in our projects.
+
,
?
, *
Regex uses the same repetition indicators that we use when writing schemas in Relax NG, and they have the same meaning in both.
+
) immediately following a character
means match a sequence of one or more instances of this character. So if you write
Great!+
as a reqular expression, it will
match all strings that consist of the five letters in Greatfollowed immediately by one or more exclamation marks. If you apply this regex to
Great, it will fail to match anything because the regex requires at least one exclamation mark.
?
) immediately following a
character means match zero or one instance of this character. This means that the regex
Great!?
will match Greator
Great!. It will not match a question mark; the question mark is a repetition indicator applied to the exclamation mark, which is a literal character, and the question mark itself is not a literal character. Because
Great!?
will match Greatfollowed by zero or one exclamation mark, if you apply that regex to
Great!!!, it will match only the
Great!part, and leave the last two exclamation marks unmatched.
*
) immediately following a character
means match zero or more instances of this character. This means that
Great!*
will match all instances of Greattogether with any immediately following exclamation marks. If there are no immediately following exclamation marks, it will nonetheless match
Greatbecause the asterisk lets the match succeed even if there are zero instances of the preceding character, that is, the exclamation mark.
To override the repetition-indicator meaning of the plus sign, question mark, or asterisk
and match a literal plus sign, question mark, or asterisk, precede it with a
backslash. (The technical term for overriding the meaning of a special character is
called escaping the character.) The regex
Great\?
matches the literal string Great
followed by a literal question mark.
.
In regex, the dot corresponds to any character except a newline character. This exception
can be overridden in <oXygen/> by checking Dot matches all
in the
Find-and-replace dialog, in which case the dot will match absolutely all characters,
including newlines. To see the difference, open a document in <oXygen/> and open
the search-and-replace dialog. Type the regex .*
in
the Text to find
window, and the string obdurodon
in the Replace
with
window. Then hit Replace all
. You’ll notice that every line of the
document now simply reads obdurodon
. Each line is replaced separately because the
strings of matching text (characters that are not newlines) are separated by newlines,
which aren’t part of the match. This means that each string of text between newline
characters is a separate successful match, and each is replaced individually by the
replacement string.
Undo this change with control-z in Windows or command-z in MacOS and check the Dot
matches all
box in the Options
section of the menu. Now run Replace
all
again. This time, the entire document has been replaced with exactly one
obdurodon
string. This is because the dot is now matching newline characters
as well, which means that the expression .*
will match
everything in the document.
Regex has two types of reserved, special characters, or metacharacters, which do not always have their literal string-value meaning, and are instead used to define patterns within expressions.
The first type of metacharacter is one that has a special meaning unless you
escape it by preceding it with a backslash. For example, the plus sign
(+
) doesn’t match a literal plus sign; what it does
instead is tell the expression to match whatever subpattern precedes the plus sign one
or more times (that is, it has the same repetition-indicator meaning as in
Relax NG). The regex pattern ab+c
will match a literal
a
character followed immediately by one or more b
characters and then
one c
character. The fact that the plus sign is a metacharacter means, among
other things, that it cannot be used by itself to match a literal plus sign character.
To use the plus sign or similar metacharacters as literal string characters (that is, to
match them in a string rather than use them for their syntactic metacharacter purpose
within regex), escape them by preceding them with a backslash
(\
). For example, the pattern
ab\+c
will match the literal sequence of four
characters that corresponds to a
, then b
, then a plus sign, and then
c
.
The second type of metadata character is one that by itself has its literal meaning, but
that acquires a special meaning when preceded by a backslash. For example, a
d
in a regex pattern usually matches a literal
d
character, but if you precede it with a backslash, it matches a digit
(0–9). Thus, because +
is the first type of
metacharacter (normally non-literal, escape to make literal) and
d
is the second type (normally literal, escape to make
non-literal):
\d+
will match all strings of one or more digits
in a text\d\+
will match one digit followed by a literal
plus signd+
will match one or more consecutive literal
dcharacters
d\+
will match exactly one literal dcharacter followed by a literal plus sign
A particularly useful feature of regex, especially for autotagging, is the ability to
backreference matched strings. This is done by wrapping parts of or
entire expressions in parentheses (( )
), which captures (remembers)
whatever is matched between the parentheses, so that you can then recall the captured
match fragment and write it into the replacement string by using a numerical value
preceded by a backslash. The numbers associated with backreferences in an expression
begin at one, and are ordered by appearance of the opening parenthesis. For example, the
expression ^.*$
will match an entire line (as long as
dot-all mode is not checked). If wrapped in parentheses as
^(.*)$
, it can be backreferenced in the Replace
with
window, using the expression \1
. This
means, for example, that you can wrap each line of your document in
<p>
tags by capturing the contents of the line
with parentheses and using <p>\1</p>
as
the replacement value.
A regex processor automatically captures the entire matched pattern, even without
parentheses, and you can insert it into a replacement string with
\0
. For example, replacing
^.*$
with
<line>\0</line>
will wrap
<line>
tags around each line of the input
document, which can be handy for autotagging line-oriented input, such as poetry. Note
that no parentheses are needed here because the entire match is automatically available
with the \0
backreference.
<,
>,
&)
There are certain characters that are not permitted as plain text characters within your XML markup, and replacing these with their entity representation is a crucial first step in autotagging a text.
&with
&
In XML the ampersand is a reserved character that is used to indicate the beginning of a
character entity (such as <
for
<
) or numerical character references. A literal ampersand is not permitted
in XML text; if you want to represent an ampersand, you have to replace it with the
&
entity. You must replace the ampersand
before anything else; otherwise any ampersands that you introduce while replacing
<
and >
(see below) will themselves be replaced, which isn’t
what you want.
<with
<
The angle brackets are used to delimit XML tags, and therefore cannot appear as literal
characters, which means that all literal <
characters must be replaced by
<
. XML parsers know that a >
isn’t a
tag delimiter when it isn’t preceded by a <
, but we nonetheless usually
replace all >
characters with >
.
\n\n
Matches two consecutive newline characters, that is, locates blank lines. Useful for
identifying and delimiting groups of text that are divided structurally using newlines,
such as paragraphs in prose or stanzas in a poem that may be separated by a blank line
in plain text representations. Be aware that if there are space characters on the
blank
line, this pattern won’t match, since it looks only for two immediately
consecutive new-line characters.
^.*$
Matches an entire line from intial to final character (the leading
^
means anchor the beginning of the match to the
beginning of the line
and the trailing $
means
anchor the end of the match to the end of the line
). This pattern can be used
as a building block to find lines that contain a specific substring by using
.*
twice with the substring in between. For example,
^.*Hamlet.*$
will match a line containing the exact
substring Hamlet
no matter what (if anything) precedes or follows it on the
line.
Note that this is different from matching Hamlet
, which
will match the string Hamlet
wherever it occurs, but it won’t match the entire
line that contains that string. It’s easy for humans to overlook the difference, but to
a computer, the difference between matching a string and matching an entire line if the
line contains that string is not at all the same thing.
^
string.*$
and
^.*
string$
Matches a line beginning or ending, respectively, with a specific string string
.
For example, ^Hamlet.*$
will find a line beginning
with the string Hamlet
.
".*?"
Matches everything inside a pair of (straight) quotation marks. Remember that by default
the dot doesn’t match a new-line character, so this pattern will match quotes only if
they start and end in the same line, that is, if there is no new-line character between
the opening and closing quotation-mark character. If you want to match quotes that may
span multiple lines, as may happen in a prose text, you will need to check Dot
matches all
—but that introduces a new catch, so you’ll need to do one more
thing, as well.
The ".*"
means match zero or more characters, except
for new-lines (unless
between quotation
marks. Changing this to Dot matches all
is checked)".*?"
by adding the trailing
question mark makes the match non-greedy; instead of matching the longest
possible stretch between quotation marks (possibly gobbling up multiple independent
quotated phrases) it will take the shortest match that fits the pattern. This is
important in situations where there may be two separate quotations in a single line. The
non-greedy match will match each quotation separately, which is what you want. A greedy
match would assume that there was just one quotation, and would capture all of the text
from the beginning of the first quotation through the end of the second.
When you replace literal quotation marks in your input with tags (such as the HTML
<q>
tag), don’t forget to remove the original
quotation marks. The original quotation marks are pseudo-markup, and serve to
indicate a quotation in a context where markup isn’t available. Once you have real
markup, you typically don’t want to retain the literal quotation marks. You can effect
the change by matching "(.*?)"
and replacing it with
<q>\1</q>
. You use the parentheses to
capture the text between the original quotation marks, and when you write that captured
text into the replacement output, you effectively throw away the original literal
quotation marks.
\d{4}
Matches four consecutive digits, such as a year. The number inside the curly braces is
the exact number of times the preceding item must be repeated in order for the match to
succeed. To add a greater degree of constraint, specific digits can be used, followed by
a specific number of variable digits to match the pattern. For instance, if you’d like
to tag all years in a text without mistakenly tagging non-year strings of four digits,
and you know that all years will fall in the twentieth century (and thus begin with
19
), you can use 19\d{2}
as your match
pattern. This might match a four-digit number that isn’t a twentieth-century year and
that nonetheless happens to begin with 19
, so it doesn’t afford absolute
protection against false positives, but at least it can limit them. Note that the
pattern \d{4}
will also match the first four digits of
a five-digit number, etc., and to avoid that you can constrain your pattern still
further by specifying that the fourth digit cannot be followed by a digit.
\d{1,2}[./-]\d{1,2}[./-]\d{2,4}
Matches dates in the dd-mm-yyyy or mm-dd-yyyy format, allowing single-
or double-digit numbers for days and months, and numeric strings of length two to four
for the year. The delimiter can be a period (.
), forward slash (/
), or
dash (-
). You don’t need to escape the dot here because the processor knows that
a dot inside a character class (that is, inside square brackets) is a literal
dot.
\b[XIVLCDM]+\b
Matches Roman numerals by matching a sequence of one or more of the capitalized letters
in the brackets, in any order, with a word boundary (whitespace, beginning of file, or
end of file, see https://www.regular-expressions.info/wordboundaries.html for details) on either
side. This expression will match the pronoun I
, which you don’t want but can’t
avoid, so you’ll need to proofread. If you’re lucky, in Real Life you may find that
Roman numerals in your text are followed by periods (as in a table of contents), while
the personal pronoun I
typically is not followed by a period. You may also find
that Roman numerals occur in restricted positions, such as chapter titles; if a chapter
title must begin with a Roman numeral and you can identify a chapter title in some other
way, you know that an initial lone I
there is a Roman numeral, and not a
pronoun.
\s[A-Za-z]*
string.*?\s
Matches any word that contains the string string
with its preceding and following
spaces. This can be helpful for morphological analysis, when you may be looking for
whole words that contain specific morphemes. If you are looking for words with the
string ing
, for instance, you can search every word that contains that string in
any location within the word. If, though, you want to find words that end in ing
but not those that contain ing
in any other position (to find, for instance,
words like running), you can use
\s[A-Za-z]+ing\s
. When you tag that word, just
remember to wrap only the word, not the spaces, in tags. To do that, see the section
above on Backreferencing.
yes|no
Matches either of the lowercase strings "yes" and "no".
{2,}
Matches two or more consecutive space characters. Replace with a single space character to collapse excess whitespace.
There are four XPath functions that use regex:
matches()
,
tokenize()
,
replace()
,
analyze-string()
, and
<xsl:analyze-string>
also uses regex. Here are a
few details:
//p[matches(.,"^[IVX]+\. [A-Z \-,']+$")]
<p>
elements that begin with a Roman numeral
less than 50 followed by a period and a space and then only uppercase letters,
spaces, hyphens, commas, and apostrophies. We used this in the Blithedale conversion
to identify chapter titles that we had tagged initially as paragraphs, so that we
could alter their markup.//p[matches(.,"^$")]
<p>
elements that have nothing between the beginning and end of the element, that is,
that are empty. We used it in the Blithedale conversion to delete blank lines.replace('abc123', '([a-z])', '$1-')
\1
refers to the
first captured parenthesized pattern. In the XPath functions described here, though,
the number must be preceded not by a backslash, but by a dollar sign; this is an
XPath peculiarity. The expression above matches a single lower-case letter (the
square brackets define a character class from ato
zand the parentheses capture whatever is matched) and replaces it by writing it into the output (the
$1
inserts the first [and, in this
case, only] captured pattern) followed by a literal hyphen. The output of the
expression above is thus a-b-c-123.
Let’s take a look at a brief section of Hamlet to illustrate the capabilites of regex in a more structured context. Suppose we have the following text:
BERNARDO
Who's there?
FRANCISCO
Nay, answer me: stand, and unfold yourself.
BERNARDO
Long live the king!
FRANCISCO
Bernardo?
BERNARDO
He.
FRANCISCO
You come most carefully upon your hour.
BERNARDO
'Tis now struck twelve; get thee to bed, Francisco.
FRANCISCO
For this relief" much thanks: 'tis bitter cold,
And I am sick at heart.
BERNARDO
Have you had quiet guard?
FRANCISCO
Not a mouse stirring.
and we’d like to tag speeches, speakers, and lines. The simplest way of approaching an
auto-tagging problem using regex is often to break it down into individual parts, rather
than trying to write the entire expression in one step. One approach is to begin by
tagging the speakers (any entire line consisting of only uppercase letters is a speaker
name) by matching ^[A-Z]+$
and replacing it with
<speaker>\0</speaker>
. This produces:
<speaker>BERNARDO</speaker>
Who's there?
<speaker>FRANCISCO</speaker>
Nay, answer me: stand, and unfold yourself.
<speaker>BERNARDO</speaker>
Long live the king!
<speaker>FRANCISCO</speaker>
Bernardo?
<speaker>BERNARDO</speaker>
He.
<speaker>FRANCISCO</speaker>
You come most carefully upon your hour.
<speaker>BERNARDO</speaker>
'Tis now struck twelve; get thee to bed, Francisco.
<speaker>FRANCISCO</speaker>
For this relief" much thanks: 'tis bitter cold,
And I am sick at heart.
<speaker>BERNARDO</speaker>
Have you had quiet guard?
<speaker>FRANCISCO</speaker>
Not a mouse stirring.
We can then match all lines that don’t begin with <
(that is, all lines of
speech) with ^[^<].*$
. The square brackets delimit
a character class, and putting a caret (^
)
at the beginning of the class (inside the square brackets) makes this a negative
character class, which means that the pattern will match all lines that
don’t begin with a character in the class (in this case the only character
in the class is <
). We can replace the match with
<line>\0</line>
, which produces:
<speaker>BERNARDO</speaker>
<line>Who's there?</line>
<speaker>FRANCISCO</speaker>
<line>Nay, answer me: stand, and unfold yourself.</line>
<speaker>BERNARDO</speaker>
<line>Long live the king!</line>
<speaker>FRANCISCO</speaker>
<line>Bernardo?</line>
<speaker>BERNARDO</speaker>
<line>He.</line>
<speaker>FRANCISCO</speaker>
<line>You come most carefully upon your hour.</line>
<speaker>BERNARDO</speaker>
<line>'Tis now struck twelve; get thee to bed, Francisco.</line>
<speaker>FRANCISCO</speaker>
<line>For this relief" much thanks: 'tis bitter cold,</line>
<line>And I am sick at heart.</line>
<speaker>BERNARDO</speaker>
<line>Have you had quiet guard?</line>
<speaker>FRANCISCO</speaker>
<line>Not a mouse stirring.</line>
We can now identify the boundary between speeches by using the tagged speaker names to
identify the beginning of a new speech. (In Real Life we’d switch to XSLT and use
<xsl:for-each-group group-starting-with="speaker">
at this point, but since you may be autotagging plain text before you’ve learned about
the XSLT strategy, we’ll stick to the <oXygen/> Find-and-replace dialog for now.)
We can match </line>\n<speaker>
and
replace it with
</line>\n</speech>\n<speech>\n<speaker>
,
which will produce:
<speaker>BERNARDO</speaker>
<line>Who's there?</line>
</speech>
<speech>
<speaker>FRANCISCO</speaker>
<line>Nay, answer me: stand, and unfold yourself.</line>
</speech>
<speech>
<speaker>BERNARDO</speaker>
<line>Long live the king!</line>
</speech>
<speech>
<speaker>FRANCISCO</speaker>
<line>Bernardo?</line>
</speech>
<speech>
<speaker>BERNARDO</speaker>
<line>He.</line>
</speech>
<speech>
<speaker>FRANCISCO</speaker>
<line>You come most carefully upon your hour.</line>
</speech>
<speech>
<speaker>BERNARDO</speaker>
<line>'Tis now struck twelve; get thee to bed, Francisco.</line>
</speech>
<speech>
<speaker>FRANCISCO</speaker>
<line>For this relief" much thanks: 'tis bitter cold,</line>
<line>And I am sick at heart.</line>
</speech>
<speech>
<speaker>BERNARDO</speaker>
<line>Have you had quiet guard?</line>
</speech>
<speech>
<speaker>FRANCISCO</speaker>
<line>Not a mouse stirring.</line>
Since this strategy finds the boundary between speaker+lines combinations by looking at the end of a line and the beginning of a speaker, it may be necessary to patch up the first and last items in a series manually, but at least we’ve been able to tag the rest of the elements with global find-and-replace operations. We’ve added most of our major structural markup, of different types, by breaking the process down into discrete, simple steps. The actual play is more complex, of course, especially with respect to stage directions, but as long as we can identify a structural object through plain-text pseudo-markup (e.g., in many editions of plays, stage directions will be in square brackets, which won’t be used for any other purpose), we can usually translate at least some of that structure automatically to XML markup.
As stated above, the most common use of regex in this course is autotagging, that is, converting plain text to XML by using global find-and-replace operations to avoid having to type all of the tags manually. There are often multiple ways of solving the same problem, so be creative; when approaching a problem, think about what it is you want to do, what patterns are available in the plain-text structure, what specific characters in the text represent pseudo-markup that you can exploit, and how can you use these patterns in your regex to identify and tag the parts of the document.
It is common to overgeneralize and then fix the errors after the fact. For example, we found it convenient when tagging a novel with chapter titles to tag everything between blank lines as a paragraph, including the titles, and then use a different regex to retag the titles. The point of the use of regex in this context is to automate those markup tasks that can be automated, so that more of your focus can be devoted to other tasks. The time saved using regex to auto-tag texts usually greatly outweights the time it takes to write the expressions.
For further information we recommend the tutorials at http://www.regular-expressions.info/, the regexpal and ExtendsClass regex tester on-line expression testers, and Elisa Beshero-Bondar’s Autotagging with regular expressions (regex).