Maintained by: David J. Birnbaum (djbpitt@gmail.com) Last modified: 2021-12-27T22:03:53+0000
A regular expression (regex) is a pattern that can be used to match a string of text. They are a standard feature of many programming languages that are used for text-processing purposes (and they have the inconvenient habit of being implemented ever-so-slightly differently in different languages). Fortunately, much of the core functionality is consistent, and the XPath peculiarities are described in Kay.
Regular expressions are often used in an XML environment in the following situations:
Regex syntax is supported by three XPath functions matches()
,
replace()
, and tokenize()
. It is also used by the XSLT
element <xsl:analyze-string>
. See Kay for details about how these are
used, including examples. In the example below we use only matches()
, but
we make extensive use of all of these in our real work.
Assume that we’re given a plain-text file like the Project Gutenberg EBook of The Blithedale Romance, by Nathaniel Hawthorne. In this case Project Gutenberg makes the same book available in HTML, and in Real Life we’d probably convert from HTML to XML rather than from plain text, but since there are situation where all we have is plain text, we’ll pretend that’s the case here. A lot of the markup we might introduce for analytical purposes will require us to touch every word of the text, but we can autotag chapter titles, paragraphs, and quotations using regex tools. We can also autotag entire chapters using and XSLT-based XML-to-XML transformation (which we won’t illustrate here).
Open the file in <oXygen/> as a plain text file. You can either open it from the <oXygen/> menu or create a new text file, copy the text from your browser, and paste it in. We then begin by cutting out the front matter (before the first chapter title) and the back matter (after the last line of the text of the novel). We might want to mark those up eventually and reintroduce them into the XML as metadata, so we can save them to a separate file, but we’ll have to do that manually. Since all we intend to autotag is the actual text of the novel, we start by stripping everying else out of the file manually.
What’s left is a bunch of chapter titles and paragraphs, separated from one another by a
blank line, and we can use a regex to find all blank lines and replace them with the
sequence </p><p>
. To perform a regex full-text
search-and-replace in <oXygen/>, hit control-f (Windows) or command-f (MacOS) and
check Regular expression,
Dot matches all,
and Wrap around.
In the Text to find
field, enter
\n\n
. \n
is the regex for a new-line character, so this
expression will find two new-line characters in a row, or a blank line. In the
Replace with
field, enter </p><p>
. Hit the
Replace all
button to run the transformation. Inserting this markup has the
effect of treating the blank line as signaling the end of the preceding XML paragraph
and beginning of the next.
Note that this transformation depends on the two end-of-line characters being immediately
adjacent to each other. If what looks like a blank line to you actually has (invisible)
spaces or tabs, the pattern won’t match and the replacement won’t happen. If you think
that might be the case, you can make those characters visible by going into the
<oXygen/> preferences (Tools → Preferences → Editor → Edit modes → Text) and
checking the boxes labeled Show TAB/NBSP/EOL/EOF marks
and Show SPACE
marks
. If you do have whitespace characters getting the way, you can use regex
processing to replace them: the pattern \s+
matches one or more white-space
characters.
You’ll have to add the <p>
start tag before the first paragraph and
the </p>
end tag after the last one manually, but you can enter all
of the rest automatically with a single regex-aware search-and-replace operation. At
this point the document looks like a bunch of <p>
elements. Some are
empty and some contain chapter titles, rather than paragraphs. We’ll fix that below.
Quotes in this text are delimited by straight double quotation marks (the "
character). This means that a quotation can be defined as the text that occurs between
two double-quote character, starting from the first. This is a fragile strategy because
plain text files found on the Internet may be missing occasional punctuation marks
because of careless typing or proofreading, and since the strategy depends on
odd-numbered quotation mark characters falling at the beginning of a quotation and
even-numbered ones falling at the end, a single missing mark can throw off the count for
the rest of the file. The text we’re using in this exercise doesn’t have that problem,
but should you encounter it in the wild, you’ll need to run the search-and-replace
described below, find where the count goes off, fix it manually in the input, and then
rerun the search-and-replace.
The regex for matching quotation is "(.*?)"
. This matches a double quotation
mark character followed by something else following by another double quotation mark
character. The something else is parenthesized for a reason we’ll see shortly; what's
important now is that the parentheses are part of the regex syntax, and the processor
does not look for parenthesis characters in the document. Inside the parentheses the dot
means any character,
the asterisk means zero or one instances of whatever
precedes it,
and the question mark means don’t be greedy.
Let’s explore
these:
Dot matches allin the Find/Replace dialog box. Once we’ve done that, dot matches absolutely any character.
.*
, then, means zero or more characters, whatever they may be.
"(.*)"
would match all of the text in the entire book between the very first quotation mark
and the very last. The question mark has the effect of modifying the .*
to mean any sequence of characters except the one that immediately follows this part of the pattern.That is, this part of the pattern will stop as soon as it meets another quotation mark. You can test this by entering the two versions in the
Text to findbox and clicking the
Findbutton. The one with the question mark will find each individual quotation. The one without will find one very long item, starting after the very first double quotation mark in the text and ending before the last.
The Text to find,
then, is "(.*?)"
. The Replace with
text you
should enter is <q>\1</q>
. The tags will be entered literally
into the output, just as happened with the paragraph tags during the first
search-and-replace operation. The \1
means insert whatever text matched
the part of the match pattern between the first set of parentheses.
In this case
we have only one set of parentheses, so it matches the text that was between the two
double quotation marks in the source text, and we copy that and insert it into the
output. What we’re doing, then, is copying that text from the input to the output, but
where we had quotation marks wrapped around it in the input, we’re replacing those with
<q>
tags in the output. You can have as many parenthesized
expressions as you’d like, and you can use \0
to insert the entire matched
pattern (which in this case would include the opening and closing quotation marks plus
all the text between them).
The input text includes some quotations that are logically divided, such as
"Mr. Coverdale," said he softly, "can I speak with you a moment?"
From a linguistic or rhetorical perspective this is one utterance, but our transformation
will treat it as two, creating two <q>
elements, one for each pair of
quotation marks. If you want to encode that these are associated with each other, you’ll
need to do that manually. Alternatively, you could use a regular expression strategy to
catch all apparent quotations that begin with a lower-case letter, infer that they
represent continuations of the immediately preceding quotation, and insert the markup
automatically. This wouldn’t catch all of the broken quotations (for example, a
continuation might begin with an upper-case letter that represents someone’s name), but
it could still save a large amount of time when compared to manually uniting all split
quotations.
We now have paragraphs and quotations tagged automatically. Let’s get rid of empty paragraphs and change the markup of chapter titles, since they aren’t really paragraphs.
Transformation from XML to XML is a common component of preparing texts for publication and analysis in digital humanities. The identity transformation is an XSLT tranformation that converts an XML document to itself by writing back out exactly what it reads. It looks like:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0"> <xsl:template match="@*|node()"> <xsl:copy> <xsl:apply-templates select="@*|node()"/> </xsl:copy> </xsl:template> </xsl:stylesheet>
The identity transformation uses a single template rule that matches all attributes
(@*
) and everything on the child axis (node()
), of which
we usually care most about elements and text()
nodes. We have to specify
attributes separately because they are on the attribute axis, and not the child axis,
which means that node()
doesn’t match them because it is short for
child:node()
. The <xsl:copy>
element makes a
shallow copy of whatever it matches, which means that it copies the node,
but not its contents. When it matches a <p>
element, for example, it
copies it (creates a <p>
node in the output document), but it doesn’t
automatically copy the attributes of the original <p>
or any text or
other elements inside it. Instead, those components are processed by the
<xsl:apply-templates select="@*|node()"/>
inside the
newly-created copy of the original node.
If all one did with an identify transformation was run it as is, it would serve no purpose, since one could copy the original document in simpler ways. The point of the identity transformation is that it can serve as a default, letting you make small changes only where you want them. We’re going to use an identity transformation to output exactly what we input, except that we’ll treat empty paragraphs and chapter titles differently.
We can match an empty paragraph by using the XPath matches()
function. This
function takes two arguments: the string that is being searched for the match and the
pattern for which you’re searching. It is similar to the contains()
function, except that contains can search only for a string, while matches()
can search for a regex pattern. We could, for example, easily find all paragraphs that
contain the name Cecilia
using contains()
, but to find, say, all
paragraphs that contain a three-digit number we’d want to use a regex pattern.
When used as the value of the @match
attribute in an
<xsl:template>
rule, the XPath pattern
p[matches(.,"^$")]
will catch any <p>
element that
matches the empty string, that is, any empty paragraph. The regex character caret
(^
) matches the beginning of a string and the dollar sign
($
) matches the end. Since there is nothing between them, this regex
will match any string that has nothing between its beginning and its end, that is, any
empty string. We could, of course, have found empty paragraphs with
p[string-length(.) eq 0]
, so in this case we could have gotten by
without a regex.
We can now augment our identify transformation by adding a second template rule, just for empty paragraphs (highlighted below):
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="p[matches(.,'^$')]"/>
</xsl:stylesheet>
The new template matches empty paragraphs, and because it itself is an empty element, it instructs the XSLT transformation engine to do nothing with whatever it matches. This has the effect of consuming all empty paragraphs and throwing them away.
It might appear as if both template rules, the new one and the original identity one,
match empty paragraphs. After all, empty paragraphs are also nodes, so they fit the
XPath pattern in the @match
attribute of the identity template rule. It
turns out that XSLT has build-in priorities to resolve potentially ambiguous matches.
The details are complicated, but the short version is that the more specific pattern has
priority, and since the new rule targets empty paragraphs much more specifically than
the identity one, it gets to handle the empty paragraphs. An XSLT processor will report
an ambiguity in priority that it can’t resolve on its own, and you can use the
@priority
attribute to resolve an ambiguity or to override the default
if it doesn’t give the behavior you want.
Chapter titles are currently tagged the same way as paragraphs, but their textual content has certain properties that can be distinguished with a regular expression. All chapter titles in this text happen to begin with an upper-case roman numeral, followed by a single period, a single space, and then a string of upper-case letters mixed with hyphens, commas, straight apostrophes, and spaces. No true paragraph matches that pattern; they all have lower-case letters, and many have other punctuation.
The pattern in question can be matched by p[matches(.,"^[IVX]+\. [A-Z
\-,']+$")]
. Here’s how it works:
any character in that group.There are twenty-nine chapters in this novel, which can be expressed with the roman numeral digits
I,
V,and
X.The plus sign has the same meaning as in Relax NG; it means
match a string that consists of at least one instance of any of these three characters.That, then, takes care of the roman numeral;
[IVX]+
will match any sequence of at least one of those three characters in any combination
and any order.any character), so to have it mean a literal period character we escape it by preceding it with a backslash (
\
). The
space character that follows means a literal space.A-Z
means any character between A and Z, inclusive.The group, then, inside the usual square brackets, contains all of the uppercase characters, the space, the hyphen (escaped with a backslash so that it won’t be mistaken for part of a range), the comma, and the single straight apostrophe. The plus sign after the square brackets means that one or more of the characters in the group must be found for the pattern to match.
If we’ve analyzed the character properties of chapter titles correctly, then, a
<p>
element that matches the pattern will be a chapter title, and
we can modify our XSLT stylesheet to change the markup to an <h2>
by
adding a new template rule for paragraphs that match this pattern, as follows:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="p[matches(.,'^$')]"/>
<xsl:template match="p[matches(.,"^[IVX]+\. [A-Z \-,']*$")]">
<h2>
<xsl:apply-templates select="@*|node()"/>
</h2>
</xsl:template>
</xsl:stylesheet>
When we poked our new regex into the value of the @match
attribute of the
template rule, we found ourselves short of quotation marks. XPath and XSLT don’t care
about the difference between single and double quotation marks when used as delimiters,
as long as whenever we use a pair the two parts match, so if we need to wrap some sort
of quotation marks around a string that contains a single quotation mark, we can use the
double ones for that purpose, and vice versa. The problem in this case is that the value
of the @match
attribute requires a set of quotation marks (all attribute
values must be quoted in XSLT), the entire regex requires a set of quotation marks (it
is part of the syntax of the matches()
function that the regex pattern must
be a string, and without the quotation marks it wouldn’t be a string), and we need a
single quotation mark inside the regex. That is, we need quotation marks for three
purposes and the character set just gives us two (we aren’t allowed to use curly
quotes). XML (and therefore XPath and XSLT) work around this by permitting us to use
entity representations of the single and double straight quotation marks,
and these can be recognized as different from the literal characters. We’ve seen
entities before; we used <
, >
, and
&
to represent <
, >
, and
&
, respectively. XML also provides "
for the
double straight quotation mark ("
) and '
for the single
straight quotation mark, or apostrophe ('
).
As an alternative, we could also have defined the regex pattern as an XSLT variable
separately from the matches()
function, using
<xsl:variable>
(see Kay for details), and then used the variable
as the second argument to the matches()
function:
<?xml version="1.0" encoding="UTF-8"?> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0"> <xsl:variable name="regex">^[IVX]+\. [A-Z \-,']*$</xsl:variable> <xsl:template match="@*|node()"> <xsl:copy> <xsl:apply-templates select="@*|node()"/> </xsl:copy> </xsl:template> <xsl:template match="p[matches(.,'^$')]"/> <xsl:template match="p[matches(.,$regex)]"> <h2> <xsl:apply-templates select="@*|node()"/> </h2> </xsl:template> </xsl:stylesheet>
Here we define the pattern as a variable to which we assign the name $regex
.
Variable names are assigned without a dollar sign (in the
<xsl:variable>
element), but references to them include a dollar
sign (in this case, as an argument to the matches()
function). Because the
variable can be defined without the wrapper quotation marks required by an attribute
value, with this strategy we don’t run out of quotation marks. See Kay for details.
Which strategy you use is a matter of personal preference.
Regular expressions may seem more complicated than they really are for a couple of reasons. First, they are very powerful, with a large number of features, each of which has a notation that must be learned. The good news is that for most purposes you need only a small subset of the available features. We don’t have them all memorized, either; we know the ones we use frequently, and we look up the others when needed. Second, the notation is cryptic. There’s no getting around the fact that learning the meaning of, say, caret and dollar sign is pretty much a matter of brute-force memorization. But the silver lining here, again, is that you don’t have to memorize very much because you can look up the details as you need them.