Maintained by: David J. Birnbaum (djbpitt@gmail.com) Last modified: 2022-08-10T20:44:28+0000
The <xsl:analyze-string>
element uses regular
expressions to parse a string of text and identify substrings that match a particular
regex pattern. Kay writes: It is useful where the source document contains text whose
structure is not fully marked up using XML elements and attributes.
Consider the following XHTML document (adapted from a page that no longer exists, but that we found a few years ago at http://ies.sas.ac.uk/cmps/Projects/OUP/index.htm):
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>The History of Oxford University Press</title>
<!-- from http://ies.sas.ac.uk/cmps/Projects/OUP/index.htm -->
</head>
<body>
<p>The History of Oxford University Press</p>
<p>This major national and international scholarly project, which will be inaugurated on 1
January 2006, is a co-operative venture between Oxford University Press and the
Institute of English Studies. Its General Editor is Professor Simon Eliot who holds the
newly-created chair in the History of the Book in the Institute.</p>
<p>The History will consist of four volumes which will cover the following periods:</p>
<div>
<ul>
<li>Volume I 1478-1780s</li>
<li>Volume II 1780s-1890s</li>
<li>Volume III 1890s-1960s</li>
<li>Volume IV 1960s-2000</li>
</ul>
</div>
<p>Each volume will be edited by a distinguished scholar in the field and will consist of
chapters written by that scholar and specialists in book history, social and economic
history, history of scholarship and the history of science and technology.</p>
<p>Oxford University Press will be funding the equivalent of six years of postdoctoral
fellowships in order to provide the fundamental research on which the History will be
based. These fellowships will most likely be divided up in the following way:</p>
<div>
<ol>
<li>A three-year postdoctoral fellowship on the economic and business history of
the Press.</li>
<li>A one-year postdoctoral fellowship on the impact of technological and
communications revolutions on the Press.</li>
<li>A one-year postdoctoral fellowship on the origins and development of OUP's
branches in the USA and Canada.</li>
<li>A one-year postdoctoral fellowship on the origins and development of OUP's
branches in South East Asia.</li>
</ol>
</div>
<p>It is intended that appointments will be made to these fellowships in 2006 and 2007.</p>
<p>In addition, a major Book History research seminar series focusing, though not
exclusively so, on the History will be established. It is hoped that this will involve
members of the History and English faculties at Oxford, and members of the Institute of
Historical Research and the Institute of English Studies in the School of Advanced Study
in the University of London. Monthly meetings will be held alternately in Oxford and
London and will be open to all.</p>
<p>Updates and progress reports on this ambitious and exciting project will be posted on the
Institute web site from time to time.</p>
</body>
</html>
This document contains years, which are four-digit numbers, but they haven’t been tagged
as years. If, for example, we want to make the years clickable links that will take us
to a place where we can look up what happened in that year, we’ll need to insert the
markup. This is the sort of not-fully-marked-up text that Kay had in mind, and we can
add the markup we want by using a modified identity transformation and
<xsl:analyze-string>
. For the purpose of this
exercise, we’re going to use a resource at http://www.historyorb.com/dates-by-year.php that allows us to look up whatever
happened in a particular year by going to, for example, http://www.historyorb.com/events/date/1960 (replacing the 1960
in the
example with whatever year we care about). What we want, then, is for each year in the
input document to create a link in the output document that will let us click on the
year and look it up at this site. To simplify our task, we’ll cut a few corners: we’ll
treat every year reference as a single year (for example, when the text says
1960s
we’ll just look up 1960), we won’t check for missing years (which means
that we might get an error message should we happen to look up a year that isn’t
represented at http://www.historyorb.com because
nothing of interest happened then), and we’ll assume that all four-digit numbers are
years and all years are later than the year 999, that is, that all years are four-digit
years. In Real Life we’d have to evaluate whether those were sensible assumptions given
our data, and if not, we’d have to decide how to cope.
Here’s our stylesheet (discussion follows):
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xpath-default-namespace="http://www.w3.org/1999/xhtml"
xmlns="http://www.w3.org/1999/xhtml"
version="3.0">
<xsl:output method="xhtml" html-version="5" omit-xml-declaration="no"
include-content-type="no" indent="yes"/>
<xsl:mode on-no-match="shallow-copy"/>
<xsl:template match="text()">
<xsl:analyze-string select="." regex="\d{{4}}">
<xsl:matching-substring>
<a href="http://www.historyorb.com/events/date/{.}">
<xsl:value-of select="."/>
</a>
</xsl:matching-substring>
<xsl:non-matching-substring>
<xsl:value-of select="."/>
</xsl:non-matching-substring>
</xsl:analyze-string>
</xsl:template>
</xsl:stylesheet>
We are transforming XHTML input into XHTML output, so our
@xpath-default-namespace
value (which refers to the
input document) is the HTML namespace (line 2), and we also set the default output
namespace to HTML (line 3).
We begin with an identity transformation, using the simplified syntax available in XSLT 3.0 (line 7):
]]>
This says that all input nodes should be copied unmodified into the output, so if this were the only template in the stylesheet, it would just transform the input document into an informationally identical output document. Once we’ve established that as a new default behavior, which overrides the usual built-in behavior (which would throw away the tags for elements that didn’t have an explicit template, instead of copying them, which is what we want to do), we have to write templates only for the nodes we want to change. In this case that’s just text nodes.
Our only other template, then, processes text nodes. When we match a text node, we invoke
the <xsl:analyze-string>
element, using the
@select
attribute to tell it to analyze the current
context item, represented by the dot (in this case the current context item is the text
node we just matched), and the regular expression that will be used to perform the
analysis is specified as the value of the @regex
attribute. The regular expression in this case,
regex="\d{{4}}"
, is designed to match any
four-digit number. Let’s look at the pieces:
The \d
matches any single digit.
You may recall from our regex unit that {4}
(single curly braces; stay tuned!) as a repetition indicator means exactly
four instances of whatever precedes
, so
\d{4}
matches exactly four consecutive
digits.
Regex syntax requires that the number of matches appear in curly braces, but in
the <xsl:analyze-string>
context, the
regex attribute is an attribute value template (AVT),
which means that anything that appears in curly braces would be interpreted not
as part of regex syntax, but as an XPath expression. We need, therefore, to
escape the curly braces by doubling them; this tells the AVT
analyzer that it should pass real curly braces to the regex analyzer, which then
recognizes them as saying match a digit exactly four times in sequence,
that is, match a four-digit number.
The <xsl:analyze-string>
element normally takes
two child elements, <xsl:matching-substring>
and
<xsl:non-matching-substring>
. As the element
name implies, once the <xsl:analyze-string>
parser breaks the string into parts that match the regex and parts that don’t, one or
the other of these subelements handles each of those parts. In our stylesheet, for a
non-matching substring (any string of text inside the
text()
node that isn’t a four-digit number), we
just output the value of that substring using the
<xsl:value-of>
element. For any substring that
matches, we create an HTML link (<a>
) and insert
the appropriate value for the @href
attribute,
using an AVT to plug in the four-digit number that we matched. Note that within
<xsl:matching-substring>
and
<xsl:non-matching-substring>
, a dot refers to
the individual matched or not-matched substring being processed at the moment, and not
to the entire text node.
Here’s a snippet of the output:
<p>This major national and international scholarly project, which will be inaugurated on 1 January
<a href="http://www.historyorb.com/events/date/2006">2006</a>, is a co-operative venture between
Oxford University Press and the Institute of English Studies. Its General Editor is Professor Simon
Eliot who holds the newly-created chair in the History of the Book in the Institute.</p>
Note that the four-digit year is marked up as a link to the http://www.historyorb.com site, but the
one-digit part of 1 January 2006
is not. Similarly:
<li>Volume I <a href="http://www.historyorb.com/events/date/1478">1478</a>-
<a href="http://www.historyorb.com/events/date/1780">1780</a>s </li>
Here the four-digit years are tagged even when they are part of an expression like
1780s
.
analyze-string()
functionIn addition to the XSLT <xsl:analyze-string>
element, there is also an XPath
analyze-string()
function that works the same
way. It can be a little trickier to use because it doesn’t have exact counterparts
to ]]>
and
]]>
. Because
the XPath analyze-string()
function was
introduced only in XPath 3.0, though, you won’t find it in Michael Kay’s book, but
it is in the XPath functions spec and the Saxonica doccumentation.