Digital humanities


Maintained by: David J. Birnbaum (djbpitt@gmail.com) [Creative Commons BY-NC-SA 3.0 Unported License] Last modified: 2022-08-10T20:44:28+0000


Using <xsl:analyze-string>

The <xsl:analyze-string> element uses regular expressions to parse a string of text and identify substrings that match a particular regex pattern. Kay writes: It is useful where the source document contains text whose structure is not fully marked up using XML elements and attributes.

Consider the following XHTML document (adapted from a page that no longer exists, but that we found a few years ago at http://ies.sas.ac.uk/cmps/Projects/OUP/index.htm):

<html xmlns="http://www.w3.org/1999/xhtml">
   <head>
       <title>The History of Oxford University Press</title>
       <!-- from http://ies.sas.ac.uk/cmps/Projects/OUP/index.htm -->
   </head>
   <body>
       <p>The History of Oxford University Press</p>
       <p>This major national and international scholarly project, which will be inaugurated on 1
            January 2006, is a co-operative venture between Oxford University Press and the
            Institute of English Studies. Its General Editor is Professor Simon Eliot who holds the
            newly-created chair in the History of the Book in the Institute.</p>
       <p>The History will consist of four volumes which will cover the following periods:</p>
       <div>
           <ul>
               <li>Volume I 1478-1780s</li>
               <li>Volume II 1780s-1890s</li>
               <li>Volume III 1890s-1960s</li>
               <li>Volume IV 1960s-2000</li>
           </ul>
       </div>
       <p>Each volume will be edited by a distinguished scholar in the field and will consist of
            chapters written by that scholar and specialists in book history, social and economic
            history, history of scholarship and the history of science and technology.</p>
       <p>Oxford University Press will be funding the equivalent of six years of postdoctoral
            fellowships in order to provide the fundamental research on which the History will be
            based. These fellowships will most likely be divided up in the following way:</p>
       <div>
           <ol>
               <li>A three-year postdoctoral fellowship on the economic and business history of
                    the Press.</li>
               <li>A one-year postdoctoral fellowship on the impact of technological and
                    communications revolutions on the Press.</li>
               <li>A one-year postdoctoral fellowship on the origins and development of OUP's
                    branches in the USA and Canada.</li>
               <li>A one-year postdoctoral fellowship on the origins and development of OUP's
                    branches in South East Asia.</li>
           </ol>
       </div>
       <p>It is intended that appointments will be made to these fellowships in 2006 and 2007.</p>
       <p>In addition, a major Book History research seminar series focusing, though not
            exclusively so, on the History will be established. It is hoped that this will involve
            members of the History and English faculties at Oxford, and members of the Institute of
            Historical Research and the Institute of English Studies in the School of Advanced Study
            in the University of London. Monthly meetings will be held alternately in Oxford and
            London and will be open to all.</p>
       <p>Updates and progress reports on this ambitious and exciting project will be posted on the
            Institute web site from time to time.</p>
   </body>
</html>

This document contains years, which are four-digit numbers, but they haven’t been tagged as years. If, for example, we want to make the years clickable links that will take us to a place where we can look up what happened in that year, we’ll need to insert the markup. This is the sort of not-fully-marked-up text that Kay had in mind, and we can add the markup we want by using a modified identity transformation and <xsl:analyze-string>. For the purpose of this exercise, we’re going to use a resource at http://www.historyorb.com/dates-by-year.php that allows us to look up whatever happened in a particular year by going to, for example, http://www.historyorb.com/events/date/1960 (replacing the 1960 in the example with whatever year we care about). What we want, then, is for each year in the input document to create a link in the output document that will let us click on the year and look it up at this site. To simplify our task, we’ll cut a few corners: we’ll treat every year reference as a single year (for example, when the text says 1960s we’ll just look up 1960), we won’t check for missing years (which means that we might get an error message should we happen to look up a year that isn’t represented at http://www.historyorb.com because nothing of interest happened then), and we’ll assume that all four-digit numbers are years and all years are later than the year 999, that is, that all years are four-digit years. In Real Life we’d have to evaluate whether those were sensible assumptions given our data, and if not, we’d have to decide how to cope.

Here’s our stylesheet (discussion follows):

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xpath-default-namespace="http://www.w3.org/1999/xhtml"
    xmlns="http://www.w3.org/1999/xhtml"
    version="3.0">
    <xsl:output method="xhtml" html-version="5" omit-xml-declaration="no" 
        include-content-type="no" indent="yes"/>
    <xsl:mode on-no-match="shallow-copy"/>
    <xsl:template match="text()">
        <xsl:analyze-string select="." regex="\d{{4}}">
            <xsl:matching-substring>
                <a href="http://www.historyorb.com/events/date/{.}">
                    <xsl:value-of select="."/>
                </a>
            </xsl:matching-substring>
            <xsl:non-matching-substring>
                <xsl:value-of select="."/>
            </xsl:non-matching-substring>
        </xsl:analyze-string>
    </xsl:template>
</xsl:stylesheet>

We are transforming XHTML input into XHTML output, so our @xpath-default-namespace value (which refers to the input document) is the HTML namespace (line 2), and we also set the default output namespace to HTML (line 3).

We begin with an identity transformation, using the simplified syntax available in XSLT 3.0 (line 7):

]]>

This says that all input nodes should be copied unmodified into the output, so if this were the only template in the stylesheet, it would just transform the input document into an informationally identical output document. Once we’ve established that as a new default behavior, which overrides the usual built-in behavior (which would throw away the tags for elements that didn’t have an explicit template, instead of copying them, which is what we want to do), we have to write templates only for the nodes we want to change. In this case that’s just text nodes.

Our only other template, then, processes text nodes. When we match a text node, we invoke the <xsl:analyze-string> element, using the @select attribute to tell it to analyze the current context item, represented by the dot (in this case the current context item is the text node we just matched), and the regular expression that will be used to perform the analysis is specified as the value of the @regex attribute. The regular expression in this case, regex="\d{{4}}", is designed to match any four-digit number. Let’s look at the pieces:

The <xsl:analyze-string> element normally takes two child elements, <xsl:matching-substring> and <xsl:non-matching-substring>. As the element name implies, once the <xsl:analyze-string> parser breaks the string into parts that match the regex and parts that don’t, one or the other of these subelements handles each of those parts. In our stylesheet, for a non-matching substring (any string of text inside the text() node that isn’t a four-digit number), we just output the value of that substring using the <xsl:value-of> element. For any substring that matches, we create an HTML link (<a>) and insert the appropriate value for the @href attribute, using an AVT to plug in the four-digit number that we matched. Note that within <xsl:matching-substring> and <xsl:non-matching-substring>, a dot refers to the individual matched or not-matched substring being processed at the moment, and not to the entire text node.

Here’s a snippet of the output:

<p>This major national and international scholarly project, which will be inaugurated on 1 January 
<a href="http://www.historyorb.com/events/date/2006">2006</a>, is a co-operative venture between
Oxford University Press and the Institute of English Studies. Its General Editor is Professor Simon 
Eliot who holds the newly-created chair in the History of the Book in the Institute.</p>

Note that the four-digit year is marked up as a link to the http://www.historyorb.com site, but the one-digit part of 1 January 2006 is not. Similarly:

<li>Volume I <a href="http://www.historyorb.com/events/date/1478">1478</a>-
<a href="http://www.historyorb.com/events/date/1780">1780</a>s </li>

Here the four-digit years are tagged even when they are part of an expression like 1780s.

The XPath analyze-string() function

In addition to the XSLT <xsl:analyze-string> element, there is also an XPath analyze-string() function that works the same way. It can be a little trickier to use because it doesn’t have exact counterparts to ]]> and ]]>. Because the XPath analyze-string() function was introduced only in XPath 3.0, though, you won’t find it in Michael Kay’s book, but it is in the XPath functions spec and the Saxonica doccumentation.