Digital humanities


Maintained by: David J. Birnbaum (djbpitt@gmail.com) [Creative Commons BY-NC-SA 3.0 Unported License] Last modified: 2016-08-28T20:35:34+0000


Using <xsl:analyze-string>

The <xsl:analyze-string> element uses regular expressions to parse a string of text and identify substrings that match a particular regex pattern. Kay writes: It is useful where the source document contains text whose structure is not fully marked up using XML elements and attributes.

Consider the following XHTML document (adapted from a page that no longer exists, but that we found a few years ago at http://ies.sas.ac.uk/cmps/Projects/OUP/index.htm):

<html xmlns="http://www.w3.org/1999/xhtml">
   <head>
       <title>The History of Oxford University Press</title>
       <!-- from http://ies.sas.ac.uk/cmps/Projects/OUP/index.htm -->
   </head>
   <body>
       <p>The History of Oxford University Press</p>
       <p>This major national and international scholarly project, which will be inaugurated on 1
            January 2006, is a co-operative venture between Oxford University Press and the
            Institute of English Studies. Its General Editor is Professor Simon Eliot who holds the
            newly-created chair in the History of the Book in the Institute.</p>
       <p>The History will consist of four volumes which will cover the following periods:</p>
       <div>
           <ul>
               <li>Volume I 1478-1780s</li>
               <li>Volume II 1780s-1890s</li>
               <li>Volume III 1890s-1960s</li>
               <li>Volume IV 1960s-2000</li>
           </ul>
       </div>
       <p>Each volume will be edited by a distinguished scholar in the field and will consist of
            chapters written by that scholar and specialists in book history, social and economic
            history, history of scholarship and the history of science and technology.</p>
       <p>Oxford University Press will be funding the equivalent of six years of postdoctoral
            fellowships in order to provide the fundamental research on which the History will be
            based. These fellowships will most likely be divided up in the following way:</p>
       <div>
           <ol>
               <li>A three-year postdoctoral fellowship on the economic and business history of
                    the Press.</li>
               <li>A one-year postdoctoral fellowship on the impact of technological and
                    communications revolutions on the Press.</li>
               <li>A one-year postdoctoral fellowship on the origins and development of OUP's
                    branches in the USA and Canada.</li>
               <li>A one-year postdoctoral fellowship on the origins and development of OUP's
                    branches in South East Asia.</li>
           </ol>
       </div>
       <p>It is intended that appointments will be made to these fellowships in 2006 and 2007.</p>
       <p>In addition, a major Book History research seminar series focusing, though not
            exclusively so, on the History will be established. It is hoped that this will involve
            members of the History and English faculties at Oxford, and members of the Institute of
            Historical Research and the Institute of English Studies in the School of Advanced Study
            in the University of London. Monthly meetings will be held alternately in Oxford and
            London and will be open to all.</p>
       <p>Updates and progress reports on this ambitious and exciting project will be posted on the
            Institute web site from time to time.</p>
   </body>
</html>

This document contains years, which are four-digit numbers, but they haven’t been tagged as years. If, for example, we want to make the years clickable links that will take us to a place where we can look up what happened in that year, we’ll need to insert the markup. This is the sort of not-fully-marked-up text that Kay had in mind, and we can add the markup we want by using a modified identity transformation and <xsl:analyze-string>. For the purpose of this exercise, we’re going to use a resource at http://www.historyorb.com/dates-by-year.php that allows us to look up whatever happened in a particular year by going to, for example, http://www.historyorb.com/events/date/1960 (replacing the 1960 in the example with whatever year we care about). What we want, then, is for each year in the input document to create a link in the output document that will let us click on the year and look it up at this site. To simplify our task, we’ll cut a few corners: we’ll treat every year reference as a single year (for example, when the text says 1960s we’ll just look up 1960), we won’t check for missing years (which means that we might get an error message should we happen to look up a year that isn’t represented at http://www.historyorb.com because nothing of interest happened then), and we’ll assume that all four-digit numbers are years and all years are later than the year 999, that is, that all years are four-digit years. In Real Life we’d have to evaluate whether those were sensible assumptions given our data, and if not, we’d have to decide how to cope.

Here’s our stylesheet (discussion follows):

<xsl:stylesheet xmlns="http://www.w3.org/1999/xhtml"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xpath-default-namespace="http://www.w3.org/1999/xhtml" version="2.0">
    <xsl:template match="*">
        <xsl:copy>
            <xsl:apply-templates/>
        </xsl:copy>
    </xsl:template>
    <xsl:template match="text()">
        <xsl:analyze-string select="." regex="\d{{4}}">
            <xsl:matching-substring>
                <a href="http://www.historyorb.com/events/date/{.}">
                    <xsl:value-of select="."/>
                </a>
            </xsl:matching-substring>
            <xsl:non-matching-substring>
                <xsl:value-of select="."/>
            </xsl:non-matching-substring>
        </xsl:analyze-string>
    </xsl:template>
</xsl:stylesheet>

We begin with the identity transformation, and because our input document has no attributes, we’re using a simplified template that doesn’t need to match attributes (in the @match attribute of the <xsl:template> element) or process them (in the @select attribute of the <xsl:apply-templates> element). Because we’re writing a separate rule for text() nodes (since we need to parse them to look for dates), our basic identity template has to match only elements, which we can do with an asterisk, which means any element.

Our other template rule processes text() nodes. When we match a text() node, we invoke the <xsl:analyze-string> element, passing it something to parse (the text() node we just matched, represented by the dot, since it’s the current context node) and a regular expression that will be used to parse it. The regular expression in this case, regex="\d{{4}}", is designed to match any four-digit number. Let’s look at the pieces:

The <xsl:analyze-string> element normally takes two child elements, <xsl:matching-substring> and <xsl:non-matching-substring>. As the element name implies, once the <xsl:analyze-string> parser breaks the string into parts that match the regex and parts that don’t, one or the other of these subelements handles each of those parts. In our stylesheet, for a non-matching substring (any string of text inside the text() node that isn’t a four-digit number), we just output the value of that substring using the <xsl:value-of> element. For any substring that matches, we create an HTML link (<a>) and insert the appropriate value for the @href attribute, using an AVT to plug in the four-digit number that we matched. Note that within <xsl:matching-substring> and <xsl:non-matching-substring>, a dot refers to the specific substring, and not to the entire text() node.

Here’s a snippet of the output:

<p>This major national and international scholarly project, which will be inaugurated on 1 January 
<a href="http://www.historyorb.com/events/date/2006">2006</a>, is a co-operative venture between
Oxford University Press and the Institute of English Studies. Its General Editor is Professor Simon Eliot
who holds the newly-created chair in the History of the Book in the Institute.</p>

Note that the four-digit year is marked up as a link to the http://www.historyorb.com site, but the one-digit part of 1 January 2006 is not. Similarly:

<li>Volume I <a href="http://www.historyorb.com/events/date/1478">1478</a>-
<a href="http://www.historyorb.com/events/date/1780">1780</a>s </li>

Here the four-digit years are tagged even when they are part of an expression like 1780s.