Digital humanities


Maintained by: David J. Birnbaum (djbpitt@gmail.com) [Creative Commons BY-NC-SA 3.0 Unported License] Last modified: 2015-08-23T21:02:02+0000


Thinking in algorithms

The problem

Definitions of the term algorithm vary, but what they all have in common is that an algorithm is a formal series of steps that can be used to accomplish a task. Before you write a line of code, you’ll want to think about whether you can come up with an algorithm for solving a programming task. Suppose your task is to create a table of contents for Hamlet, along the lines of:

Forgetting about XSLT for the moment, how can you think about this task? Here are two very different perspectives:

  1. Read through the play yourself and notice that there are five acts, that Act 1 has five scenes, that Act 2 has two scenes, etc. After you’ve tracked all of this down by looking through the text yourself, have your XSLT create, all at once, the entire HTML document above. Where you want it to say Act 4, Scene 2, for example, you type those exact characters into your XSLT, and you do that manually for all of the acts and all of the scenes.
  2. Have the XSLT do all of the work, so that you don’t even have to know how many acts there are, or how many scenes each act has. Have your XSLT find all of the acts in the play, no matter how many there are and no matter where they are. Print the header for each one. Then, before you go on to the next act, have the XSLT find all of the scenes in that act and print their headers.

Here’s another way to think about the difference between those two models. The first says (we’ll simplify the problem for the moment and talk just about the acts, as if there weren’t any scenes) print a title for Act 1, then for Act 2, then for Act 3, then for Act 4, then for Act 5, and then stop. The second says find all the acts, no matter how many there are, and process them in the order in which they occur. When you run out of acts, you’re done.

The first approach requires you to do all the work. If you’re going to type up, manually, each of the headers, you might as well just write the HTML manually. Furthermore, the first approach doesn’t even need to look at the play; the human has done that already and created the output, instead of letting the XSLT create the output in response to what it finds in the play. A big problem with having the human write the output separately and explicitly for each act and scene is that when that’s done, all you have is the table of contents for that one play. If you want to get a table of contents for another play (which may have a different number of acts, with a different number of scenes in each act), you have to build that manually, as well. The second approach, on the other hand, will work for any play, no matter how many acts or scenes it has, and it (the program, not you) has to look at the play to determine how many acts and scenes there are.

Evaluating algorithms

The preceding two strategies can both be considered algorithms, or step-by-step instructions for completing the task, although the first is so brute force that some programmers would consider it a non-algorithmic solution because it really says don’t even look at the play; just print this stuff that I’ve typed up manually. The second algorithm is clearly superior for two reasons: 1) it lets the computer do more of the work (the human doesn’t even have to know the act and scene structure of the play) and 2) it works for any play with any number of acts or scenes.

A brute-force solution

Here’s a solution that requires the human to do all the work:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0"
    xpath-default-namespace="http://www.tei-c.org/ns/1.0" xmlns="http://www.w3.org/1999/xhtml">
    <xsl:output method="xhtml" indent="yes"/>
    <xsl:template match="/">
        <html>
            <head>
                <title>Hamlet</title>
            </head>
            <body>
                <ul>
                    <li>Act 1 <ul>
                            <li>Act 1, Scene 1</li>
                            <li>Act 1, Scene 2</li>
                            <li>Act 1, Scene 3</li>
                            <li>Act 1, Scene 4</li>
                            <li>Act 1, Scene 5</li>
                        </ul>
                    </li>
                    <li>Act 2 <ul>
                            <li>Act 2, Scene 1</li>
                            <li>Act 2, Scene 2</li>
                        </ul>
                    </li>
                    <li>Act 3 <ul>
                            <li>Act 3, Scene 1</li>
                            <li>Act 3, Scene 2</li>
                            <li>Act 3, Scene 3</li>
                            <li>Act 3, Scene 4</li>
                        </ul>
                    </li>
                    <li>Act 4 <ul>
                            <li>Act 4, Scene 1</li>
                            <li>Act 4, Scene 2</li>
                            <li>Act 4, Scene 3</li>
                            <li>Act 4, Scene 4</li>
                            <li>Act 4, Scene 5</li>
                            <li>Act 4, Scene 6</li>
                            <li>Act 4, Scene 7</li>
                        </ul>
                    </li>
                    <li>Act 5 <ul>
                            <li>Act 5, Scene 1</li>
                            <li>Act 5, Scene 2</li>
                        </ul>
                    </li>
                </ul>
            </body>
        </html>
    </xsl:template>
</xsl:stylesheet>

This creates the correct output, but it doesn’t extract any information at all from the input XML document. The human has written the entire output HTML by hand, and the XSLT just produces it all at once. The XSLT doesn’t figure out how many acts or scenes there are (the human had to do that manually), and it also doesn’t know what the <head> elements say because it never looks at them. The human has counted the acts and scenes, hard-coded where each one should be cited, and has manually duplicated the information that could have been extracted automatically from the <head> elements in the input XML document.

A slightly less brute-force solution

Here’s a compromise approach:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0"
    xpath-default-namespace="http://www.tei-c.org/ns/1.0" xmlns="http://www.w3.org/1999/xhtml">
    <xsl:output method="xhtml" indent="yes"/>
    <xsl:template match="/">
        <html>
            <head>
                <body>
                    <ul>                        
                        <li>
                            <xsl:apply-templates select="//body/div[1]/head"/>
                            <ul>
                                <li>
                                    <xsl:apply-templates select="//body/div[1]/div[1]/head"/>
                                </li>
                                
                                <li>
                                    <xsl:apply-templates select="//body/div[1]/div[2]/head"/>
                                </li>
                                <li>
                                    <xsl:apply-templates select="//body/div[1]/div[3]/head"/>
                                </li>
                                <li>
                                    <xsl:apply-templates select="//body/div[1]/div[4]/head"/>
                                </li>
                                <li>
                                    <xsl:apply-templates select="//body/div[1]/div[5]/head"/>
                                </li>
                            </ul>
                        </li>
                        <li>
                            <xsl:apply-templates select="//body/div[2]/head"/>
                            <ul>
                                <li>
                                    <xsl:apply-templates select="//body/div[2]/div[1]/head"/>
                                </li>
                                <li>
                                    <xsl:apply-templates select="//body/div[2]/div[2]/head"/>
                                </li>
                            </ul>                           
                        </li>
                        <li>
                            <xsl:apply-templates select="//body/div[3]/head"/>
                            <ul>
                                <li>
                                    <xsl:apply-templates select="//body/div[3]/div[1]/head"/>
                                </li>
                                <li>
                                    <xsl:apply-templates select="//body/div[3]/div[2]/head"/>
                                </li>
                                <li>
                                    <xsl:apply-templates select="//body/div[3]/div[3]/head"/>
                                </li>
                                <li>
                                    <xsl:apply-templates select="//body/div[3]/div[4]/head"/>
                                </li>
                            </ul>                           
                        </li>
                        <li>
                            <xsl:apply-templates select="//body/div[4]/head"/>
                            <ul>
                                <li>
                                    <xsl:apply-templates select="//body/div[4]/div[1]/head"/>
                                </li>
                                <li>
                                    <xsl:apply-templates select="//body/div[4]/div[2]/head"/>
                                </li>
                                <li>
                                    <xsl:apply-templates select="//body/div[4]/div[3]/head"/>
                                </li>
                                <li>
                                    <xsl:apply-templates select="//body/div[4]/div[4]/head"/>
                                </li>
                                <li>
                                    <xsl:apply-templates select="//body/div[4]/div[5]/head"/>
                                </li>
                                <li>
                                    <xsl:apply-templates select="//body/div[4]/div[6]/head"/>
                                </li>
                                <li>
                                    <xsl:apply-templates select="//body/div[4]/div[7]/head"/>
                                </li>
                            </ul> 
                        </li>
                        <li>
                            <xsl:apply-templates select="//body/div[5]/head"/>
                            <ul>
                                <li>
                                    <xsl:apply-templates select="//body/div[5]/div[1]/head"/>
                                </li>
                                <li>
                                    <xsl:apply-templates select="//body/div[5]/div[2]/head"/>
                                </li>
                            </ul>
                        </li>
                    </ul>
                </body>
            </head>
        </html>
    </xsl:template>
</xsl:stylesheet>

This solution also requires the human to figure out exactly how many acts and scenes there are, and it manually places each one where has to be. If the programmer mistypes a number in a predicate or miscounts the number of scenes or types something out of order, it gets the wrong answer. It also can’t create a table of contents for any play that has a different number of acts or scenes. Unlike the preceding example, which doesn’t even look into the input XML document, this one does require access to the input XML version of the play because it is extracting the value of the <head> elements from there. Nonetheless, this version requires the human, rather than the computer, to determine how many acts and scenes there are and where they should be placed in the output HTML.

A good solution

Here is one solution based on the second approach, that is, a solution that 1) lets the computer figure out how many acts and scenes there are and 2) will work with any play with any number of acts and scenes:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xpath-default-namespace="http://www.tei-c.org/ns/1.0" xmlns="http://www.w3.org/1999/xhtml"
    xmlns:xs="http://www.w3.org/2001/XMLSchema" exclude-result-prefixes="xs" version="2.0">
    <xsl:output method="xhtml" doctype-system="about:legacy-compat"/>
    <xsl:template match="/">
        <html>
            <head>
                <title>Play structure</title>
            </head>
            <body>
                <ul>
                    <xsl:apply-templates select="//body/div"/>
                </ul>
            </body>
        </html>
    </xsl:template>
    <xsl:template match="body/div">
        <li>
            <xsl:apply-templates select="head"/>
            <ul>
                <xsl:apply-templates select="div"/>
            </ul>
        </li>
    </xsl:template>
    <xsl:template match="div">
        <li>
            <xsl:apply-templates select="head"/>
        </li>
    </xsl:template>
</xsl:stylesheet>

The first template rule, for the document node (the entire document) says create an HTML document for the output, create an unordered list (<ul>) inside the body of that HTML document, and then do something for every act in the play no matter how many there are (//body/div). The second template processes each act by creating a list item (<li>) for it and, inside that item, outputting the value of the <head> for that act (Act 1, Act 2, etc.), creating a new <ul> to list the scenes in that act, and then processing each scene. The third template creates a list item for each scene.

Note that this program doesn’t have to know how many acts and scenes there are, and it doesn’t have to know exactly what each <head> says. It follows the second algorithm, above, and lets the computer count the acts and scenes, copy the output text from the input XML document, and know when to stop because it has run out of acts and scenes.

Taking stock

XSLT is a transformation language, and it is designed to take an XML input document and transform it into something else. An XSLT stylesheet that doesn’t extract any information at all from the input XML isn’t really performing a transformation. And an XSLT stylesheet that requires the human to do things that the computer can do much more easily and reliably, such as count the acts and scenes and get them in the right order, isn’t taking advantage of the algorithmic facilities that computational tools provide.

When you approach an XSLT transformation problem, before you write a line of code, think about how much of the answer is already latent in your input XML. In this case, the number of act and scenes, the order in which they appear, and the value of their <head> elements were all already present in the input XML. All that information should be extracted from the input XML by the XSLT stylesheet, and not recreated manually by the human. What the human has to do is supply information not present in the input XML (such as, for example, the fact that the output should be in the form of HTML unordered lists) and specify when specific bits of information in the input XML should should be fetched and where they should be written in the output. As a rule of thumb, if the computer can do it for you, don’t do it yourself.

Another example: numbering

If you think back to the first XML encoding assignments you did in this course, a lot of you created numbered output, along the lines of the following hypothetical recipe for preparing a cup of coffee:

<recipe>
    <step n="1">Brew coffee</step>
    <step n="2">Pour coffee</step>
    <step n="3">Add cream and sugar, if desired</step>
</recipe>

We told you then that it was usually a Bad Idea to number items because you could mistype and make a mistake, and because the computer can count for you. Now that you know XPath, you know that you could have written:

<recipe>
    <step>Brew coffee</step>
    <step>Pour coffee</step>
    <step>Add cream and sugar, if desired</step>
</recipe>

and you could then use the XPath position() function to tell you the number (ordinal position) of each step in the series. For example, transforming that XML with the following XLST:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
    <xsl:output method="xml" indent="yes"/>
    <xsl:template match="/">
        <output_recipe>
            <xsl:apply-templates select="//step"/>
        </output_recipe>
    </xsl:template>
    <xsl:template match="step">
        <output_step>
            <xsl:text>Step #</xsl:text>
            <xsl:value-of select="position()"/>
            <xsl:text>: </xsl:text>
            <xsl:apply-templates/>
        </output_step>
    </xsl:template>
</xsl:stylesheet>

will produce

<?xml version="1.0" encoding="UTF-8"?>
<output_recipe>
    <output_step>Step #1: Brew coffee</output_step>
    <output_step>Step #2: Pour coffee</output_step>
    <output_step>Step #3: Add cream and sugar, if desired</output_step>
</output_recipe>

Note that the steps were not numbered explicitly in the input XML. The numbers were added by the XSLT stylesheet during the transformation. The big payoff for letting the computer do the numbering is that if you later decide to add a step (e.g., Grind coffee beans), you don’t have to change the step numbers manually.

Development strategies that can deal gracefully with different types of input are considered robust, which means that they don’t break easily (the opposite of robust is fragile or brittle). Manual numbering, where the computer could have handled it for you, is fragile because you can mistype, and also because if you modify your content, you may have to modify all of the numbering, including the numbers you’ve already entered for other items. There are times when you might need to enter numbers manually, though. For example, if you’re transcribing a document that was originally misnumbered and you want to record that original numbering, with the error, you’ll have to enter that yourself. In most cases, though, for more robust results, you’ll want to rely on XPath and XSLT to take care of this type of task for you.