Regex assignment #1

Maintained by: David J. Birnbaum (djbpitt@gmail.com)

Last modified: 2022-09-20T22:22:46+0000

Regex assignment #1

The task

Your goal is to produce an XML version of a plain text edition of Shakespeare’s sonnets by using search-and-replace techniques with regular expressions and submit a write-up of your conversion process (not the XML output) as a markdown document. The XML you are aiming to create should look something like http://dh.obdurodon.org/shakespeare-sonnets.xml. That is, each sonnet should be its own element, each line should be tagged separately, and the roman numerals should be encoded in a useful way (we’ve used attributes, but you could also put them in a child element).

What to submit

Your write-up can be brief and concise, but it should provide enough information to enable us to duplicate the procedure you followed. It must give your exact regular expressions, and those must be surrounded by backticks (` characters), so that they will look like code snippets when the markdown is intepreted and rendered.

Short version

Submit a prose description of your regex conversion process as a markdown (not word processor, not plain text) file. The filename extension for markdown files is .md. We know that this is your first submission in markdown, so if you get stuck, post an inquiry in the #markdown channel in our Slack workspace.

Longer version

We describe below how to use regex to transform plain text into XML, and following along with description and performing the transformation is the goal of this assignment. We don’t need to see your XML, but we do need to see a step-by-step explanation of how you used regex to create it. That explanation must be a markdown document (not a word-processor or plain-text document); the reason is that a word processor might convert your straight quotation marks to curly ones or make other typographic changes that will corrupt your regex (curly quotation marks are not legal replacements for straight ones in regex), and plain text does not distinguish clearly when you are writing a regular expression and when you are writing normal text.

Okay, so how do I create a markdown document to submit for homework?

You can create your markdown explanation of your regex conversion process in <oXygen/>. Create a new document in <oXygen/>, selecting “Markdown” as the document type. <oXygen/> will open two windows side-by-side; you enter your markdown in the one on the left, and the one on the right (which will say DITA XDITA HTML along the bottom, with HTML selected) will interpret and display your markdown, similarly to the Preview view in GitHub Issues, as you type. You must surround your regex snippets with backticks (` characters), which will cause them to be formatted as code (in the right side of the <oXygen/> interface they will be in a monospaced font, and the backticks will not be rendered as literal characters). Using backticks to format your code snippets lets you (and us) distinguish clearly when characters are prose and when they are intended to represent the regular expressions you are creating.

Write up your process on the left side of the <oXygen/> display, hitting the “Enter” or “Return” key at the end of each line to keep the line length manageable (<oXygen/> does not wrap long lines automatically, and pretty-printing doesn’t work with markdown documents). Save your document with an .md filename extention, which is the conventional extension for markdown documents.

If you are already comfortable with a different markdown editor you are welcome to use it, as long as it lets you verify that you have written your markdown correctly, that is, that your markdown is being interpreted in a way that renders your content correctly. If you are a Mac user, we are partial to an open-source MacOS editor called MacDown (https://macdown.uranusjr.com/) because we find the formatted view easier to read than the formatted view provided inside <oXygen/>. Other popular editors for creating markdown include Visual Studio Code, Mark text, and Obsidian.

How to proceed

There are several ways to get to the target output, but here is how we might approach the task:

Reserved characters

The plain text file could, at least in principle, contain characters that have special meaning in XML: the ampersand and the angle brackets. You need to search for those and replace them with their corresponding XML entities; if you don’t remember the entity strings, you can look them up in the Entities and numerical character references section of http://dh.obdurodon.org/what-is-xml.xhtml. Note that if they are all present in a document, you will need to process them in the correct order. What is that order, and why is it important?

Title and author

The title and author at the top are going to have to be tagged manually. You can either remove them now and then paste them back in later, after you’ve tagged the sonnets, or you can leave them in place and fix them up at the end. You’ll use global find-and-replace to tag the sonnets, and if you leave the title and author in place while you do that, you’ll wind up tagging them incorrectly. That isn’t a problem as long as you remember to fix them manually at the end.

To perform regex searching, you need to check the box labeled Regular expression at the bottom of the <oXygen/> find-and-replace dialog box, which you open with Control-f (Windows) or Command-f (Mac). If you don’t check this box, <oXygen/> will just search for what you type literally, and it won’t recognize that some characters in regex have special meaning. You don’t have to check anything else yet. Be sure that Dot matches all is unchecked, though; we’ll explain why below.

Leading space characters

The non-blank lines all begin with space characters: there are two spaces before most lines (the Roman numerals and the first twelve lines of each sonnet) and four spaces before the last two lines of every sonnet. Those spaces are presentational formatting, and not part of the content of the text, and since we don’t need them in order to tag the text, we’ll start by deleting them. The regex to match a space character is just a space character, and you can match one or more space characters by using the plus sign repetition indicator. To match one or more instances of the letter X, you would use a regex like X+. To match one or more instances of a space character, just replace the X with a space.

You don’t want to remove all space characters, though; you just want to remove the ones at the beginning of a line. You can do that by using the caret metacharacter, which anchors a match so that it succeeds only at the beginning of a line. For example, if the regex X+ matches one or more instances of X, the regex ^X+ matches one or more instances of X only at the beginning of line. You can use this information to match one or more space characters at the beginning of a line and replace them with nothing, that is, delete them.

We aren’t going to use the blank lines in this approach, so you can delete those if you’d like, or you can leave them in place if you find that they enhance the legibility. To delete them, you need to match a blank line, and the easiest way to do that is to match two new line characters in a row and replace them with a single new line character. The regex for a new line character is \n. Try it.

Inside out or outside in

We can create our markup either from the outside in (document, then sonnet, then divide the sonnet into Roman numeral and lines) or from the inside out (lines and Roman numeral, then wrap those in a sonnet, then wrap all of the sonnets in a document). Either strategy can be made to work, but we generally find it easier to work from the inside out because when we work from outside in, it’s easy to wind up incorrectly wrapping <line> tags around the <sonnet> start and end tags, etc.

Lines

We’ll start by tagging every line as a <line>, which we’ll do with a find-and-replace operation that finds each line individually and wraps tags around it. This will erroneously tag the Roman numerals as if they were lines of poetry, which they aren’t, but it’s easier to let the first find-and-replace overgeneralize and then go back and retag the Roman numerals than to try to write a more constrained regex that won’t overgeneralize. We don't want to tag blank lines (if we left them in), though, so we need a regex that matches only lines that have characters in them.

Remember where we told you above to make sure that Dot matches all was unchecked? Normally the dot (.) matches any character except a new line, which means that we can use the plus sign repetition indicator to match one or more instances of any character except a new line (that is, .+). By default regex selects the longest possible match (the technical term is that it matches greedily by default), so since one or more instances of any character can be satisfied by one or two or three or more characters, up to the end of the line, when we run it it will always match each entire line, but no more, separately, that is, all of the longest matches that conform to the regular expression. Try it and examine the results. Now check Dot matches all, run Find all, and look at those results. Notice that the match no longer stops at the end of the line, so you get just one big, run-on match, and since you want to tag each line individually, you need to uncheck that box to revert to the normal, default behavior. (You can undo a find-and-replace operation in <oXygen/> with control-z in Windows and command-z in MacOS, so you can try something and then undo it if you’re not happy with the results.)

A human might think of our task as wrap every line in <line> tags, but regex has a find-and-replace view of the world, so a regex way to think about it would be match every line, remove it, and replace it with itself wrapped in <line> tags. That is, regex doesn’t think about leaving the line in place and inserting something before and after it; it thinks about matching the line, deleting it, and then putting it back, but with the addition of the desired tags. The regex selects and matches each full line, but that will be different for every line, so how do we write what we selected into the replacement string? The answer is that the sequence \0 in the replacement pattern means the entire regex match, and you can use that to write the matched line back into the replacement, but wrapped in <line> tags. Try it.

Roman numerals

The Roman numerals are now erroneously tagged as if they were lines of poetry, and in our sample output at http://dh.obdurodon.org/shakespeare-sonnets.xml we want them to be attribute values. To start that process we need to think about how to distinguish a Roman numeral line from a real line of poetry. Since there are 154 sonnets, a Roman numeral line is a line that contains one or more instances of I, V, X, L, and C in any order and nothing else, and no real line of poetry matches that pattern. That means that we can match that pattern by using a regex character class, which you can read about at http://www.regular-expressions.info/charclass.html. This approach would, in principle, also match sequences that aren’t valid Roman numerals, like XVX, but those don’t occur, so we don’t have to worry about them. This illustrates a useful strategy: a simple regex that overgeneralizes vacuously may be more useful than a complex one that avoids matching things that won’t occur anyway. You can use the character class (wrapped in square brackets, as described at the link above) followed by a plus sign (meaning one or more) to complete your regex so that it matches only <line> elements that contain a Roman numeral and nothing but a Roman numeral. Try it.

In this case you want to write the Roman numeral into the replacement string, but you want to get rid of the spurious <line> tags and replace them with other markup. \0 will write the entire match into the replacement, but that would include the original <line> tags that you want to remove. To capture just part of a regex match for reuse in the replacement, you wrap it in parentheses; this doesn’t match parenthesis characters, but it does make the part of the regex that’s between the parentheses available for reuse in the replacement string. For example, a(b)c would match the sequence abc and capture the b in the middle, so that it could be written into the replacement. Capturing a single literal character value isn’t very useful because you could have just written the b into the replacement literally, but you can also capture wildcard matches. For example, a(.)c matches a sequence of a literal a character followed by any single character except a new line followed by a literal c character. You can use that type of approach to capture everything between the <line> tags in the matched string: write a regex that matches the entire line with the Roman numeral, including the <line> tags, but put parentheses around the stuff between the <line> tags.

Okay, you’ve captured the Roman numeral, but how do you write it into the replacement? To write a captured pattern into the replacement, use a backslash followed by a digit, where \1 means the first captured group, \2 means the second, etc. Since in this case we’re capturing only one group (we have only one set of parentheses), wherever we write \1 in our replacement string, we’ll insert the Roman numeral that we captured. For this task we’d build a replacement string that starts with a </sonnet> end tag (since the Roman numeral appears after the end of the preceding sonnet), then a new line, and then a <sonnet> start tag, and inside that start tag we’d include the number attribute and use the captured string (that is, \1) as its value, etc. Try it.

Clean up

Your XML won’t be entirely well-formed because you are entering <sonnet> start- and end-tags between sonnets, but the first and last sonnets are not between anything, so the first sonnet will be missing its start-tag and the last sonnet will be missing its end-tag. Add those manually. You may also have to clean up the beginning and end of the document manually, including the title and author, and you’ll also need to add a root element.

Checking your results

Although you’ve added XML markup to the document, if you opened it as plain text, <oXygen/> remembers that, which means that you can’t check it for well-formedness right away. To fix that, save it as XML with File → Save as and give it the extension .xml. Even that doesn’t tell <oXygen/> that you’ve changed the file type, though; you have to close the file and reopen it. When you do that, <oXygen/> now knows that it’s XML, so it should perform real-time well-formedness checking when you open it with the new filename extension. You can also verify that it’s well formed with keyboard shortcuts (Control+Shift+W on Windows, Command+Shift+W on Mac) or by clicking on the drop-down arrow next to the red check mark in the icon bar at the top and choosing Check well-formedness.

<oo>→<dh> Digital humanities