Digital humanities


Maintained by: David J. Birnbaum (djbpitt@gmail.com) [Creative Commons BY-NC-SA 3.0 Unported License] Last modified: 2017-02-14T18:19:00+0000


Test #3 (regex) answers

General directions

If you want to test your regex as you develop, you might find Regexpal helpful.

Your regex expressions in Part 2 should do the following:

Part 1

Explain in prose what the following regular expressions match:

  1. [1-9]\.[A-Za-z]

    Any digit from 1 through 9, inclusive, followed by a period, followed by any single upper- or lower-case letter (from a to z, inclusive). The character class defined by [1-9] matches a digit (1–9), and is equivalent to [123456789]. The backslash before the dot causes the dot to match a literal period (instead of its default metacharacter meaning of any character other than a newline). The square brackets create a character class, a group of characters, any of which can be part of the match. Two characters separated by a hyphen creates a range, so [A-Z] is a concise way of writing [ABCDEFGHIJKLMNOPQRSTUVWXYZ]. Because there is no repetition indicator, this matches only a single character.

  2. "(.*?)"

    A quotation, that is, a pair of quotation marks plus everything between them. The quotation marks at the beginning and end are matched literally. The parentheses capture whatever appears between the quotation marks so that it can be used in a replacement string (for example, to replace quotation marks in the plain-text input with HTML <q> tags in the output, inserting the text of the quote between the new tags). Because regex matching is greedy, if you had only "(.*)", you would match the longest possible string, so in a line like:

    "I wonder," he thought, "what I should say next?"

    you would match everything from the very first quotation mark to the very last, where what you really want to do is match each quotation separately. The question mark in the regex keeps the match from being greedy, and tells it to get the shortest possible match, instead of the longest. With the question mark, the regex will match each quotation in the line separately.

  3. \d{4,}

    Four or more digits. \d matches any single digit, so it can be considered shorthand for [01234567789] or [0-9]. A number in curly braces after a match component (in this case, a digit) specifies the number of consecutive instances of the component required for the match to succeed. It would be useful if we could do this in Relax NG, which otherwise has repetition indicators similar to those in regex, but, unfortunately, Relax NG doesn’t support this curly-brace notation. The comma indicates that there is no upper limit to the number of digits to be matched, and regex patterns are greedy, so this pattern will match the longest possible string of consecutive digits as long as there are at least four of them. That means, among other things, that numbers separated by commas (like 10,000) will not be matched at all, since although a human thinks of this as a five-digit number, from a regex perspective it’s a sequence of two digits, then a comma (not a digit), and then a sequence of three digits..

  4. ^[IVXLC]+$

    A line that consists of a sequence of one or more of the characters inside the square brackets, that is, a Roman numeral. The square brackets define a character class, so [IVXLC] by itself would match exactly one instance of any of those five letters. The plus sign afterwards is a repetition indicator, so the pattern now matches one or more instances of any of those characters (whether the repetition is the same character or a different character from the class). The ^ at the beginning and $ at the end are anchors; they don’t match anything themselves, but they make the match succeed only if it begins at the beginning of a line and ends at the end of the line. The anchors protect us from accidentally matching the pronoun I in the middle of a line.

    We used this pattern in the Shakespeare sonnet exercise to tag the Roman numerals at the beginning of each sonnet. It would also match sequences of these letters that do not form a legal Roman numeral, e.g., something like VVXIL. Since that’s unlikely to occur, we don’t have to worry about being overly permissive. It is possible to construct a regex that matches only legal Roman numerals, but it’s tough going; see the discussion at http://stackoverflow.com/questions/267399/how-do-you-match-only-valid-roman-numerals-with-a-regular-expression.

  5. [A-Z][a-z]{1,2}\.

    This pattern matches common abbreviated titles, such as Mr., Mrs., Dr., Esq., etc. It uses two character classes, one for upper case letters and one for lower case letters, followed by an escaped period, meaning that it will match a literal period (".") rather than any character other than a new line which is what a plain (unescaped) period matches. In order to be more constrained and to limit the possibility of extraneous matches, the numerical bounds established inside the curly braces limit the number of lowercase letters possible to one or two, meaning a total length of three characters for the abbreviation.

    It wasn’t necessary, of course, to write abbreviated titles as part of your answer as long as you described the type of string that the pattern will match: exactly one uppercase letter followed by one or two lowercase letters followed by a period. For what it’s worth, in addition to the abbreviated titles mentioned above, this expression would also match several other string patterns, such as one word sentences ending with a period where the word is of three letters or fewer (such as Yes.) or the conclusion of sentences that end in a short proper name, such as Wu. The pattern would miss longer abbreviated titles, such as Messrs., as well as one-letter titles, such as French M..

Part 2

Give a regex you might use to match:

  1. Any standard US social security number, e.g., 212-12-1212

    \d{3}-\d{2}-\d{4}. See the explanation of Part I, Item 3, above. The hyphen here is a literal hyphen character; it’s a separator in a range expression (as in Part I, Item 1, above) only inside a character class, that is, only inside square brackets.

  2. Any standard filename, with a main component that contains just lower-case letters, digits, hyphens, and underscores; a period; and then a three-letter extension, e.g., filename.xml

    [a-z\-0-9_]+\.[A-Za-z]{3}. The square brackets at the beginning contain a character class that matches lower-case letters, a literal hyphen, digits, and a literal underscore. To keep the hyphen from being confused with a range separator, which would be its usual meaning inside a character class (i.e., inside square brackets), we escape it by preceding it with a backslash. This is the general practice with metacharacters; if a character in a regex normally has a non-literal meaning, you can give it its literal meaning by preceding it with a backslash. This pattern doesn’t match anything that includes a literal backslash; the backslash here is just an escape. You should also note that in this case, as with almost all regex searching you do, you should have the case-sensitive box checked.

    Note: An alternative way to match a literal hyphen inside a character class is to put it first, e.g. [-a-z0-9_]+\.[A-Za-z]{3}. Since the range separator must fall between two items (e.g., between A as the start of the range of upper-case letters and Z as the end in [A-Z]), a hyphen at the beginning of the character class cannot be a range separator. For that reason, you don’t need to escape it there; a regex parser will recognize that it must be a literal hyphen. Escaping a hyphen where you don’t have to escape it doesn’t affect the match, but it’s nonetheless a mistake because it’s clutter and because someone reading your code will expect that it must be there for a reason.

    The \. matches a literal period (a dot by itself is a metacharacter that matches any character except a new line), and the final part matches exactly three letters, any of which can be upper or lower case. We could extend this to allow letters in other writing systems: [a-z] matches only the twenty-six English-language upper-case letters; it doesn’t include 1) Latin-alphabet characters in other writing systems, such as the accented letters in some European languages, 2) non-Latin-alphabet characters, or 3) digits (e.g., .mp3), although all of these are legal in filenames.

    The regex expressions \w matches any word character, which is defined as at least any of the twenty-six letters in the English writing system plus any digit and the underscore. Different regex flavors (annoyingly, regex is implemented differently in different systems and programming languages) may include characters from other writing systems; see the discussion at Shorthand character classes. You could, then, write [\w\-]+ for the portion before the dot, using \w to represent all of the letters and digits and the underscore without having to spell them out. (We do have to add the hyphen.)

  3. Any word ending in ier that starts with an upper-case letter and is at least six letters long, e.g., Happier

    [A-Z][a-z]{3,}ier. The first portion, in square brackets, matches exactly one upper-case letter. The next part matches two or more lower-case letters. Just as a single number inside curly braces indicates the exact number of repetitions (e.g., [a-z]{2} would match exactly two lower-case letters; see Part I, Item 3, above), you can also specify a minimum or maximum number of repetitions, or both. If there’s a comma inside the curly braces, number before it is the minimum number of repetitions and the number after the comma is the maximum. If you omit the maximum value, it’s unlimited. Omitting the minimum doesn’t make sense (and may raise an error message, depending on the regex implementation); if you mean zero or one, you can say so directly. Since we want words of at least six letters, we need at least three additional letters between the upper-case first letter and the ier at the end. If you’re doing this in the <oXygen/> find-and-replace interface, be sure you’ve checked the Case sensitive box.

  4. Any hyphenated word, which may include any number of hyphens, e.g., up-to-date

    ([A-Za-z]+-)+[A-Za-z]+. We use parentheses here to group some parts of the regex together not so that we can use them in a replacement string (although we could), but so that we can use a repetition indicator and have it apply to all of the parts together, in order. We conceptualize the pattern as a string of letters followed by hyphen, and then perhaps more strings of letters followed by hyphens, but the word has to end with a string of letters not followed by a hyphen. The part in parentheses defines a string of letters followed by a hyphen, and we make that part repeatable by using the plus sign as a repetition indicator. If we take the sample match true-to-life, that part alone would match true-to-, since there are two repetitions of the pattern. The regex ends with a string of characters, which would match the final life part of that example.