Digital humanities


Maintained by: David J. Birnbaum (djbpitt@gmail.com) [Creative Commons BY-NC-SA 3.0 Unported License] Last modified: 2021-10-31T22:00:32+0000


Test #3: Regular expressions test answer key

The task

To demonstrate your ability to work with regular expressions we are asking you to answer the following test questions. Regex has different flavors depending on where you use it/for what purpose, and because we use <oXygen/> for our development in this course, you should develop and test your answers in <oXygen/>. Please write your answers as a correctly formatted markdown (.md) document and upload it to Canvas.

Like all of our tests, this is open-book, which means that you can look things up, but you cannot ask for or receive help from another person. If you have questions, please put them in Slack. We can’t give away the answers, of course, but we’ll be happy to help clear up any confusion.

If you use dot-matches-all, you need to say explicitly when you check or uncheck it. If you don’t mention it, that means that it is not checked, which is the default behavior with regex.

Part 1: Understanding regular expressions

Explain what each of the following regular expressions matches.

  1. [^0-9]{3,}

    Match anything except a digit (or a newline, unless dot-matches-all is checked) that is at least three characters long.

  2. ^ +

    Match one or more leading spaces at the beginning of a line.

  3. \(.+?\)

    Match an opening parenthesis, followed by one or more of any character except a closing parentheses (and, if dot-matches-all is not checked, also except a newline), followed by a closing parenthesis.

    The question mark means that the search is non-greedy, that is, that the pattern matches the shortest possible string. This means that the .+ part of the expression will stop matching when it encounters the first closing parenthesis. If dot-matches-all is not checked, the .+ portion of the expression cannot cross a newline; if dot-matches-all is checked, that portion of the expression can cross newline characters.

Part 2: Writing regular expressions

  1. Write a regular expression to match the following:
    1. Any word ending in ly

      [a-z]+ly is a good basic answer, but …

      The first piece of the regular expression, in square brackets, is a character class that matches any lower-case letter; adding the plus sign matches strings of one or more lower-case characters (that is, we’ll match the entire word up to the ly, no matter how long it is). We decided not to include the upper-case letters in the character class, but matching upper-case letters wouldn’t be a mistake; our reason was that if we were to write [A-Za-z]+ly, we wouldn’t match, for example, TERRIBLY (unless we also catered to situations where the LY at the end could be upper-case) One other option is the \w metacharacter, which matches any word character. The definition of word character is different in different flavors of regular expressions, but a common intepretation is that \w is equivalent to [A-Za-z0-9_].

      All of these patterns would fail to match hyphenated words, so if you want to account for them, you’ll need to include a hyphen in your character class. Since a hyphen inside a character class usually describes a range (e.g., from a to z) and does not match a literal hyphen, you can either escape the hyphen by preceding it with a backslash or put it first inside the character class (where a regular expression processor knows that it isn’t part of a range because nothing precedes it). All of these patterns also match only words that use the twenty-six letters of the English alphabet, so they won’t match words that include letters with diacritics or letters from other writing systems. You can work around that by either including letters from other writing systems in your character class or using an appropriate Unicode character class (see Regular expression Unicode syntax reference and Unicode categories for more information).

      There’s a trap lurking in our expression above: it will also match, e.g., flyer, that is, words where the ly isn’t word-final. A robust solution would anticipate and avoid the risk by using the \b word-boundary pattern (see Word boundaries) and writing \b[a-z]+ly\b. The \b anchor is better than using \s or just a space character at the beginning and end because \b will match both adjacent to whitespace and at the beginning and end of the document, where there won’t be any leading or trailing whitespace. Because \b is an anchor (like ^ and $) it doesn’t match any characters; it just means that the match succeeds only at the beginning or end of a word.

    2. Any word ending in ly that starts with an upper-case letter, e.g., Happily

      [A-Z][a-z]+ly

      This expression looks similar to the one for 1a, above, but rather than having one character class that accounts for upper-case and lower-case letters, we have separated them into two classes. Since the first letter in these words must be capitalized, but it can be any letter in the alphabet, we use only upper-case letters in the first class. Notice that there is no repetition indicator after this character class because we want only the first letter to be capitalized. The second character class contains only lower-case letters, followed by the plus sign. Then we again add ly. Since case is important for this task, the Case Sensitive box must be checked. Otherwise, the pattern will also match -ly words that do not start with an upper-case letter, i.e., the same results as 2a.

    3. Any word ending in ly that is at least 10 letters long

      [a-z]{8,}ly

      As in 1a, we’ve included a character class that matches any lower-case letter. Because we are looking for a word of particular length, we use curly braces to match a particular number of characters. Since the total minimum length specified is ten characters and the last two letters have to be ly, we specify that there must be at least eight other letters.

      Just as a single number inside curly braces indicates an exact number of repetitions. If there’s a comma inside the curly braces, the number before it is the minimum number of repetitions and the number after the comma is the maximum. If you omit the maximum value, it’s unlimited. Omitting the minimum doesn’t make sense (and may raise an error message, depending on the regex implementation); if you mean zero or one, you can say so directly with the question-mark repetition indicator.

      Whether you do or do not check the Case sensitive box affects the results, and in the artificial test context either approach is fine, but in Real Life you’ll want to let your desired results guide your decision.

  2. Regular expressions are often used to validate that the format of data inputted by users is correct and consistent. Consider, for example, the following name, date, and US telephone number drawn from a mythical address book:
    • John Doe
    • Jan 1, 1960
    • (555) 555-5555

    Write a regular expression for each type of data (name, date, telephone) that will match the examples above and other entries similar to them. For the name, assume just forename and surname (no middle initial, Jr., mononym, etc.). For the month assume a three-letter abbreviation without a period (Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, and Dec), a one- or two-digit date, and a four-digit year. You don’t have to check for a valid date, that is, you don’t have to verify whether months have the correct number of days or whether a particular year is a leap year. Assume that all telephone numbers will be formatted exactly like the one above; the only thing that will change will be the specific digits.

    Name: [A-Z][a-z]* [A-Z][a-z]*

    This pattern matches name parts that start with an upper-case letter and continue with zero or more lower-case letters. There is then a literal space (separating the forename and surname) and then another pattern like the first. You might reasonably have made different assumptions about where upper- and lower-case letters may appear or about apostrophes or hyphens in names (or more).

    Date of birth: (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) \d{1,2}, \d{4}

    This pattern matches the three-letter codes for each month with an or-group, followed by a space, followed by one or two digits for the day, then a comma, and then four digits for the year. You could make your expression more robust by restricting the first digit of the date pattern to just the digits 1, 2, and 3, and in other ways, none of which we expected in this simplified context.

    You need the parentheses around the month options even if you aren’t planning to capture and reuse that part of the match. Try it with and without at https://www.regexpal.com and see if you can figure out why.

    Phone number: \(\d{3}\) \d{3}-\d{4}

    This pattern searches for a literal open parenthesis, then three digits followed by a literal closing parenthesis, then a space, then three digits followed by a hyphen, and then, finally, four more digits.

  3. Write a regular expression that will validate that a (not terribly secure, alas!) hypothetical user password conforms to the following specifications:
    • must begin with a number
    • the initial digit must be followed by at least eight characters (any combination of digits and uppercase and lowercase letters, but no other characters)
    • must end in one of the following punctuation marks: !, @, #, $, or %

    ^[0-9][A-Za-z0-9]{8,}[!@#$%]$

    The expression above anchors its search at the start of a line. The caret (^) inside the first character class means that it’s a negative character class, so that it will match any character except 0-9. (You could, alternatively, use \D, which is the inverse of \d, that is, which matches any non-digit.) This is followed by a character class that will match uppercase letters, lowercase letters, and digits. The {8,} means that there must be at least eight of these characters after the initial non-digit. The final character class matches any one of the specified punctuation characters. We end with a dollar sign to anchor the match at the end of the line.

    We’ve used leading and trailing line anchors because we’ve assumed that the value we would be validating would occupy a separate line, and we’d do something like that in Real Life to avoid the substring matching issue we describe above in the case of flyer. In the artificial test context it’s fine if you didn’t make that same assumption.

Part 3: Matching and replacing

  1. Reading views of texts often favor curly (sometimes called typographic) quotation marks (“ ”) over straight ones (" "). Write a regular expression find-and-replace operation that matches all single words that are inside straight quotation marks (e.g., "XML") and replaces the straight quotation marks with curly ones (e.g., “XML”). (Feel free to copy and paste literal curly quotation mark characters from the test into your expressions if you find that easier than entering them from the keyboard.)

    Find: "(.+?)", Replace: “\1”

    This regular expression captures content between straight quotation marks using non-greedy repetition of the dot, which matches any character except a newline (or including a newline, if dot-matches-all is checked). It uses a back reference to preserve what was captured and place it between curly quotation marks.

    Because we specified quoted single words in the question you don’t have to worry about matches that cross newline characters, so you don’t have to check Dot matches all (although checking it would do no harm). We want to remove and discard the original (straight) quotation marks, since we’re going to replace them with curly ones.

    You might reasonably have matched quoted single words in other ways, for example as "([A-Za-z]+)" or as "(\w+)". If you take one of these approaches you don’t have to worry about greedy vs non-greedy matching, since they don’t match quotation marks, which means that you won’t overrun the end of the word.

  2. In much of Europe the comma is used to separate the integral part of a number from the decimal part, e.g., a European might write 12,5 where someone in the US would write 12.5. Suppose a European company is looking to open a retail branch in the United States and they need to update their catalog from Euros to US dollars. Assume also that the currency conversion rate between the Euro and the US Dollar is 1:1 (as it once was, long ago!).

    Write a regular expression find-and-replace operation that will find match a European price (beginning with the Euro symbol, which you can copy and paste from here: ) and replace it with the corresponding American price, beginning with the dollar sign. Assume that prices for the products of this hypothetical company range from €0,01 (that is, one Euro cent) to €999,99 (that is 999 Euros and 99 Euro cents). (Curious fact: European countries differ in whether they write the Euro symbol before or after the digits and whether it is or is not separated from the digits by a space. Assume that this company writes the Euro sign immediately before the digits, with no intervening space. Similarly, write the dollar sign in your replacement immediately before the digits, also with no intervening space.)

    Find: €(\d+),(\d+), Replace: \$\1.\2

    The above expression begins by matching a Euro character, then one or more digits, followed by a comma, followed by one or more digits. The replace expression begins by entering a literal dollar sign. It then uses a back reference to input the captured integer part of the price, followed by a literal dot (to replace the comma), and then a second back reference for the cents. We escape the literal dollar sign in the replacement so that it won’t be mistaken for a back reference (back references can begin with either dollar signs or backslashes, depending on the regex flavor). We don’t have to escape the dot because although it is a metacharacter in a match patterns, it is always a literal character in a replacement string.

    We’ve made a simplifying assumption that there will always be both an integral part and a decimal part, even if they are equal to zero. A more sophisticated strategy, not expected in the context of this test, might also allow for prices that are only whole numbers or only cents.

  3. Consider the following text, which represents a speech from a play:

    CATHERINE. 
    Of course: he sent me the news. 
    Sergius is the hero of the hour, the idol of the regiment.

    Assume that speeches are separated from one another by a blank line. Write a regular expression find-and-replace operation to tag the entire speech as <speech> with a speaker attribute whose value is the speaker’s name, e.g.:

    
    Of course: he sent me the news. 
    Sergius is the hero of the hour, the idol of the regiment.
    ]]>

    Assume all the speeches in this hypothetical document begin with the speaker’s name in uppercase letters on a separate line that ends with a dot. Please remove the dot, as in the example above, when you create the output.

    You may use either multiple find-and-replace operations or a single find-and-replace operation to complete this task.

    Solution

    A speaker is a line with all capital letters and ends with a period, a space, and a newline. For a solution in a single find-and replace operation, be sure that you’ve checked case-sensitive and dot-matches-all and then find

    ^([A-Z]+)\.(.+?)\n{2}

    and replace with

    <speech speaker="\1">\2</speech>\n

    The first part is straight-forward: match (first capture group) a sequence of upper-case letters that starts at the beginning of a line and is followed immediately by a literal dot (but don’t include the dot in the capture group). Then match (second capture group) all following characters until but not including a blank line (two newline characters in a row, and don’t include the final newline characters in the capture group). We now have the character name in one capture group and the speech in a second capture group. The replacement pokes the name into the attribute value and uses the text as the content of the new <speech> element. We write a literal newline after the element so that each speech will begin on a new line; you could, alternatively, write two literal newlines if you want a blank line between the newly tagged speeches. You might have to fix the last speech manually, since our match depends on a following blank line and you may not have one after the last speech.

    A solution with multiple find-and-replace operations might be easier to develop because you have to concentrate on only one thing at a time. You could, for example, do something along the lines of:

    1. With dot-matches-all unchecked, match and tag all of the speaker lines (temporarily) by matching:

      ^([A-Z]+)\. $

      and replacing with:

      <speaker>\1</speaker>

      This approach requires catching (or otherwise allowing for) the space character at the end of the character name, which gets handled automatically with the earlier approach. A speech now looks like:

      CATHERINE
      Of course: he sent me the news. 
      Sergius is the hero of the hour, the idol of the regiment.]]>
    2. With dot-matches-all checked, use a non-greedy match to match everything between a </speaker> end-tag and two newlines, capturing the part between the </speaker> end-tag and the newlines, which represents the speech content. Replace by writing back the tag you matched, followed by the captured speech inside temporary tags with a name like <content>. Add at least one newline, so that each speech will begin on a new line, or two if you want a blank line between speeches. Match:

      \n(.+?)\n{2}]]>

      and replace with:

      \n\1\n\n]]>

      A speech now looks like:

      CATHERINE
      Of course: he sent me the news. 
      Sergius is the hero of the hour, the idol of the regiment.]]>
    3. With dot-matches-all still checked, match an entire <speaker> element plus a following entire <content> element (there’s a newline character between them), capturing the content of each (but not the tags) separately. Replace with <speech> start- and end-tags, including a speaker attribute in the start-tag. Write the first capture group as the value of that speaker attribute and the second capture group as the content of the element. Match:

      (.+?)\n(.+?)]]>

      and replace with:

      \2]]>

      A speech now looks like:

      Of course: he sent me the news. 
      Sergius is the hero of the hour, the idol of the regiment.]]>

      As with the one-step approach, you may have to fix the last speech manually.

Bonus tasks

  1. Write a regular expression to match an XML filename that conforms to the requirements for XML files in this course, with a main component that begins with your surname in lower-case letters (use your own surname); followed by any combination of letters, digits, hyphens, and underscores, but nothing else; then a period; and then a filename extension that reads xml. e.g., biden_xml-assignment-1.xml

    ^biden[-A-Za-z0-9_]*\.xml$

    The above expression matches your surname (if you happen to be the President of the United States) followed by zero or more permitted characters and ending with a literal period and the literal characters xml. We write the literal hyphen in the character class first so that it has its literal meaning (and is not misread as part of a range expression); alternatively you can put it anywhere if you escape it with a preceding backslash.

  2. Any hyphenated word, which may include any number of hyphens, e.g., up-to-date

    ([A-Za-z]+-)+[A-Za-z]+

    We use parentheses here to group some parts of the regex together not so that we can use them in a replacement string (although we could), but so that we can use a repetition indicator and have it apply to all of the parts together, in order. We conceptualize the pattern as a string of letters followed by hyphen, and then perhaps more strings of letters followed by hyphens, but the word has to end with a string of letters not followed by a hyphen. The part in parentheses defines a string of letters followed by a hyphen, and we make that part repeatable by using the plus sign as a repetition indicator. With a sample text like out-of-the-loop, that part alone would match out-of-the-, since there are three repetitions of the pattern. The regex ends with a string of characters, which would match the final loop part of that example.

  3. Rather than replacing all straight quotation marks with curly ones, as in the required question above, write a regular expression find-and-replace operation to match quotations and tag them all with <q> tags. Here's an example text you can use to test your expressions:

    "You had my note?" he asked with a deep harsh voice and a strongly
    marked German accent. "I told you that I would call." He looked from
    one to the other of us, as if uncertain which to address.
    
    "Pray take a seat," said Holmes. "This is my friend and colleague, Dr.
    Watson, who is occasionally good enough to help me in my cases. Whom
    have I the honour to address?"
    
    "You may address me as the Count Von Kramm, a Bohemian nobleman. I
    understand that this gentleman, your friend, is a man of honour and
    discretion, whom I may trust with a matter of the most extreme
    importance. If not, I should much prefer to communicate with you
    alone."
    
    I rose to go, but Holmes caught me by the wrist and pushed me back into
    my chair. "It is both, or none," said he. "You may say before this
    gentleman anything which you may say to me."

    Find:"(.+?)", Replace: <q>\1</q>

    The same considerations when it comes to greediness apply as in the solution to Part 3, Question 1 above.