Digital humanities

Maintained by: David J. Birnbaum ( [Creative Commons BY-NC-SA 3.0 Unported License] Last modified: 2021-10-01T13:57:01+0000

Test #3: Regular Expressions

The task

To demonstrate your ability to work with regular expressions we are asking you to answer the following test questions. Regex has different flavors depending on where you use it/for what purpose, and because we use <oXygen/> for our development in this course, you should develop and test your answers in <oXygen/>. Please write your answers as a correctly formatted markdown (.md) document and upload it to Canvas.

Like all of our tests, this is open-book, which means that you can look things up, but you cannot ask for or receive help from another person. If you have questions, please put them in Slack. We can’t give away the answers, of course, but we’ll be happy to help clear up any confusion.

If you use dot-matches-all, you need to say explicitly when you check or uncheck it. If you don’t mention it, that means that it is not checked, which is the default behavior with regex.

Part 1: Understanding regular expressions

Explain what each of the following regular expressions matches.

  1. [^0-9]{3,}
  2. ^ +
  3. \(.+?\)

Part 2: Writing regular expressions

  1. Write a regular expression to match each of the following:
    1. Any word ending in ly
    2. Any word ending in ly that starts with an upper-case letter, e.g., Happily
    3. Any word ending in ly that is at least 10 letters long (total, including the ly)
  2. Regular expressions are often used to validate that the format of data inputted by users is correct and consistent. Consider, for example, the following name, date, and US telephone number drawn from a mythical address book:
    • John Doe
    • Jan 1, 1960
    • (555) 555-5555

    Write a regular expression for each type of data (name, date, telephone) that will match the examples above and other entries similar to them. For the name, assume just forename and surname (no middle initial, Jr., mononym, etc.). For the month assume a three-letter abbreviation without a period (Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, and Dec), a one- or two-digit date, and a four-digit year. You don’t have to check for a valid date, that is, you don’t have to verify whether months have the correct number of days or whether a particular year is a leap year. Assume that all telephone numbers will be formatted exactly like the one above; the only thing that will change will be the specific digits.

  3. Write a regular expression that will validate that a (not terribly secure, alas!) hypothetical user password conforms to the following specifications:
    • must begin with a digit
    • the initial digit must be followed by at least eight characters (any combination of digits and uppercase and lowercase letters, but no other characters)
    • must end in one of the following punctuation marks: ! @ # $ %

Part 3: Matching and replacing

  1. Reading views of texts often favor curly (sometimes called typographic) quotation marks (“ ”) over straight ones (" "). Write a regular expression find-and-replace operation that matches all single words that are inside straight quotation marks (e.g., "XML") and replaces the straight quotation marks with curly ones (e.g., “XML”). (Feel free to copy and paste literal curly quotation mark characters from the test into your expressions if you find that easier than entering them from the keyboard.)
  2. In much of Europe the comma is used to separate the integral part of a number from the decimal part, e.g., a European might write 12,5 where someone in the US would write 12.5. Suppose a European company is looking to open a retail branch in the United States and they need to update their catalog from Euros to US dollars. Assume also that the currency conversion rate between the Euro and the US Dollar is 1:1 (as it once was, long ago!).

    Write a regular expression find-and-replace operation that will find match a European price (beginning with the Euro symbol, which you can copy and paste from here: ) and replace it with the corresponding American price, beginning with the dollar sign. Assume that prices for the products of this hypothetical company range from €0,01 (that is, one Euro cent) to €999,99 (that is 999 Euros and 99 Euro cents). (Curious fact: European countries differ in whether they write the Euro before or after the digits and whether it is or is not separated from the digits by a space. Assume that this company writes the Euro sign immediately before the digits, with no intervening space. Similarly, write the dollar sign in you replacement immediately before the digits, also with no intervening space.)

  3. Consider the following text, which represents a speech from a play:

    Of course: he sent me the news. 
    Sergius is the hero of the hour, the idol of the regiment.

    Assume that speeches are separated from one another by a blank line. Write a regular expression find-and-replace operation to tag the entire speech as <speech> with a speaker attribute whose value is the speaker’s name, e.g.:

    Of course: he sent me the news. 
    Sergius is the hero of the hour, the idol of the regiment.

    Assume all the speeches in this hypothetical document begin with the speaker’s name in uppercase letters on a separate line that ends with a dot. Please remove the dot, as in the example above, when you create the output.

    You may use either multiple find-and-replace operations or a single find-and-replace operation to complete this task.

Bonus tasks

  1. Write a regular expression to match an XML filename that conforms to the requirements for XML files in this course, with a main component that begins with your surname in lower-case letters (use your own surname); followed by any combination of letters, digits, hyphens, and underscores, but nothing else; then a period; and then a filename extension that reads xml. e.g., biden_xml-assignment-1.xml
  2. Write a regular expression to match any hyphenated word, which may include any number of hyphens, e.g., up-to-date.
  3. Rather than replacing all straight quotation marks with curly ones, as in the required question above, write a regular expression find-and-replace operation to match quotations and tag them all with <q> tags. Here's an example text you can use to test your expressions:
    "You had my note?" he asked with a deep harsh voice and a strongly
    marked German accent. "I told you that I would call." He looked from
    one to the other of us, as if uncertain which to address.
    "Pray take a seat," said Holmes. "This is my friend and colleague, Dr.
    Watson, who is occasionally good enough to help me in my cases. Whom
    have I the honour to address?"
    "You may address me as the Count Von Kramm, a Bohemian nobleman. I
    understand that this gentleman, your friend, is a man of honour and
    discretion, whom I may trust with a matter of the most extreme
    importance. If not, I should much prefer to communicate with you
    I rose to go, but Holmes caught me by the wrist and pushed me back into
    my chair. "It is both, or none," said he. "You may say before this
    gentleman anything which you may say to me."