Digital humanities

Maintained by: David J. Birnbaum ( [Creative Commons BY-NC-SA 3.0 Unported License] Last modified: 2022-10-04T17:02:33+0000

Text #3: Regular expressions


Answer the following questions and upload your answers to Canvas. All tests in this course are open book, so you can look things up and you can try out your code in <oXygen/>, but you cannot receive help from another person.

Create your answers in <oXygen/> as a Markdown document, which you can do by creating a new document (using the File → New menu options) and typing markdown into the Type filter text box. If you prefer to use a platform other than <oXygen/> for your Markdown files, you may use that. The file you upload must follow our file-naming conventions, using .md as the filename extension (to indicate that you are submitting a Markdown file). When we ask you to explain what a pattern matches, your explanations should be concise (between one and two sentences); be sure to attend to the most important aspects of the pattern within your description. Code snippets should be surrounded by backticks, as is the usual convention for citing code inside markdown documents. If your pattern requires Dot matches all to be checked, please say so; otherwise we’ll assume that Dot matches all is unchecked.

Required questions

  1. What is the difference between a character class and a capturing group in a regex pattern? When and why would you use a character class? A capturing group? How can they be used together in a regex pattern? Please provide a brief description and an example of each (in the form of a regex pattern) and explain in one or two sentences what it matches.

  2. There are two possible forms of repetition in a regex pattern. For each type, provide a brief description of the term and an example (in the form of a regex pattern). In your descriptions, please also explain why you would use that repetition form instead of the other:

    1. Greedy repetition

    2. Lazy repetition

  3. Imagine you are using regex to tag chapter numbers in a novel. The chapters in this novel do not have titles, so each chapter is preceded by a line that contains only a Roman numeral to denote its number, and there are no spaces before each Roman numeral on the lines in which they appear. The Roman numerals are immediately followed by a period (I., II., III., etc.). If you use the pattern ^[IVXLC].$ to match the numerals, the chapter numbers are not all returned. What’s wrong with this pattern and how can you fix it? (There may be more than one thing to fix.)

  4. Describe the function of a backreference in a regex pattern. When and why would you use a backreference in a replacement pattern? Additionally, what is the difference between the backreference \0 and the other numbered backreferences (for example, \1 and \2)?

  5. What is upconversion? Why is it useful?

Optional extra-credit questions

  1. Imagine you are using regex to convert a text document into well-formed XML. As an end result, you want there to be a root element that contains tagged chapters, chapter titles, paragraphs, and quotations. Note that there is a hierarchy to these elements (for example, a chapter contains a chapter title and paragraphs). From the beginning, describe each step that you would take to transform the plain text document into XML. Be sure to provide example regex patterns for each step. Note the following qualities of the original plain text document:

    1. The text is a novel.

    2. The novel is taken from Project Gutenberg.

    3. Two newlines (one blank line) separate each paragraph from other paragraphs.

    4. More than two newlines separate other sections of the text (e.g., chapters).

    5. Chapter numbers appear on the same line as chapter titles and are formatted as I. I am born.

  2. In <oXygen/>, what is the effect of selecting the Dot matches all checkbox when using regex? Provide an example regex pattern with a dot character that depends on checking Dot matches all to find the correct number of matches in a document and explain why you need to check Dot matches all in order to get the result you want.