Digital humanities


Maintained by: David J. Birnbaum (djbpitt@gmail.com) [Creative Commons BY-NC-SA 3.0 Unported License] Last modified: 2022-10-18T17:41:04+0000


Text #3: Regex answers

Instructions

Answer the following questions and upload your answers to Canvas. All tests in this course are open book, so you can look things up and you can try out your code in <oXygen/>, but you cannot receive help from another person.

Create your answers in <oXygen/> as a Markdown document, which you can do by creating a new document (using the File → New menu options) and typing markdown into the Type filter text box. If you prefer to use a platform other than <oXygen/> for your Markdown files, you may use that. The file you upload must follow our file-naming conventions, using .md as the filename extension (to indicate that you are submitting a Markdown file). When we ask you to explain what a pattern matches, your explanations should be concise (between one and two sentences); be sure to attend to the most important aspects of the pattern within your description. Code snippets should be surrounded by backticks, as is the usual convention for citing code inside markdown documents. If your pattern requires Dot matches all to be checked, please say so; otherwise we’ll assume that Dot matches all is unchecked.

Required questions

  1. Question: What is the difference between a character class and a capturing group in a regex pattern? When and why would you use a character class? A capturing group? How can they be used together in a regex pattern? Please provide a brief description and an example of each (in the form of a regex pattern) and explain in one or two sentences what it matches.

    Answer: A character class specifies a set of characters using square brackets, where the engine searches for one instance of any single character in that character class within a document. A character class matches any single character contained within it; however, by appending a repetition indicator you can match multiple consecutive instances of any of those characters. A capturing group groups characters together using parentheses. Capturing groups are most commonly utilized to 1) group characters together and set a repetition indicator, 2) group alternating values together, and 3) group characters together that will later be accessed through a backreference. Together, a character class and a capturing group can be used to search for a particular character, then access that character with a backreference.

    Example pattern(s):

    1. cre[ae]k (a pattern containing a character class [ae] that will match either creek or creak.
    2. <em>([A-Za-z]+?)</em> (a pattern that captures a group of any string of English-language text characters that are surrounded by <em> tags).

    Character classes do not separate their values with commas or pipes; they just write them one after another inside the square brackets. Correct way to match either a or e: [ae]. Incorrect way to match either a or e: [a,e] or [a|e].

  2. There are two possible forms of repetition in a regex pattern. For each type, provide a brief description of the term and an example (in the form of a regex pattern). In your descriptions, please also explain why you would use that repetition form instead of the other:

    1. Question: Greedy repetition

      Answer: As its name suggests, greedy repetition matches as many characters as possible. The repetition indicators * and + are greedy because they continue through a document until they find the last possible character that matches a pattern, continuing through previous matches to find the longest match.

      Example pattern:".+". This pattern looks like it’s searching for all of the quotes in a document (characters located in between an opening and closing quotation mark). However, what it really matches is all of the characters between the first opening quotation mark and last closing quotation mark in a document (if dot matches all is checked; otherwise it would be the first or last in a line). This pattern keeps extending the match until it reaches its longest possible result (it’s greedy)!

    2. Question: Lazy repetition

      Answer: As its name suggests, lazy repetition matches as few characters as possible. The repetition indicator ? is used to make a match lazy, as it tells the engine to continue matching a pattern just until that pattern is found. Unlike greedy repetition, lazy repetition stops the first time it matches a pattern, returning the shortest possible match.

      Example pattern: ".+?". This pattern modifies the example pattern for greedy repetition above, making it lazy by adding a ? after the + repetition indicator. This modification is important because now, instead of returning all of the characters between the first opening quotation mark and last closing quotation mark in a document (or line), this pattern matches and returns each quote. This example of lazy repetition tells the engine to stop matching the pattern each time it reaches a closing quotation mark that follows an opening quotation mark, with at least one character contained between them. It stops matching each time it reaches a shortest possible match for the pattern (it’s lazy)!

  3. Question: Imagine you are using regex to tag chapter numbers in a novel. The chapters in this novel do not have titles, so each chapter is preceded by a line that contains only a Roman numeral to denote its number, and there are no spaces before each Roman numeral on the lines in which they appear. The Roman numerals are immediately followed by a period (I., II., III., etc.). If you use the pattern ^[IVXLC].$ to match the numerals, the chapter numbers are not all returned. What’s wrong with this pattern and how can you fix it? (There may be more than one thing to fix.)

    Answer: There are two sets of mistakes in this example:

    • The character class does not allow for repetition. As a result it matches only one-digit Roman numerals.

    • The period is interpreted by the engine as the regex dot character. As a result it matches a Roman-numeral character followed by any other character, and not only by a literal dot.

    A correct version might be: ^[IVXLC]+\.$. Here, we’ve added a + repetition indicator to search for one or more Roman numerals, followed by an escaped period, so that the engine interprets the . character not as the reserved regex character, but as the literal period that follows a Roman numeral in a chapter number.

  4. Question: Describe the function of a backreference in a regex pattern. When and why would you use a backreference in a replacement pattern? Additionally, what is the difference between the backreference \0 and the other numbered backreferences (for example, \1 and \2)?

    Answer: A backreference points towards the values matched by a capturing group in a regex pattern. Backreferences are particularly useful when you want to keep information that you have matched, but that information varies within its container (for example, within quotation marks). The \0 backreference, which is always available (without parentheses), points towards the entirety of each match. The other numbered backreferences point towards the characters contained within a specific parenthesized capturing group, where the number of the backreference corresponds to the order in which the capturing groups are written.

  5. Question: What is upconversion? Why is it useful?

    Answer: Upconversion refers to the process of making implicit structural information in a text document explicit. Implicit structural information includes paragraph spacing, chapters and chapter titles, quotation marks, and leading whitespace: these are all forms of information in a text that we, as humans, recognize as structural components. These typographic conventions are not explicit in XML (<oXygen/> does not recognize them as anything other than whitespace characters and strings of text), though, and if we want XML processing to act on them we need to convert them to markup. Using regex, we can automatically tag these structural components by writing patterns that match certain arrangements of spacing (paragraphs), Roman numerals (chapter numbers), and other typographic information (quotation marks, chapter titles, etc). Once the patterns are found, we include XML elements in our replacement patterns, making the structural information explicit to the engine.

    Upconversion is useful because it shortens the time spent on structural markup. If we were to go through a text and manually tag each paragraph, this would understandably take us a long time. With upconversion through regex, we are able to tag large amounts of information in minutes, saving time that would have been spent on structural markup and allowing us to focus our attention on the analytical markup that relates to our specific research questions.

Optional extra-credit questions

  1. Question: Imagine you are using regex to convert a text document into well-formed XML. As an end result, you want there to be a root element that contains tagged chapters, chapter titles, paragraphs, and quotations. Note that there is a hierarchy to these elements (for example, a chapter contains a chapter title and paragraphs). From the beginning, describe each step that you would take to transform the plain text document into XML. Be sure to provide example regex patterns for each step. Note the following qualities of the original plain text document:

    1. The text is a novel.

    2. The novel is taken from Project Gutenberg.

    3. Two newlines (one blank line) separate each paragraph from other paragraphs.

    4. More than two newlines separate other sections of the text (e.g., chapters).

    5. Chapter numbers appear on the same line as chapter titles and are formatted as I. I am born.

    Answer: There are many ways to approach this question, but these are the steps that we would take to convert this text document into XML, utilizing the preliminary information above.

    1. Remove the Project Gutenberg information located at the start and end of the document.

    2. Search for reserved characters &, <, and > (ampersand first). If there are any, replace them with &amp;, &lt;, and &gt;.

    3. Standardize whitespace, so that there are two newlines between each block of text. Find: \n{3,}. Replace: \n\n.

    4. Each block of text is now separated by two newlines. We then find the newlines and replace them with </p><p> for our paragraph elements. We manually add paragraph tags at the start and end of the document.

    5. With Dot matches all selected: search for quotes and then wrap the quoted characters in a <q> element, discarding and replacing the quotation marks. Find: "(.+?)". Replace: <q>\1</q>.

    6. Search for chapter titles and then wrap them in a <title> element, discarding and replacing the <p> tags we placed in step four. Find: <p>([IVXLC]+\..+?)</p>. Replace: <title>\1</title>.

    7. Wrap the chapters in a <ch> element. Find: <title>. Replace: </ch>\n<ch>\n\0. For detailed rationale on this process, see our answer key to the second regex assignment, under Chapters. Manually fix the first and last chapters (as we did with the paragraphs).

    8. Finally, manually add the root element at the start and end of the document.

  2. Question: In <oXygen/>, what is the effect of selecting the Dot matches all checkbox when using regex? Provide an example regex pattern with a dot character that depends on checking Dot matches all to find the correct number of matches in a document and explain why you need to check Dot matches all in order to get the result you want.

    Answer: By default, the regex dot character matches everything but newlines. Sometimes, we want to match things that span multiple lines in a document (e.g., quotations). Without Dot matches all selected, we would only match quotations that begin and end on the same line, as there are newline characters in multiline quotations.

    Example pattern: "(.+?)". Utilizing our example pattern for lazy repetition that matches quotations (characters sandwiched between quotation marks), we can see the importance of Dot matches all. Without Dot matches all selected, the engine will not match quotations that span multiple lines. This is very common in texts, and we want to make sure that when we automatically tag quotations, we are correctly selecting each one of them.