Maintained by: David J. Birnbaum (djbpitt@gmail.com)
Last modified:
2022-03-10T19:14:16+0000
The task for this test is to complete the transformation of Oscar Wilde’s The importance of being Ernest from plain text to XML that you began in Regex homework #3. Do not start the test from the original plain text file; the test assumes that you are starting from the result of completing homework #3. You are welcome to use our posted solution to that assignment to get to the starting point for this test. Alternatively, the following file is our result after completing homework #3, but before any of the test actions, so you can just use it as your starting point: ernest-pre-test.xml.
There are multiple ways to finish tagging this play, and it’s fine if your solution differs from ours as long as you arrived a well-formed XML that tagged all of the pieces you needed to tag and that made reasonable use of regular expressions to add the markup. We start with the prepared interim file linked above, which has already added all of the markup that was part of homework #3. The steps we followed are then:
Manually fix the two inconsistencies mentioned in the instructions.
With dot-matches-all unchecked, search for
\n\n
and replace with
\n\n
.
This approach introduces two types of error, which we fix as follows:
It creates a spurious
</speech>
end-tag at the
beginning and a blank line followed by an erroneous
<speech>
start-tag at the
end. We remove these manually now.
It erroneously tags stand-alone stage directions and settings as speeches. We fix those over-generalizations below.
With dot-matches-all unchecked use the expression supplied in the prompt to
search for
and replace with
\1]]>
. The prompt
explains how this works and why we took this approach.
You can adapt the method for stand-alone stage directions to match settings,
but we searched separately (with dot-matches-all unchecked) first for
, which we
replaced with ]]>
, and
then for ]]>
,
which we replaced with
]]>
.
Act titles (strings like SECOND ACT
) are erroneously tagged as
<speech><act>SECOND ACT</act></speech>
.
We remove the <speech>
wrappers by
searching for
(
(with dot-matches-all unchecked) and replacing it with
\1
. The pattern doesn’t match the first
act, which is fine, since that one does not have the erroneous
<speech>
wrapper.
Act titles are still erroneously tagged, this time as
FIRST ACT]]>
, etc.,
and we need to move the tags to surround the entire act, and not just its
label, while tagging the label as a
<title>
element. We do this by
matching (.+?)]]>
(with dot-matches-all
unchecked) and replacing it with
\n\n
.
This creates a spurious </act>
end-tag at the beginning of the document, before the start-tag for the first
act, and it fails to write a needed
</act>
end-tag at the end of the
document, after the conclusions of the third act. We fix those
manually.
To move speaker names from the beginning of the content of the
<speech>
element into an attribute
we match <speech>(.+?)\.
(with
dot-matches-all unchecked; note the space character after the dot) and
replace it with
<speech speaker="\1">
. This step
fails to tag the two speeches where characters speak together, but we didn’t
discover that until later (see below).
XML requires a root element to be well formed, so we add that by selecting
the entire content and wrapping
<play>
tags around it.
<oXygen/> shows a green square, but there are two errors, which we fix
below.
There are two consecutive speeches where characters speak in unison and our
strategy for adding a @speaker
attribute failed because there isn’t a dot followed by a space in those
lines, which read:
Gwendolen and Cecily Speaking together. Your Christian names are still
an insuperable barrier. That is all!
Jack and Algernon Speaking together. Our Christian names! Is that
all? But we are going to be christened this afternoon. ]]>
These are well-formed, so <oXygen/> doesn’t recognize them as errors.
We discovered them by running a sanity check, where we searched
for <speech>
elements without
@speaker
attributes by searching for
just <speech>
.
If you tagged speakers without including a space after the dot in the pattern, you’ll see:
Speaking together"> Your Christian names are still
an insuperable barrier. That is all!
Our Christian names! Is that
all? But we are going to be christened this afternoon. ]]>
These lines are not be well-formed, which means that <oXygen/> will report the problems without our having to look for them.
The document is now well formed, with all of the tagging tasks completed.