Spelling Correction
A robust morphological analyzer should not only recognize words that
are spelled correctly. It should also deal intelligently with common
misspellings. Words ending in -ible,
-able, -ant, -ent are often misspelled: irresistible ~ irresistable, indispensible ~ indispensable, dominant ~ dominent, inadvertant ~ inadvertent. Doubled consonants
provide many opportunties for wrong spellings: misspelling ~ mispelling, occurence
~ ocurrence ~ occurrence, embarasment ~ embarrasment ~ embarrassment. A Spelling Test
by Mindy McAdams gives you a chance to test your skills on 50 commonly
misspelled words.
The Task
Pick at least a dozen correctly spelled words from the McAdams list and
write one or several replace rules that produce one or more misspelled
variants for each of the words on your list. For example, to
produce irresistable from irresistible and dominent from dominant, you can use rules such as
{ible} ->
{able}, {ant} -> {ent} ;
The task is to make a
lexical transducer that has on the upper side only correctly spelled
words. On the lower side, it should have both the correct spelling and
the incorrect spelling produced by the misspelling rules. Each
misspelled word on the lower side is paired with the correctly
spelled form. That is, the commands
xfst[1]: apply up
irresistable
xfst[1]: apply
up irresistible
should both produce the result
irresistible
On the other hand the command
xfst[1]: apply down
irresistable
should not yield any output because it is an incorrect spelling.
And the command
xfst[1]: apply down
irresistible
should only produce the output
irresistible
and not the misspelled variant. To get this behavior you need to have
the failure flag diacritic, @F@, on the
upper side of each path that contains
a misspelled variant of the lower side. For example,
Upper
side: @F@ i r r e s i s t i b l e
Lower side:
i r r e s i s t a b l e
The position of the @F@ along
the path does not matter. It could also be where the error occurs or at
the end. What is important is that the @F@ flag is
mapped to an epsilon on the lower side. Therefore, it is not visible to
the apply up
routine but blocks the incorrect realization in the apply down
case.
To make this exercise more realistic, let us throw in a few words such
as banjo that have two
possible legal spellings in the plural. The commands
xfst[1]: apply up
banjoes
xfst[1]: apply
up banjos
should both produce the output
banjos
and the command
xfst[1]: apply down
banjos
should produce two outputs
banjoes
banjos
The purpose of this lexical transducer is twofold: to correct incorrect
spellings and to normalize variant spellings into a single canonical
form.
The xfst script that creates
it should leave the transducer on the stack for testing.