background image
Tokenizing Transducers
Kenneth R. Beesley
Xerox Research Centre Europe
6, chemin de Maupertuis
38240 Meylan, France
Ken.Beesley@xrce.xerox.com
October 11, 2004
1
Introduction
Advanced tokenization for Xerox finite-state morphology can be done using the Xerox
tokenize
command-line utility, included on the CD-ROM and in upgrades from
Lauri Karttunen (karttunen@parc.com), and a tokenizing transducer that you define
and pre-compile yourself. The definition of a tokenizing transducer is shown in detail
on pages 422-31 of the book Finite State Morphology. This handout contains just a bit
of extra mind-tuning.
The tokenizer expert at the Xerox Research Centre Europe is my colleague Anne
Schiller (schiller@xrce.xerox.com).
2
The tokenize Utility
As shown in Chapter 9, the
tokenize
utility reads from the standard input and writes
to the standard output. Because of this, it can use the Unix command-line re-direction
operators
<
and
>
. Let tokenizingfst be the path to your tokenizing transducer.
tokenize tokenizingfst < infile > outfile
The
tokenize
utility is commonly used in a pipe, where the output of one pro-
gram becomes the input to a subsequent program.
unixprompt> cat mycorpus.txt | \
tokenize tokenizingfst | \
lookup mylanguage.fst | \
grep '+?' | \
gawk '{print $1}' | \
sort | uniq -c | sort -rnb > sorted_failures.txt
The
tokenize
utility, with a properly written tokenizingfst, should produce an
output file with one token per line. That is precisely the input format expected by the
lookup
utility.
1