FAQ: UTF-8 and Xerox Finite-State Software

ISO-8859-1 or Unicode in UTF-8 Encoding

The new versions of the Xerox Finite-State utilities xfst, lexc, tokenize and lookup can handle either

    1.       ISO-8859-1 (Official ISO 8-bit Latin-1), or
    2.       Unicode UTF-8

ISO-8859-1, also known as Latin-1, is still the default encoding in all cases. Windows and Mac users are warned that they should stick to the Real Official ISO Latin-1 and not stray into additions allowed by Windows Latin-1 (CP 1252) and Macintosh Roman. See below

We encourage users to move to Unicode UTF-8 if they need any encodings beyond the 7-bit ASCII set. Unicode is the Future. Regional 8-bit encodings such as ISO-8859-2 and mutants such as CP1252 are the Past.

The treatment of the Euro symbol is a good example of why it is best to avoid 8-bit encodings other than standard ISO-8859-1. There is no Euro symbol in the part of Unicode that corresponds to ISO-8859-1. The proper Unicode code point for € [this may or may not display correctly as the Euro sign in your browser] is decimal 8364 (0x20AC). In Windows CP1252 € has the code 128 (0x80); in ISO-8859-15 (also known as Latin-9) the € code is 164 (0xA4); in Macintosh Roman it is 219 (0xDB). These incompatible 8-bit encoding standards breed confusion. The best way out is to adopt the Unicode standard in the common UTF-8 encoding that is universally supported on all modern operating systems.

xfst

xfst-8.3.2 and later can handle Unicode UTF-8. By default, xfst assumes that scripts and the terminal itself are in ISO-8859-1. To change into UTF-8 mode, invoke the command

xfst[]: set char-encoding UTF-8

To set it back to ISO-8859-1 mode, invoke

xfst[]: set char-encoding ISO-8859-1

You can launch xfst in UTF-8 mode with an optional -utf8 flag on the Unix command line (here the dollar sign represents the Unix prompt):

$ xfst -utf8

This is equivalent to

$ xfst
xfst[]: set char-encoding UTF-8

lexc

lexc assumes ISO-8859-1 by default. The command utf8-mode toggles to the opposite utf8 mode:

lexc> utf8-mode

To toggle back to ISO-8859-1 mode, simply invoke the command utf8-mode again.

You can launch lexc in UTF-8 with an optional -utf8 flag on the Unix command line (the dollar sign here represents the Unix prompt):

$ lexc -utf8

This is equivalent to the command sequence

$ lexc
lexc> utf8-mode

tokenize

By default tokenize assumes that its input is in ISO-8859-1. If the input file is in UTF-8, then the -utf8 flag must be added. For example, if the input file myfile.utf8 is in UTF-8, and your tokenizer FST is in mytokenizer.fst then you could type the following at the command line:

cat myfile.utf8 | tokenize mytokenizer.fst -utf8 | ...

lookup

By default lookup assumes that its input is in ISO-8859-1. If the input is in UTF-8, then the -utf8 flag must be added. For example, if the input is in UTF-8, and your analyzer FST is in myanalyzer.fst then the flag is added as shown below:

... | lookup myanalyzer.fst -utf8 | ...

or

lookup myanalyzer.fst -utf8 < tokenizedinputfile.utf8 > myout.utf8

or

... | lookup -flags L"=>"LTT my.fst -utf8 > myout.utf8

etc.

Beware Windows "Latin-1"

When using Latin-1, Windows (and Mac users) should stick to Official ISO Latin-1 and not use the Windows CP 1252 codepage, which is (lamentably) sometimes called "Latin-1". In real ISO Latin-1, character codes in the range 127-159 are undefined. The Microsoft CP 1252 ("Windows Latin-1") has assigned these undefined codes to glyphs listed on their codepage CP 1252. For example, in Windows Latin-1, the Euro symbol has the code 128. As long as the user creates and applies networks on his own machine or some other Windows machine, everything seems to work fine, but the networks cannot be shared with users on other platforms and cannot be used in Xerox utf8-mode. In Latin-1 mode, xfst does not map the Microsoft euro symbol to its proper Unicode representation \u20AC. This is the same problem that happens with users whose environment is ISO-8859-15 (also known as Latin-9).

Bottom Line: Users of the Xerox finite-state software need to understand that ISO-8859-1 in xfst and the other applications means the REAL TRUE ISO-8859-1 STANDARD and not some altered variant such as Latin-9 or CP 1252 ("Windows Latin-1"). For any user who needs symbols that are not in the 7-bit ASCII set, our recommendation is to move to Unicode UTF-8. That is the only encoding that is the same across all platforms and operating systems that support it.

Warning: Some UTF-8 editors insert an optional BOM (Byte Order Mark) into the beginning of the file. UTF-8 files that start with a BOM cannot be processed without removing this mark (a sequence of three bytes: 0xEF 0xBB 0xBF). Please read UTF-8 problem: BOM for information about how to deal with this problem. It will go away in the next release.



Last Modified:Tuesday, 07-Dec-2004 09:19:25 PST