xfst-8.3.2 and later can handle Unicode UTF-8. By default, xfst assumes that scripts and the terminal itself are in ISO-8859-1. To change into UTF-8 mode, invoke the command
xfst[]: set char-encoding UTF-8
To set it back to ISO-8859-1 mode, invoke
xfst[]: set char-encoding ISO-8859-1
You can launch xfst in UTF-8 mode with an optional -utf8 flag on the Unix command line (here the dollar sign represents the Unix prompt):
$ xfst -utf8
This is equivalent to
$ xfst
xfst[]: set char-encoding UTF-8
lexc assumes ISO-8859-1 by default. The command utf8-mode toggles to the opposite utf8 mode:
lexc> utf8-mode
To toggle back to ISO-8859-1 mode, simply invoke the command utf8-mode again.
You can launch lexc in UTF-8 with an optional -utf8 flag on the Unix command line (the dollar sign here represents the Unix prompt):
$ lexc -utf8
This is equivalent to the command sequence
$ lexc
lexc> utf8-mode
By default tokenize assumes that its input is in ISO-8859-1. If the input file is in UTF-8, then the -utf8 flag must be added. For example, if the input file myfile.utf8 is in UTF-8, and your tokenizer FST is in mytokenizer.fst then you could type the following at the command line:
cat myfile.utf8 | tokenize mytokenizer.fst -utf8 | ...
By default lookup assumes that its input is in ISO-8859-1. If the input is in UTF-8, then the -utf8 flag must be added. For example, if the input is in UTF-8, and your analyzer FST is in myanalyzer.fst then the flag is added as shown below:
... | lookup myanalyzer.fst -utf8 | ...
or
lookup myanalyzer.fst -utf8 < tokenizedinputfile.utf8 > myout.utf8
or
... | lookup -flags L"=>"LTT my.fst -utf8 > myout.utf8
etc.
When using Latin-1, Windows (and Mac users) should stick to Official ISO Latin-1 and not use the Windows CP 1252 codepage, which is (lamentably) sometimes called "Latin-1". In real ISO Latin-1, character codes in the range 127-159 are undefined. The Microsoft CP 1252 ("Windows Latin-1") has assigned these undefined codes to glyphs listed on their codepage CP 1252. For example, in Windows Latin-1, the Euro symbol has the code 128. As long as the user creates and applies networks on his own machine or some other Windows machine, everything seems to work fine, but the networks cannot be shared with users on other platforms and cannot be used in Xerox utf8-mode. In Latin-1 mode, xfst does not map the Microsoft euro symbol to its proper Unicode representation \u20AC. This is the same problem that happens with users whose environment is ISO-8859-15 (also known as Latin-9).
Bottom Line: Users of the Xerox finite-state software need to
understand that ISO-8859-1 in xfst and the
other applications
means the REAL TRUE ISO-8859-1 STANDARD and not some altered variant
such as Latin-9 or CP
1252 ("Windows Latin-1"). For any user who needs symbols that are not
in
the 7-bit ASCII set, our recommendation is to move to Unicode UTF-8.
That is the only encoding that is the same across all platforms and
operating systems
that support it.
Warning: Some UTF-8 editors
insert an optional BOM (Byte Order Mark) into the beginning of the
file. UTF-8 files that start with a BOM cannot be processed without
removing this mark (a sequence of three bytes: 0xEF 0xBB 0xBF). Please
read UTF-8
problem: BOM for information
about how to deal with this problem. It will go away in the next
release.