The current version of xfst prefers Unicode in UTF-8 encoding. By default, xfst assumes that scripts and the terminal itself are in UTF-8. To change into ISO-8859-1 mode, invoke the command
xfst[]: set char-encoding latin-1
To set it back to UTF-8 mode, invoke
xfst[]: set char-encoding utf-8
You can launch xfst in ISO-8859-1 mode with an optional -latin1 flag on the Unix command line (here the dollar sign represents the Unix prompt):
$ xfst -latin1
This is equivalent to
$ xfst
xfst[]: set char-encoding latin-1
The current version of lexc assumes UTF-8 by default. The command utf8-mode toggles to the opposite latin-1 mode:
lexc> utf8-mode
To toggle back to UTF-8 mode, simply invoke the command utf8-mode again.
You can launch lexc in ISO-8859-1 with an optional -latin1 flag on the Unix command line (the dollar sign here represents the Unix prompt):
$ lexc -latin1
This is equivalent to the command sequence
$ lexc
lexc> utf8-mode
By default the current version of tokenize assumes that its input is in UTF-8. If the input file is in ISO-8859-1, then the -latin1 flag must be added. For example, if the input file myfile.txt is in ISO-8859-1, and your tokenizer FST is in mytokenizer.fst then you could type the following at the command line:
cat myfile.txt | tokenize mytokenizer.fst -latin1 | ...
By default the current version of lookup assumes that its input is in UTF-8 format. If the input is in ISO-8859-1, then the -latin1 flag must be added. For example, if the input is in ISO-8859-1, and your analyzer FST is in myanalyzer.fst then the flag is added as shown below:
... | lookup myanalyzer.fst -latin1 | ...
or
lookup myanalyzer.fst -latin1 < tokenizedinputfile.txt > myout.txt
or
... | lookup -flags L"=>"LTT my.fst -latin1 > myout.txt
etc.
When using Latin-1, Windows (and Mac users) should stick to Official ISO Latin-1 and not use the Windows CP 1252 codepage, which is (lamentably) sometimes called "Latin-1". In real ISO Latin-1, character codes in the range 127-159 are undefined. The Microsoft CP 1252 ("Windows Latin-1") has assigned these undefined codes to glyphs listed on their codepage CP 1252. For example, in Windows Latin-1, the Euro symbol has the code 128. As long as the user creates and applies networks on his own machine or some other Windows machine, everything seems to work fine, but the networks cannot be shared with users on other platforms and cannot be used in Xerox utf8-mode. In Latin-1 mode, xfst does not map the Microsoft euro symbol to its proper Unicode representation \u20AC. This is the same problem that happens with users whose environment is ISO-8859-15 (also known as Latin-9).
Bottom Line: Users of the Xerox finite-state software need to
understand that ISO-8859-1 in xfst and the
other applications
means the REAL TRUE ISO-8859-1 STANDARD and not some altered variant
such as Latin-9 or CP
1252 ("Windows Latin-1"). For any user who needs symbols that are not
in
the 7-bit ASCII set, our recommendation is to move to Unicode UTF-8.
That is the only encoding that is the same across all platforms and
operating systems
that support it.
Warning: Some UTF-8 editors
insert an optional BOM (Byte Order Mark) into the beginning of the
file. UTF-8 files that start with a BOM can be processed without
removing this mark (a sequence of three bytes: 0xEF 0xBB 0xBF). It is
not required but harmless unless it conflicts with the file encoding
declaration on the first line of a text input file that xfst and other
c-fsm applications are looking for:
#-*- coding: utf-8 -*-
or
#-*- coding: iso-8859-1 -*-
Please
read UTF-8
problem: BOM for information
about how to deal with this problem if it affects you.
release.