Byte Order Mark
The implementation of UTF-8 mode in version 8.4.3 cannot handle the
initial Byte Order Mark that some UTF-8 editors automatically insert to
the beginning of a file. This is a bug that will be corrected in the
next release.
What is a Byte Order Mark?
Besides UTF-8, there are other UTFs (Unicode Transport Format) and in
some
of them bytes are interpreted differently depending on whether the
machine is "big-endian" (Sun, Apple) or "little-endian" (Windows and
most Linux machines). For this reason, the Unicode standard specifies
that a file may begin with a BOM, a sequence of reserved bytes that
indicate byte order as well as the type of UTF encoding. However, the
UTF-8
standard (by far the most common UTF) is the same for both big-endian
and little-endian machines. No BOMs are needed in UTF-8 files.
Unfortunately, the UFT-8 standard allows but does not require a
BOM mark in the beginning of a file. It does not indicate byte order;
it just serves to indicate that the encoding is UTF-8 rather than
something else. Some UTF-8 editors insert a BOM, some don't.
Unfortunately, we did not know this at the time when 8.4.3 was created.
If you run into this problem, there is a work-around. If you look at
your UTF-8 file in a text editor such as Emacs, you see that the first
character is a strange blob. Remove it with a Control-D and save the
file. Try your file again in UTF-8 mode in xfst. Everything should work fine.
A Hex editor will show that the UTF-8 BOM consists of three bytes:
0xEF 0xBB 0xBF
The next version of xfst will
ignore this sequence in the beginning of a UTF-8 file and will signal
an error if it sees some other BOM.
Thanks to Mohammed Attia for flushing out this bug.