Byte Order Mark


Older versions of UTF-8 mode in xfst could not accommodate the initial Byte Order Mark that some UTF-8 editors automatically insert to the beginning of a file. Earlier versions of the c-fsm code could not decipher the BOM mark.

What is a Byte Order Mark?

Besides UTF-8, there are other UTFs (Unicode Transport Format) and in some of them bytes are interpreted differently depending on whether the machine is "big-endian" (Sun, Apple) or "little-endian" (Windows and most Linux machines). For this reason, the Unicode standard specifies that a file may begin with a BOM, a sequence of reserved bytes that indicate byte order as well as the type of UTF encoding. However, the UTF-8 standard (by far the most common UTF) is the same for both big-endian and little-endian machines. No BOMs are needed in UTF-8 files. Unfortunately, the UFT-8 standard allows but does not require a BOM mark in the beginning of a file. It does not indicate byte order; it just serves to indicate that the encoding is UTF-8 rather than something else. Some UTF-8 editors insert a BOM, some don't.
If you look at your UTF-8 file in a text editor such as Emacs, you see that the first character is a strange blob. You can remove it with a Control-D and save the file. Everything should work fine without the BOM.
    0xEF 0xBB 0xBF

The current version of xfst ignores a BOM in the beginning of a UTF-8 file and will signal an error if it sees some conflicting character encoding directive.

Thanks to Mohammed Attia for flushing out this bug.