sgfutils

The package sgfutils contains a few command line utilities that help working with SGF files that describe go (igo, weiqi, baduk) games. This page is about sgfcharset.

sgfcharset

% sgfcharset [-q] [-v] [-na] [-nu] [-nok] [--] [files]
% sgfcharset -toutf8 [-from CHARSET] [-replace] [--] [files]

The program sgfcharset reads SGF files and tries to guess their character set. If desired, the files are converted to UTF-8.

Guessing the character set of an SGF file is not easy: almost all is ASCII, and perhaps only very little is written in some local character set. Strictly speaking, an SGF file does not have a character set. It is written in ASCII, with text fields in some character set CS as specified by the CA[] property. But when occurrences of '\' and ']' are escaped, one can get a '\' in the middle of a multibyte character, and conceivably the text field is no longer valid CS text after escaping.

It is quite common to find SGF files with a CA field that does not describe the character set of its text fields. Maybe the CA field was correct originally, but a general purpose converter will not update it when it converts the file. That is why sgfcharset tries to determine the character set independently of what the CA field, if any, says. (The standard tells us that the default is Latin-1, i.e. ISO-8859-1, but I have not met many examples. Very common are ASCII, UTF-8, GB2312 or GB18030, and SJIS. Also Big5 and EUC-KR occur often.)

The standard is not very clear on how escaping is supposed to work in multibyte characters, and very often one finds unescaped '\' and ']' bytes that are the second byte of a multibyte character. Therefore sgfcharset will view an unescaped ']' in the input as termination of a textfield only when that is grammatically possible, i.e., when the next non-whitespace byte is ';' or '(' or ')' or '[' or an ASCII letter.

When invoked without -toutf8 flag, sgfcharset is informative only, and prints its report to stdout. The options ask for more or less output.

Options:

-na: Don't mention ASCII.
-nu: Don't mention UTF-8 (implies -na).
-nok: Don't mention cases with one, confirmed, candidate character set (implies -na).
-q: Be more quiet.
-v: Be more verbose. With -v -v: even more verbose.

When invoked with the -toutf8 flag, sgfcharset converts the input files to UTF-8 (and adapts the CA[] field). When reading from stdin, the output goes to stdout. When reading from a file foo.sgf, the output is written on foo.sgf.utf8. The character set that is used as starting point for the conversion is the one specified by the user in the -from option, or, when no such option is given, is the character set specified in the file in the CA[] field, if it seems reasonable, or otherwise is what sgfcharset guesses, if it has a good guess. If sgfcharset is uncertain or has no idea, no conversion is done. If one is optimistic, one can specify that the files must be overwritten in place by the converted files. Of course that means that the original contents is lost.

Options:

-from CHARSET: Don't guess, but convert from CHARSET to UTF-8
-replace: Don't append .utf8 to the filename, but replace the original file by the converted file.
-f: Force: do not abort, but replace nonunderstood bytes by '?' (and report the number of such replacements).