Notes on the SGF format

For a byte-oriented parser, this escape mechanism is OK. But for a reader who understands the character set used, the escape mechanism produces 'mojibake'. For example, the text

For local use (say, among the users of GB18030) it is much more convenient not to do this escaping and unescaping, so that the SGF file is a real text file, and can be handled by general utilities for text files, and can be displayed without conversion. This is what one commonly sees. With 金昞俊 (Kim Pyonjun) spelled bdf0 955c bfa1 with unescaped '\', and 朴埈奭 (Park Junseok) spelled c6d3 88ad 8a5d with unescaped ']'.

The grammar is not explicit about whether the escape mechanism is supposed to be used at the character level or at the byte level, but since parsers cannot be expected to know about all character sets in the world, one might assume that the escape mechanism is supposed to act at the byte level. However, people who follow this interpretation run into problems.

One consequence is that when a website is moved to another system, and all files are converted to UTF-8, suddenly the SGF files become unparseable. The first problem is that converting 'CA[big5]' to UTF-8 does not change this string, so that the converted file has an incorrect character set indication. The second problem is that a65c converts to U+5412 (e5 90 92) that no longer contains a '\' at byte level, and the following ']' is not escaped any longer, and the SGF file can no longer be interpreted.

In short: SGF with character-level escaping requires knowledge of the character set. SGF with byte-level escaping must be regarded as a binary format. It does not have an associated character set, and cannot be converted to a different character set by a general-purpose converter.

Grammar

It is here the problems start. What is a PropValue? It is a ValueType, or two ValueTypes separated by ':', and a ValueType is? Nobody knows. There are several possibilities, and some of these are listed as 'game-specific'.

Among the possibilities for a ValueType, several seem subcases of others, so that the grammar is ambiguous. A Double looks like a special case of a Number, and a Number looks like a special case of a SimpleText, and a SimpleText looks like a special case of a Text. This is of some importance, since the escaping needed is described in terms of categories that cannot be distinguished.

The standard should have said that all occurrences of ']' and '\' inside a CValueType must be escaped by inserting a '\'. It half-heartedly makes some suggestions that this perhaps is what is meant.

Colons

Goal

Variations

We see here that a parenthetical variation precedes the main line of the game: the first move inside the variation and the first move after the variation are siblings.

According to the SGF FF[4] standard main line and variant must both be GameTrees, so we need two more parentheses. And worse, the main line must come first, so we must move the parenthetical part to the end. This means that variants occurring near move 200 are stored textually earlier in the file than variants occurring near move 60. This also means that comments about a move and a sibling are stored far from each other. This is unfortunate, a mistake in my opinion, although one might maintain that the file is not for human consumption and that the internal format is irrelevant.

However, it is easy to allow the above as extension, letting the formerly ungrammatical

This means that the sequence of moves A B C, where for each of the moves there are 1-move deep variations A', A'', etc., which is presently written

I find that Jan van der Steen made the same proposal:

Variations
From: Jan van der Steen
There is a lot of confusion about where the exact location of a variation should be, same level or after a move. I've entered a lot of games using mgt and for humans it's much easier and natural to enter the variation when commenting on the move you want to give an alternative for. So I *always* create the diagram after the move has been played and dully remove the originally played move. I understand the problems how to interpret the resulting tree but that's inherent to the format not to the user using the interface. Maybe we should reconsider the meaning of the braces.
( C[Game start] ;B[point] ;W[point] ( C[Subgame based on initial position with two stones] ;B[point] # ... ) C[game continues] ;B[point] # ... )
So instead of splitting the main branche into two sub-branches (which one is the game, and which is the variation?) we let the main branche proceed untouched and just create a side branch (subgame). No confusion possible, right?
Comment MM: Oh no! RTFM

But Martin Müller is mistaken and Jan van der Steen was right. The convention chosen by SGF is unfortunate. This old choice cannot be undone, but it is easy to accept both formats.

I find that Rui Jiang made the same proposal:

... Or a structural reform of SGF, to allow nodes in the middle like:

;B[..]
;W[..]
(;B[];W[..] ... )
;B[..]
;W[..]
(;B[];W[..] ... )

which is more natual for game comment style SGF files. Go Assistant used to do this. But this will break almost all other applications. ...

Empty variations

UGF

Properties

The RE property

Of course this covers almost all cases. However, the Japanese rules describe situations where both players are deemed to have lost, and there are real examples of this happening:

There are further examples of results that cannot be given in the simple FF[4] scheme. For example, the famous 1928-10-10 game between Takahashi Shigeyuki and Segoe Kensaku resulted in "White wins but black does not lose". Since it is highly desirable to keep the formalized RE[] field whenever possible, but real life does not quite fit into any simple scheme, one could envisage a result RE[Special], with extended (simpletext) explanation in an REX[] field.

As a further comment: people interested in the rules of Go want to see the games with Void result as a consequence of the rules (say, due to multiple ko or some other unusual situation). That is a situation quite different from that of a game that was never finished. One needs the possible result "Unfinished" distinct from "Void". Many sites already use a label "Unfinished" (or "U" or "UF") or "Left unfinished".

Different from "Unfinished" is "Playing", for a game that is being played right now.

Experience shows that RE[0] is not sufficiently robust. Often a single 0 is interpreted as "none" or "no information present", and such properties are deleted by some programs. Therefore, it is better to avoid 0 and write Jigo instead.

Usually the real numbers in the score are integers or half-integers. But the Ing rules assign a fractional score to sekis, and results like "B wins by 1 5/6" occur. These do not really fit in FF[4].

The DT property

For older games one often has a Japanese date using the lunar calendar. The FF[4] requirement to give dates in ISO standard form has led people to write dates like "the 1st day of the 9th intercalary month 1843" as 1843-09-01, and today such old dates are usually given incorrectly. Also this type of date-like information can be stored in a DTX[] field.

Where old game records perhaps only have an approximate year, and more recent games a day, games played on a server often have a starting date and time. Probably the grammar of DT should be generalized to allow 2004-06-14 20:50:06 and something like 2013-08-15 23:54..2013-08-16 00:09. Again an extended field can give nonformalized time periods, such as "From the hour of the Dragon to the lower hour of the Monkey" and "From the hour of the Snake to the hour of the Sheep".

The TM property

Concerning the meaning: meant is allotted time per player. One often encounters TM[60m each] to stress that 60m is not the total time but the time per player.

The PB property

A possibility is to make PB[] and BR[] into &-separated lists of the same length:

Sometimes the players are known, but it is unknown who had black. In such a case it would be more natural to have P1 and P2 for player 1 and player 2. In particular this happens when the game was not actually played. If it is known who won, this must be noted down without use of a color letter.

SGF

Character set