SGF

SGF is the standard format for go (igo, weiqi, baduk) game records, and is also used for several other games. The official specification is due to Arno Hollosi. Roughly speaking, it is well-defined and easy to parse, but there are a few flaws.

Character set
Grammar
Colons
Goal
Variations
Properties
SGFC

Character set

An SGF file does not have a character set, but the text fields inside may have one. The SGF file itself is a byte sequence that is interpreted as ASCII, unless one is in a text field, in which case it is interpreted as text in the character set indicated in the CA[] property, or, failing that, as a string in ISO-8859-1. The text field ends at the closing ']'. If there are occurrences of '\' or ']' (or perhaps ':') in the text field, they must be escaped by prefixing them by '\'.

For a byte-oriented parser, this escape mechanism is OK. But for a reader who understands the character set used, the escape mechanism produces 'mojibake'. For example, the text

(;CA[big5]AP[MultiGo:4.4.3]SZ[19]C[第64屆日本本吒]坊戰
七番棋挑戰賽第5局

黑:高尾紳路 九段  貼6目半
白:羽根直樹 本吒]坊

黑中押勝]
...
was supposed to say
(;CA[big5]AP[MultiGo:4.4.3]SZ[19]C[第64屆日本本因坊戰
七番棋挑戰賽第5局

黑:高尾紳路 九段  貼6目半
白:羽根直樹 本因坊

黑中押勝]
...
but the word 本因坊 (Honinbo) is spelled in big5 with character values (hex, big-endian) a5bb a65d a77b, and since '\' and ']' are 5c and 5d (hex) and are escaped, this a5bb a65d a77b is turned into a5bb a65c 5d a77b, explaining the occurrence of a strange character followed by ']' twice in the middle of this text.

For local use (say, among the users of GB18030) it is much more convenient not to do this escaping and unescaping, so that the SGF file is a real text file, and can be handled by general utilities for text files, and can be displayed without conversion. This is what one commonly sees. With 金昞俊 (Kim Pyonjun) spelled bdf0 955c bfa1 with unescaped '\', and 朴埈奭 (Park Junseok) spelled c6d3 88ad 8a5d with unescaped ']'.

The grammar is not explicit about whether the escape mechanism is supposed to be used at the character level or at the byte level, but since parsers cannot be expected to know about all character sets in the world, one might assume that the escape mechanism is supposed to act at the byte level. However, people who follow this interpretation run into problems.

One consequence is that when a website is moved to another system, and all files are converted to UTF-8, suddenly the SGF files become unparseable. The first problem is that converting 'CA[big5]' to UTF-8 does not change this string, so that the converted file has an incorrect character set indication. The second problem is that a65c converts to U+5412 (e5 90 92) that no longer contains a '\' at byte level, and the following ']' is not escaped any longer, and the SGF file can no longer be interpreted.

In short: SGF with character-level escaping requires knowledge of the character set. SGF with byte-level escaping must be regarded as a binary format. It does not have an associated character set, and cannot be converted to a different character set by a general-purpose converter.

Of course this happens, and one obtains files with broken escapes, where unescaping would destroy the text.
For example, I found a file where PW[余正麒] (Yu Zhengqi) was given as PW[予]正麒] with missing escape. What happened of course was that the SJIS 余 character 975d contains the 5d = ']' byte, and therefore was written 97 5c 5d with 5c = '\'. Upon conversion to UTF-8 the 975c yields 予, and the ] stays.
For example, I found a file 21mrp30.sgf with difficult to understand EV[] information. It said (in hex)
8b 79 32 31 8f 8c 9a 5f 8a 66 89 4c 9b ce 8d 9a 92 9b 94 c9 9b 5c 9a a2 99 40 94 fa 8b 79 99 b9 91 d4
and the 5c here is not a '\' but the second half of 9b5c. But what character set could it be? Some puzzling shows that it was obtained by applying an EUC-JP -> SJIS conversion to a file that was not in EUC-JP, but in GB2312. Converting back shows that the text was EV[第21届永城杯中国名人战预选赛第一轮] (21st Mingren preliminary tournament).
See also this example, where an incorrect conversion was applied four times.

Today it would probably be best to have UTF-8 as default character set.

Grammar

The grammar of an SGF file is:
  Collection = GameTree+
  GameTree   = '(' Sequence GameTree* ')'
  Sequence   = Node+
  Node       = ';' Property*
  Property   = PropIdent PropValue+
  PropIdent  = UcLetter+
  PropValue  = '[' CValueType ']'
  CValueType = (ValueType | Compose)
  ValueType  = (None | Number | Real | Double | Color | SimpleText | Text | Point  | Move | Stone)
  UcLetter   = 'A'..'Z'
  Compose    = ValueType ':' ValueType
where '*' denotes zero or more copies, and '+' denotes one or more copies. We see that each Collection, and each GameTree, starts with a '('. That each Sequence, and each Node starts with a ';'. That each Property, and each PropIdent starts with an upper case letter (in the range 'A'..'Z'). We can parse an SGF file provided we can find the end of a PropValue.

It is here the problems start. What is a PropValue? It is a ValueType, or two ValueTypes separated by ':', and a ValueType is? Nobody knows. There are several possibilities, and some of these are listed as 'game-specific'.

Among the possibilities for a ValueType, several seem subcases of others, so that the grammar is ambiguous. A Double looks like a special case of a Number, and a Number looks like a special case of a SimpleText, and a SimpleText looks like a special case of a Text. This is of some importance, since the escaping needed is described in terms of categories that cannot be distinguished.

The standard should have said that all occurrences of ']' and '\' inside a CValueType must be escaped by inserting a '\'. It half-heartedly makes some suggestions that this perhaps is what is meant.

Colons

Should a ':' be escaped? The easy answer is "No, never". The standard says that ':' is to be escaped only when "used in compose data type". Meant is no doubt that escaping is needed when a ':' could be mistaken for the ':' that is the separator in a compose data type. Nobody knows what private properties use a compose data type, so whether escaping ':' is needed for them. As we saw above, converting a file to a different character set can introduce bytes that were not present before, and that might need escaping, so a utility that handles SGF files may need to understand the structure of a CValueType also when it does not know about the Property it belongs to. In the properties where FF4 prescribes or allows a compose data type (namely, AP, AR, FG, LB, LN, SZ), the colon, if any, is the first in the value, so can be unambiguously recognized, except perhaps in the case of AP, where the structure is AP[programname:version], and the programname might contain a colon. A general grammar that allows parsing without the use of detailed information about possibly private properties has to make sure that no prescribed ':' is preceded by a (simple)text field, so that escaping is never necessary. Thus, the description of AP should be tightened.

Some generality was missed by not writing

  Compose    = ValueType ':' CValueType
so that also constructs like x:y:z could occur.

Goal

The goal of an SGF specification can only be a description of the form and meaning of an SGF file. It is inappropriate to specify what a reader of the file must do. Sentences like
Whitespaces other than space must be converted to space, i.e. there's no newline!
are mysterious. "Must be converted"? By whom? When? The author of the specification thinks he is addressing the author of a viewer. But SGF files are also handled by other programs than viewers, and what a viewer does is up to the author of the viewer.

Variations

The text below describes a game between Kato Masao and Cho Chikun.
(;
GM[1]
PB[Kato Masao]
PW[Cho Chikun]
SZ[19]
;B[pd];W[dd];B[pq];W[cp];B[eq];W[hp];B[eo];W[dn];B[cq];W[bq]
;B[cr];W[dp];B[ep];W[ck];B[jp];W[hn];B[fm];W[hl];B[do];W[co]
;B[dm];W[cm];B[en];W[cn];B[dk];W[dj];B[ek];W[po];B[jn];W[oq]
;B[or];W[op];B[nr];W[qq];B[pr];W[qp];B[mp];W[qk];B[im];W[hm]
;B[cj];W[bk];B[ik];W[jq];B[kp];W[hk];B[ij];W[hj];B[hi];W[fj]
;B[ej];W[gi];B[ii];W[il];B[jl];W[in];B[jm];W[fd];B[qi];W[ok]
;B[ic];W[qf](;
N[Diagram 1]
;B[pf];W[pg];B[qg];W[pe];B[of];W[qd];B[oe];W[qe];B[qc];W[rg]
;B[qh];W[pc];B[od];W[qb];B[rc];W[rb];B[pb];W[oc];B[nc];W[ob]
;B[nb];W[pa];B[oi])
;B[pm];W[om];B[pf];W[pg];B[qg];W[pe];B[of];W[qd]
;B[oe];W[qe];B[qc];W[rg];B[qh];W[pc];B[od];W[qb];B[rc];W[rb]
;B[pl];W[pk];B[rl];W[rk];B[ol];W[ql];B[qm];W[nl];B[on];W[nm]
;B[pn];W[nn];B[oo];W[no];B[qo];W[pp];B[rn];W[sl];B[sm];W[rm]
;B[oc];W[pb];B[rl];W[sk];B[rp];W[rm];B[rd];W[se];B[rl];W[rq]
;B[rm];W[sp];B[so];W[ro];B[re];W[rf];B[rp];W[oh];B[sq];W[ni]
;B[np];W[gg];B[ie];W[hd];B[id];W[ng];B[eg];W[gq];B[iq];W[ip]
;B[jr];W[dq];B[dr];W[fr];B[er];W[cg];B[ch];W[dg];B[eh];W[ir]
;B[fi];W[kq];B[kr];W[lq];B[lr];W[fh];B[ei];W[gf];B[fk];W[gj]
;B[bi];W[mc];B[md];W[nb];B[lc];W[lb];B[nc];W[mb];B[ld];W[ce]
;B[br];W[ap];B[fb];W[db];B[ob];W[oa];B[kb];W[mq];B[nq];W[gb]
;B[hb];W[gc];B[na];W[pa];B[mr];W[hq];B[la];W[ha];B[ia];W[ga]
;B[bg];W[bf];B[mh];W[nh];B[mi];W[mj];B[lj];W[mk];B[mg];W[nf]
;B[he];W[ge];B[hh];W[gh];B[mf];W[lk];B[ne];W[og];B[hc];W[ef]
;B[kk];W[kj];B[li];W[kl];B[jj];W[km];B[lp];W[iq];B[gd];W[fc]
;B[bj];W[dl];B[el];W[cl];B[ff];W[fg];B[dh];W[fe];B[ar];W[ak]
;B[ln];W[lm];B[mn];W[mm])
Halfway there is a variation with diagram.

We see here that a parenthetical variation precedes the main line of the game: the first move inside the variation and the first move after the variation are siblings.

According to the SGF FF[4] standard main line and variant must both be GameTrees, so we need two more parentheses. And worse, the main line must come first, so we must move the parenthetical part to the end. This means that variants occurring near move 200 are stored textually earlier in the file than variants occurring near move 60. This also means that comments about a move and a sibling are stored far from each other. This is unfortunate, a mistake in my opinion, although one might maintain that the file is not for human consumption and that the internal format is irrelevant.

However, it is easy to allow the above as extension, letting the formerly ungrammatical

Sequence-A (GameTree-B)+ Sequence-C (GameTree-D)*
stand for the current
Sequence-A '(' Sequence-C (GameTree-D)* ')' (GameTree-B)+

This means that the sequence of moves A B C, where for each of the moves there are 1-move deep variations A', A'', etc., which is presently written

(;rootnode(;A(;B(;C)(;C')(;C''))(;B')(;B''))(;A')(;A''))
can also be written
(;rootnode(;A')(;A'');A(;B')(;B'');B(;C')(;C'');C)
that is much easier to read for a human (because a short variation is textually where it belongs), and equivalent for a machine. Both systems can be mixed arbitrarily. This locality also means that combining the parts of a game that was published in two installments is much easier: one does not have to move the variations discussed in the first installment to follow the end of the actual game.

I find that Jan van der Steen made the same proposal:
  • Variations
    From: Jan van der Steen
    There is a lot of confusion about where the exact location of a variation should be, same level or after a move. I've entered a lot of games using mgt and for humans it's much easier and natural to enter the variation when commenting on the move you want to give an alternative for. So I *always* create the diagram after the move has been played and dully remove the originally played move. I understand the problems how to interpret the resulting tree but that's inherent to the format not to the user using the interface. Maybe we should reconsider the meaning of the braces.
            (
            C[Game start]
            ;B[point]
            ;W[point]
                (
                C[Subgame based on initial position with two stones]
                ;B[point]
                # ...
                )
            C[game continues]
            ;B[point]
            # ...
            )
    
    So instead of splitting the main branche into two sub-branches (which one is the game, and which is the variation?) we let the main branche proceed untouched and just create a side branch (subgame). No confusion possible, right?

    Comment MM: Oh no! RTFM

  • But Martin Müller is mistaken and Jan van der Steen was right. The convention chosen by SGF is unfortunate. This old choice cannot be undone, but it is easy to accept both formats.

    I find that Rui Jiang made the same proposal:
    ... Or a structural reform of SGF, to allow nodes in the middle like:

    ;B[..]
    ;W[..]
    (;B[];W[..] ... )
    ;B[..]
    ;W[..]
    (;B[];W[..] ... )

    which is more natual for game comment style SGF files. Go Assistant used to do this. But this will break almost all other applications. ...

    Empty variations

    One often encounters empty variations. For example, if one wants to discuss what could have happened after the final move of a game, one desires a structure like
    (Actual_game (Possible_sequel))
    
    In FF[4] there is no good notation for this situation. The best one can do would be to write
    (Actual_game (;)(Possible_sequel))
    
    The unfortunate fact that the grammar requires an additional semicolon forces the introduction of an additional nonsense node.

    UGF

    In the UGF format one has .Fig subsections that work a bit like SGF's variations. However, each .Fig subsection is self-contained, and starts from an empty board, describes the starting position being commented on and gives a sequence of moves and a comment. There is no natural way to represent such figures+comment in SGF. (One would need something like save-board, clear-board, setup-position, give-variation, comment, restore-board. The closest analog in SGF would use (VW[..]AE[...]AB[...]AW[...];B[..];W[..]...C[...]).) This greater generality allows one to say "If Black plays tenuki, then ...", or "if the stone at A would have been at B, then ..." or "in a different game, the following happened". However, the actual implementation of this feature in UGF is very wasteful and inefficient.

    Properties

    The markup of information using properties is useful because it allows for automated retrieval of such information. One can write arbitrary information in a comment, but that may not be found by a data base search. Below it is argued that on the one hand the RE[] property lacks a common category: "Unfinished", while on the other hand the RE[] and DT[] formats are a little too rigid and lead to users entering incorrect information. For cases that do not fit the RE[] or DT[] specification one needs an extended format, say REX[] and DTX[].

    The RE property

    The FF[4] standard says:
    Property:       RE
    Propvalue:      simpletext
    Propertytype:   game-info
    Function:       Provides the result of the game. It is MANDATORY to use the following format:
                    "0" (zero) or "Draw" for a draw (jigo),
                    "B+" ["score"] for a black win and
                    "W+" ["score"] for a white win
                    If the score is given it has to be given as a real value,
                    e.g. "B+0.5", "W+64", "B+12.5"
                    Use "B+R" or "B+Resign" and "W+R" or "W+Resign" for a win by resignation.
                    Applications must not write "Black resigns".
                    Use "B+T" or "B+Time" and "W+T" or "W+Time" for a win on time,
                    "B+F" or "B+Forfeit" and "W+F" or "W+Forfeit" for a win by forfeit,
                    "Void" for no result or suspended play and "?" for an unknown result.
    

    Of course this covers almost all cases. However, the Japanese rules describe situations where both players are deemed to have lost, and there are real examples of this happening:

    (;
    EV[Oteai]
    PB[Kitani Minoru]BR[5d]
    PW[Murashima Yoshikatsu]WR[5d]
    DT[1930-11-26~28]
    RE[Both lost]
    GC[Both players decided to take a break and have some sleep.
    That was against the rules, and it was decided that both lost.]
    ;B[qd];W[dd];B[od];W[qp];B[do];W[dq];B[cq];W[qj];B[oq];W[po]
    ...
    ;B[qh];W[ph];B[rg];W[rb];B[rj];W[rk];B[sd];W[rc];B[qk];W[pj]
    ;B[ri];W[rl];B[pg];W[mr])
    
    and
    (;
    EV[16th Tengen]RO[Round 2]
    PB[Haruyama Isamu]BR[9p]
    PW[Hane Yasumasa]WR[9p]
    KM[5.5]
    RE[Both lost]
    DT[1990-04-05]
    PC[Nihon Ki-in]
    GC[Both players lost this game. W played 242 where he had earlier
    played 46, but that stone had accidentally been moved
    and neither player had noticed.]
    ;B[qd];W[dd];B[pq];W[cp];B[oc];W[po];B[qo];W[pp];B[qp];W[oq]
    ;B[qq];W[pn];B[qm];W[ol];B[ql];W[np];B[fq];W[ep];B[dr];W[hp]
    ;B[fp];W[fo];B[go];W[gp];B[eo];W[fn];B[dp];W[do];B[eq];W[en]
    ;B[co];W[ep];B[bp];W[dq];B[cq];W[dp];B[dn];W[eo];B[br];W[cn]
    ;B[bo];W[dm];B[fc];W[df];B[id];W[ig];B[kd];W[fd];B[gd];W[ge]
    ...
    ;B[nm];W[ig])
    
    Strictly speaking, such games cannot be recorded using FF[4].

    There are further examples of results that cannot be given in the simple FF[4] scheme. For example, the famous 1928-10-10 game between Takahashi Shigeyuki and Segoe Kensaku resulted in "White wins but black does not lose". Since it is highly desirable to keep the formalized RE[] field whenever possible, but real life does not quite fit into any simple scheme, one could envisage a result RE[Special], with extended (simpletext) explanation in an REX[] field.

    As a further comment: people interested in the rules of Go want to see the games with Void result as a consequence of the rules (say, due to multiple ko or some other unusual situation). That is a situation quite different from that of a game that was never finished. One needs the possible result "Unfinished" distinct from "Void". Many sites already use a label "Unfinished" (or "U" or "UF") or "Left unfinished".

    Different from "Unfinished" is "Playing", for a game that is being played right now.

    Experience shows that RE[0] is not sufficiently robust. Often a single 0 is interpreted as "none" or "no information present", and such properties are deleted by some programs. Therefore, it is better to avoid 0 and write Jigo instead.

    Usually the real numbers in the score are integers or half-integers. But the Ing rules assign a fractional score to sekis, and results like "B wins by 1 5/6" occur. These do not really fit in FF[4].

    The DT property

    The FF[4] standard says:
      Property:DT
      Propvalue:simpletext
      Propertytype:game-info
      Function:Provides the date when the game was played.
      It is MANDATORY to use the ISO-standard format for DT.
      Note: ISO format implies usage of the Gregorian calendar.
      Syntax:
      "YYYY-MM-DD" year (4 digits), month (2 digits), day (2 digits)
    
    As above for RE[] also here the formalization is very useful, and the above (plus additional text for the case of games that last more than a single day, and for the case that only part of YYYY-MM-DD is known) takes care of most cases. But it happens that one has a Broadcast date, or a Published date (maybe together with the newspaper or magazine name and issue). Presumably close to the actual game date, but the game date may be unknown. For such date-like information one could use a DTX[] field ("extended date") in simpletext.

    For older games one often has a Japanese date using the lunar calendar. The FF[4] requirement to give dates in ISO standard form has led people to write dates like "the 1st day of the 9th intercalary month 1843" as 1843-09-01, and today such old dates are usually given incorrectly. Also this type of date-like information can be stored in a DTX[] field.

    Where old game records perhaps only have an approximate year, and more recent games a day, games played on a server often have a starting date and time. Probably the grammar of DT should be generalized to allow 2004-06-14 20:50:06 and something like 2013-08-15 23:54..2013-08-16 00:09. Again an extended field can give nonformalized time periods, such as "From the hour of the Dragon to the lower hour of the Monkey" and "From the hour of the Snake to the hour of the Sheep".

    The TM property

    The FF[4] standard says:
      Property:TM
      Propvalue:real
      Propertytype:game-info
      Function:Provides the time limits of the game.
      The time limit is given in seconds.
    
    A single value without units turns out not to be robust. Often the time is given in minutes instead of seconds. Also, big numbers are nonintuitive, and a human author of an SGF file does not immediately recognize TM[5400] as a mistake, when she really meant TM[15h]. It is preferable to use suffixes h,m,s for hours, minutes and seconds and write, e.g., TM[1h30m] instead of TM[5400].

    Concerning the meaning: meant is allotted time per player. One often encounters TM[60m each] to stress that 60m is not the total time but the time per player.

    The PB property

    Normally one separates PB[] and BR[]. However, in the case of a (multi-player) relay game there is no natural way to split the information using the current FF[4] syntax.
    (;
    PB[Kubomatsu Katsukiyo 5d, Yoshida Misako 4d, Harima Kisaburo 1d and Taniguchi Fusazo]
    PW[Shusai Meijin, Fujita Toyojiro 4d, Nakagawa 1d and Narukami Magoshichi]
    RE[B+R]
    DT[1922-08-12]
    ...
    

    A possibility is to make PB[] and BR[] into &-separated lists of the same length:

    (;
    EV[10th Ricoh Pro Pair Go]
    PB[Inori Yoko & Cho Chikun]
    BR[5p & 25th Honinbo]
    PW[Okada Yumiko & Cho U]
    WR[Strongest Woman Player & Honinbo, Oza]
    DT[2003-12-06]
    ...
    

    Sometimes the players are known, but it is unknown who had black. In such a case it would be more natural to have P1 and P2 for player 1 and player 2. In particular this happens when the game was not actually played. If it is known who won, this must be noted down without use of a color letter.

    (;
    EV[26th Gosei]
    P1[Kobayashi Satoru 9p]
    P2[Ishii Kunio 9p]
    RE[P2+F]
    
    )
    

    SGFC

    The SGF FF[4] format comes with a syntax checker called sgfc. Very useful. Some people blindly use this program not only as a checker but also as a converter. The results are (of course) syntactically correct, but funny.

    For example, sgfc converts

    RE[B+4.5  (moves beyond 195 not known; 314 played)]
    
    into
    RE[B+4.5195314]
    
    (this is an example from the 48585_Pro_Games archive).