Character set
Grammar
Colons
Goal
Variations
Properties
SGFC
For a byte-oriented parser, this escape mechanism is OK. But for a reader who understands the character set used, the escape mechanism produces 'mojibake'. For example, the text
was supposed to say(;CA[big5]AP[MultiGo:4.4.3]SZ[19]C[第64屆日本本吒]坊戰 七番棋挑戰賽第5局 黑:高尾紳路 九段 貼6目半 白:羽根直樹 本吒]坊 黑中押勝] ...
but the word 本因坊 (Honinbo) is spelled in big5 with character values (hex, big-endian) a5bb a65d a77b, and since '\' and ']' are 5c and 5d (hex) and are escaped, this a5bb a65d a77b is turned into a5bb a65c 5d a77b, explaining the occurrence of a strange character followed by ']' twice in the middle of this text.(;CA[big5]AP[MultiGo:4.4.3]SZ[19]C[第64屆日本本因坊戰 七番棋挑戰賽第5局 黑:高尾紳路 九段 貼6目半 白:羽根直樹 本因坊 黑中押勝] ...
For local use (say, among the users of GB18030) it is much more convenient not to do this escaping and unescaping, so that the SGF file is a real text file, and can be handled by general utilities for text files, and can be displayed without conversion. This is what one commonly sees. With 金昞俊 (Kim Pyonjun) spelled bdf0 955c bfa1 with unescaped '\', and 朴埈奭 (Park Junseok) spelled c6d3 88ad 8a5d with unescaped ']'.
The grammar is not explicit about whether the escape mechanism is supposed to be used at the character level or at the byte level, but since parsers cannot be expected to know about all character sets in the world, one might assume that the escape mechanism is supposed to act at the byte level. However, people who follow this interpretation run into problems.
One consequence is that when a website is moved to another system, and all files are converted to UTF-8, suddenly the SGF files become unparseable. The first problem is that converting 'CA[big5]' to UTF-8 does not change this string, so that the converted file has an incorrect character set indication. The second problem is that a65c converts to U+5412 (e5 90 92) that no longer contains a '\' at byte level, and the following ']' is not escaped any longer, and the SGF file can no longer be interpreted.
In short: SGF with character-level escaping requires knowledge of the character set. SGF with byte-level escaping must be regarded as a binary format. It does not have an associated character set, and cannot be converted to a different character set by a general-purpose converter.
Of course this happens, and one obtains files with broken escapes, where unescaping would destroy the text.
For example, I found a file where PW[余正麒] (Yu Zhengqi) was given as PW[予]正麒] with missing escape. What happened of course was that the SJIS 余 character 975d contains the 5d = ']' byte, and therefore was written 97 5c 5d with 5c = '\'. Upon conversion to UTF-8 the 975c yields 予, and the ] stays.
For example, I found a file 21mrp30.sgf with difficult to understand EV[] information. It said (in hex)8b 79 32 31 8f 8c 9a 5f 8a 66 89 4c 9b ce 8d 9a 92 9b 94 c9 9b 5c 9a a2 99 40 94 fa 8b 79 99 b9 91 d4and the 5c here is not a '\' but the second half of 9b5c. But what character set could it be? Some puzzling shows that it was obtained by applying an EUC-JP -> SJIS conversion to a file that was not in EUC-JP, but in GB2312. Converting back shows that the text was EV[第21届永城杯中国名人战预选赛第一轮] (21st Mingren preliminary tournament).
See also this example, where an incorrect conversion was applied four times.
Today it would probably be best to have UTF-8 as default character set.
Collection = GameTree+ GameTree = '(' Sequence GameTree* ')' Sequence = Node+ Node = ';' Property* Property = PropIdent PropValue+ PropIdent = UcLetter+ PropValue = '[' CValueType ']' CValueType = (ValueType | Compose) ValueType = (None | Number | Real | Double | Color | SimpleText | Text | Point | Move | Stone) UcLetter = 'A'..'Z' Compose = ValueType ':' ValueTypewhere '*' denotes zero or more copies, and '+' denotes one or more copies. We see that each Collection, and each GameTree, starts with a '('. That each Sequence, and each Node starts with a ';'. That each Property, and each PropIdent starts with an upper case letter (in the range 'A'..'Z'). We can parse an SGF file provided we can find the end of a PropValue.
It is here the problems start. What is a PropValue? It is a ValueType, or two ValueTypes separated by ':', and a ValueType is? Nobody knows. There are several possibilities, and some of these are listed as 'game-specific'.
Among the possibilities for a ValueType, several seem subcases of others, so that the grammar is ambiguous. A Double looks like a special case of a Number, and a Number looks like a special case of a SimpleText, and a SimpleText looks like a special case of a Text. This is of some importance, since the escaping needed is described in terms of categories that cannot be distinguished.
The standard should have said that all occurrences of ']' and '\' inside a CValueType must be escaped by inserting a '\'. It half-heartedly makes some suggestions that this perhaps is what is meant.
Some generality was missed by not writing
Compose = ValueType ':' CValueTypeso that also constructs like x:y:z could occur.
Whitespaces other than space must be converted to space, i.e. there's no newline!are mysterious. "Must be converted"? By whom? When? The author of the specification thinks he is addressing the author of a viewer. But SGF files are also handled by other programs than viewers, and what a viewer does is up to the author of the viewer.
(; GM[1] PB[Kato Masao] PW[Cho Chikun] SZ[19] ;B[pd];W[dd];B[pq];W[cp];B[eq];W[hp];B[eo];W[dn];B[cq];W[bq] ;B[cr];W[dp];B[ep];W[ck];B[jp];W[hn];B[fm];W[hl];B[do];W[co] ;B[dm];W[cm];B[en];W[cn];B[dk];W[dj];B[ek];W[po];B[jn];W[oq] ;B[or];W[op];B[nr];W[qq];B[pr];W[qp];B[mp];W[qk];B[im];W[hm] ;B[cj];W[bk];B[ik];W[jq];B[kp];W[hk];B[ij];W[hj];B[hi];W[fj] ;B[ej];W[gi];B[ii];W[il];B[jl];W[in];B[jm];W[fd];B[qi];W[ok] ;B[ic];W[qf](; N[Diagram 1] ;B[pf];W[pg];B[qg];W[pe];B[of];W[qd];B[oe];W[qe];B[qc];W[rg] ;B[qh];W[pc];B[od];W[qb];B[rc];W[rb];B[pb];W[oc];B[nc];W[ob] ;B[nb];W[pa];B[oi]) ;B[pm];W[om];B[pf];W[pg];B[qg];W[pe];B[of];W[qd] ;B[oe];W[qe];B[qc];W[rg];B[qh];W[pc];B[od];W[qb];B[rc];W[rb] ;B[pl];W[pk];B[rl];W[rk];B[ol];W[ql];B[qm];W[nl];B[on];W[nm] ;B[pn];W[nn];B[oo];W[no];B[qo];W[pp];B[rn];W[sl];B[sm];W[rm] ;B[oc];W[pb];B[rl];W[sk];B[rp];W[rm];B[rd];W[se];B[rl];W[rq] ;B[rm];W[sp];B[so];W[ro];B[re];W[rf];B[rp];W[oh];B[sq];W[ni] ;B[np];W[gg];B[ie];W[hd];B[id];W[ng];B[eg];W[gq];B[iq];W[ip] ;B[jr];W[dq];B[dr];W[fr];B[er];W[cg];B[ch];W[dg];B[eh];W[ir] ;B[fi];W[kq];B[kr];W[lq];B[lr];W[fh];B[ei];W[gf];B[fk];W[gj] ;B[bi];W[mc];B[md];W[nb];B[lc];W[lb];B[nc];W[mb];B[ld];W[ce] ;B[br];W[ap];B[fb];W[db];B[ob];W[oa];B[kb];W[mq];B[nq];W[gb] ;B[hb];W[gc];B[na];W[pa];B[mr];W[hq];B[la];W[ha];B[ia];W[ga] ;B[bg];W[bf];B[mh];W[nh];B[mi];W[mj];B[lj];W[mk];B[mg];W[nf] ;B[he];W[ge];B[hh];W[gh];B[mf];W[lk];B[ne];W[og];B[hc];W[ef] ;B[kk];W[kj];B[li];W[kl];B[jj];W[km];B[lp];W[iq];B[gd];W[fc] ;B[bj];W[dl];B[el];W[cl];B[ff];W[fg];B[dh];W[fe];B[ar];W[ak] ;B[ln];W[lm];B[mn];W[mm])Halfway there is a variation with diagram.
We see here that a parenthetical variation precedes the main line of the game: the first move inside the variation and the first move after the variation are siblings.
According to the SGF FF[4] standard main line and variant must both be GameTrees, so we need two more parentheses. And worse, the main line must come first, so we must move the parenthetical part to the end. This means that variants occurring near move 200 are stored textually earlier in the file than variants occurring near move 60. This also means that comments about a move and a sibling are stored far from each other. This is unfortunate, a mistake in my opinion, although one might maintain that the file is not for human consumption and that the internal format is irrelevant.
However, it is easy to allow the above as extension, letting the formerly ungrammatical
Sequence-A (GameTree-B)+ Sequence-C (GameTree-D)*stand for the current
Sequence-A '(' Sequence-C (GameTree-D)* ')' (GameTree-B)+
This means that the sequence of moves A B C, where for each of the moves there are 1-move deep variations A', A'', etc., which is presently written
(;rootnode(;A(;B(;C)(;C')(;C''))(;B')(;B''))(;A')(;A''))can also be written
(;rootnode(;A')(;A'');A(;B')(;B'');B(;C')(;C'');C)that is much easier to read for a human (because a short variation is textually where it belongs), and equivalent for a machine. Both systems can be mixed arbitrarily. This locality also means that combining the parts of a game that was published in two installments is much easier: one does not have to move the variations discussed in the first installment to follow the end of the actual game.
But Martin Müller is mistaken and Jan van der Steen was right. The convention chosen by SGF is unfortunate. This old choice cannot be undone, but it is easy to accept both formats.Variations
From: Jan van der Steen
There is a lot of confusion about where the exact location of a variation should be, same level or after a move. I've entered a lot of games using mgt and for humans it's much easier and natural to enter the variation when commenting on the move you want to give an alternative for. So I *always* create the diagram after the move has been played and dully remove the originally played move. I understand the problems how to interpret the resulting tree but that's inherent to the format not to the user using the interface. Maybe we should reconsider the meaning of the braces.( C[Game start] ;B[point] ;W[point] ( C[Subgame based on initial position with two stones] ;B[point] # ... ) C[game continues] ;B[point] # ... )So instead of splitting the main branche into two sub-branches (which one is the game, and which is the variation?) we let the main branche proceed untouched and just create a side branch (subgame). No confusion possible, right?Comment MM: Oh no! RTFM
... Or a structural reform of SGF, to allow nodes in the middle like:
;B[..]
;W[..]
(;B[];W[..] ... )
;B[..]
;W[..]
(;B[];W[..] ... )
which is more natual for game comment style SGF files. Go Assistant used to do this. But this will break almost all other applications. ...
In FF[4] there is no good notation for this situation. The best one can do would be to write(Actual_game (Possible_sequel))
The unfortunate fact that the grammar requires an additional semicolon forces the introduction of an additional nonsense node.(Actual_game (;)(Possible_sequel))
Property: RE Propvalue: simpletext Propertytype: game-info Function: Provides the result of the game. It is MANDATORY to use the following format: "0" (zero) or "Draw" for a draw (jigo), "B+" ["score"] for a black win and "W+" ["score"] for a white win If the score is given it has to be given as a real value, e.g. "B+0.5", "W+64", "B+12.5" Use "B+R" or "B+Resign" and "W+R" or "W+Resign" for a win by resignation. Applications must not write "Black resigns". Use "B+T" or "B+Time" and "W+T" or "W+Time" for a win on time, "B+F" or "B+Forfeit" and "W+F" or "W+Forfeit" for a win by forfeit, "Void" for no result or suspended play and "?" for an unknown result.
Of course this covers almost all cases. However, the Japanese rules describe situations where both players are deemed to have lost, and there are real examples of this happening:
and(; EV[Oteai] PB[Kitani Minoru]BR[5d] PW[Murashima Yoshikatsu]WR[5d] DT[1930-11-26~28] RE[Both lost] GC[Both players decided to take a break and have some sleep. That was against the rules, and it was decided that both lost.] ;B[qd];W[dd];B[od];W[qp];B[do];W[dq];B[cq];W[qj];B[oq];W[po] ... ;B[qh];W[ph];B[rg];W[rb];B[rj];W[rk];B[sd];W[rc];B[qk];W[pj] ;B[ri];W[rl];B[pg];W[mr])
Strictly speaking, such games cannot be recorded using FF[4].(; EV[16th Tengen]RO[Round 2] PB[Haruyama Isamu]BR[9p] PW[Hane Yasumasa]WR[9p] KM[5.5] RE[Both lost] DT[1990-04-05] PC[Nihon Ki-in] GC[Both players lost this game. W played 242 where he had earlier played 46, but that stone had accidentally been moved and neither player had noticed.] ;B[qd];W[dd];B[pq];W[cp];B[oc];W[po];B[qo];W[pp];B[qp];W[oq] ;B[qq];W[pn];B[qm];W[ol];B[ql];W[np];B[fq];W[ep];B[dr];W[hp] ;B[fp];W[fo];B[go];W[gp];B[eo];W[fn];B[dp];W[do];B[eq];W[en] ;B[co];W[ep];B[bp];W[dq];B[cq];W[dp];B[dn];W[eo];B[br];W[cn] ;B[bo];W[dm];B[fc];W[df];B[id];W[ig];B[kd];W[fd];B[gd];W[ge] ... ;B[nm];W[ig])
There are further examples of results that cannot be given in the simple FF[4] scheme. For example, the famous 1928-10-10 game between Takahashi Shigeyuki and Segoe Kensaku resulted in "White wins but black does not lose". Since it is highly desirable to keep the formalized RE[] field whenever possible, but real life does not quite fit into any simple scheme, one could envisage a result RE[Special], with extended (simpletext) explanation in an REX[] field.
As a further comment: people interested in the rules of Go want to see the games with Void result as a consequence of the rules (say, due to multiple ko or some other unusual situation). That is a situation quite different from that of a game that was never finished. One needs the possible result "Unfinished" distinct from "Void". Many sites already use a label "Unfinished" (or "U" or "UF") or "Left unfinished".
Different from "Unfinished" is "Playing", for a game that is being played right now.
Experience shows that RE[0] is not sufficiently robust. Often a single 0 is interpreted as "none" or "no information present", and such properties are deleted by some programs. Therefore, it is better to avoid 0 and write Jigo instead.
Usually the real numbers in the score are integers or half-integers. But the Ing rules assign a fractional score to sekis, and results like "B wins by 1 5/6" occur. These do not really fit in FF[4].
As above for RE[] also here the formalization is very useful, and the above (plus additional text for the case of games that last more than a single day, and for the case that only part of YYYY-MM-DD is known) takes care of most cases. But it happens that one has a Broadcast date, or a Published date (maybe together with the newspaper or magazine name and issue). Presumably close to the actual game date, but the game date may be unknown. For such date-like information one could use a DTX[] field ("extended date") in simpletext.Property:DT Propvalue:simpletext Propertytype:game-info Function:Provides the date when the game was played. It is MANDATORY to use the ISO-standard format for DT. Note: ISO format implies usage of the Gregorian calendar. Syntax: "YYYY-MM-DD" year (4 digits), month (2 digits), day (2 digits)
For older games one often has a Japanese date using the lunar calendar. The FF[4] requirement to give dates in ISO standard form has led people to write dates like "the 1st day of the 9th intercalary month 1843" as 1843-09-01, and today such old dates are usually given incorrectly. Also this type of date-like information can be stored in a DTX[] field.
Where old game records perhaps only have an approximate year, and more recent games a day, games played on a server often have a starting date and time. Probably the grammar of DT should be generalized to allow 2004-06-14 20:50:06 and something like 2013-08-15 23:54..2013-08-16 00:09. Again an extended field can give nonformalized time periods, such as "From the hour of the Dragon to the lower hour of the Monkey" and "From the hour of the Snake to the hour of the Sheep".
A single value without units turns out not to be robust. Often the time is given in minutes instead of seconds. Also, big numbers are nonintuitive, and a human author of an SGF file does not immediately recognize TM[5400] as a mistake, when she really meant TM[15h]. It is preferable to use suffixes h,m,s for hours, minutes and seconds and write, e.g., TM[1h30m] instead of TM[5400].Property:TM Propvalue:real Propertytype:game-info Function:Provides the time limits of the game. The time limit is given in seconds.
Concerning the meaning: meant is allotted time per player. One often encounters TM[60m each] to stress that 60m is not the total time but the time per player.
(; PB[Kubomatsu Katsukiyo 5d, Yoshida Misako 4d, Harima Kisaburo 1d and Taniguchi Fusazo] PW[Shusai Meijin, Fujita Toyojiro 4d, Nakagawa 1d and Narukami Magoshichi] RE[B+R] DT[1922-08-12] ...
A possibility is to make PB[] and BR[] into &-separated lists of the same length:
(; EV[10th Ricoh Pro Pair Go] PB[Inori Yoko & Cho Chikun] BR[5p & 25th Honinbo] PW[Okada Yumiko & Cho U] WR[Strongest Woman Player & Honinbo, Oza] DT[2003-12-06] ...
Sometimes the players are known, but it is unknown who had black. In such a case it would be more natural to have P1 and P2 for player 1 and player 2. In particular this happens when the game was not actually played. If it is known who won, this must be noted down without use of a color letter.
(; EV[26th Gosei] P1[Kobayashi Satoru 9p] P2[Ishii Kunio 9p] RE[P2+F] )
For example, sgfc converts
intoRE[B+4.5 (moves beyond 195 not known; 314 played)]
(this is an example from the 48585_Pro_Games archive).RE[B+4.5195314]