SGF mojibake

Sometimes one finds SGF files with strange text, full of Â and Ã, for example

(;GM[1]FF[4]SZ[19]CA[UTF-8]
PW[前田陈尔 六段]
PB[吴清源 六段]
WR[1p]
BR[1p]
KM[3.5]
TM[ÃƒÆ’Ã¢â‚¬Å¡Ãƒâ€šÃ‚ÂºÃƒÆ’Ã†â€™Ãƒâ€¦Ã‚Â¡: 02:37  ÃƒÆ’Ã¢â‚¬Å¡Ãƒâ€šÃ‚Â°ÃƒÆ’Ã†â€™ÃƒÂ¢Ã¢â€šÂ¬Ã¢â‚¬Â: 04:56]
RE[共129手 黑中盘胜]
DT[1935年(大阪朝日新闻9月17日至10月4日连载)]

;B[pd];W[ed];B[pp];W[fp];B[pj];W[kd];B[do];W[dp];B[cp];W[cq]
;B[bq];W[co];B[bp];W[cn];B[dq];W[ep];B[dr];W[ce];B[en];W[dn]
;B[em];W[eo];B[ej];W[mp];B[cj];W[fn];B[io];W[ek];B[dk];W[el]
;B[dl];W[fm];B[dm];W[gl];B[go];W[fo];B[pn];W[jn];B[fr];W[ch]
;B[dh];W[cg];B[ci];W[fj];B[ei];W[qf];B[qe];W[pf];B[nd];W[ng]
;B[nj];W[lg];B[ph];W[rf];B[rd];W[pq];B[qq];W[oq];B[qr];W[jq]
;B[kp];W[kq];B[rh];W[nb];B[nc];W[qc];B[pc];W[pb];B[qb];W[rb]
;B[rc];W[qa];B[lc];W[mb];B[ld];W[oa];B[ke];W[qi];B[qh];W[nf]
;B[oc];W[ob];B[id];W[pm];B[om];W[ol];B[pl];W[qm];B[ql];W[oi]
;B[pi];W[on];B[nm];W[po];B[qn];W[qo];B[rm];W[qp];B[no];W[np]
;B[dc];W[fc];B[dd];W[de];B[fb];W[gb];B[eb];W[bc];B[bb];W[cc]
;B[cb];W[hd];B[mh];W[le];B[lf];W[me];B[kc];W[kf];B[je];W[he]
;B[jf];W[kg];B[ee];W[ef];B[fe];W[ff];B[fd];W[jg];B[ec])

The PW, PB, RE and DT fields say

PW[Maeda Nobuaki 6d]
PB[Go Seigen 6d]
RE[129 moves, B+R]
DT[1935 (Osaka Asahi Shimbun September 17 to October 4 serial)]

But what is this strange TM field here?

Mojibake

Such text is typically the result of doing a Latin-1 to UTF-8 conversion while the original was UTF-8 already. Here it is not quite Latin-1, since the occurrences of ƒ, š, € show that these letters must have had single byte codes in the source character set. So, here it was a Windows-1252 to UTF-8 conversion. Let us revert the conversion. We find

TM[Ãƒâ€šÃ‚ÂºÃƒÆ’Ã…Â¡: 02:37  Ãƒâ€šÃ‚Â°ÃƒÆ’Ã¢â‚¬â€: 04:56]

This still looks like the result of a Windows-1252 to UTF-8 conversion. Revert once more. We find

TM[Ã‚ÂºÃƒÅ¡: 02:37  Ã‚Â°Ãƒâ€”: 04:56]

This still looks like the result of a Windows-1252 to UTF-8 conversion. Revert once more. We find

TM[ÂºÃš: 02:37  Â°Ã—: 04:56]

Revert once more. We find a text that read in GB2312 becomes

TM[黑: 02:37  白: 04:56]

So that is how our TM field arose: it was originally in GB2312, but went four times through a Windows-1252_to_UTF-8 conversion.

(The 0x9d byte is not assigned in Windows-1252 and was left as U+009d, that is 0xc2 0x9d in UTF-8.)

Escapes

SGF has an unfortunate system of escapes, not well-defined and not consistently used. This causes damage to files. The PB field in a file like

(;
EV[2015利民杯本戦16強戦]
PB[申真ゾ
BR[3p]
...

looks like the terminating ']' was lost. What happened was that the player is called 申真ソ (Shin Jinseo), which in SJIS is 905c 905e 835c. The 83 5c code for ソ ends in 5c, the code for \, and if the SGF escape mechanism is applied on byte level the \] sequence is taken as a literal ], so that 83 5c turns into 83 5d, which is ゾ, and the closing ] is lost.

FF[4]

Usually the FF[4] label is meaningless, and does not denote that the file follows FF[4] conventions.