How to do Unicode in ABC

The bad news is that ABC predates Unicode, and so only uses 8-bit characters. This means that you can't include Unicode characters in ABC programs, nor type Unicode characters in the editor.

The good news is that the UTF-8 encoding of Unicode only uses 8 bit characters, so it is possible to process UTF-8 encoded input and output with an ABC program.

To do this, you need to run the HOW TO below called UNICODE to prepare necessary data structures at the start of a program, and then you can use the following HOW TOs:

HOW TO RETURN u.1 s: \The first Unicode character of a string

>>> WRITE chinese
智取威虎山

>>> WRITE u.1 chinese
智

It returns the empty string if the parameter is empty, or there is a Unicode encoding error in first character of the parameter.

HOW TO RETURN u.items s: \ String s split into its constituent Unicode characters

>>> WRITE u.items chinese
{[1]: "智"; [2]: "取"; [3]: "威"; [4]: "虎"; [5]: "山"}

You can use this to find out how many Unicode characters are in a string:

>>> WRITE #u.items chinese
5

though there is a faster function u.size:

HOW TO RETURN u.size s: \ The number of characters in s

>>> WRITE u.size chinese
5

You can find out if a string consists of a single character:

HOW TO REPORT u.single s: \ Whether s consists of a single character

>>> IF u.single u.1 chinese: WRITE u.1 chinese
智

Identifying individual characters:

HOW TO RETURN u.abs c: \ The Unicode code point of c
>>> WRITE u.abs u.1 chinese
26234

HOW TO RETURN u.char i: \ The UTF-8 encoding of the char at codepoint i

>>> WRITE u.char 26234
智

Some ABC text functions and operators work as expected with Unicode texts, in particular split and ^:

>>> WRITE russian
Рукописи не горят!

>>> WRITE split russian
{[1]: "Рукописи"; [2]: "не"; [3]: "горят!"}

>>> PUT (u.char 197)^"ngstr"^(u.char 246)^"m" IN angstrom
>>> WRITE angstrom
Ångström

However, although Unicode characters may look like a single character, to ABC they consist of several, so that operations that require a single (ABC) character will act differently. For instance, you can't use

IF (u.char 246) in angstrom:

because ö isn't a single ABC character; you have to say:

IF (u.char 246) in u.items angstrom:

or use

HOW TO REPORT c u.in s: \ Whether the character c occurs in the string s

>>> IF (u.char 246) u.in angstrom: Write "yes"
yes

Similarly, if string contains non-ASCII characters, you can't use

FOR c IN string:

but have to use:

FOR c IN u.items string:

For the same reason the operators |, @, >>, ><, << and item will not behave as you want. As replacements, there are:

HOW TO RETURN s u.trim n: \ The first n characters

>>> WRITE chinese u.trim 3
智取威

HOW TO RETURN s u.at n: \ s from position n

>>> WRITE chinese u.at 3
威虎山

HOW TO RETURN s u.left n \ s justified left in a space of n characters
HOW TO RETURN s u.right n \ s justified right in a space of n characters
HOW TO RETURN s u.middle n \ s centred in a space of n characters

>>> WRITE "|", chinese u.left 8, "|"/
|智取威虎山   |
>>> WRITE "|", chinese u.right 8, "|"/
|   智取威虎山|
>>> WRITE "|", chinese u.middle 8, "|"/
| 智取威虎山  |

HOW TO RETURN s u.item n: \ The n'th character in s

>>> FOR i IN {0..1+u.size chinese}: WRITE i, chinese u.item i/
0
1 智
2 取
3 威
4 虎
5 山
6

Finally, the ABC functions lower and upper won't work properly on Unicode characters. Use instead

HOW TO RETURN u.upper s: \ s returned UPPERCASE
HOW TO RETURN u.lower s: \ s returned lowercase

>>> WRITE angstrom
Ångström
>>> WRITE u.upper angstrom
ÅNGSTRÖM
>>> WRITE u.lower angstrom
ångström

One final useful function. Unicode defines a character class for each character, for instance Lu is the class of Uppercase Letters, and Ll of lowercase letters, Zs are space characters:

HOW TO RETURN u.class c: \ The Unicode class of character c

>>> WRITE russian
Рукописи не горят!

>>> FOR c IN u.items russian: WRITE u.class c, " "
Lu Ll Ll Ll Ll Ll Ll Ll Zs Ll Ll Zs Ll Ll Ll Ll Ll Po

The classes, along with how many there are of each currently, are:

Cc Control 65
Cf Format 161
Co Private Use 137468
Cn Unassigned 693204*
Cs Surrogate 137468
Ll Lowercase Letter 2,155
Lm Modifier Letter 260
Lo Other Letter 127,004
Lt Titlecase Letter 31
Lu Uppercase Letter 1,791
Mc Spacing Mark 443
Me Enclosing Mark 13
Mn Nonspacing Mark 1,839
Nd Decimal Number 650
Nl Letter Number 236
No Other Number 895
Pc Connector Punctuation 10
Pd Dash Punctuation 25
Pe Close Punctuation 73
Pf Final Punctuation 10
Pi Initial Punctuation 12
Po Other Punctuation 593
Ps Open Punctuation 75
Sc Currency Symbol 62
Sk Modifier Symbol 123
Sm Math Symbol 948
So Other Symbol 6,431
Zl Line Separator 1
Zp Paragraph Separator 1
Zs Space Separator 17

* This number was calculated by summing the numbers given here and subtracting from the total number of Unicode code-points possible.

>>> WRITE u.class u.char 888
Cn

Don't forget that there is a central workspace in ABC called 'abc'; any HOW TOs in that workspace are usable from all others. So put these HOW TOs in the abc workspace, and Unicode can be used in all others.

How it works

Eight-bit bytes can have a value from 0-255. The ASCII characters 0-127 are just themselves: that's the reason that UTF-8 exists. Of the other 128 byte values, some (192-223, hexadecimal C0-DF) are leading bytes of a two byte character, some (224-239, hexadecimal E0-EF) of three byte character and some (240-247, hexadecimal xF0-F7) of four byte character.

Of the other values 128-191 (80-BF) are continuation bytes of the multibyte characters, and 248-255 (F8-FF) are illegal.

So a table 'u.start' records for a given byte value what the number of bytes this is the start of: 1 for ascii, 2 for C0-DF, and so on, 0 for continuation bytes, since you can never start with a continuation byte. Consequently you take the first byte of a string, look up how many bytes this starts, and take that number of bytes to make up a single Unicode character. So the first Unicode character of a string is s|u.start[s|1].

Multibyte UTF-8 encodings are just base-64 encodings of the codepoint of the character, so the table u.val records the base-64 value of each particular byte.

The table u.revstart is just the inverse table of u.start.

There is one essential part of the HOW TO UNICODE: that the target xchars be installed correctly. It must be the compound

("^A","^?",""\200","\367")

and must contain the four characters

If it isn't installed properly, you will have to fix it outside of ABC (look for the file called xchars.cts).

There is a simple C program called xchars.c available to produce exactly these characters. Compile it and run it to produce the required compound. For instance:

cc xchars.c
 ./a.out

Download

Get it here: unicode.txt and save it somewhere.

Install it with:

abc -u -w unicode < unicode.txt

You can unpack it into any workspace: just change the first 'unicode' in the above line into the workspace name you want to use.

There is a HOW TO called UTEST that you can run to check that it has installed OK.

Summary

HOW TO UNICODE: \ Initialise data-structures for a program to use Unicode
HOW TO RETURN u.1 s: \ The first Unicode character of s
HOW TO RETURN s u.item n: \ The n'th Unicode character of s
HOW TO RETURN u.items s: \ String s split into its constituent Unicode characters
HOW TO RETURN u.size s: \ The number of Unicode characters in s
HOW TO RETURN u.abs c: \ The Unicode code point of character c
HOW TO RETURN u.char i: \ The UTF-8 encoding of the character at codepoint i
HOW TO RETURN s u.trim n: \ The first n Unicode characters of s
HOW TO RETURN s u.at n: \ s from position n
HOW TO RETURN s u.left n \ s justified left in a space of n characters
HOW TO RETURN s u.right n \ s justified right in a space of n characters
HOW TO RETURN s u.middle n \ s centred in a space of n characters
HOW TO REPORT u.single s: \ Whether s consists of a single character
HOW TO REPORT c u.in s: \ Whether the character c occurs in the string s
HOW TO RETURN u.upper s: \ s returned UPPERCASE
HOW TO RETURN u.lower s: \ s returned lowercase