The bad news is that ABC predates Unicode, and so only uses 8-bit characters. This means that you can't include Unicode characters in ABC programs, nor type Unicode characters in the editor.
The good news is that the UTF-8 encoding of Unicode only uses 8 bit characters, so it is possible to process UTF-8 encoded input and output with an ABC program.
To do this, you need to run the HOW TO below called
UNICODE to prepare necessary data structures at the start of a
program, and then you can use the following HOW TOs:
HOW TO RETURN u.1 s: \The first Unicode character of a string >>> WRITE chinese 智取威虎山 >>> WRITE u.1 chinese 智
It returns the empty string if the parameter is empty, or there is a Unicode encoding error in first character of the parameter.
HOW TO RETURN u.items s: \ String s split into its constituent Unicode characters
>>> WRITE u.items chinese
{[1]: "智"; [2]: "取"; [3]: "威"; [4]: "虎"; [5]: "山"}
You can use this to find out how many Unicode characters are in a string:
>>> WRITE #u.items chinese 5
though there is a faster function u.size:
HOW TO RETURN u.size s: \ The number of characters in s >>> WRITE u.size chinese 5
You can find out if a string consists of a single character:
HOW TO REPORT u.single s: \ Whether s consists of a single character >>> IF u.single u.1 chinese: WRITE u.1 chinese 智
Identifying individual characters:
HOW TO RETURN u.abs c: \ The Unicode code point of c
>>> WRITE u.abs u.1 chinese 26234 HOW TO RETURN u.char i: \ The UTF-8 encoding of the char at codepoint i >>> WRITE u.char 26234 智
Some ABC text functions and operators work as expected with Unicode texts,
in particular split and ^:
>>> WRITE russian
Рукописи не горят!
>>> WRITE split russian
{[1]: "Рукописи"; [2]: "не"; [3]: "горят!"}
>>> PUT (u.char 197)^"ngstr"^(u.char 246)^"m" IN angstrom
>>> WRITE angstrom
Ångström
However, although Unicode characters may look like a single character, to ABC they consist of several, so that operations that require a single (ABC) character will act differently. For instance, you can't use
IF (u.char 246) in angstrom:
because ö isn't a single ABC character; you have to say:
IF (u.char 246) in u.items angstrom:
or use
HOW TO REPORT c u.in s: \ Whether the character c occurs in the string s >>> IF (u.char 246) u.in angstrom: Write "yes" yes
Similarly, if string contains non-ASCII characters, you can't
use
FOR c IN string:
but have to use:
FOR c IN u.items string:
For the same reason the operators |, @,
>>, ><, << and
item will not behave as you want. As replacements, there are:
HOW TO RETURN s u.trim n: \ The first n characters
>>> WRITE chinese u.trim 3
智取威
HOW TO RETURN s u.at n: \ s from position n
>>> WRITE chinese u.at 3
威虎山
HOW TO RETURN s u.left n \ s justified left in a space of n characters
HOW TO RETURN s u.right n \ s justified right in a space of n characters
HOW TO RETURN s u.middle n \ s centred in a space of n characters
>>> WRITE "|", chinese u.left 8, "|"/
|智取威虎山 |
>>> WRITE "|", chinese u.right 8, "|"/
| 智取威虎山|
>>> WRITE "|", chinese u.middle 8, "|"/
| 智取威虎山 |
HOW TO RETURN s u.item n: \ The n'th character in s
>>> FOR i IN {0..1+u.size chinese}: WRITE i, chinese u.item i/
0
1 智
2 取
3 威
4 虎
5 山
6
Finally, the ABC functions lower and upper won't
work properly on Unicode characters. Use instead
HOW TO RETURN u.upper s: \ s returned UPPERCASE HOW TO RETURN u.lower s: \ s returned lowercase >>> WRITE angstrom Ångström >>> WRITE u.upper angstrom ÅNGSTRÖM >>> WRITE u.lower angstrom ångström
One final useful function. Unicode defines a character class for each
character, for instance Lu is the class of Uppercase Letters, and
Ll of lowercase letters, Zs are space characters:
HOW TO RETURN u.class c: \ The Unicode class of character c >>> WRITE russian Рукописи не горят! >>> FOR c IN u.items russian: WRITE u.class c, " " Lu Ll Ll Ll Ll Ll Ll Ll Zs Ll Ll Zs Ll Ll Ll Ll Ll Po
The classes, along with how many there are of each currently, are:
CcControl 65
CfFormat 161
CoPrivate Use 137468
CnUnassigned 693204*
CsSurrogate 137468
LlLowercase Letter 2,155
LmModifier Letter 260
LoOther Letter 127,004
LtTitlecase Letter 31
LuUppercase Letter 1,791
McSpacing Mark 443
MeEnclosing Mark 13
MnNonspacing Mark 1,839
NdDecimal Number 650
NlLetter Number 236
NoOther Number 895
PcConnector Punctuation 10
PdDash Punctuation 25
PeClose Punctuation 73
PfFinal Punctuation 10
PiInitial Punctuation 12
PoOther Punctuation 593
PsOpen Punctuation 75
ScCurrency Symbol 62
SkModifier Symbol 123
SmMath Symbol 948
SoOther Symbol 6,431
ZlLine Separator 1
ZpParagraph Separator 1
ZsSpace Separator 17* This number was calculated by summing the numbers given here and subtracting from the total number of Unicode code-points possible.
>>> WRITE u.class u.char 888 Cn
Don't forget that there is a central workspace in ABC called 'abc'; any HOW TOs in that workspace are usable from all others. So put these HOW TOs in the abc workspace, and Unicode can be used in all others.
Eight-bit bytes can have a value from 0-255. The ASCII characters 0-127 are just themselves: that's the reason that UTF-8 exists. Of the other 128 byte values, some (192-223, hexadecimal C0-DF) are leading bytes of a two byte character, some (224-239, hexadecimal E0-EF) of three byte character and some (240-247, hexadecimal xF0-F7) of four byte character.
Of the other values 128-191 (80-BF) are continuation bytes of the multibyte characters, and 248-255 (F8-FF) are illegal.
So a table 'u.start' records for a given byte value what the
number of bytes this is the start of: 1 for ascii, 2 for C0-DF, and so on, 0
for continuation bytes, since you can never start with a continuation byte.
Consequently you take the first byte of a string, look up how many bytes this
starts, and take that number of bytes to make up a single Unicode character. So
the first Unicode character of a string is s|u.start[s|1].
Multibyte UTF-8 encodings are just base-64 encodings of the codepoint of the
character, so the table u.val records the base-64 value of each
particular byte.
The table u.revstart is just the inverse table of
u.start.
There is one essential part of the HOW TO UNICODE: that the
target xchars be installed correctly. It must be the compound
("^A","^?",""\200","\367")
and must contain the four characters
If it isn't installed properly, you will have to fix it outside of ABC (look
for the file called xchars.cts).
There is a simple C program called xchars.c available to produce exactly these characters. Compile it and run it to produce the required compound. For instance:
cc xchars.c ./a.out
Get it here: unicode.txt and save it somewhere.
Install it with:
abc -u -w unicode < unicode.txt
You can unpack it into any workspace: just change the first 'unicode' in the above line into the workspace name you want to use.
There is a HOW TO called UTEST that you can run to check that it has installed OK.
HOW TO UNICODE: \ Initialise data-structures for a
program to use Unicode
HOW TO RETURN u.1 s: \ The first Unicode character of
s
HOW TO RETURN s u.item n: \ The n'th Unicode
character of s
HOW TO RETURN u.items s: \ String s split into its
constituent Unicode characters
HOW TO RETURN u.size s: \ The number of Unicode
characters in s
HOW TO RETURN u.abs c: \ The Unicode code point of
character c
HOW TO RETURN u.char i: \ The UTF-8 encoding of the
character at codepoint i
HOW TO RETURN s u.trim n: \ The first n Unicode characters
of s
HOW TO RETURN s u.at n: \ s from position n
HOW TO RETURN s u.left n \ s justified left in a
space of n characters
HOW TO RETURN s u.right n \ s justified right in a
space of n characters
HOW TO RETURN s u.middle n \ s centred in a space of n
characters
HOW TO REPORT u.single s: \ Whether s consists of a single
character
HOW TO REPORT c u.in s: \ Whether the character c occurs
in the string s
HOW TO RETURN u.upper s: \ s returned UPPERCASE
HOW TO RETURN u.lower s: \ s returned lowercase