TIP 389: Full support for Unicode 10.0 and later (part 1)

Login
FlightAware bounty program for improvements to Tcl and certain Tcl packages.
Author:         Jan Nijtmans <jan.nijtmans@users.sf.net>
Author:         Jan Nijtmans <jan.nijtmans@gmail.com>
State:          Final
Type:           Project
Vote:           Done
Created:        23-Aug-2011
Post-History:   
Discussions-To: Tcl Core list
Keywords:       Tcl
Tcl-Version:    8.7
Tcl-Branch:     tip-389

Abstract

This TIP proposes to add full support for all characters in Unicode 10.0+, inclusive the characters >= U+010000.

Summary

In order to extend the range of the characters to more than 16 bits, the type Tcl_UniChar is not big enough any more to hold all possible characters. Changing the type of Tcl_UniChar to a 32-bit quantity is not an option, as it will result in a binary API incompatibility.

The solution proposed in this TIP is to keep Tcl_UniChar a 16-bit quantity, but to increase the value of TCL_UTF_MAX to 4 (from 3). Any conversions from UTF-8 to Tcl_UniChar will convert any valid 4-byte UTF-8 sequence to a sequence of two Surrogate characters. All conversions from UTF-16 to UTF-8 will make sure that any High Surrogate immediately followed by a Low Surrogate will result in a single 4-byte UTF-8 character.

This can be done in a binary compatible way: No source code of existing extensions need to be modified. As long as no characters >= U+010000 or Surrogates are used, all functions provided by the Tcl library will function as before. There are few functions which currently return a value of type Tcl_UniChar, those will be modified to return an int in stead.

Rationale

As Unicode 10.0, and future Unicode versions, will supply more and more characters outside the 16-bit range, it would be useful if Tcl supports that as well.

Specification

This document proposes:

  • Change the functions Tcl_UniCharToUtf and UnicodeToUtfProc such that when they are fed with a valid High Surrogate followed by a Low Surrogate, the result will be a single 4-byte UTF-8 character.

  • Change the functions Tcl_UtfToUniChar and UtfToUnicodeProc such that when they are fed with a valid 4-byte UTF-8 character, the first call will return a High Surrogate character, and the next call will return a Low Surrogate character.

  • The following functions, which currently return a Tcl_UniChar, will be changed to return an int instead:

    * Tcl_UniCharAtIndex

    * Tcl_UniCharToLower

    * Tcl_UniCharToTitle

    * Tcl_UniCharToUpper

    * Tcl_GetUniChar

    At first sight, this looks like a binary incompatibility, but in fact this is upwards compatible. Since in C, function calls generally transfer the result of a function call in a special register (the Accumulator). When compiling an extension using Tcl 8.6 headers, the caller expects the accumulator to contain a 16-bit result, while the remaining 48 bits (the Accumulator generally is 64-bit) are undefined. When the extension is run under Tcl 8.7, 16 more bits of the accumulator content are now defined (generally all zero's). The effect is that all characters >= U+010000 (which are not supported on Tcl 8.6) are now mapped to characters in the first unicode plane, but that's all. Re-compiling the extension using Tcl 8.7 headers might enable full Unicode support for the extension, if a 32-bit register is used to store the result.

  • Extend tclUniData.c to include all Unicode 10.0 characters up to U+02FA20. A special case will be made for the functions Tcl_UniCharIsGraph and Tcl_UniCharIsPrint for the characters in the range U+0E0100 - U+0E01EF, otherwise it would almost double the Unicode table size.

  • Extend the Tcl_UtfToUniChar function such that the invalid bytes 0x80 up to 0x9F are interpreted as the cp1252 characters € up to Ÿ instead of the (not-existing) Unicode characters U+0080 - U+009F. Since cp1252 is the mostly used Windows codepage, which is upwards compatible with ISO 8859-1, this adheres more to what's intuitively expected behavior. Many current UTF-8 converters nowadays use this. See also here

  • If Tcl is compiled with -DTCL_UTF_MAX=6, use a different TCL_STUB_MAGIC value. Since extensions compiled with -DTCL_UTF_MAX=6 are binary incompatible with normally-compiled Tcl, this causes extensions compiled with this same options no longer being loadable in normal Tcl and reverse. Note that TCL_UTF_MAX=6 compiles are still not officially supported, a lot of additional fixes are needed to make it work right.

Compatibility

As long as no Surrogates or characters >= U+010000 are used, all functions behave exactly the same as before. The only way that Tcl_UniCharToUtf can produce a 4-byte output is when Surrogates or characters >= U+010000 are used.

Extension that want to be compatible with any Tcl version, can include tcl.h as follows:

#define TCL_UTF_MAX 4
#include <tcl.h>

or they can call the C compiler with the additional argument -DTCL_UTF_MAX=4, in order to be sure that UTF-8 representations of length 4 can be handled. This way, the extension can be used with any Tcl version, whether it supports Surrogates or not.

Apart from this, it is advisable to initialize the variable where the chPtr argument from Tcl_UtfToUniChar points to, as this location is used to remember whether the High Surrogate is already produced or not. Not doing so when the first character of a string is a character > U+010000 might result in a Low Surrogate character only. This danger, however unlikely, only exists for the first character in a string, and it only occurs when the (random) value is exactly equal to the expected High Surrogate.

Caveats

In the current implementation "string length \U10000" will return 2 in stead of the expected 1. The reason for this is that many internal operations convert the representation into UTF-16 format, which occupies two surrogate characters. This should change in future Tcl implementations. Correcting this would involve many internal changes, and the risk of introducing crashes because of miscounting bytes (as have been reported in the past, during the tip-389 branch implementation development). Tcl is not the only language doing it this way: javascript does it as well.

This also means that functions line "string index", "string range" and "string reverse" might give unexpected results when characters > U+010000 are involved. Any index pointing to the middle of a 'double-length' Unicode character will be handled as if the index points to just after the character instead. So:

In Tcl 8.7 with TIP #389:

% string length "a\U100000b"
4
% scan %c \U100000
1048576  -> (this is the correct Unicode character)
% string length [string index "a\U100000b" 1]
2        -> (the Unicode character has length 2)
% string length [string index "a\U100000b" 2]
0        -> (So we cannot access the lower surrogate separately)

So, the "string length" of a Unicode character >= U+010000 is 2, and if you try to split it in two separate characters that won't work: It will then be split in a character with length 2 (the original one) and another character with length 0 (the empty string).

Also note that the regexp engine still cannot really handle Unicode characters >U+FFFF, it will handle those as if they consist of 2 separate characters. Most usage of regular expressions won't notice the difference.

Those caveats are planned to be handled in "part 2" (TIP #497)

Reference Implementation

A reference implementation is available in the tip-389 branch.

Rejected Alternatives

It would have been possible to give the new Tcl_GetUniChar and friends a new stub entry and to deprecate the original one, as was done with Tcl_Backslash. However, Tcl_Backslash originally only returned an ASCII character, which needed to be extended to UniChar. UniChar's < U+01000 common in Tcl, Unicode Characters >= U+010000 are rare and don't behave well in Tcl 8.6 anyway. Casts from Tcl_UniChar to int don't cause a warning because all Tcl_UniChar's fit in the 32-bit int range. On the other hand, casting "char" to Tcl_UniChar can result in surprising Unicode characters U+FF?? if char is a signed type (as in most platforms). That's why Tcl_Backslash had to be handled differently.

Copyright

This document has been placed in the public domain.