TIP 542:Support for switchable Full Unicode support

Login
Author:         Jan Nijtmans <[email protected]>
Author:         Jan Nijtmans <[email protected]>
State:          Final
Type:           Project
Vote:           Done
Created:        10-May-2019
Post-History:
Discussions-To: Tcl Core list
Keywords:       Tcl
Tcl-Version:    8.7
Tcl-Branch:     utf-max
Vote-Summary:   Accepted 6/0/0
Votes-For:      FV, JN, KBK, KW, MC, SL
Votes-Against:  none
Votes-Present:  none

Abstract

This TIP proposes being able to switch Tcl between Full Unicode mode (TCL_UTF_MAX>3, almost compatible with Androwish) and current partial Unicode mode (as far as TIP #389 goes, using TCL_UTF_MAX=3)

Full Unicode mode means that [string length \U10000] will no longer return 2 but the expected 1, so Unicode characters larger than \uFFFF will no longer internally be split into two surrogate characters.

Rationale

Tcl currently can be compiled in 3 different modes: using TCL_UTF_MAX=3, TCL_UTF_MAX=4 or TCL_UTF_MAX=6. The first 2 are actually equal now in Tcl 8.7 (since TIP #389). Using TCL_UTF_MAX=6 is actually overkill, since no utf-8 character consists of more than 4 bytes.

Therefore it makes sense to reduce this to only two modes: TCL_UTF_MAX=3 means being fully compatible with Tcl 8.6, while TCL_UTF_MAX=4 means compatibility with the Androwish-version of Tcl. Defining TCL_UTF_MAX=6 results in a valid compilation as well (functioning the same as TCL_UTF_MAX=4), only some buffer-sizes will be 2 bytes larger than necessary.

Androwish made the choice to use an (at that time) un-supported Tcl mode: Changing the size of the Tcl_UniChar type using TCL_UTF_MAX=6. This causes a binary incompatibility which results that all extensions need to be re-compiled with TCL_UTF_MAX=6 as well. This TIP proposes to add a supported TCL_UTF_MAX=4 compilation mode to Tcl, which has the same effect as the earlier unsupported TCL_UTF_MAX=6, but without the need to re-compile all extensions.

There are 4 functions which expose the internal structure of Tcl strings to extensions, functions that are rarely used in extensions. Those are Tcl_UniCharCaseMatch(), Tcl_UniCharLen(), Tcl_UniCharNcmp() and Tcl_UniCharNcasecmp(). They all have a UTF-8 equivalent which can be used instead. Switching TCL_UTF_MAX to a different value causes a binary incompatibility, but it's not worth to do anything about it. That's why this TIP proposes to deprecate them, and have them removed completely in Tcl 9.0 (actually: they will become MODULE_SCOPE internal functions in 9.0).

The default compilation mode for Tcl 8.x will continue to be TCL_UTF_MAX=3, which is 100% upwards compatible with Tcl 8.6.

This TIP opens the room to be able to switch Tcl using TCL_UTF_MAX=4 eventually for Tcl 9.0. More work needs to be done for that, but that's stuff for TIP #497. This TIP describes how far we can go for Tcl 8.x without causing unacceptable binary incompatibilities.

Specification

This document proposes:

If Tcl is compiled with either -DTCL_UTF_MAX=4 or -DTCL_NO_DEPRECATED, those functions will no longer be available for extensions. In Tcl 9.0 those 5 functions will be completely gone.

If Tcl is compiled with a different TCL_UTF_MAX value than the extension, those 5 functions cannot be used. In order to keep binary compatibility with 8.6, those 5 functions are not supported for extensions compiled with -DTCL_UTF_MAX=4. You can use the functions in TIP #548 instead, for converting between Tcl_UniChar and UTF-8.

Compatibility

As long as Tcl is compiled with -DTCL_UTF_MAX=3, this is fully upwards compatible.

When Tcl is compiled with -DTCL_UTF_MAX=4, this is at the Tcl level, compatible with the Androwish-version of Tcl. At the C-API level, it's upwards compatible with Tcl 8.6 in TCL_UTF_MAX=6 mode, except for the functions mentioned above.

Later change

The original version of this TIP (as voted upon), proposed to fully deprecate the Tcl_AppendUnicodeToObj(). However the later TIP #622 changed that: Only deprecate Tcl_AppendUnicodeToObj() when Tcl is compiled in a different -DTCL_UTF_MAX mode than the extension.

Caveats

Remark: All extensions bundled with Tcl (and Tk 8.7a3 as well) are already modified not to use any of those deprecated functions any more. So they can be compiled with any TCL_UTF_MAX value, independent on the TCL_UTF_MAX value Tcl was compiled with.

Reference Implementation

A reference implementation is available in the utf-max branch. https://core.tcl-lang.org/tcl/timeline?r=utf-max

Copyright

This document has been placed in the public domain.