Tcl Source Code

View Ticket
Login
Ticket UUID: 418035470b1306a17cff21dbb78dedfeee3d17f4
Title: tcl.h: #if TCL_UTF_MAX > 4
Type: Bug Version: 8.6.1
Submitter: pointsman Created on: 2013-12-22 23:12:32
Subsystem: 44. UTF-8 Strings Assigned To: jan.nijtmans
Priority: 5 Medium Severity: Minor
Status: Closed Last Modified: 2013-12-23 10:13:38
Resolution: Rejected Closed By: jan.nijtmans
    Closed on: 2013-12-23 10:13:38
Description:
tcl.h has:

#if TCL_UTF_MAX > 4
    /*
     * unsigned int isn't 100% accurate as it should be a strict 4-byte value
     * (perhaps wchar_t). 64-bit systems may have troubles. The size of this
     * value must be reflected correctly in regcustom.h and
     * in tclEncoding.c.
     * XXX: Tcl is currently UCS-2 and planning UTF-16 for the Unicode
     * XXX: string rep that Tcl_UniChar represents.  Changing the size
     * XXX: of Tcl_UniChar is /not/ supported.
     */
typedef unsigned int Tcl_UniChar;
#else
typedef unsigned short Tcl_UniChar;
#endif

(See in context here: http://core.tcl.tk/tcl/artifact/f555d5aa61ea46ec31d833b33c52fb8788c8dcdf?ln=2208-2221)

That means, with TCL_UTF_MAX 4 Tcl_UniChar is still an unsigned short. This is obviously wrong, because a valid 4 byte utf-8 char is beyond the BMP, that means bigger than an unsigned short.

That define should read as 

#if TCL_UTF_MAX > 3

as before check-in http://core.tcl.tk/tcl/info/4d6af4f7a468b71a

Same in regcustom.h: http://core.tcl.tk/tcl/artifact/197b7849d4dfcb86b50f783fdc0fc45dde12125b?ln=100-108
User Comments: jan.nijtmans added on 2013-12-23 10:13:38:
> because a valid 4 byte utf-8 char is beyond the BMP,
> that means bigger than an unsigned short.

Unless surrogate pairs are used to represent characters beyond the BMP.

In TIP #388 you can read the reason for this:
> 4 Not supported. The same as 3, but allowing the use of
> Unicode surrogate pairs to represent the range \U010000 - \U10ffff

Implementation of this is ongoing in the "tip-389-impl" Tcl branch.

Hope this helps.