Description: |
tcl.h has:
#if TCL_UTF_MAX > 4
/*
* unsigned int isn't 100% accurate as it should be a strict 4-byte value
* (perhaps wchar_t). 64-bit systems may have troubles. The size of this
* value must be reflected correctly in regcustom.h and
* in tclEncoding.c.
* XXX: Tcl is currently UCS-2 and planning UTF-16 for the Unicode
* XXX: string rep that Tcl_UniChar represents. Changing the size
* XXX: of Tcl_UniChar is /not/ supported.
*/
typedef unsigned int Tcl_UniChar;
#else
typedef unsigned short Tcl_UniChar;
#endif
(See in context here: http://core.tcl.tk/tcl/artifact/f555d5aa61ea46ec31d833b33c52fb8788c8dcdf?ln=2208-2221)
That means, with TCL_UTF_MAX 4 Tcl_UniChar is still an unsigned short. This is obviously wrong, because a valid 4 byte utf-8 char is beyond the BMP, that means bigger than an unsigned short.
That define should read as
#if TCL_UTF_MAX > 3
as before check-in http://core.tcl.tk/tcl/info/4d6af4f7a468b71a
Same in regcustom.h: http://core.tcl.tk/tcl/artifact/197b7849d4dfcb86b50f783fdc0fc45dde12125b?ln=100-108
|
User Comments: |
jan.nijtmans added on 2013-12-23 10:13:38:
> because a valid 4 byte utf-8 char is beyond the BMP,
> that means bigger than an unsigned short.
Unless surrogate pairs are used to represent characters beyond the BMP.
In TIP #388 you can read the reason for this:
> 4 Not supported. The same as 3, but allowing the use of
> Unicode surrogate pairs to represent the range \U010000 - \U10ffff
Implementation of this is ongoing in the "tip-389-impl" Tcl branch.
Hope this helps.
|