Tcl Source Code

View Ticket
Login
Ticket UUID: 1004065
Title: UTF-8 encoding crashes in UCS-4 mode
Type: Bug Version: obsolete: 8.4.7
Submitter: loewis Created on: 2004-08-05 16:51:32
Subsystem: 44. UTF-8 Strings Assigned To: jan.nijtmans
Priority: 7 High Severity: Minor
Status: Closed Last Modified: 2020-04-23 16:19:08
Resolution: Fixed Closed By: dgp
    Closed on: 2020-04-23 16:19:08
Description:
I built Tcl 8.4.7 by setting TCL_UTF_MAX to 6 in tcl.h
(changed from 3). I then run the command

set x [encoding convertfrom utf-8 \xf0\x9d\x99\xaf]

Tcl crashes with the traceback

#0  0x080994e7 in TableFromUtfProc (clientData=0x810acb0,
    src=0x8115960 "ð\235\231¯ð\213\030@# Default system
startup file for Tcl-based applications.  Defines\n#
\"unknown\" procedure and auto-load facilities.\n#\n#
RCS: @(#) $Id: init.tcl,v 1.55.2.3 2004/05/03 14:28:59
dgp Exp $\n#\n# Copy"..., srcLen=4, flags=0,
statePtr=0x8114b04, dst=0x8116988 "% ", dstLen=4111,
srcReadPtr=0xbffff068,
    dstWrotePtr=0xbffff074, dstCharsPtr=0xbfffefe0) at
../generic/tclEncoding.c:2353
#1  0x08097a22 in Tcl_UtfToExternal (interp=0x0,
encoding=0x810adc0,
    src=0x8115960 "ð\235\231¯ð\213\030@# Default system
startup file for Tcl-based applications.  Defines\n#
\"unknown\" procedure and auto-load facilities.\n#\n#
RCS: @(#) $Id: init.tcl,v 1.55.2.3 2004/05/03 14:28:59
dgp Exp $\n#\n# Copy"..., srcLen=4, flags=0,
statePtr=0x8114b04, dst=0x8116988 "% ", dstLen=4111,
srcReadPtr=0xbffff068,
    dstWrotePtr=0xbffff074, dstCharsPtr=0xbfffefe0) at
../generic/tclEncoding.c:1091
#2  0x080b2848 in WriteChars (chanPtr=0x8115940,
src=0x811a59c "", srcLen=0) at ../generic/tclIO.c:3170
#3  0x080b24b7 in Tcl_WriteObj (chan=0x8115940,
objPtr=0x81140e8) at ../generic/tclIO.c:2960
#4  0x08055128 in Tcl_Main (argc=1, argv=0xbffff2f4,
appInitProc=0x80549e6 <Tcl_AppInit>)
    at ../generic/tclMain.c:407
#5  0x080549dc in main (argc=1, argv=0xbffff2f4) at
../unix/tclAppInit.c:90

This is all on a Debian system.
User Comments: dgp added on 2020-04-23 16:19:08:
These lines may also need attention.

https://core.tcl-lang.org/tcl/artifact/da1c0f1647cd7d2c?ln=2474-2478

dgp added on 2020-04-23 16:16:23:
After the fix, I still see the lines

https://core.tcl-lang.org/tcl/artifact/da1c0f1647cd7d2c?ln=2687-2696

If the bug is fixed, we should be able to remove them.

Also, maybe there should be a test case revealing they are still there when they should not be?

jan.nijtmans added on 2020-04-23 12:15:15:

Fixed in [ed551cd16cc8a97e]


dgp added on 2004-11-16 05:50:54:
Logged In: YES 
user_id=80530

the command in this report
no longer crashes in 8.4.8
with TCL_UTF_MAX==6.
The test suite also runs to
completion with only tests
encoding-16.1 and utf-2.8 
failing, which
appears should be expected.

likewise for the HEAD.

hobbs added on 2004-11-13 06:44:20:
Logged In: YES 
user_id=72656

I've applied the patch to avoid the crash for 8.4.8 and
8.5a2, but a full evaluation of the TCL_UTF_MAX==6 is still
required.

loewis added on 2004-08-06 15:50:59:
Logged In: YES 
user_id=21627

For the case of TCL_UTF_MAX==6 (which I refer to in this
report), there is no need for surrogates: They can be
represented just using a four-byte integer.
For the two-byte Unicode case, you have two options:
- use surrogates, or
- explicitly don't support surrogates, and non-BMP characters

In the latter alternative, you essentially would tell people
that they need a four-byte Unicode installation if they want
non-BMP characters.

Of course, there will be additional issues externally, e.g.
when trying to pass non-BMP characters to the GUI platform
in Tk, when finding appropriate fonts, etc.

dkf added on 2004-08-06 14:47:38:
Logged In: YES 
user_id=79902

The question is really how should we support unicode chars
outside the BMP?

(I suspect the answer involves surrogates internally, and
goodness knows what externally.)

dgp added on 2004-08-06 01:51:17:

File Added - 96520: 1004065.patch

dgp added on 2004-08-06 01:51:16:
Logged In: YES 
user_id=80530


It looks like the encoding routines
simply do not support the

#define TCL_UTF_MAX 6

variant currently.

The attached patch stops the
reported crash, but a more thorough
review is in order to really address
the issue.

dgp added on 2004-08-06 01:34:46:
Logged In: YES 
user_id=80530

Confirmed on the HEAD.

Note this bug only shows up 
in an interactive tclsh.  The problem
happens when converting to
the system encoding for writing
the result to stdout.

dgp added on 2004-08-06 01:25:23:
Logged In: YES 
user_id=80530

sorry, I misspoke before.
I got my "convertfrom" and
"convertto" mixed up.
Your demo script is fine.

The following demo script
should be equivalent:

set x [encoding convertfrom utf-8 \u00f0\u009d\u0099\u00af]

loewis added on 2004-08-06 00:50:21:
Logged In: YES 
user_id=21627

I believe it does what I want to do: explicitly invoke
decoding of an UTF-8 encoded string. I hope that the string
I specify contains four characters, which is then
interpreted as an octet string of four octets when passed to
"encoding convertfrom". This, in turn, should generate a
string with a single character.

Using the \u form is not possible, since it only supports
characters with numeric values up to U+FFFF. However, the
character above is U+0001D6FF, MATHEMATICAL SANS-SERIF BOLD
ITALIC SMALL Z. Other languages support a \U notation for
non-BMP characters. Apparently, Tcl doesn't (which is not
really a problem if you could get the character by decoding
it from UTF-8).

dgp added on 2004-08-06 00:23:14:
Logged In: YES 
user_id=80530


A crash is definitely a bug; thanks for the report.

That said, your snippet of code is likely
not doing what you intend.  Each \xHH
substitution produces one Unicode character,
not one byte.

As a general rule \xHH substitution should be
avoided in Tcl scripts.  Use \u instead.

Attachments: