Ticket UUID: | 1004065 | |||
Title: | UTF-8 encoding crashes in UCS-4 mode | |||
Type: | Bug | Version: | obsolete: 8.4.7 | |
Submitter: | loewis | Created on: | 2004-08-05 16:51:32 | |
Subsystem: | 44. UTF-8 Strings | Assigned To: | jan.nijtmans | |
Priority: | 7 High | Severity: | Minor | |
Status: | Closed | Last Modified: | 2020-04-23 16:19:08 | |
Resolution: | Fixed | Closed By: | dgp | |
Closed on: | 2020-04-23 16:19:08 | |||
Description: |
I built Tcl 8.4.7 by setting TCL_UTF_MAX to 6 in tcl.h (changed from 3). I then run the command set x [encoding convertfrom utf-8 \xf0\x9d\x99\xaf] Tcl crashes with the traceback #0 0x080994e7 in TableFromUtfProc (clientData=0x810acb0, src=0x8115960 "ð\235\231¯ð\213\030@# Default system startup file for Tcl-based applications. Defines\n# \"unknown\" procedure and auto-load facilities.\n#\n# RCS: @(#) $Id: init.tcl,v 1.55.2.3 2004/05/03 14:28:59 dgp Exp $\n#\n# Copy"..., srcLen=4, flags=0, statePtr=0x8114b04, dst=0x8116988 "% ", dstLen=4111, srcReadPtr=0xbffff068, dstWrotePtr=0xbffff074, dstCharsPtr=0xbfffefe0) at ../generic/tclEncoding.c:2353 #1 0x08097a22 in Tcl_UtfToExternal (interp=0x0, encoding=0x810adc0, src=0x8115960 "ð\235\231¯ð\213\030@# Default system startup file for Tcl-based applications. Defines\n# \"unknown\" procedure and auto-load facilities.\n#\n# RCS: @(#) $Id: init.tcl,v 1.55.2.3 2004/05/03 14:28:59 dgp Exp $\n#\n# Copy"..., srcLen=4, flags=0, statePtr=0x8114b04, dst=0x8116988 "% ", dstLen=4111, srcReadPtr=0xbffff068, dstWrotePtr=0xbffff074, dstCharsPtr=0xbfffefe0) at ../generic/tclEncoding.c:1091 #2 0x080b2848 in WriteChars (chanPtr=0x8115940, src=0x811a59c "", srcLen=0) at ../generic/tclIO.c:3170 #3 0x080b24b7 in Tcl_WriteObj (chan=0x8115940, objPtr=0x81140e8) at ../generic/tclIO.c:2960 #4 0x08055128 in Tcl_Main (argc=1, argv=0xbffff2f4, appInitProc=0x80549e6 <Tcl_AppInit>) at ../generic/tclMain.c:407 #5 0x080549dc in main (argc=1, argv=0xbffff2f4) at ../unix/tclAppInit.c:90 This is all on a Debian system. | |||
User Comments: |
dgp added on 2020-04-23 16:19:08:
These lines may also need attention. https://core.tcl-lang.org/tcl/artifact/da1c0f1647cd7d2c?ln=2474-2478 dgp added on 2020-04-23 16:16:23: After the fix, I still see the lines https://core.tcl-lang.org/tcl/artifact/da1c0f1647cd7d2c?ln=2687-2696 If the bug is fixed, we should be able to remove them. Also, maybe there should be a test case revealing they are still there when they should not be? jan.nijtmans added on 2020-04-23 12:15:15: Fixed in [ed551cd16cc8a97e] dgp added on 2004-11-16 05:50:54: Logged In: YES user_id=80530 the command in this report no longer crashes in 8.4.8 with TCL_UTF_MAX==6. The test suite also runs to completion with only tests encoding-16.1 and utf-2.8 failing, which appears should be expected. likewise for the HEAD. hobbs added on 2004-11-13 06:44:20: Logged In: YES user_id=72656 I've applied the patch to avoid the crash for 8.4.8 and 8.5a2, but a full evaluation of the TCL_UTF_MAX==6 is still required. loewis added on 2004-08-06 15:50:59: Logged In: YES user_id=21627 For the case of TCL_UTF_MAX==6 (which I refer to in this report), there is no need for surrogates: They can be represented just using a four-byte integer. For the two-byte Unicode case, you have two options: - use surrogates, or - explicitly don't support surrogates, and non-BMP characters In the latter alternative, you essentially would tell people that they need a four-byte Unicode installation if they want non-BMP characters. Of course, there will be additional issues externally, e.g. when trying to pass non-BMP characters to the GUI platform in Tk, when finding appropriate fonts, etc. dkf added on 2004-08-06 14:47:38: Logged In: YES user_id=79902 The question is really how should we support unicode chars outside the BMP? (I suspect the answer involves surrogates internally, and goodness knows what externally.) dgp added on 2004-08-06 01:51:17: File Added - 96520: 1004065.patch dgp added on 2004-08-06 01:51:16: Logged In: YES user_id=80530 It looks like the encoding routines simply do not support the #define TCL_UTF_MAX 6 variant currently. The attached patch stops the reported crash, but a more thorough review is in order to really address the issue. dgp added on 2004-08-06 01:34:46: Logged In: YES user_id=80530 Confirmed on the HEAD. Note this bug only shows up in an interactive tclsh. The problem happens when converting to the system encoding for writing the result to stdout. dgp added on 2004-08-06 01:25:23: Logged In: YES user_id=80530 sorry, I misspoke before. I got my "convertfrom" and "convertto" mixed up. Your demo script is fine. The following demo script should be equivalent: set x [encoding convertfrom utf-8 \u00f0\u009d\u0099\u00af] loewis added on 2004-08-06 00:50:21: Logged In: YES user_id=21627 I believe it does what I want to do: explicitly invoke decoding of an UTF-8 encoded string. I hope that the string I specify contains four characters, which is then interpreted as an octet string of four octets when passed to "encoding convertfrom". This, in turn, should generate a string with a single character. Using the \u form is not possible, since it only supports characters with numeric values up to U+FFFF. However, the character above is U+0001D6FF, MATHEMATICAL SANS-SERIF BOLD ITALIC SMALL Z. Other languages support a \U notation for non-BMP characters. Apparently, Tcl doesn't (which is not really a problem if you could get the character by decoding it from UTF-8). dgp added on 2004-08-06 00:23:14: Logged In: YES user_id=80530 A crash is definitely a bug; thanks for the report. That said, your snippet of code is likely not doing what you intend. Each \xHH substitution produces one Unicode character, not one byte. As a general rule \xHH substitution should be avoided in Tcl scripts. Use \u instead. |
Attachments:
- 1004065.patch [download] added by dgp on 2004-08-06 01:51:17. [details]