Ticket UUID: | 411825 | |||
Title: | Passing list w/UTF-8 from C can fail | |||
Type: | Bug | Version: | obsolete: 8.4.4 | |
Submitter: | arobert3434 | Created on: | 2001-03-28 06:36:18 | |
Subsystem: | 10. Objects | Assigned To: | dgp | |
Priority: | 5 Medium | Severity: | ||
Status: | Closed | Last Modified: | 2003-08-28 18:58:04 | |
Resolution: | Fixed | Closed By: | dgp | |
Closed on: | 2003-08-27 20:11:02 | |||
Description: |
On certain installations of Tcl/Tk 8.3.1, the passing of UTF-8 character-triplets ending in octal 240 (decimal 160, hex A0) interferes with list delimitation when Tcl_AppendElement is used to return a result from a C function. In particular, if a UTF-8 string ending in octal 240 is appended to the result, and then another UTF-8 string is appended afterwards, the octal 240 seems to be interpreted as a "forward delete" character of some kind, with the result that the separation between the two list elements is erased and they are interpreted as one. The following C function, when called from Tcl, illustrates the problem. int sendCharList(ClientData clientData, Tcl_Interp *interp, int argc, char **argv) { char s1[5], s2[5], s3[5], s4[5]; strcpy(s1, "\345\220\240"); strcpy(s2, "\345\214\240"); strcpy(s3, "\351\235\240"); strcpy(s4, "\347\264\240"); Tcl_ResetResult(interp); Tcl_AppendElement(interp, s1); Tcl_AppendElement(interp, s2); Tcl_AppendElement(interp, s3); Tcl_AppendElement(interp, s4); return TCL_OK; } The Tcl calls: set s6 [sendCharList] puts "[llength $s6] , [string length $s6]" should output "4 , 7" (4 list elements, each a single UTF-8 composite character plus 3 delimiters). On some systems it does. On others, however, the output is "1 , 4", resulting from deletion of the list delimiters somewhere during passage from C to Tcl. A complete test program involving the above (plus some additional tests and using wish not tclsh) may be accessed at: ftp://zakros.ucsd.edu/arobert/Temp/testTclBug.tgz (it is also attached). A full application that exposes the bug (and led to its discovery) may be found at: http://freshmeat.net/projects/hanzim Unfortunately, I have not been able to isolate why some installations exhibit the bug and some don't. A default SUSE 7.0 Linux installation of 8.3.1 had the problem, while a default Slackware 7.1 installation of the same Tcl/Tk version did not. Maybe it is a compilation flag difference... ? I'm also not sure whether it persists in 8.3.2 or 8.4. | |||
User Comments: |
dossy added on 2003-08-28 18:58:04:
Logged In: YES user_id=21885 Donal -- yes, I see your point and now I agree. The rule is that list elements ending in a list delimiter get quoted, and since \302\240 is now no longer considered a list delimiter, it doesn't cause quoting to happen. Thanks. Don -- I understand what's supposed to happen (at least, I thought I did) but then, explain this: % encoding system identity % fconfigure stdout -encoding binary -translation binary % TestCmd foo bar % string length [TestCmd] 8 % string bytelength [TestCmd] 9 I would have expected to get "foo\302\240 bar" and not just "foo\240 bar". It's clear from string bytelength that the \302 is in there, but when I set stdout encoding to binary, it should give me the raw UTF-8 (9 bytes) and not the transcoded ISO-8859-1 representation (8 bytes), right? Or, am I misunderstanding what "-encoding binary" means and what "encoding system identity" does? I mean, this actually does what I expect: % fconfigure stdout -encoding identity % TestCmd foo bar Now it output "foo\302\240 bar" -- why will it do that on "- encoding identity" but not "-encoding binary"? Perhaps we can take this discussion to the wiki or email since it's not directly related to this particular bug -- let me know what works best for you. dkf added on 2003-08-28 06:40:59: Logged In: YES user_id=79902 Behaviour is correct. UTF-8 sequence \302\240 corresponds to ISO8859-1 character \240 (i.e. non-breaking space.) Non-breaking space is (now, with DGP's patch) considered to not be a space character and hence not in need of quoting. dgp added on 2003-08-28 06:39:06: Logged In: YES user_id=80530 Let me advise you try again tomorrow. By then the anonymous CVS at SF will have caught up to all my commits. the output you describe sounds correct to me. Before you file another bug report, be sure you understand that Tcl uses UTF-8 encoding internally and by default converts to your system encoding on output. The two byte sequence \302\240 is the UTF-8 encoding for the character known in Tcl-Unicode notation as \u00a0 which is the non-breaking space. When you write that character to output on a system with system encoding of iso8859-1 it gets written as the single byte \240 which is the same character in that encoding. Likewise, if you were to read in the byte \240 on the same system, Tcl will convert it back to UTF-8 so by the time Tcl sees it again, it will be the 2-byte sequence \302\240 . When you work with an interactive tclsh, the results you see have actually been written to stdout, and are in the system encoding. If you don't completely follow what I just said, do not file another bug report yet, but let's find another channel to straighten out any misunderstandings about how Tcl encodings are supposed to work. dossy added on 2003-08-28 06:12:20: Logged In: YES user_id=21885 Your patch only included tests util-8.5 and util-8.6. I just checked HEAD and core-8-4-branch and the util.test file stops at util-8.1. I'm showing the last checkin for tests/util.test as: revision 1.11 date: 2003/07/24 16:05:24; author: dgp; state: Exp; lines: +37 -4 I assume this means you didn't get to check your change in, yet? Either way, the C test case I provided on 2003-08-25 20:37 passes after applying the patch, kinda. [llength [TestCmd]] == 2, but now look at what TestCmd outputs: % encoding system iso8859-1 % TestCmd foo bar Pushing that through "od -xc", here's the actual bytes that get output: 666f 6fa0 2062 6172 f o o 240 b a r Instead of \302\240 coming back out, only \240 came back. At least this is a *different* problem to solve, now. At least before it would return "foo\302\240bar" -- now, it's returning "foo\240 bar" -- I'm not exactly sure which is worse. :-) However, the behavior I described on 2003-08-26 13:22 hasn't changed, list elements ending in \302\240 don't get wrapped with {}. Suppose I should file this as a new bug, now? dossy added on 2003-08-28 03:22:28: Logged In: YES user_id=21885 Thank you so much, Don. We're going to apply the patch and do our tests. I'll let you know how it goes! dgp added on 2003-08-28 02:59:29: File Added - 59965: 411825.patch Logged In: YES user_id=80530 Here's a copy of the patch I am committing to HEAD and to core-8-4-branch. dgp added on 2003-08-28 00:58:05: Logged In: YES user_id=80530 committed new tests to test suite util-8.3 shows dossy's reported bug util-8.4 shows another TclNeedSpace bug Fix on the way. dgp added on 2003-08-27 04:47:39: Logged In: YES user_id=80530 sorry, but it's gonna be another day. Just as I was testing the patch, a big storm came through and knocked off power. Power's back, but the disk on which the patch is stored has not come back online yet. Will get back to this tomorrow. dossy added on 2003-08-27 02:25:59: Logged In: YES user_id=21885 Sounds good, Don. Thanks for the quick response on this! dgp added on 2003-08-27 02:21:33: Logged In: YES user_id=80530 Tell you what. Let me commit a fix to the re-opened bug (should be able to start work on that shortly; should not take long). Then after that fix is in, if you still find something not meeting your expectations, you can file a new bug report on that. Thanks. dossy added on 2003-08-27 00:22:57: Logged In: YES user_id=21885 I don't know if this should be entered as a seperate bug, but it's related to this problem (similar fix should address both): % set a [list "abc "] {abc } This is correct -- since the list element ends in whitespace, it's wrapped with {} for its string representation. % encoding system iso8859-1 % set a [list "abc\240\240"] abc Here, the string is "abc\240\240" but it's not being wrapped by {}. But, if [string is space \240] is 1, shouldn't it be? % encoding system utf-8 % set a [list [encoding convertfrom utf-8 "abc\302\240\302 \240"]] abc Here, [string is space [encoding convertfrom utf-8 \302\240]] = 1. Again, the list element isn't being wrapped with {} -- why? Parts of Tcl treat \240 or \302\240 as a space (and thus don't insert a list delimiter character) but others don't treat it as a space, so stringify'ing list elements don't get {} wrapped around them. -- Dossy dkf added on 2003-08-26 17:12:25: Logged In: YES user_id=79902 Sorry. I've no time to chase this today. :^/ dgp added on 2003-08-26 10:30:59: Logged In: YES user_id=80530 I see the ChangeLog comment from dkf's patch: * generic/tclUtil.c (TclNeedSpace): Rewrote to be UTF-8 aware. [Bug 411825, but not that patch which would have added extra spaces if there was a real non-ASCII space involved. ] Trouble here is that Tcl_UniCharIsSpace() is the wrong test. It is not equivalent to Tcl_UniCharIsAListElementTerminator() which is what we really need to test. In particular, the "non-breaking space" \u00A0 returns true from Tcl_UniCharIsSpace(), but is not recognized by the list parser in [llength] as a separator of list elements. Looks like the prior fix did correct lots of errors. Prior to the fix, every UTF-8 sequence ending in the byte \xA0 (or \240) caused trouble with TclNeedSpace(). After the fix, only the UTF-8 sequence \xC2\xA0 is a problem. Here's an interactive sequence in plain Tcl (no C coding required) that demos the remaining bug: % interp create \u00a0 � % interp create [list \u00a0 foo] � foo % interp alias {} fooset [list \u00a0 foo] set fooset % interp target {} fooset �foo % # Just to be really clear... % llength [interp target {} fooset] 1 Assigning to dkf. If he doesn't have it fixed by the time I get to work tomorrow, we'll get it done then. dgp added on 2003-08-26 10:12:11: Logged In: YES user_id=80530 Thank you! A good example clarifies a lot. Certainly looks like dkf's patch failed to fix things, doesn't it? Not clear to me why the patch attached to this report wasn't accepted instead. Re-opening. dossy added on 2003-08-26 07:37:20: Logged In: YES user_id=21885 ====8<==== utfNbspTest.c ====8<==== /* * utfNbspTest.c * * Appending an element to a previous element that ends with the * sequence 0xC2A0 (or \302\240), the UTF code for NO- BREAK SPACE, * results in an incorrect list. * * $ gcc -o utfNbspTest utfNbspTest.c -L/path/to/libtcl8.4.* - ltcl8.4 * */ #include <tcl.h> int TestCmd(ClientData clientData, Tcl_Interp *interp, int argc, char **argv) { Tcl_AppendElement(interp, "foo\302\240"); Tcl_AppendElement(interp, "bar"); return TCL_OK; } int My_AppInit(Tcl_Interp *interp) { Tcl_CreateCommand(interp, "TestCmd", (Tcl_CmdProc *) TestCmd, NULL, NULL); return TCL_OK; } int main(int argc, char **argv) { Tcl_Main(argc, argv, My_AppInit); Tcl_Exit(0); /* NOTREACHED */ return 0; } ====8<==== utfNbspTest.c ====8<==== Here's the transcript showing the error: $ ./utfNbspTest % set tcl_patchLevel 8.4.4 % encoding system utf-8 % encoding system utf-8 % set x [TestCmd] foo bar % llength $x 1 % string length $x 7 % string bytelength $x 8 % exit Of course, yes I know: 1) I should Tcl_Obj'ify everything. 2) Tcl_AppendElement is deprecated (supposedly!) However, I'm dealing with a good amount of legacy code that will eventually get changed/modernized, but for now, it needs to work. If Tcl >8.1 isn't backward compatible, that's fine. But, to call Tcl_AppendElement "deprecated" when it isn't backward compatible, well ... that's just wrong. -- Dossy dgp added on 2003-08-26 05:06:59: Logged In: YES user_id=80530 Provide the C code that calls Tcl_AppendElement() and that gives results that are incorrect in either Tcl 8.4.4 or the HEAD. dossy added on 2003-08-26 04:42:18: Logged In: YES user_id=21885 I'd really hate to pick at an old scab (this bug was closed back in 09/2001) but exactly what was "fixed" by dkf's commit? Against Tcl 8.4.4, using Tcl_AppendElement() which I know is deprecated, the problem is still occurring. I guess it has to do with this behavior: $ string is space [encoding convertfrom utf-8 \302\240] 1 What's annoying is if you do: > set a foo\302\240 fooà> set a [encoding convertfrom utf-8 foo\302\240] foo > lappend a bar foo bar > llength $a 2 > string bytelength $a 9 That does the right thing. But if you Tcl_AppendElement(), you'll get "foo\302\240bar", which is bad. -- Dossy dkf added on 2001-09-19 15:53:42: Logged In: YES user_id=79902 Test and fix committed (SF seems to be working at the mo...) dgp added on 2001-09-19 04:23:29: Logged In: YES user_id=80530 Assigning to dkf, since he can't log in and assign it himself. dgp added on 2001-09-19 00:10:50: Logged In: YES user_id=80530 The bug is in TclNeedSpace(), in generic/tclUtil.c, part of the Objects Category. Is there a reason not to accept the patch already attached to this report? Will it break TclNeedSpace for its existing callers? dgp added on 2001-09-18 23:47:22: Logged In: YES user_id=80530 Here's a sequence of Tcl commands broken by this bug. % interp create \u5420 ? % interp create [list \u5420 foo] ? foo % interp alias {} fooset [list \u5420 foo] set fooset % interp target {} fooset ?foo Re-opening the bug. dgp added on 2001-09-18 23:17:50: Logged In: YES user_id=80530 Self-explanatory and revealing. I think you're missing the point, Jeff. Adrian's [sendCharList] command is trying to return the result [list \u5420 \u5320 \u9760 \u7d20] but it's failing because Tcl_AppendElement is mangling his UTF-8 characters that he has encoded "by hand". If I can manage it, I'll post a Tcl script that demos the bug. I think such a script is possible. Tcl_AppendElement calls haven't been entirely banished from the Tcl source code. hobbs added on 2001-09-18 22:03:01: Logged In: YES user_id=72656 This should be self-explanatory: (hobbs) 50 % set var \345\220\240 吠 (hobbs) 51 % string length $var 3 (hobbs) 52 % string bytelength $var 6 dkf added on 2001-09-18 21:34:28: Logged In: YES user_id=79902 Jeff just happens to be wrong. :^) The example code contains valid UTF-8 strings. The problem is that TclNeedsSpace doesn't know anything about UTF-8 and therefore anything depending on it (Tcl_AppendElement, Tcl_DStringAppendElement and Tcl_DStringStartSublist says a search with grep, plus goodness knows how much in extensions as the code is in the stub table) is *not* UTF-8 safe. Unfortunately, none of those three public functions (two of which are not deprecated at all) warns in its documentation that it is unsafe to pass UTF-8 strings to it. :^( The problems in TclNeedSpace are really the 'end--' which is fundamentally wrong on UTF-8 strings, and the way it detects what character it is looking at which needs to be much more careful when looking at bytes outside \000-\177. Plus isspace is not usually Unicode-aware... dgp added on 2001-09-18 12:46:47: Logged In: YES user_id=80530 Sorry if I'm being dense, but what is it about the strings in Adrian's example that makes them invalid UTF-8 strings? Is it the terminating null bytes? How would would Tcl_ExternalToUtf be added to the reported example code to solve the problem? hobbs added on 2001-09-18 07:06:01: Logged In: YES user_id=72656 Ah, but you are making a fatal flaw in your argument - you are *not* passing UTF-8 strings - you are passing incorrectly formed strings through Tcl. If you converted these to UTF-8 first (with Tcl_ExternalToUtf), this would not have happened. That isn't to say this still doesn't need fixing - but it is one of those areas in the core where the distinction between using utf-8 and raw data became important. arobert3434 added on 2001-09-18 06:58:33: Logged In: YES user_id=146959 This is NOT a solution. If you don't want to change any code, you should at least clarify the documentation so that people in the future don't waste their time. The documentation should state at the very least that List-related methods should NOT be used with UTF-8 strings for communications between C and Tcl. Please see the comments submitted earlier for this bug for additional clarification. Thank you. hobbs added on 2001-05-04 04:08:17: Logged In: YES user_id=72656 The basic answer at this point is that if you want space chars to be thought of as space chars in Tcl, you should restrict yourself to the ascii 7-bit set, of which \240 isn't part. It works on some systems, where the locale isspace('\240') is 1, but that's not reliable. arobert3434 added on 2001-04-02 08:27:58: Logged In: YES user_id=146959 Yes, OK, the suggestion in Tip #20 just mentioned of adding a locale-independent isspace() to Tcl and using that would prevent the problem I had, which arises because 0240 is defined as a "no-break" space in a number of important character encodings, such as ISO-8859-1. This leads a great many locales, including en_US, to define 0240 as being in the whitespace category. Since many UTF characters have 0240 inside them, this can lead to problems... arobert3434 added on 2001-04-02 08:27:56: Logged In: YES user_id=146959 Yes, OK, the suggestion in Tip #20 just mentioned of adding a locale-independent isspace() to Tcl and using that would prevent the problem I had, which arises because 0240 is defined as a "no-break" space in a number of important character encodings, such as ISO-8859-1. This leads a great many locales, including en_US, to define 0240 as being in the whitespace category. Since many UTF characters have 0240 inside them, this can lead to problems... msofer added on 2001-03-30 07:07:31: Logged In: YES user_id=148712 This bug is related to bugs #408568 and #227512. See TIP #20 at http://www.cs.man.ac.uk/fellowsd-bin/TIP/ dgp added on 2001-03-29 15:55:06: File Added - 4717: tclUtil.c.patch Logged In: YES user_id=80530 I was talking about the man page for Tcl_AppendElement(): http://dev.scriptics.com/man/tcl8.3.2/TclLib/SetResult.htm Now, reading the I18N HOWTO, it looks like I was reading "deprecated" too strongly. Tcl_DStringAppendElement() and Tcl_DStringStartSublist() also rely on TclNeedSpace() and they have not been deprecated, so TclNeedSpace() needs to be fixed after all. This bug is re-opened. Looking at TclNeedSpace() explains the mysterious platform dependence. The buggy symptoms you report will be present on those platforms/locales for which isspace(0240) returns true. I've attached a patch that I think will correct the problem. It's possible that it has other undesirable side-effects, so I've assigned this report to one of the maintainers of generic/tclUtil.c for review. Meanwhile you can use the workaround I posted in the first comment. Tcl_Merge() is safe for UTF-8 strings. arobert3434 added on 2001-03-29 13:50:12: Logged In: YES user_id=146959 Also, could you please post a pointer to the documentation you are referring to? It would help clear up other questions like whether Tcl_Merge is affected... For example, the docs at http://dev.scriptics.com/doc/howto/i18n.html do not so much as hint at the problem. They merely say that all the Tcl C APIs expect UTF-8 strings, and that everything should work perfectly if they get them... arobert3434 added on 2001-03-29 12:34:09: Logged In: YES user_id=146959 Also, could you please post a pointer to the documentation you are referring to? It would help clear up other questions like whether Tcl_Merge is affected... For example, the docs at http://dev.scriptics.com/doc/howto/i18n.html do not so much as hint at the problem. They merely say that all the Tcl C APIs expect UTF-8 strings, and that everything should work perfectly if they get them... arobert3434 added on 2001-03-29 12:11:24: Logged In: YES user_id=146959 Thanks very much for a response and proposed solution, however the documentation in the man page unfortunately says nothing about this issue. It only says that it is best to use the object versions of the result-handling functions because it is "significantly more efficient". This is hardly incentive to go and learn a framework that is significantly more complex at first sight when all one wants to do is pass a string and everything has been running fast enough as it is. Since string-handling is said to be fully unicode-based in Tcl/Tk 8.1 and above, the default assumption on a developer's part is to assume that "string" means "internationalized, UTF-8, or what have you string", and that Tcl_AppendElement therefore does not present a problem. The real solution it seems to me is to repair the deficiency in TclNeedSpace(), but there may be other constraints, performance among them, that argue against this. If this repair is not made, the documentation for Tcl_AppendElement, and "routines that call it" (how exactly is the typical Tcl/Tk end-developer supposed to know which those are) should be updated to reflect the fact that they should not be used for anything but ASCII. Maybe there is some other documentation that says something about these issues, but it should be in the man page as well. arobert3434 added on 2001-03-29 12:11:14: Logged In: YES user_id=146959 Thanks very much for a response and proposed solution, however the documentation in the man page unfortunately says nothing about this issue. It only says that it is best to use the object versions of the result-handling functions because it is "significantly more efficient". This is hardly incentive to go and learn a framework that is significantly more complex at first sight when all one wants to do is pass a string and everything has been running fast enough as it is. Since string-handling is said to be fully unicode-based in Tcl/Tk 8.1 and above, the default assumption on a developer's part is to assume that "string" means "internationalized, UTF-8, or what have you string", and that Tcl_AppendElement therefore does not present a problem. The real solution it seems to me is to repair the deficiency in TclNeedSpace(), but there may be other constraints, performance among them, that argue against this. If this repair is not made, the documentation for Tcl_AppendElement, and "routines that call it" (how exactly is the typical Tcl/Tk end-developer supposed to know which those are) should be updated to reflect the fact that they should not be used for anything but ASCII. Maybe there is some other documentation that says something about these issues, but it should be in the man page as well. arobert3434 added on 2001-03-29 12:09:08: Logged In: YES user_id=146959 Thanks very much for a response and proposed solution, however the documentation in the man page unfortunately says nothing about this issue. It only says that it is best to use the object versions of the result-handling functions because it is "significantly more efficient". This is hardly incentive to go and learn a framework that is significantly more complex at first sight when all one wants to do is pass a string and everything has been running fast enough as it is. Since string-handling is said to be fully unicode-based in Tcl/Tk 8.1 and above, the default assumption on a developer's part is to assume that "string" means "internationalized, UTF-8, or what have you string", and that Tcl_AppendElement therefore does not present a problem. The real solution it seems to me is to repair the deficiency in TclNeedSpace(), but there may be other constraints, performance among them, that argue against this. If this repair is not made, the documentation for Tcl_AppendElement, and "routines that call it" (how exactly is the typical Tcl/Tk end-developer supposed to know which those are) should be updated to reflect the fact that they should not be used for anything but ASCII. Maybe there is some other documentation that says something about these issues, but it should be in the man page as well. dgp added on 2001-03-29 05:31:55: Logged In: YES user_id=80530 TclNeedSpace() is not UTF-8 aware. That's why routines that call it, like Tcl_AppendElement() are deprecated. (See the documentation.) Rewrite your command procedure like so: Tcl_Obj *resultPtr; ... Tcl_ResetResult(interp); resultPtr = Tcl_GetObjResult(interp); Tcl_ListObjAppendElement(interp, resultPtr, Tcl_NewStringObj(s1, -1)); ... Tcl_ListObjAppendElement(interp, resultPtr, Tcl_NewStringObj(s4, -1)); return TCL_OK; arobert3434 added on 2001-03-28 13:36:19: File Added - 4692: testTclBug.tgz |