Ticket UUID: | 584603 | |||
Title: | 'encoding convertfrom' causes error messages | |||
Type: | Bug | Version: | obsolete: 8.5a3 | |
Submitter: | vincentdarley | Created on: | 2002-07-21 19:31:48 | |
Subsystem: | 16. Commands A-H | Assigned To: | andreas_kupries | |
Priority: | 7 High | Severity: | ||
Status: | Closed | Last Modified: | 2006-03-15 04:18:59 | |
Resolution: | Duplicate | Closed By: | andreas_kupries | |
Closed on: | 2006-03-14 21:18:59 | |||
Description: |
The following command: encoding convertfrom identity \342 locks up Tcl in an infinite loop. This has been verified on Windows and MacOS X (unix) with Tcl 8.4b2. It does not appear to happen with Tcl 8.3. | |||
User Comments: |
andreas_kupries added on 2006-03-15 04:18:58:
Logged In: YES user_id=75003 Decided to close this as duplicate. Further discussion should be held at 624919, as Don (dgp) proposed. As a last comment in this thread, I will not be in tears should [encoding convertfrom identity] go the way of the Dodo, except for a special command in tcltest. The public availability just gives us grief. hobbs added on 2005-10-12 01:34:28: Logged In: YES user_id=72656 See original change: 2002-07-30 Andreas Kupries <[email protected]> * tests/io.test: * generic/tclIO.c (WriteChars): Added flag to break out of loop if nothing of the input is consumed at all, to prevent infinite looping of called with a non-UTF-8 string. Fixes Bug 584603 (partially). Added new test "io-60.1". Might need additional changes to Tcl_Main so that unprintable results are printed as binary data. dgp added on 2005-01-13 05:40:54: Logged In: YES user_id=80530 This is the ongoing confusion about what encoding requirements exist for the bytes found at objPtr->bytes . Originally, in Tcl 8.0, there were no requirements, as it was meant to be a counted string, with no constraints. After 8.1, slowly more and more parts of the core, including Tcl_Write*() started assuming the Tcl_Obj's passed to it had UTF-8 strings only at objPtr->bytes. [encoding convertfrom identity] is the only script level access remaining among the built-in Tcl commands that can create an objPtr->bytes that is not legal UTF-8. Bug 624919 covers similar issues. Might want to close this and redirect discussion there. dkf added on 2005-01-13 04:48:37: Logged In: YES user_id=79902 The fact that this only causes a problem with the Tk console and nothing else (I tried tkcon and tclsh) indicates it's a console problem. I've no idea what to do about it. :^) vincentdarley added on 2005-01-13 00:04:55: Logged In: YES user_id=32170 This hasn't gone away. Type 'encoding convertfrom identity \342' + return into the Windows Tk wish console. You'll get a bgerror (!) dialog popping up with this contents: error writing "stdout": invalid argument error writing "stdout": invalid argument while executing "puts $result" (procedure "tk::ConsoleInvoke" line 20) invoked from within "tk::ConsoleInvoke" (command bound to event) (the infinite loop mentioned in the original report was fixed quickly, but this subsidiary bug still exists). dkf added on 2005-01-12 16:26:17: Logged In: YES user_id=79902 Whatever was exactly the problem, it's gone away now. vincentdarley added on 2003-09-15 17:40:36: Logged In: YES user_id=32170 I believe the problem is in these lines in tclCmdAH.c: if ((enum options) index == ENC_CONVERTFROM) { /* * Treat the string as binary data. */ string = (char *) Tcl_GetByteArrayFromObj(data, &length); Tcl_ExternalToUtfDString(encoding, string, length, &ds); /* * Note that we cannot use Tcl_DStringResult here because * it will truncate the string at the first null byte. */ The 'Tcl_ExternalToUtfDString' can return an invalid utf string (when used with the identity encoding). I believe that function call should be replaced with a call to the more complex 'Tcl_ExternalToUtf' which is capable of telling us whether the returned string is valid utf or not. We can then take appropriate action for the case where it is invalid utf. According to the comments below, 'appropriate action' appears to mean that we should generate a bytearray instead. dkf added on 2003-02-20 22:29:28: Logged In: YES user_id=79902 I don't understand what needs doing at all. The various Tcl_ExternalTo* functions always confused me. Attach a patch (with tests preferably!) and I'd be happy to review it... vincentdarley added on 2003-02-20 16:05:06: Logged In: YES user_id=32170 Looked at this some more, the problem seems to be in the encoding command, and this comment below is correct: "Tcl_EncodingObjCmd should *not* use Tcl_ExternalToUtfDString, but should instead use the more complex Tcl_ExternalToUtf which can be used to the same effect, but returns the extra bits of information which are needed" re-categorizing to cmdAH. vincentdarley added on 2002-07-31 16:40:20: Logged In: YES user_id=32170 Changed title of bug report. In the console this now gives an error message: error writing "stdout": invalid argument while executing "puts $result" Which can be fixed either by changes to Tcl_Main or to Tcl_EncodingObjCmd. Probably some documentation needs changing as well to clarify what Tcl is supposed to do in these circumstances. vincentdarley added on 2002-07-31 16:05:34: Logged In: YES user_id=32170 It's still not clear to me if the remaining bug is a bug in tclCmdAH.c (i.e. Tcl_EncodingObjCmd should really use Tcl_ExternalToUtf and check the results flags), or if it is a bug in Tcl_Main, but either way it still needs further fixing. Thanks for getting rid of the infinite loop! andreas_kupries added on 2002-07-31 01:42:50: Logged In: YES user_id=75003 So, should we reassign this to the encoding system ? Lowering the priority. The priority-9 hang is gone with the patch just committed. The rest could be cleanup. andreas_kupries added on 2002-07-31 01:40:11: Logged In: YES user_id=75003 Patch to I/O system committed to head. nobody added on 2002-07-24 23:46:34: Logged In: NO Aah! I think you have hit on one solution. Tcl_EncodingObjCmd should *not* use Tcl_ExternalToUtfDString, but should instead use the more complex Tcl_ExternalToUtf which can be used to the same effect, but returns the extra bits of information which are needed... andreas_kupries added on 2002-07-24 23:22:31: Logged In: YES user_id=75003 Yes, the result is not valid utf-8. AFAIK [encoding convertfrom identity] is the only command/argument combo which is able to do this. With regard to generation of a byte-array ... The code of [encoding convertfrom] invokes "Tcl_ExternalToUtfDString". This means that the caller has not enough information to be able to decide if the generated DString is valid utf-8 or not, thus unable to decide whether to generate a string or a bytearray object. AFAIK [encoding convertfrom (identity)] is used in the testsuite to thoroughly test the encoding/utf-8 system. Outside of this I currently see no application for the identity encoding. Regarding TclObj / UTF-8 ... Jeff told me yesterday that we have nearly true UTF-8, with the exception of embedded \0's. Regarding Tcl_Main: IMHO we at least have to tell the user when a result could not be printed due to invalid utf-8 ... Invalid utf-8 should not happen in 99.999% of cases, only with [enc cf ident] as said above. vincentdarley added on 2002-07-24 22:23:14: Logged In: YES user_id=32170 Thinking more (but outside my area of expertise -- always fun!)... Perhaps the issue is that 'encoding convertfrom' is generating a bad string object which is not valid utf-8, and in this case it should really generate a bytearray... dgp added on 2002-07-24 21:47:46: Logged In: YES user_id=80530 Hmmm... seems there are conflicting specifications for Tcl_Obj ? The docs for Tcl_NewObj and Tcl_RegisterObjType do not mention a particular encoding for the bytes field, but they explicitly mention the possibility of embedded nulls, which proper Tcl UTF-8 does not contain. However, the docs for Tcl_WriteObj say that "the UTF-8 characters in writeObjPtr's string representation are converted..." indicating that it makes the UTF-8 assumption. This should be resolved and spelled out more completely in the docs. dgp added on 2002-07-24 21:38:34: Logged In: YES user_id=80530 I don't really see a role for Tcl_Main here. It's just playing the role of middle-man, taking a result from one Tcl function and passing it on to another Tcl function. If the two pieces are not compatible, I'd say there's a contract being broken somewhere. Looks like the problem is that there's a Tcl_Obj with a string rep that is not proper UTF-8 encoded. vincentdarley added on 2002-07-24 20:47:00: Logged In: YES user_id=32170 The patch looks good; I think you're right -- something does need adding to Tcl_Main so that the error is caught and a binary representation (or whatever) is used. After all, the command itself (encoding convertfrom identity \342) is not an error, so the user shouldn't see an error. andreas_kupries added on 2002-07-24 07:16:27: File Added - 27615: 584603.diff Logged In: YES user_id=75003 Here is a patch to the I/O system which prevents the looping. It throwns an error instead. andreas_kupries added on 2002-07-23 23:44:09: Logged In: YES user_id=75003 Note: This time WriteChars does not block because of the encoding flags. It blocks because Tcl_UTfToExternal is told to convert \342, which is not a complete UTF-8 character. And it is given \342 to print because that is the result of the [encoding convertfrom] command. When you do a [puts \342] the system converts this into proper UTF-8 when reading, and this proper UTF-8 is then converted back to \342 for printing, thus no problem. vincentdarley added on 2002-07-23 23:32:50: Logged In: YES user_id=32170 It is definitely when the system tries to output the result to console that the loop occurs, because this: encoding convertfrom identity \342 ; set a 1 just happily prints '1' in the console, no infinite loop, no error, no problem! Interestingly, 'puts \342' and 'set a \342' work fine for me. I'm interested to see in WriteChars this block, which shows there was a different bug which caused an infinite loop: /* Fix for SF #506297, reported by Martin Forssen * <[email protected]>. * * The encoding chosen in the script exposing the bug writes out * three intro characters when TCL_ENCODING_START is set, but does * not consume any input as TCL_ENCODING_END is cleared. As some * output was generated the enclosing loop calls UtfToExternal * again, again with START set. Three more characters in the out * and still no use of input ... To break this infinite loop we * remove TCL_ENCODING_START from the set of flags after the first * call (no condition is required, the later calls remove an unset * flag, which is a no-op). This causes the subsequent calls to * UtfToExternal to consume and convert the actual input. */ andreas_kupries added on 2002-07-23 10:12:06: File Added - 27568: 584603.tar.gz Logged In: YES user_id=75003 And back in the encoding subsystem ... WriteChars uses Tcl_UtfToExternal. In the ok case I get result 0, 1 char read, converted and written. In the failure case result is -1, 0 chars read, converted, written => infniite loop. Another difference: In the ok case the system uses identity to write the result of the command, definitely something from the wrong init. In the failure case it tries to use iso-8859-1 Ah, the result -1 means that the chars are the beginning of a multibyte sequence which cannot be converted without more characters. This could point back to the [encoding convertfrom], i.e. should it not have generated valid utf-8 ? The result of [encoding convertfrom identity \xe2] was \xe2! Invalid utf-8. Note: WriteChars expects that it is given a valid and complete UTF-8 sequence. Partial UTF-8 is AFAIK not allowed. IIRC Rolf Ade fell into that trap too one time when working on tDOM. Attaching patch for my trace output and relevant logs. andreas_kupries added on 2002-07-23 09:33:46: Logged In: YES user_id=75003 More data ... The command in itself seems to be ok ... Now it makes sense that miguel comes to WriteChars ... The interpreter is most likely trying to print the result of the conversion to the console and that hangs, not the command itself. andreas_kupries added on 2002-07-23 09:24:13: Logged In: YES user_id=75003 Ok. More data: Got the CVS head a few minutes ago, compiled --disable-shared --enable-symbols. Executed the command as described, using the shell in the __build directory__. I had not installed first. It got the message about tclsh being unable to find its script library. The execution of the command was **ok**. I then installed tcl and executed the command using the installed tclsh. _Now_ I get an infinite loop. My conclusion is that the loading of the encodings gets into trouble. The main steps of the command are: (1) Check arguments, (2) Load encoding (3) Tcl_ExternalToUtfDString (4) Set result ... More testing, i.e. enclosing (2) into printf's ... The load step is in both cases ok. It is step 3 which hangs. Full conclusion: Without encoding files the internal 'identity' is used, this is ok. Else 'identity' is loaded from an external file, and this encoding is bogus. andreas_kupries added on 2002-07-23 08:37:06: Logged In: YES user_id=75003 At first glance I have trouble believing this. [encoding convertfrom] is not a channel command. Note line 3146ff in the implementation of 'WriteChars'. Could this be in Tcl_UtfToExternal ? msofer added on 2002-07-23 07:43:27: Logged In: YES user_id=148712 I've been following the tracks of this thing - the call that doesn't return is WriteChars(chanPtr, src, srcLen) called from tclIO.c line 2934 Tcl_WriteObj(outChannel, resultPtr) called from tclMain.c line 407 I do not think this has to with objects. Who's knowledgeable about the going-ons in there? aku? dgp added on 2002-07-23 04:58:11: Logged In: YES user_id=80530 encoding convertfrom identity \xe2 shows the same thing. msofer added on 2002-07-23 00:59:57: Logged In: YES user_id=148712 It happens in 8.4b1 but not in 8.4a4. msofer added on 2002-07-23 00:54:22: Logged In: YES user_id=148712 verified on linux |