Tcl Source Code

View Ticket
Login
Ticket UUID: 584603
Title: 'encoding convertfrom' causes error messages
Type: Bug Version: obsolete: 8.5a3
Submitter: vincentdarley Created on: 2002-07-21 19:31:48
Subsystem: 16. Commands A-H Assigned To: andreas_kupries
Priority: 7 High Severity:
Status: Closed Last Modified: 2006-03-15 04:18:59
Resolution: Duplicate Closed By: andreas_kupries
    Closed on: 2006-03-14 21:18:59
Description:
The following command:

encoding convertfrom identity \342

locks up Tcl in an infinite loop.  This has been verified on 
Windows and MacOS X (unix) with Tcl 8.4b2.  It does not appear 
to happen with Tcl 8.3.
User Comments: andreas_kupries added on 2006-03-15 04:18:58:
Logged In: YES 
user_id=75003

Decided to close this as duplicate. Further discussion
should be held at 624919, as Don (dgp) proposed.

As a last comment in this thread, I will not be in tears
should [encoding convertfrom identity] go the way of the
Dodo, except for a special command in tcltest. The public
availability just gives us grief.

hobbs added on 2005-10-12 01:34:28:
Logged In: YES 
user_id=72656

See original change:

2002-07-30  Andreas Kupries 
<[email protected]>

* tests/io.test: 
* generic/tclIO.c (WriteChars): Added flag to break out of
loop if
  nothing of the input is consumed at all, to prevent infinite
  looping of called with a non-UTF-8 string. Fixes Bug 584603
  (partially). Added new test "io-60.1". Might need additional
  changes to Tcl_Main so that unprintable results are printed as
  binary data.

dgp added on 2005-01-13 05:40:54:
Logged In: YES 
user_id=80530


This is the ongoing confusion about
what encoding requirements exist
for the bytes found at
objPtr->bytes .

Originally, in Tcl 8.0,  there were no
requirements, as it was meant to be
a counted string, with no constraints.

After 8.1, slowly more and more parts
of the core, including Tcl_Write*()
started assuming the Tcl_Obj's 
passed to it had UTF-8 strings only
at objPtr->bytes.

[encoding convertfrom identity] is
the only script level access remaining
among the built-in Tcl commands
that can create an objPtr->bytes
that is not legal UTF-8.

Bug 624919 covers similar issues.
Might want to close this and
redirect discussion there.

dkf added on 2005-01-13 04:48:37:
Logged In: YES 
user_id=79902

The fact that this only causes a problem with the Tk console
and nothing else (I tried tkcon and tclsh) indicates it's a
console problem. I've no idea what to do about it. :^)

vincentdarley added on 2005-01-13 00:04:55:
Logged In: YES 
user_id=32170

This hasn't gone away.  Type 'encoding convertfrom identity
\342' + return into the Windows Tk wish console.  You'll get
a bgerror (!) dialog popping up with this contents:

error writing "stdout": invalid argument
error writing "stdout": invalid argument
    while executing
"puts $result"
    (procedure "tk::ConsoleInvoke" line 20)
    invoked from within
"tk::ConsoleInvoke"
    (command bound to event)

(the infinite loop mentioned in the original report was
fixed quickly, but this subsidiary bug still exists).

dkf added on 2005-01-12 16:26:17:
Logged In: YES 
user_id=79902

Whatever was exactly the problem, it's gone away now.

vincentdarley added on 2003-09-15 17:40:36:
Logged In: YES 
user_id=32170

I believe the problem is in these lines in tclCmdAH.c:

    if ((enum options) index == ENC_CONVERTFROM) {
/*
 * Treat the string as binary data.
 */

string = (char *) Tcl_GetByteArrayFromObj(data, &length);
Tcl_ExternalToUtfDString(encoding, string, length, &ds);

/*
 * Note that we cannot use Tcl_DStringResult here because
 * it will truncate the string at the first null byte.
 */

The 'Tcl_ExternalToUtfDString' can return an invalid utf
string (when used with the identity encoding). I believe
that function call should be replaced with a call to the
more complex 'Tcl_ExternalToUtf' which is capable of telling
us whether the returned string is valid utf or not.  We can
then take appropriate action for the case where it is
invalid utf.  According to the comments below, 'appropriate
action' appears to mean that we should generate a bytearray
instead.

dkf added on 2003-02-20 22:29:28:
Logged In: YES 
user_id=79902

I don't understand what needs doing at all.  The various
Tcl_ExternalTo* functions always confused me.  Attach a
patch (with tests preferably!) and I'd be happy to review
it...

vincentdarley added on 2003-02-20 16:05:06:
Logged In: YES 
user_id=32170

Looked at this some more, the problem seems to be in 
the encoding command, and this comment below is 
correct: "Tcl_EncodingObjCmd should *not* use 
Tcl_ExternalToUtfDString, but should instead use the 
more complex Tcl_ExternalToUtf which can be used to 
the same effect, but returns the extra bits of 
information which are needed"

re-categorizing to cmdAH.

vincentdarley added on 2002-07-31 16:40:20:
Logged In: YES 
user_id=32170

Changed title of bug report.  In the console this now 
gives an error message:

error writing "stdout": invalid argument
    while executing
"puts $result"

Which can be fixed either by changes to Tcl_Main or to 
Tcl_EncodingObjCmd.  Probably some documentation 
needs changing as well to clarify what Tcl is supposed 
to do in these circumstances.

vincentdarley added on 2002-07-31 16:05:34:
Logged In: YES 
user_id=32170

It's still not clear to me if the remaining bug is a bug in 
tclCmdAH.c (i.e. Tcl_EncodingObjCmd should really use 
Tcl_ExternalToUtf and check the results flags), or if it is 
a bug in Tcl_Main, but either way it still needs further 
fixing.  Thanks for getting rid of the infinite loop!

andreas_kupries added on 2002-07-31 01:42:50:
Logged In: YES 
user_id=75003

So, should we reassign this to the encoding system ?

Lowering the priority. The priority-9 hang is gone with
the patch just committed. The rest could be cleanup.

andreas_kupries added on 2002-07-31 01:40:11:
Logged In: YES 
user_id=75003

Patch to I/O system committed to head.

nobody added on 2002-07-24 23:46:34:
Logged In: NO 

Aah!  I think you have hit on one solution.  
Tcl_EncodingObjCmd should *not* use 
Tcl_ExternalToUtfDString, but should instead use the 
more complex Tcl_ExternalToUtf which can be used to 
the same effect, but returns the extra bits of 
information which are needed...

andreas_kupries added on 2002-07-24 23:22:31:
Logged In: YES 
user_id=75003

Yes, the result is not valid utf-8. AFAIK [encoding convertfrom 
identity] is the only command/argument combo which is able 
to do this. With regard to generation of a byte-array ... The 
code of [encoding convertfrom] 
invokes "Tcl_ExternalToUtfDString". This means that the 
caller has not enough information to be able to decide if the 
generated DString is valid utf-8 or not, thus unable to decide 
whether to generate a string or a bytearray object.

AFAIK [encoding convertfrom (identity)] is used in the 
testsuite to thoroughly test the encoding/utf-8 system. 
Outside of this I currently see no application for the identity 
encoding.

Regarding TclObj / UTF-8 ... Jeff told me yesterday that we 
have nearly true UTF-8, with the exception of embedded \0's.

Regarding Tcl_Main: IMHO we at least have to tell the user 
when a result could not be printed due to invalid utf-8 ... 
Invalid utf-8 should not happen in 99.999% of cases, only with 
[enc cf ident] as said above.

vincentdarley added on 2002-07-24 22:23:14:
Logged In: YES 
user_id=32170

Thinking more (but outside my area of expertise -- 
always fun!)...

Perhaps the issue is that 'encoding convertfrom' is 
generating a bad string object which is not valid utf-8, 
and in this case it should really generate a bytearray...

dgp added on 2002-07-24 21:47:46:
Logged In: YES 
user_id=80530

Hmmm... seems there are conflicting
specifications for Tcl_Obj ?

The docs for Tcl_NewObj and
Tcl_RegisterObjType do not
mention a particular encoding
for the bytes field, but they explicitly
mention the possibility of embedded
nulls, which proper Tcl UTF-8 does
not contain.

However, the docs for Tcl_WriteObj
say that "the UTF-8 characters in
writeObjPtr's string representation
are converted..." indicating that it
makes the UTF-8 assumption.

This should be resolved and spelled
out more completely in the docs.

dgp added on 2002-07-24 21:38:34:
Logged In: YES 
user_id=80530

I don't really see a role for Tcl_Main here.
It's just playing the role of middle-man,
taking a result from one Tcl function and
passing it on to another Tcl function.  If
the two pieces are not compatible, I'd
say there's a contract being broken
somewhere.

Looks like the problem is that there's
a Tcl_Obj with a string rep that is
not proper UTF-8 encoded.

vincentdarley added on 2002-07-24 20:47:00:
Logged In: YES 
user_id=32170

The patch looks good; I think you're right -- something 
does need adding to Tcl_Main so that the error is caught 
and a binary representation (or whatever) is used.  After 
all, the command itself (encoding convertfrom identity 
\342) is not an error, so the user shouldn't see an error.

andreas_kupries added on 2002-07-24 07:16:27:

File Added - 27615: 584603.diff

Logged In: YES 
user_id=75003

Here is a patch to the I/O system which prevents the looping. 
It throwns an error instead.

andreas_kupries added on 2002-07-23 23:44:09:
Logged In: YES 
user_id=75003

Note: This time WriteChars does not block because of the 
encoding flags. It blocks because Tcl_UTfToExternal is told to 
convert \342, which is not a complete UTF-8 character.

And it is given \342 to print because that is the result of the 
[encoding convertfrom] command.

When you do a [puts \342] the system converts this into 
proper UTF-8 when reading, and this proper UTF-8 is then 
converted back to \342 for printing, thus no problem.

vincentdarley added on 2002-07-23 23:32:50:
Logged In: YES 
user_id=32170

It is definitely when the system tries to output the result 
to console that the loop occurs, because this:

encoding convertfrom identity \342 ; set a 1

just happily prints '1' in the console, no infinite loop, no 
error, no problem!  Interestingly, 'puts \342' and 'set a 
\342' work fine for me.  I'm interested to see in 
WriteChars this block, which shows there was a 
different bug which caused an infinite loop:

    /* Fix for SF #506297, reported by Martin 
Forssen
     * <[email protected]>.
     *
     * The encoding chosen in the script 
exposing the bug writes out
     * three intro characters when 
TCL_ENCODING_START is set, but does
     * not consume any input as 
TCL_ENCODING_END is cleared. As some
     * output was generated the enclosing loop 
calls UtfToExternal
     * again, again with START set. Three more 
characters in the out
     * and still no use of input ... To break this 
infinite loop we
     * remove TCL_ENCODING_START from the 
set of flags after the first
     * call (no condition is required, the later 
calls remove an unset
     * flag, which is a no-op). This causes the 
subsequent calls to
     * UtfToExternal to consume and convert 
the actual input.
     */

andreas_kupries added on 2002-07-23 10:12:06:

File Added - 27568: 584603.tar.gz

Logged In: YES 
user_id=75003

And back in the encoding subsystem ...
WriteChars uses Tcl_UtfToExternal. In
the ok case I get result 0, 1 char read,
converted and written. In the failure case
result is -1, 0 chars read, converted,
written => infniite loop.

Another difference: In the ok case the system
uses identity to write the result of the command,
definitely something from the wrong init. In
the failure case it tries to use iso-8859-1

Ah, the result -1 means that the chars are the
beginning of a multibyte sequence which cannot
be converted without more characters.

This could point back to the [encoding convertfrom],
i.e. should it not have generated valid utf-8 ? The
result of [encoding convertfrom identity \xe2] was
\xe2! Invalid utf-8.

Note: WriteChars expects that it is given a valid
and complete UTF-8 sequence. Partial UTF-8 is AFAIK
not allowed. IIRC Rolf Ade fell into that trap too
one time when working on tDOM.

Attaching patch for my trace output and relevant logs.

andreas_kupries added on 2002-07-23 09:33:46:
Logged In: YES 
user_id=75003

More data ... The command in itself seems to be ok ... Now
it makes sense that miguel comes to WriteChars ... The
interpreter is most likely trying to print the result of the
conversion to the console and that hangs, not the command
itself.

andreas_kupries added on 2002-07-23 09:24:13:
Logged In: YES 
user_id=75003

Ok. More data: Got the CVS head a few minutes ago,
compiled --disable-shared --enable-symbols. Executed
the command as described, using the shell in the
__build directory__. I had not installed first. It
got the message about tclsh being unable to find its
script library. The execution of the command was **ok**.
I then installed tcl and executed the command using
the installed tclsh. _Now_ I get an infinite loop.

My conclusion is that the loading of the encodings gets
into trouble.

The main steps of the command are:
(1) Check arguments,
(2) Load encoding
(3) Tcl_ExternalToUtfDString
(4) Set result

... More testing, i.e. enclosing (2) into printf's ... The
load step is in both cases ok. It is step 3 which hangs.

Full conclusion: Without encoding files the internal
'identity' is used, this is ok. Else 'identity' is loaded
from an external file, and this encoding is bogus.

andreas_kupries added on 2002-07-23 08:37:06:
Logged In: YES 
user_id=75003

At first glance I have trouble believing this.
[encoding convertfrom] is not a channel command.

Note line 3146ff in the implementation of 'WriteChars'.
Could this be in Tcl_UtfToExternal ?

msofer added on 2002-07-23 07:43:27:
Logged In: YES 
user_id=148712

I've been following the tracks of this thing - the call that
doesn't return is

WriteChars(chanPtr, src, srcLen)       called from tclIO.c 
  line 2934
Tcl_WriteObj(outChannel, resultPtr)   called from tclMain.c
line  407

I do not think this has to with objects. Who's knowledgeable
about the going-ons in there? aku?

dgp added on 2002-07-23 04:58:11:
Logged In: YES 
user_id=80530


encoding convertfrom identity \xe2

shows the same thing.

msofer added on 2002-07-23 00:59:57:
Logged In: YES 
user_id=148712

It happens in 8.4b1 but not in 8.4a4.

msofer added on 2002-07-23 00:54:22:
Logged In: YES 
user_id=148712

verified on linux

Attachments: