Tcl Source Code

View Ticket
Login
Ticket UUID: 411825
Title: Passing list w/UTF-8 from C can fail
Type: Bug Version: obsolete: 8.4.4
Submitter: arobert3434 Created on: 2001-03-28 06:36:18
Subsystem: 10. Objects Assigned To: dgp
Priority: 5 Medium Severity:
Status: Closed Last Modified: 2003-08-28 18:58:04
Resolution: Fixed Closed By: dgp
    Closed on: 2003-08-27 20:11:02
Description:
On certain installations of Tcl/Tk 8.3.1, the passing
of UTF-8
character-triplets ending in octal 240 (decimal 160,
hex A0)
interferes with list delimitation when
Tcl_AppendElement is used
to return a result from a C function.  In particular,
if a UTF-8
string ending in octal 240 is appended to the result,
and then
another UTF-8 string is appended afterwards, the octal
240 seems
to be interpreted as a "forward delete" character of
some kind,
with the result that the separation between the two
list elements
is erased and they are interpreted as one.

The following C function, when called from Tcl,
illustrates the
problem.

int sendCharList(ClientData clientData, Tcl_Interp *interp,
                 int argc, char **argv)
{
    char s1[5], s2[5], s3[5], s4[5];

    strcpy(s1, "\345\220\240");
    strcpy(s2, "\345\214\240");
    strcpy(s3, "\351\235\240");
    strcpy(s4, "\347\264\240");

    Tcl_ResetResult(interp);

    Tcl_AppendElement(interp, s1);
    Tcl_AppendElement(interp, s2);
    Tcl_AppendElement(interp, s3);
    Tcl_AppendElement(interp, s4);

  return TCL_OK;
}

The Tcl calls:

set s6 [sendCharList]
puts "[llength $s6] , [string length $s6]"

should output "4 , 7" (4 list elements, each a single UTF-8
composite character plus 3 delimiters).  On some
systems it does.
On others, however, the output is "1 , 4", resulting from
deletion of the list delimiters somewhere during
passage from C
to Tcl.  A complete test program involving the above
(plus some
additional tests and using wish not tclsh) may be
accessed at:
ftp://zakros.ucsd.edu/arobert/Temp/testTclBug.tgz (it
is also
attached).

A full application that exposes the bug (and led to its
discovery) may be found at:
http://freshmeat.net/projects/hanzim

Unfortunately, I have not been able to isolate why some
installations exhibit the bug and some don't.  A
default SUSE 7.0
Linux installation of 8.3.1 had the problem, while a
default
Slackware 7.1 installation of the same Tcl/Tk version
did not.
Maybe it is a compilation flag difference... ?

I'm also not sure whether it persists in 8.3.2 or 8.4.
User Comments: dossy added on 2003-08-28 18:58:04:
Logged In: YES 
user_id=21885

Donal -- yes, I see your point and now I agree.  The rule is 
that list elements ending in a list delimiter get quoted, and 
since \302\240 is now no longer considered a list delimiter, it 
doesn't cause quoting to happen.  Thanks.

Don -- I understand what's supposed to happen (at least, I 
thought I did) but then, explain this:

% encoding system identity
% fconfigure stdout -encoding binary -translation binary
% TestCmd
foo  bar
% string length [TestCmd]
8
% string bytelength [TestCmd]
9

I would have expected to get "foo\302\240 bar" and not 
just "foo\240 bar".  It's clear from string bytelength that the 
\302 is in there, but when I set stdout encoding to binary, it 
should give me the raw UTF-8 (9 bytes) and not the 
transcoded ISO-8859-1 representation (8 bytes), right?

Or, am I misunderstanding what "-encoding binary" means and 
what "encoding system identity" does?  I mean, this actually 
does what I expect:

% fconfigure stdout -encoding identity
% TestCmd
foo  bar

Now it output "foo\302\240 bar" -- why will it do that on "-
encoding identity" but not "-encoding binary"?

Perhaps we can take this discussion to the wiki or email since 
it's not directly related to this particular bug -- let me know 
what works best for you.

dkf added on 2003-08-28 06:40:59:
Logged In: YES 
user_id=79902

Behaviour is correct.  UTF-8 sequence \302\240 corresponds to ISO8859-1 character \240 (i.e. 
non-breaking space.)  Non-breaking space is (now, with DGP's patch) considered to not be a space 
character and hence not in need of quoting.

dgp added on 2003-08-28 06:39:06:
Logged In: YES 
user_id=80530


Let me advise you try again tomorrow.
By then the anonymous CVS at SF
will have caught up to all my commits.

the output you describe sounds
correct to me.  Before you file
another bug report, be  sure you
understand that Tcl uses UTF-8
encoding internally and by default
converts to your system encoding
on output.

The two byte sequence \302\240
is the UTF-8 encoding for the character
known in Tcl-Unicode notation as \u00a0
which is the non-breaking space.  When
you write that character to output on
a system with system encoding of
iso8859-1 it gets written as the single
byte \240 which is the same character
in that encoding.  Likewise, if you were
to read in the byte \240 on the same
system, Tcl will convert it back to UTF-8
so by the time Tcl sees it again, it will
be the 2-byte sequence \302\240 .

When you work with an interactive tclsh,
the results you see have actually been
written to stdout, and are in the system
encoding.

If you don't completely follow what I just
said, do not file another bug report yet,
but let's find another channel to straighten
out any misunderstandings about how Tcl
encodings are supposed to work.

dossy added on 2003-08-28 06:12:20:
Logged In: YES 
user_id=21885

Your patch only included tests util-8.5 and util-8.6.  I just 
checked HEAD and core-8-4-branch and the util.test file 
stops at util-8.1.

I'm showing the last checkin for tests/util.test as:

revision 1.11
date: 2003/07/24 16:05:24;  author: dgp;  state: Exp;  lines: 
+37 -4

I assume this means you didn't get to check your change in, 
yet?

Either way, the C test case I provided on 2003-08-25 20:37 
passes after applying the patch, kinda.  [llength [TestCmd]] 
== 2, but now look at what TestCmd outputs:

% encoding system
iso8859-1
% TestCmd
foo  bar

Pushing that through "od -xc", here's the actual bytes that 
get output:

666f 6fa0 2062 6172
f   o   o 240       b   a   r

Instead of \302\240 coming back out, only \240 came back.  
At least this is a *different* problem to solve, now.  At least 
before it would return "foo\302\240bar" -- now, it's 
returning "foo\240 bar" -- I'm not exactly sure which is 
worse.  :-)

However, the behavior I described on 2003-08-26 13:22 
hasn't changed, list elements ending in \302\240 don't get 
wrapped with {}.  Suppose I should file this as a new bug, 
now?

dossy added on 2003-08-28 03:22:28:
Logged In: YES 
user_id=21885

Thank you so much, Don.  We're going to apply the patch 
and do our tests.  I'll let you know how it goes!

dgp added on 2003-08-28 02:59:29:

File Added - 59965: 411825.patch

Logged In: YES 
user_id=80530


Here's a copy of the patch I am
committing to HEAD and to
core-8-4-branch.

dgp added on 2003-08-28 00:58:05:
Logged In: YES 
user_id=80530


committed new tests to test suite
   util-8.3 shows dossy's reported bug
   util-8.4 shows another TclNeedSpace bug
Fix on the way.

dgp added on 2003-08-27 04:47:39:
Logged In: YES 
user_id=80530


sorry, but it's gonna be another day.
Just as I was testing the patch, a big
storm came through and knocked off
power.  Power's back, but the disk
on which the patch is stored has
not come back online yet.  Will
get back to this tomorrow.

dossy added on 2003-08-27 02:25:59:
Logged In: YES 
user_id=21885

Sounds good, Don.  Thanks for the quick response on this!

dgp added on 2003-08-27 02:21:33:
Logged In: YES 
user_id=80530


Tell you what.  Let me commit
a fix to the re-opened bug (should
be able to start work on that shortly;
should not take long).  Then 
after that fix is in, if you still find
something not meeting your
expectations, you can file a new
bug report on that.  Thanks.

dossy added on 2003-08-27 00:22:57:
Logged In: YES 
user_id=21885

I don't know if this should be entered as a seperate bug, but 
it's related to this problem (similar fix should address both):

% set a [list "abc  "]
{abc  }

This is correct -- since the list element ends in whitespace, 
it's wrapped with {} for its string representation.

% encoding system
iso8859-1

% set a [list "abc\240\240"]
abc  

Here, the string is "abc\240\240" but it's not being wrapped 
by {}.  But, if [string is space \240] is 1, shouldn't it be?

% encoding system utf-8
% set a [list [encoding convertfrom utf-8 "abc\302\240\302
\240"]]
abc  

Here, [string is space [encoding convertfrom utf-8 \302\240]] 
= 1.  Again, the list element isn't being wrapped with {} -- 
why?

Parts of Tcl treat \240 or \302\240 as a space (and thus 
don't insert a list delimiter character) but others don't treat it 
as a space, so stringify'ing list elements don't get {} wrapped 
around them.

-- Dossy

dkf added on 2003-08-26 17:12:25:
Logged In: YES 
user_id=79902

Sorry.  I've no time to chase this today.  :^/

dgp added on 2003-08-26 10:30:59:
Logged In: YES 
user_id=80530


I see the ChangeLog comment from dkf's patch:

        * generic/tclUtil.c (TclNeedSpace): Rewrote to be
UTF-8 aware.
        [Bug 411825, but not that patch which would have
added extra
        spaces if there was a real non-ASCII space involved. ]

Trouble here is that Tcl_UniCharIsSpace() is the
wrong test.  It is not equivalent to 
Tcl_UniCharIsAListElementTerminator()
which is what we really need to test.  In particular,
the "non-breaking space" \u00A0 returns true
from Tcl_UniCharIsSpace(), but is not recognized
by the list parser in [llength] as a separator of
list elements.

Looks like the prior fix did correct lots of errors.
Prior to the fix, every UTF-8 sequence ending
in the byte \xA0 (or \240) caused trouble with
TclNeedSpace().  After the fix, only the UTF-8
sequence \xC2\xA0 is a problem.

Here's an interactive sequence in plain Tcl
(no C coding required) that demos the remaining
bug:

% interp create \u00a0
�
% interp create [list \u00a0 foo]
� foo
% interp alias {} fooset [list \u00a0 foo] set
fooset
% interp target {} fooset
�foo
% # Just to be really clear...
% llength [interp target {} fooset]
1

Assigning to dkf.  If he doesn't
have it fixed by the time I get
to work tomorrow, we'll get it
done then.

dgp added on 2003-08-26 10:12:11:
Logged In: YES 
user_id=80530


Thank you!  A good example clarifies a lot.

Certainly looks like dkf's patch failed
to fix things, doesn't it?

Not clear to me why the patch attached
to this report wasn't accepted instead.

Re-opening.

dossy added on 2003-08-26 07:37:20:
Logged In: YES 
user_id=21885

====8<==== utfNbspTest.c ====8<====
/*
 * utfNbspTest.c
 *
 * Appending an element to a previous element that ends 
with the
 * sequence 0xC2A0 (or \302\240), the UTF code for NO-
BREAK SPACE,
 * results in an incorrect list.
 *
 * $ gcc -o utfNbspTest utfNbspTest.c -L/path/to/libtcl8.4.* -
ltcl8.4
 *
 */

#include <tcl.h>

int
TestCmd(ClientData clientData, Tcl_Interp *interp, int argc, 
char **argv)
{
    Tcl_AppendElement(interp, "foo\302\240");
    Tcl_AppendElement(interp, "bar");
    return TCL_OK;
}

int
My_AppInit(Tcl_Interp *interp)
{
    Tcl_CreateCommand(interp, "TestCmd", (Tcl_CmdProc *) 
TestCmd, NULL, NULL);
    return TCL_OK;
}

int
main(int argc, char **argv)
{
    Tcl_Main(argc, argv, My_AppInit);
    Tcl_Exit(0);
    /* NOTREACHED */
    return 0;
}
====8<==== utfNbspTest.c ====8<====

Here's the transcript showing the error:

$ ./utfNbspTest 
% set tcl_patchLevel
8.4.4
% encoding system utf-8
% encoding system
utf-8
% set x [TestCmd]
foo bar
% llength $x
1
% string length $x
7
% string bytelength $x
8
% exit


Of course, yes I know:

1) I should Tcl_Obj'ify everything.
2) Tcl_AppendElement is deprecated (supposedly!)

However, I'm dealing with a good amount of legacy code that 
will eventually get changed/modernized, but for now, it needs 
to work.  If Tcl >8.1 isn't backward compatible, that's fine.  
But, to call Tcl_AppendElement "deprecated" when it isn't 
backward compatible, well ... that's just wrong.

-- Dossy

dgp added on 2003-08-26 05:06:59:
Logged In: YES 
user_id=80530


Provide the C code that calls Tcl_AppendElement()
and that gives results that are incorrect in either
Tcl 8.4.4 or the HEAD.

dossy added on 2003-08-26 04:42:18:
Logged In: YES 
user_id=21885

I'd really hate to pick at an old scab (this bug was closed 
back in 09/2001) but exactly what was "fixed" by dkf's 
commit?

Against Tcl 8.4.4, using Tcl_AppendElement() which I know is 
deprecated, the problem is still occurring.  I guess it has to 
do with this behavior:

$ string is space [encoding convertfrom utf-8 \302\240]
1

What's annoying is if you do:

> set a foo\302\240
fooà
> set a [encoding convertfrom utf-8 foo\302\240]
foo 
> lappend a bar
foo  bar
> llength $a
2
> string bytelength $a
9

That does the right thing.  But if you Tcl_AppendElement(), 
you'll get "foo\302\240bar", which is bad.

-- Dossy

dkf added on 2001-09-19 15:53:42:
Logged In: YES 
user_id=79902

Test and fix committed (SF seems to be working at the mo...)

dgp added on 2001-09-19 04:23:29:
Logged In: YES 
user_id=80530

Assigning to dkf, since he can't log in and assign it 
himself.

dgp added on 2001-09-19 00:10:50:
Logged In: YES 
user_id=80530

The bug is in TclNeedSpace(), in generic/tclUtil.c,
part of the Objects Category.

Is there a reason not to accept the patch already
attached to this report?  Will it break 
TclNeedSpace for its existing callers?

dgp added on 2001-09-18 23:47:22:
Logged In: YES 
user_id=80530

Here's a sequence of Tcl commands broken by this bug.

% interp create \u5420
?
% interp create [list \u5420 foo]
? foo
% interp alias {} fooset [list \u5420 foo] set
fooset
% interp target {} fooset
?foo

Re-opening the bug.

dgp added on 2001-09-18 23:17:50:
Logged In: YES 
user_id=80530

Self-explanatory and revealing.  I think you're missing
the point, Jeff.  Adrian's [sendCharList] command
is trying to return the result

[list \u5420 \u5320 \u9760 \u7d20]

but it's failing because Tcl_AppendElement is
mangling his UTF-8 characters that he has
encoded "by hand".

If I can manage it, I'll post a Tcl script that
demos the bug.  I think such a script is possible.
Tcl_AppendElement calls haven't been entirely banished
from the Tcl source code.

hobbs added on 2001-09-18 22:03:01:
Logged In: YES 
user_id=72656

This should be self-explanatory:

(hobbs) 50 % set var \345\220\240
吠
(hobbs) 51 % string length $var
3
(hobbs) 52 % string bytelength $var
6

dkf added on 2001-09-18 21:34:28:
Logged In: YES 
user_id=79902

Jeff just happens to be wrong.  :^)

The example code contains valid UTF-8 strings.  The problem
is that TclNeedsSpace doesn't know anything about UTF-8 and
therefore anything depending on it (Tcl_AppendElement,
Tcl_DStringAppendElement and Tcl_DStringStartSublist says a
search with grep, plus goodness knows how much in extensions
as the code is in the stub table) is *not* UTF-8 safe.

Unfortunately, none of those three public functions (two of
which are not deprecated at all) warns in its documentation
that it is unsafe to pass UTF-8 strings to it.  :^(

The problems in TclNeedSpace are really the 'end--' which is
fundamentally wrong on UTF-8 strings, and the way it detects
what character it is looking at which needs to be much more
careful when looking at bytes outside \000-\177.  Plus
isspace is not usually Unicode-aware...

dgp added on 2001-09-18 12:46:47:
Logged In: YES 
user_id=80530

Sorry if I'm being dense, but what is it about the
strings in Adrian's example that makes them invalid
UTF-8 strings?  Is it the terminating null bytes?
How would would Tcl_ExternalToUtf be added to the
reported example code to solve the problem?

hobbs added on 2001-09-18 07:06:01:
Logged In: YES 
user_id=72656

Ah, but you are making a fatal flaw in your argument - you 
are *not* passing UTF-8 strings - you are passing 
incorrectly formed strings through Tcl.  If you converted 
these to UTF-8 first (with Tcl_ExternalToUtf), this would 
not have happened.  That isn't to say this still doesn't 
need fixing - but it is one of those areas in the core 
where the distinction between using utf-8 and raw data 
became important.

arobert3434 added on 2001-09-18 06:58:33:
Logged In: YES 
user_id=146959

This is NOT a solution.  If you don't want to change any
code, you should at least clarify the documentation so that
people in the future don't waste their time.  The
documentation should state at the very least that
List-related methods should NOT be used with UTF-8 strings
for communications between C and Tcl.  Please see the
comments submitted earlier for this bug for additional
clarification.  Thank you.

hobbs added on 2001-05-04 04:08:17:
Logged In: YES 
user_id=72656

The basic answer at this point is that if you want space 
chars to be thought of as space chars in Tcl, you should 
restrict yourself to the ascii 7-bit set, of which \240 
isn't part.  It works on some systems, where the locale 
isspace('\240') is 1, but that's not reliable.

arobert3434 added on 2001-04-02 08:27:58:
Logged In: YES 
user_id=146959

Yes, OK, the suggestion in Tip #20 just mentioned of adding
a locale-independent
isspace() to Tcl and using that would prevent the problem I
had, which arises
because 0240 is defined as a "no-break" space in a number of
important character
encodings, such as  ISO-8859-1.  This leads a great many
locales, including
en_US, to define 0240 as being in the whitespace category. 
Since many UTF
characters have 0240 inside them, this can lead to
problems...

arobert3434 added on 2001-04-02 08:27:56:
Logged In: YES 
user_id=146959

Yes, OK, the suggestion in Tip #20 just mentioned of adding
a locale-independent
isspace() to Tcl and using that would prevent the problem I
had, which arises
because 0240 is defined as a "no-break" space in a number of
important character
encodings, such as  ISO-8859-1.  This leads a great many
locales, including
en_US, to define 0240 as being in the whitespace category. 
Since many UTF
characters have 0240 inside them, this can lead to
problems...

msofer added on 2001-03-30 07:07:31:
Logged In: YES 
user_id=148712

This bug is related to bugs #408568 and #227512.
See TIP #20 at 
    http://www.cs.man.ac.uk/fellowsd-bin/TIP/

dgp added on 2001-03-29 15:55:06:

File Added - 4717: tclUtil.c.patch

Logged In: YES 
user_id=80530

I was talking about the man page for Tcl_AppendElement():

http://dev.scriptics.com/man/tcl8.3.2/TclLib/SetResult.htm

Now, reading the I18N HOWTO, it looks like I was reading
"deprecated" too strongly.  Tcl_DStringAppendElement() and
Tcl_DStringStartSublist() also rely on TclNeedSpace() and
they have not been deprecated, so TclNeedSpace() needs to
be fixed after all.  This bug is re-opened.

Looking at TclNeedSpace() explains the mysterious platform
dependence.  The buggy symptoms you report will be present
on those platforms/locales for which isspace(0240) returns
true. 

I've attached a patch that I think will correct the problem.
It's possible that it has other undesirable side-effects, so
I've assigned this report to one of the maintainers of
generic/tclUtil.c for review.

Meanwhile you can use the workaround I posted in the first
comment.

Tcl_Merge() is safe for UTF-8 strings.

arobert3434 added on 2001-03-29 13:50:12:
Logged In: YES 
user_id=146959

Also, could you please post a pointer to the documentation
you are referring to?  It would help clear up other
questions like whether Tcl_Merge is affected...

For example, the docs at
http://dev.scriptics.com/doc/howto/i18n.html do not so much
as hint at the problem.  They merely say that all the Tcl C
APIs expect UTF-8 strings, and that everything should work
perfectly if they get them...

arobert3434 added on 2001-03-29 12:34:09:
Logged In: YES 
user_id=146959

Also, could you please post a pointer to the documentation
you are referring to?  It would help clear up other
questions like whether Tcl_Merge is affected...

For example, the docs at
http://dev.scriptics.com/doc/howto/i18n.html do not so much
as hint at the problem.  They merely say that all the Tcl C
APIs expect UTF-8 strings, and that everything should work
perfectly if they get them...

arobert3434 added on 2001-03-29 12:11:24:
Logged In: YES 
user_id=146959

Thanks very much for a response and proposed solution,
however the documentation in the man page unfortunately
says nothing about this issue.  It only says that it is
best to use the object versions of the result-handling
functions because it is "significantly more efficient".
This is hardly incentive to go and learn a framework that
is significantly more complex at first sight when all one
wants to do is pass a string and everything has been
running fast enough as it is.  Since string-handling is
said to be fully unicode-based in Tcl/Tk 8.1 and above,
the default assumption on a developer's part is to assume
that "string" means "internationalized, UTF-8, or what have
you string", and that Tcl_AppendElement therefore does not
present a problem.

The real solution it seems to me is to repair the deficiency
in TclNeedSpace(), but there may be other constraints,
performance among them, that argue against this.  If this
repair is not made, the documentation for Tcl_AppendElement,
and "routines that call it" (how exactly is the typical
Tcl/Tk end-developer supposed to know which those are)
should be updated to reflect the fact that they should not
be used for anything but ASCII.  Maybe there is some other
documentation that says something about these issues, but
it should be in the man page as well.

arobert3434 added on 2001-03-29 12:11:14:
Logged In: YES 
user_id=146959

Thanks very much for a response and proposed solution,
however the documentation in the man page unfortunately
says nothing about this issue.  It only says that it is
best to use the object versions of the result-handling
functions because it is "significantly more efficient".
This is hardly incentive to go and learn a framework that
is significantly more complex at first sight when all one
wants to do is pass a string and everything has been
running fast enough as it is.  Since string-handling is
said to be fully unicode-based in Tcl/Tk 8.1 and above,
the default assumption on a developer's part is to assume
that "string" means "internationalized, UTF-8, or what have
you string", and that Tcl_AppendElement therefore does not
present a problem.

The real solution it seems to me is to repair the deficiency
in TclNeedSpace(), but there may be other constraints,
performance among them, that argue against this.  If this
repair is not made, the documentation for Tcl_AppendElement,
and "routines that call it" (how exactly is the typical
Tcl/Tk end-developer supposed to know which those are)
should be updated to reflect the fact that they should not
be used for anything but ASCII.  Maybe there is some other
documentation that says something about these issues, but
it should be in the man page as well.

arobert3434 added on 2001-03-29 12:09:08:
Logged In: YES 
user_id=146959

Thanks very much for a response and proposed solution,
however the documentation in the man page unfortunately
says nothing about this issue.  It only says that it is
best to use the object versions of the result-handling
functions because it is "significantly more efficient".
This is hardly incentive to go and learn a framework that
is significantly more complex at first sight when all one
wants to do is pass a string and everything has been
running fast enough as it is.  Since string-handling is
said to be fully unicode-based in Tcl/Tk 8.1 and above,
the default assumption on a developer's part is to assume
that "string" means "internationalized, UTF-8, or what have
you string", and that Tcl_AppendElement therefore does not
present a problem.

The real solution it seems to me is to repair the deficiency
in TclNeedSpace(), but there may be other constraints,
performance among them, that argue against this.  If this
repair is not made, the documentation for Tcl_AppendElement,
and "routines that call it" (how exactly is the typical
Tcl/Tk end-developer supposed to know which those are)
should be updated to reflect the fact that they should not
be used for anything but ASCII.  Maybe there is some other
documentation that says something about these issues, but
it should be in the man page as well.

dgp added on 2001-03-29 05:31:55:
Logged In: YES 
user_id=80530

TclNeedSpace() is not UTF-8 aware.  That's why routines
that call it, like Tcl_AppendElement() are deprecated.
(See the documentation.)

Rewrite your command procedure like so:
Tcl_Obj *resultPtr;
...
Tcl_ResetResult(interp);
resultPtr = Tcl_GetObjResult(interp);
Tcl_ListObjAppendElement(interp, resultPtr,
Tcl_NewStringObj(s1, -1));
...
Tcl_ListObjAppendElement(interp, resultPtr,
Tcl_NewStringObj(s4, -1));
return TCL_OK;

arobert3434 added on 2001-03-28 13:36:19:

File Added - 4692: testTclBug.tgz

Attachments: