Tcl Source Code

View Ticket
Login
Ticket UUID: 768852
Title: binary format a* $utf8_string - needs u/U
Type: RFE Version: None
Submitter: nobody Created on: 2003-07-10 02:55:47
Subsystem: 12. ByteArray Object Assigned To: dkf
Priority: 5 Medium Severity:
Status: Closed Last Modified: 2004-06-16 05:21:24
Resolution: Rejected Closed By: dkf
    Closed on: 2004-06-15 22:21:24
Description:
I expected that
    set bar [binary format a* $foo]
would result in $foo == $bar,
but for a chineese utf-8 value of $foo it does not.

Please see attached test script.
User Comments: dkf added on 2004-06-16 05:21:23:
Logged In: YES 
user_id=79902

Reevaluating this FRQ indicates that the original code
should never have worked, but the following should:

  set tmp [encoding convertto utf-8 $foo]
  set tmp [binary format a* $tmp]
  set bar [encoding convertfrom utf-8 $tmp]
Following this, $foo and $bar should be identical, and this
will hold true for any encoding instead of utf-8 as long as
all the characters in $foo are representable within that
encoding.

Given that, I've updated the docs to be clearer and contain
an example that shows the handling of non-ASCII data. Future
problems are now "pilot error"...

samoc added on 2003-07-18 07:32:51:
Logged In: YES 
user_id=737265

It seems to me that there is a simple solution to this problem.

The purpose of binary format is to convert numeric values into
specified binary representations. "a" and "A" are the only two
types that don't convert some kind of numeric value. They are
also uniqie in that they in fact do no conversion at all, they just
dump a character string into the output.
Perhaps the reason that no elegant solution presents itself here is
that character strings have no place in binary format.

I suggest simply depricating "a" and "A". As it stands they are not
backwards compatible anyway since they don't work with multi-
byte cahracters, and Tcl is supposed to support these
transparently.
If you have code like this
  set foo [binary format iai $x $s $y]
do this instead:
  set foo [binary format i $x]$s[binary format i $y]
and if the $s dosn't come out the way you want, use the usual
encoding commands to deal with it.

If "a" and "A" are to be kept I would suggest that they should
take the string verbatim in its _current_ encoding. (not the system
encoding) So if you want somthing else you could do:
  set foo [binary format iai $x [encoding convertto foo $a] $y]

Sam

dkf added on 2003-07-17 16:29:58:
Logged In: YES 
user_id=79902

The docs are more than a bit flawed, but then so is
particularly the 'A' specifier.

'u' and 'U' specifiers are a good idea (though I'd prefer a
way to insert strings of a specified encoding, with the
system encoding as default) but require a TIP.  Switching
this bug to the FRQ track.

hobbs added on 2003-07-17 05:27:04:
Logged In: YES 
user_id=72656

I'm conflicted as to whether that doc fix was correct.  I think 
this is a problem in binary that was just never resolved.  One 
may think that a == ascii, and that stripping the high bytes is 
correct.  But that seems to be the purpose of 'c', correct?  Of 
course, the fact that 'A' means space instead of null padding 
versus the LE vs BE of the other small/large reps shows that 
the designers never intended to support multi-byte chars 
with 'a'.  I think we need to leave this as is, with improved 
docs, and add 'u' and 'U' for unicode chars.

dkf added on 2003-07-12 04:23:17:
Logged In: YES 
user_id=79902

It is now documented that high bytes are *stripped*.  That
was always the way - the docs just weren't updated when
unicode was introduced with 8.1 - but it is now documented
as such.

Leaving this bug open as it is asking for something that
possibly ought to work but definitely doesn't right now...

samoc added on 2003-07-11 08:46:49:
Logged In: YES 
user_id=737265

The manual page says "a: Stores a character string..." and it says
"If count is *, then all of the bytes in arg will be formatted."
In Tcl strings are UTF-8, if all the bytes are going to be formatted, 
then they all need to end up in the result string, without any 
conversion happening.
I think the UTF-8 string should be copied verbatim into the "binary 
format" result. This seems like a sensible behaviour since it 
preserves the current behaviour for plain ascii strings.
[binary format a* $x] == $x should hold for any value of x.

(I imagine the current behaviour for "a7" would be unpredictable 
as well given that the man page talks about bytes not characters.)

Sam

P.S. I am the original submitter.

P.P.S. We have worked arround the bug by changing:
    set r  [binary format iia*i $i $j $s $k]
to
    set r [binary format ii $i $j]
    append r $s
    append r [binary format i $k]

I think that the "a" encoding should either be removed (since it 
does no formatting anyway) or be changed to behave as described 
above.

dkf added on 2003-07-10 22:31:28:
Logged In: YES 
user_id=79902

Behaviour in this case is undocumented (!) but since the 
primary purpose of the [binary] command is interaction with 
the system, I'm tempted to say that we will handle the a and 
A formats using [encoding system].  The alternative is to 
drop the high bytes (which is what happens now) but that's 
not nice at all.

So, proposed change:
  [binary format a* $str] to use [encoding system] when 
converting to a byte-stream (which is inserted in the 
resulting bytearray.)  Hence we also use [encoding system] 
when going the other way too.

Net result is that the attached script will not work.  However 
the following will:
  binary scan [binary format a* $foo] a* bar
and the strings will be equal IF ALL THE CHARACTERS ARE 
REPRESENTABLE IN THE SYSTEM ENCODING (which will be 
fine most of the time.)

Any comments Jeff?  If you approve, assign it back to me and 
I'll develop a fix.

dkf added on 2003-07-10 21:29:57:
Logged In: YES 
user_id=79902

Hmm.  Converting to ASCII by adding unicode escapes (easier 
to port to different test platforms!) gives this:
  set foo \u6211\u56fd
  set bar [binary format a* $foo]
  if { $foo != $bar } {
     puts "foo != bar"
  }
Still not sure if those strings *should* be equal or not...

dkf added on 2003-07-10 15:10:30:
Logged In: YES 
user_id=79902

This is really the same issue as Bug 735364; leaving both
bugs open for now because I'm not yet quite sure what the
fix ought to be (something with encodings, probably.)

nobody added on 2003-07-10 09:55:47:

File Added - 55416: test.tcl

Attachments: