Ticket UUID: | 768852 | |||
Title: | binary format a* $utf8_string - needs u/U | |||
Type: | RFE | Version: | None | |
Submitter: | nobody | Created on: | 2003-07-10 02:55:47 | |
Subsystem: | 12. ByteArray Object | Assigned To: | dkf | |
Priority: | 5 Medium | Severity: | ||
Status: | Closed | Last Modified: | 2004-06-16 05:21:24 | |
Resolution: | Rejected | Closed By: | dkf | |
Closed on: | 2004-06-15 22:21:24 | |||
Description: |
I expected that set bar [binary format a* $foo] would result in $foo == $bar, but for a chineese utf-8 value of $foo it does not. Please see attached test script. | |||
User Comments: |
dkf added on 2004-06-16 05:21:23:
Logged In: YES user_id=79902 Reevaluating this FRQ indicates that the original code should never have worked, but the following should: set tmp [encoding convertto utf-8 $foo] set tmp [binary format a* $tmp] set bar [encoding convertfrom utf-8 $tmp] Following this, $foo and $bar should be identical, and this will hold true for any encoding instead of utf-8 as long as all the characters in $foo are representable within that encoding. Given that, I've updated the docs to be clearer and contain an example that shows the handling of non-ASCII data. Future problems are now "pilot error"... samoc added on 2003-07-18 07:32:51: Logged In: YES user_id=737265 It seems to me that there is a simple solution to this problem. The purpose of binary format is to convert numeric values into specified binary representations. "a" and "A" are the only two types that don't convert some kind of numeric value. They are also uniqie in that they in fact do no conversion at all, they just dump a character string into the output. Perhaps the reason that no elegant solution presents itself here is that character strings have no place in binary format. I suggest simply depricating "a" and "A". As it stands they are not backwards compatible anyway since they don't work with multi- byte cahracters, and Tcl is supposed to support these transparently. If you have code like this set foo [binary format iai $x $s $y] do this instead: set foo [binary format i $x]$s[binary format i $y] and if the $s dosn't come out the way you want, use the usual encoding commands to deal with it. If "a" and "A" are to be kept I would suggest that they should take the string verbatim in its _current_ encoding. (not the system encoding) So if you want somthing else you could do: set foo [binary format iai $x [encoding convertto foo $a] $y] Sam dkf added on 2003-07-17 16:29:58: Logged In: YES user_id=79902 The docs are more than a bit flawed, but then so is particularly the 'A' specifier. 'u' and 'U' specifiers are a good idea (though I'd prefer a way to insert strings of a specified encoding, with the system encoding as default) but require a TIP. Switching this bug to the FRQ track. hobbs added on 2003-07-17 05:27:04: Logged In: YES user_id=72656 I'm conflicted as to whether that doc fix was correct. I think this is a problem in binary that was just never resolved. One may think that a == ascii, and that stripping the high bytes is correct. But that seems to be the purpose of 'c', correct? Of course, the fact that 'A' means space instead of null padding versus the LE vs BE of the other small/large reps shows that the designers never intended to support multi-byte chars with 'a'. I think we need to leave this as is, with improved docs, and add 'u' and 'U' for unicode chars. dkf added on 2003-07-12 04:23:17: Logged In: YES user_id=79902 It is now documented that high bytes are *stripped*. That was always the way - the docs just weren't updated when unicode was introduced with 8.1 - but it is now documented as such. Leaving this bug open as it is asking for something that possibly ought to work but definitely doesn't right now... samoc added on 2003-07-11 08:46:49: Logged In: YES user_id=737265 The manual page says "a: Stores a character string..." and it says "If count is *, then all of the bytes in arg will be formatted." In Tcl strings are UTF-8, if all the bytes are going to be formatted, then they all need to end up in the result string, without any conversion happening. I think the UTF-8 string should be copied verbatim into the "binary format" result. This seems like a sensible behaviour since it preserves the current behaviour for plain ascii strings. [binary format a* $x] == $x should hold for any value of x. (I imagine the current behaviour for "a7" would be unpredictable as well given that the man page talks about bytes not characters.) Sam P.S. I am the original submitter. P.P.S. We have worked arround the bug by changing: set r [binary format iia*i $i $j $s $k] to set r [binary format ii $i $j] append r $s append r [binary format i $k] I think that the "a" encoding should either be removed (since it does no formatting anyway) or be changed to behave as described above. dkf added on 2003-07-10 22:31:28: Logged In: YES user_id=79902 Behaviour in this case is undocumented (!) but since the primary purpose of the [binary] command is interaction with the system, I'm tempted to say that we will handle the a and A formats using [encoding system]. The alternative is to drop the high bytes (which is what happens now) but that's not nice at all. So, proposed change: [binary format a* $str] to use [encoding system] when converting to a byte-stream (which is inserted in the resulting bytearray.) Hence we also use [encoding system] when going the other way too. Net result is that the attached script will not work. However the following will: binary scan [binary format a* $foo] a* bar and the strings will be equal IF ALL THE CHARACTERS ARE REPRESENTABLE IN THE SYSTEM ENCODING (which will be fine most of the time.) Any comments Jeff? If you approve, assign it back to me and I'll develop a fix. dkf added on 2003-07-10 21:29:57: Logged In: YES user_id=79902 Hmm. Converting to ASCII by adding unicode escapes (easier to port to different test platforms!) gives this: set foo \u6211\u56fd set bar [binary format a* $foo] if { $foo != $bar } { puts "foo != bar" } Still not sure if those strings *should* be equal or not... dkf added on 2003-07-10 15:10:30: Logged In: YES user_id=79902 This is really the same issue as Bug 735364; leaving both bugs open for now because I'm not yet quite sure what the fix ought to be (something with encodings, probably.) nobody added on 2003-07-10 09:55:47: File Added - 55416: test.tcl |
Attachments:
- test.tcl [download] added by nobody on 2003-07-10 09:55:47. [details]