Ticket UUID: | 523988 | |||
Title: | Decoding broken for iso-2022-jp | |||
Type: | Bug | Version: | obsolete: 8.3.4 | |
Submitter: | atfrost | Created on: | 2002-02-28 18:39:42 | |
Subsystem: | 44. UTF-8 Strings | Assigned To: | hobbs | |
Priority: | 5 Medium | Severity: | ||
Status: | Closed | Last Modified: | 2002-03-02 10:16:03 | |
Resolution: | Fixed | Closed By: | hobbs | |
Closed on: | 2002-03-02 03:16:03 | |||
Description: |
Tcl 8.3.4 does not decode iso-2022-jp data properly under certain conditions. Attached is a file "iso2022.sample". This file is also located at http://www.nevernever.org/iso2022.sample When this file is read line-by-line using "gets", only the first 62 bytes of each line are decoded properly. The remainder of the line is decoded as rubbish, eg. a string literal such as: %e%"%&%H$N:]$N A sample of the garbled output (reencoded in iso-2022- jp) can be found at http://www.nevernever.org/iso2022.out Here is a code segment which will produce output with mangled lines in $msg: set f [ open $filename r ] fconfigure $f -encoding iso2022-jp set msg "" while { ![eof $f] } { set line [ gets $f ] append msg "$line\n" } close $f Interestingly, this problem is not exhibited when the file is read using "read", as in: set f [ open $filename r ] fconfigure $f -encoding iso2022-jp set msg [ read $f ] close $f A Usenet thread discussing this issue (subject was "trouble decoding iso2022-jp") can be found at http://groups.google.com/groups? hl=en&frame=right&th=b7c438757333dab2 Jeff Hobbs posted: "OK, on this one Thomas keyed me in by pointing out that only his strict example fails (using gets - read is OK). So I honed in on that, and was further keyed in that he noted only X chars ever get translated. Poking around, I found the problem in tclIO.c:FilterInputBytes. It has something to do with the value of the ENCODING_LINESIZE #define (currently 30). If I bump that up to 60, I can read Thomas' sample just fine, and if I drop it to 20, it stops the correct encoding translation even earlier per line. That's obviously not a correct fix, but it does indicate that FilterInputBytes isn't encoding right. I'll have to look into this more when it's not past midnight ..." | |||
User Comments: |
hobbs added on 2002-03-02 10:16:03:
File Added - 18689: 523988.patch Logged In: YES user_id=72656 I figured out the problem ... the ChannelState's inputEncodingFlags was not getting the TCL_ENCODING_START flag ever turned off. 'read' would do this, but not 'gets'. This meant that 'gets' would see the initial escape and jump to jis0208 mode, but after reading in a small buffer's worth of data (somewhat related to ENCODING_LINESIZE), it reset to the default encoding in the table again (iso8859-1). Increasing the ENCODING_LINESIZE parameter just extended the amount of data that fit in that initial buffer. In the patch I actually lower the ENCODING_LINESIZE value slightly, as it speeds up 'gets' somewhat. Commited to 8.3.4+ and 8.4a4cvs. Added tests based on the sample data to encoding.test atfrost added on 2002-03-01 02:02:22: Logged In: YES user_id=472917 I would like to mention that > 62 byte lines of text in other encodings (such as Shift-jis) do not pose any problems when read using "gets". A 235 byte line of SJis text can be found at http://www.nevernever.org/shiftjis.sample atfrost added on 2002-03-01 01:42:08: File Added - 18581: iso2022.out atfrost added on 2002-03-01 01:39:42: File Added - 18580: iso2022.sample |