Tcl Source Code

View Ticket
Login
Ticket UUID: 523988
Title: Decoding broken for iso-2022-jp
Type: Bug Version: obsolete: 8.3.4
Submitter: atfrost Created on: 2002-02-28 18:39:42
Subsystem: 44. UTF-8 Strings Assigned To: hobbs
Priority: 5 Medium Severity:
Status: Closed Last Modified: 2002-03-02 10:16:03
Resolution: Fixed Closed By: hobbs
    Closed on: 2002-03-02 03:16:03
Description:
Tcl 8.3.4 does not decode iso-2022-jp data properly 
under certain conditions.

Attached is a file "iso2022.sample".  This file is 
also located at
http://www.nevernever.org/iso2022.sample

When this file is read line-by-line using "gets", only 
the first 62 bytes of each line are decoded properly.  
The remainder of the line is decoded as rubbish, eg. a 
string literal such as:

%e%"%&%H$N:]$N


A sample of the garbled output (reencoded in iso-2022-
jp) can be found at
http://www.nevernever.org/iso2022.out


Here is a code segment which will produce output with 
mangled lines in $msg:

    set f [ open $filename r ]
    fconfigure $f -encoding iso2022-jp
    set msg ""

    while { ![eof $f] } {
     set line [ gets $f ]
     append msg "$line\n"
    }

    close $f


Interestingly, this problem is not exhibited when the 
file is read using "read", as in:

   set f [ open $filename r ]
   fconfigure $f -encoding iso2022-jp
   set msg [ read $f ]
   close $f




A Usenet thread discussing this issue (subject 
was "trouble decoding iso2022-jp") can be found at 
http://groups.google.com/groups?
hl=en&frame=right&th=b7c438757333dab2

Jeff Hobbs posted: "OK, on this one Thomas keyed me in 
by pointing out that only his strict example fails 
(using gets - read is OK).  So I honed in on that, and 
was further keyed in that he noted only X chars ever 
get translated.  Poking around, I found the problem 
in  tclIO.c:FilterInputBytes.  It has something to do 
with the value of the ENCODING_LINESIZE #define 
(currently 30).  If I bump that up to 60, I can read 
Thomas' sample just fine, and if I drop it to 20, it 
stops the correct encoding translation even earlier 
per line.  That's obviously not a correct fix, but it 
does indicate that FilterInputBytes isn't encoding 
right.  I'll have to look into this more when it's not 
past midnight ..."
User Comments: hobbs added on 2002-03-02 10:16:03:

File Added - 18689: 523988.patch

Logged In: YES 
user_id=72656

I figured out the problem ... the ChannelState's 
inputEncodingFlags was not getting the TCL_ENCODING_START 
flag ever turned off.  'read' would do this, but 
not 'gets'.  This meant that 'gets' would see the initial 
escape and jump to jis0208 mode, but after reading in a 
small buffer's worth of data (somewhat related to 
ENCODING_LINESIZE), it reset to the default encoding in the 
table again (iso8859-1).   Increasing the ENCODING_LINESIZE 
parameter just extended the amount of data that fit in that 
initial buffer.

In the patch I actually lower the ENCODING_LINESIZE value 
slightly, as it speeds up 'gets' somewhat.

Commited to 8.3.4+ and 8.4a4cvs.  Added tests based on the 
sample data to encoding.test

atfrost added on 2002-03-01 02:02:22:
Logged In: YES 
user_id=472917

I would like to mention that > 62 byte lines of text in 
other encodings (such as Shift-jis) do not pose any 
problems when read using "gets".

A 235 byte line of SJis text can be found at 
http://www.nevernever.org/shiftjis.sample

atfrost added on 2002-03-01 01:42:08:

File Added - 18581: iso2022.out

atfrost added on 2002-03-01 01:39:42:

File Added - 18580: iso2022.sample

Attachments: