Ticket UUID: | 526524 | |||
Title: | iso2022-jp conversion problems | |||
Type: | Bug | Version: | obsolete: 8.4a4 | |
Submitter: | furukawa | Created on: | 2002-03-06 18:24:46 | |
Subsystem: | 44. UTF-8 Strings | Assigned To: | hobbs | |
Priority: | 5 Medium | Severity: | ||
Status: | Closed | Last Modified: | 2002-04-18 08:52:18 | |
Resolution: | Fixed | Closed By: | hobbs | |
Closed on: | 2002-04-18 01:52:18 | |||
Description: |
tcl8.4a4 addressed several problems around the iso2022-jp enconding. For example, bugs that I submitted in the past was mostly fixed. [ BugID: 218099 ] iso2022-jp encoding does not work. [ BugID: 219283 ] iso2022-jp encoding is broken However, it still have problems when I convert relatively long (longer than several kilo-bytes) japanese texts (eg. Unix Japanese Manual Pages) into iso2022-jp. I'll attach a scipt to reproduce that. Some details follow. (1) euc-jp to iso2022-jp gets-puts conversion When I convert a text with "tclsh8.4 eucjis.tcl -eucjis -gets infile outfile", sometimes "esc ( B" is missing, sometimes extra "esc ( B" appears. While extra "esc ( B" does not matter, missing "esc ( B" causes missing characters on reading. The error is reprodusible if I use the same file, but I don't know how and when it happens. "od -x -a" of an example error is below. If I extract the erroneous line, the error does not occur. Thus the error is not the code dependent but context dependent. [ output from eucjis.tcl -eucjis -gets euc.txt jis-n3.txt ] % H $ 7 $ ^ esc ( B nl sp sp sp sp sp sp 0007760 241b 2442 2139 1b23 4228 0a0a 2020 752d esc $ B $ 9 ! # esc ( B nl nl sp sp - u ! 0010000 2020 241b 2542 213d 253c 2148 2d4a 2074 ! sp sp esc $ B % = ! < % H ! J - t sp ! 0010020 241b 2442 3b48 4d48 2451 2439 246b 2448 ! esc $ B $ H ; H M Q $ 9 $ k $ H $ [ correct output produced from a software called nkf ] % H $ 7 $ ^ esc ( B nl sp sp sp sp sp sp 0007760 241b 2442 2139 1b23 4228 0a0a 2020 752d esc $ B $ 9 ! # esc ( B nl nl sp sp - u ! 0010000 2020 241b 2542 213d 253c 2148 1b4a 4228 ! sp sp esc $ B % = ! < % H ! J esc ( B ! 0010020 742d 1b20 4224 4824 483b 514d 3924 6b24 ! - t sp esc $ B $ H ; H M Q $ 9 $ k ! 0010040 4824 2d24 4b21 5e24 3f24 4f24 3d49 283c ! $ H $ - ! K $ ^ $ ? $ O I = < ( (2) euc-jp to iso2022-jp read-puts conversion When I convert a text with "tclsh8.4 eucjis.tcl -eucjis -read infile outfile", sometimes extra "esc $ B" appears in the middle of the output. It seems it always appears at around the character number 4096 or 8192, etc. (It's not byte number, but character number.) Thus, if the tcl internal buffer for unicode storage is 8192-byte long (4096 characters), such boundary handling is supposed to have some bugs, at the beginning of each internal buffer. (3) font selection mechanism Under tk8.4a4 some character is not displayed correctly with a font like "*-jisx0208.1983-1". It is a minor problem, since we normally use "*-jisx0208.1983-0". > | |||
User Comments: |
hobbs added on 2002-04-18 08:52:18:
File Added - 21415: yamako-endenc.patch Logged In: YES user_id=72656 Applied patch to 8.4 head on 2002-04-17. Attached patch for posterity. yamako added on 2002-03-12 21:37:12: Logged In: YES user_id=475117 Hi, I sent Mr. Furukawa an additional patch to fix this problem, then I received his message that (1) and (2) problems were solved. My additional patch is available from: http://www3.ocn.ne.jp/~yamako/tcl/iso2022- jp.tcl84a4.2002mar12.patch furukawa added on 2002-03-08 19:31:00: Logged In: YES user_id=49637 Problems (1) and (2) were found to be fixed by a patch by Koichi Yamamoto (private communication). He may submit the patch after he refine it. furukawa added on 2002-03-07 01:24:46: File Added - 18910: eucjis.tcl |