Wed May 14 14:35:15 2003 209.17.183.249 [MSG] tclguy: argh ... I will spend 30 min on tcl bug 557030 before calling 8.4.3 done Wed May 14 14:35:25 2003 209.17.183.249 [MSG] tclguy: anyone know chinese? Wed May 14 14:36:01 2003 193.98.144.2 [MSG] suchenwi: Here! Wed May 14 14:36:02 2003 192.35.44.3 [MSG] kennykb: suchenwi is a Sinologist by training.... Wed May 14 14:36:41 2003 209.17.183.249 [MSG] tclguy: can you please just browse that bug? Wed May 14 14:36:58 2003 193.98.144.2 [MSG] suchenwi: Yup - but a bit out of practice, doing all those other languages.. Wed May 14 14:40:07 2003 209.17.183.249 [MSG] tclguy: who knows the difference between gb2312, gb12345 and euc-cn? Wed May 14 14:40:35 2003 193.98.144.2 [MSG] suchenwi: suchenwi@jaguar% wish ~/src/roi/config/ru_mr
% info pa
8.4.1
% exit
suchenwi@jaguar% LC_CTYPE=zh_CN.GB2312 wish ~/src/roi/config/ru_mr
zsh: 21620 bus error LC_CTYPE=zh_CN.GB2312 wish Wed May 14 14:40:42 2003 193.98.144.2 [MSG] suchenwi: (Solaris, that was) Wed May 14 14:41:19 2003 193.98.144.2 [MSG] suchenwi: gb2312 is the first Chinese encoding, iusing a 2x7 bit plan. EUC-CN is the same, with added high bit, so it can be mixed with ASCII characters. Wed May 14 14:41:49 2003 192.35.44.3 [MSG] kennykb: good discussion at http://nscp.upenn.edu/aix4.3html/aixprggd/genprogc/codeset_over.htm Wed May 14 14:42:07 2003 193.98.144.2 [MSG] suchenwi: BTW, it's all described in the paper I sent you :) GB12345 is an extension for rarer characters, I think; I have it printed at home somewhere, but no fonts to actually do something with it. Wed May 14 14:43:08 2003 209.17.183.249 [MSG] tclguy: so what would the repurcussions be to at least copy euc-cn over gb2312 for 8.4.3? Wed May 14 14:43:21 2003 193.98.144.2 [MSG] suchenwi: suchenwi@smart2[~/src/roi/config/ru_mr]671: LC_CTYPE=zh_CN.GB2312 wish
#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?public#?lib/tcl8#?#?#?#?lib/tcl8#?#?#?#?public#?librar#?#?#?librar#?#?#?tcl8#?4.1/librar#?#?#?public#?packages#?languages/tcl8#?4.1/lib/tcl8#?#?#?@O#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?suchenwi@smart2[~/src/roi/config/ru_mr]672:
suchenwi@smart2[~/src/roi/config/ru_mr] Wed May 14 14:43:51 2003 193.98.144.2 [MSG] suchenwi: (Linux, that was), 8.4.1 too. I think it can't get worse - people often refer to gb2312 when they mean euc-cn. Wed May 14 14:45:30 2003 192.35.44.3 [MSG] kennykb: or euc-cn when they mean gbk. Wed May 14 14:45:38 2003 193.98.144.2 [MSG] suchenwi: On Windows it does not crash, but seems to have no effect on [encoding system]. Wed May 14 14:45:44 2003 209.17.183.249 [MSG] tclguy: you could try option add *Entry*disabledBackground white Wed May 14 14:49:36 2003 193.98.144.2 [MSG] suchenwi: Jeff - use euc-cn (which provenly works well) as replacement for gb2312 Wed May 14 14:49:50 2003 192.35.44.3 [MSG] kennykb: Hmm, is it true that Guo Biao Kuo is a superset of euc-CN and that in turn a superset of GB2312/ISO2022 ? Wed May 14 14:52:13 2003 209.17.183.249 [MSG] tclguy: euc-cn appears to be a superset, but with slightly different code points (or something). Wed May 14 14:52:13 2003 193.98.144.2 [MSG] suchenwi: In my understanding GB2312 & 0x8080 = euc-cn Wed May 14 14:52:54 2003 193.98.144.2 [MSG] suchenwi: ..and I've worked with euc-cn before knowing that name, with the reference book titled GB 2312-80, for many years. Wed May 14 14:53:22 2003 193.98.144.2 [MSG] suchenwi: GuoBiao = GB = guojia biaozhun, National Standard. Kuo: extended, extension. Wed May 14 14:54:13 2003 193.98.144.2 [MSG] suchenwi: Jeff: LC_CTYPE=zh_CN.EUC-CN wish comes back with the same #?#?.. error. Wed May 14 14:54:15 2003 209.17.183.249 [MSG] tclguy: I would like to know where the flip the whole tcl/tools/encoding data came from originally. Wed May 14 14:54:45 2003 209.17.183.249 [MSG] tclguy: copying euc-cn over gb2312 works for me, with LANG=zh_CN.gb2312 Wed May 14 14:55:42 2003 193.98.144.2 [MSG] suchenwi: 'Here's how Perlites see it: http://www.perldoc.com/perl5.8.0/lib/Encode/CN.html Wed May 14 14:56:33 2003 192.35.44.3 [MSG] kennykb: and the pythoneers: http://www.basistech.com/papers/chinese/python-zh-transcoding-iuc20-te2.pdf Wed May 14 14:56:33 2003 209.17.183.249 [MSG] tclguy: "When you see charset=gb2312 on mails and web pages, they really mean euc-cn encodings. To fix that, gb2312 is aliased to euc-cn. Use gb2312-raw when you really mean it. " Wed May 14 14:57:54 2003 129.6.88.137 [MSG] dgp: the origins described in the header of tcl/tools/encoding/gb2312.txt are false? Wed May 14 14:58:42 2003 209.17.183.249 [MSG] tclguy: there are no actual URL origins there. Wed May 14 14:58:48 2003 209.17.183.249 [MSG] tclguy: and I want to find updated versions. Wed May 14 14:59:23 2003 193.98.144.2 [MSG] suchenwi: www.unicode.org should have "official" conversion tables. Wed May 14 15:02:04 2003 192.35.44.3 [MSG] kennykb: My head hurts.
Big5 8ea8 = \u67a1
HKSCS 8ea8 = Big5 C9A0 = \u7dab = GB12345 47-63
But GB2312 47-63 = \u7ebf
But euc-CN 47-63 = \u7dda
Richard? Wed May 14 15:02:26 2003 209.17.183.249 [MSG] tclguy: hmmm, I think I will copy euc-cn over gb2312 for now. Wed May 14 15:02:59 2003 209.17.183.249 [MSG] tclguy: when I do that, setting LC_CTYPE or LANG to zh_CN(.gb2312|.euc-cn) works when I do 'make shell' Wed May 14 15:03:07 2003 209.17.183.249 [MSG] tclguy: and both return gb2312 ... Wed May 14 15:03:13 2003 209.17.183.249 [MSG] tclguy: for encoding system that is Wed May 14 15:03:23 2003 193.98.144.2 [MSG] suchenwi: http://www.unicode.org/Public/UNIDATA/Unihan.txt is the fat boy (25 MB, still loading) Wed May 14 15:03:32 2003 209.17.183.249 [MSG] tclguy: I've looked at unicode.org, but their format is different ... Wed May 14 15:03:42 2003 193.98.144.2 [MSG] suchenwi: Kevin, wait, let me check Wed May 14 15:05:27 2003 193.98.144.2 [MSG] suchenwi: \u7ebf is the abbreviated form of \u7dda - so they seem to call euc-cn a non-abbreviated encoding. That's wrong, to all my knowledge. Wed May 14 15:06:23 2003 192.35.44.3 [MSG] kennykb: And the other two Hanzi? Wed May 14 15:06:43 2003 209.17.183.249 [MSG] tclguy: I'm more interested in if the change of encodings works for you, because right now gb2312 is completely bunk. Wed May 14 15:06:50 2003 193.98.144.2 [MSG] suchenwi: \u7dab actually is the more frequent non-abbreviated form of \u7ebf. Here one simplified char has two non-simplified predecessors, which makes the relation complicated indeed. Wed May 14 15:06:58 2003 209.17.183.249 [MSG] tclguy: something that "sort of" works is much better than DOA. Wed May 14 15:07:58 2003 193.98.144.2 [MSG] suchenwi: \u67a1 has nothing to do with the other three - I think it was picked as example of different encoding, same code point. Wed May 14 15:08:26 2003 192.35.44.3 [MSG] kennykb: ok. Wed May 14 15:08:59 2003 209.17.183.249 [MSG] tclguy: I could also do what perl did and rename gb2312.enc (the current one) to gb2312-raw.enc, so you have to be more explicit ... if it works at all, but I'd rather see more tests for that. Wed May 14 15:10:34 2003 209.17.183.249 [MSG] tclguy: any opinions? Wed May 14 15:10:39 2003 209.17.183.249 [MSG] tclguy: going once ... Wed May 14 15:10:54 2003 12.231.162.99 [MSG] CoderX2: um, i'm confused what the problem is. Wed May 14 15:11:16 2003 12.231.162.99 [MSG] CoderX2: does that encoding file work at all? Wed May 14 15:11:50 2003 193.98.144.2 [MSG] suchenwi: Funny - encoding convertfrom big5 doesn't work with the codes above (if written as \x8e\xa8); while it works with euc-cn, -jp, -kr correctly. Wed May 14 15:11:52 2003 209.17.183.249 [MSG] tclguy: it appears not to. Wed May 14 15:12:26 2003 209.17.183.249 [MSG] tclguy: but we have no chinese experts, and it could be the source data for the encoding that is off. Wed May 14 15:12:35 2003 192.35.44.3 [MSG] kennykb: CoderX2: The problem is that many people write 'gb2312' where they mean 'euc-cn' The encodings are related byt not the same. Wed May 14 15:13:10 2003 193.98.144.2 [MSG] suchenwi: I suppose I forgot some offset - but I never had docs for big5, or experience (though I read a Chinese eBook with it, so it can't be all bad) Wed May 14 15:13:32 2003 193.98.144.2 [MSG] suchenwi: How about asking Chengye Mao? Wed May 14 15:14:26 2003 209.17.183.249 [MSG] tclguy: in fact, if Linux in this case really means euc-cn when they say gb2312, that could be our problem. Wed May 14 15:14:26 2003 12.231.162.99 [MSG] CoderX2: which brings up the more interesting point, how do we know if the other encodings are correct? Wed May 14 15:14:39 2003 209.17.183.249 [MSG] tclguy: Veronica is in Sweden. Wed May 14 15:14:47 2003 193.98.144.2 [MSG] suchenwi: Kevin: I do not share the difference claimed above between GB2312 and euc-cn. Both should be simplified chars. That GB12345 has the corresponding traditional character, fits better into the picture. Wed May 14 15:15:07 2003 209.17.183.249 [MSG] tclguy: we've had people bring up errors (very few) in other encodings over time. My guess is this one really is an interpretation problem. Wed May 14 15:15:27 2003 209.17.183.249 [MSG] tclguy: The perl guys must have good reason for aliasing gb2312 to euc-cn. Wed May 14 15:15:52 2003 193.98.144.2 [MSG] suchenwi: Proof of the pudding - an encoding is correct if a witness can read the rendering of a known text. That's how I verify euc-cn, and even (a bit of) big5. Wed May 14 15:15:58 2003 209.17.183.249 [MSG] tclguy: so, back to copy over ... going twice ... Wed May 14 15:16:46 2003 193.98.144.2 [MSG] suchenwi: ..and iso8859-5 and cp1251 were also sampled correct last week. Wed May 14 15:16:58 2003 209.17.183.249 [MSG] tclguy: OK, give me some gb2312 text and a graphic correct rendering. Wed May 14 15:17:15 2003 193.98.144.2 [MSG] suchenwi: ..as was euc-kr, some time ago. Wed May 14 15:18:57 2003 193.98.144.2 [MSG] suchenwi: Well, will try - but you won't believe it if I produce it with Tcl, right? So I have to fall back to well-known gb2312 codes I still remember. Wed May 14 15:19:42 2003 192.35.44.3 [MSG] kennykb: http://crl.nmsu.edu/Events/FWOI/SecondWorkshop/article.chinese.html Wed May 14 15:20:04 2003 209.17.183.249 [MSG] tclguy: does this help? http://www.i18nguy.com/unicode-example.html Wed May 14 15:20:36 2003 209.17.183.249 [MSG] tclguy: thanks kkb - that shows on IE Wed May 14 15:22:12 2003 209.17.183.249 [MSG] tclguy: set str {µÚÒ»¼Ò·þÎñÓÚ΢ÐÍÆóÒµ¾­ÓªÕßµÄ˽ӪÉÌÒµÒøÐÐÍŽáÒøÐмÍÊÂ} Wed May 14 15:22:23 2003 209.17.183.249 [MSG] tclguy: encoding convertfrom gb2312 $str Wed May 14 15:22:33 2003 193.98.144.2 [MSG] suchenwi: The i18nguy page appears correctly rendered on IE, for PRC, Japan, ROC Wed May 14 15:22:36 2003 209.17.183.249 [MSG] tclguy: that gives me the same junk back (current gb2312 encoding) Wed May 14 15:22:52 2003 209.17.183.249 [MSG] tclguy: encoding convertfrom euc-cn $str Wed May 14 15:23:20 2003 209.17.183.249 [MSG] tclguy: that gives me the exact same bold string in the gb2312 encoding that starts the paragraph of http://crl.nmsu.edu/Events/FWOI/SecondWorkshop/article.chinese.gb2312.html Wed May 14 15:23:36 2003 209.17.183.249 [MSG] tclguy: that confirms for me at least that gb2312 != gb2312. gb2312 == euc-cn. Wed May 14 15:24:12 2003 209.17.183.249 [MSG] tclguy: going three times ... Wed May 14 15:24:31 2003 209.17.183.249 [MSG] tclguy: BTW, this is on Windows, using tkcon. Wed May 14 15:24:53 2003 193.98.144.2 [MSG] suchenwi: Yes - convertfrom euc-cn gives: "the first service in microindustry dealer's profits trade bank united bank memo" Wed May 14 15:24:58 2003 209.17.183.249 [LOGIN] tclguy Wed May 14 15:25:18 2003 193.98.144.2 [MSG] suchenwi: Very simple: pure gb2312 is 7 bits, so it should look like pure-ASCII gibberish. Wed May 14 15:25:53 2003 192.35.44.3 [MSG] kennykb: http://www.geocities.com/tao4dummies/reading_list.html Wed May 14 15:26:52 2003 209.17.183.249 [MSG] tclguy: and it works with SuSE 7.3 with unicode fonts installed just the same (gb2312 bogus, euc-cn works) Wed May 14 15:26:59 2003 192.35.44.3 [MSG] kennykb: particularly http://www.geocities.com/tao4dummies/Laozi_gb.html Wed May 14 15:27:07 2003 209.17.183.249 [MSG] tclguy: so I am convinced that the copy over and no -raw is correct Wed May 14 15:27:33 2003 209.17.183.249 [MSG] tclguy: color me green Wed May 14 15:27:53 2003 193.98.144.2 [MSG] suchenwi: Hey wait - the gb2312 encoding itself isn't all wrong. I tried: Wed May 14 15:27:54 2003 192.35.44.3 [MSG] kennykb: Tclguy: The perlites needed -raw for some reason. Wed May 14 15:28:12 2003 193.98.144.2 [MSG] suchenwi: proc dec2bytes7 dec {
set hi [expr $dec/100]
set lo [expr $dec%100]
return [format %c%c [expr $hi+32] [expr $lo+32]]
}
encoding convertfrom gb2312 [dec2bytes7 2136] Wed May 14 15:28:29 2003 209.17.183.249 [MSG] tclguy: yeah, well maybe theirs works ... or maybe nobody uses it. Wed May 14 15:28:39 2003 193.98.144.2 [MSG] suchenwi: which indeed returns the well-known character row 21 col 36 of GB2312 Wed May 14 15:29:27 2003 193.98.144.2 [MSG] suchenwi: So our gb2312.enc appears to work well according to the Chinese standard. Wed May 14 15:29:39 2003 209.17.183.249 [MSG] tclguy: what are the URL for chat logs? Wed May 14 15:30:14 2003 193.98.144.2 [MSG] suchenwi: Yes, verified with more GB codes that I happened to know by heart. The bug is not in gb2312.enc. Wed May 14 15:30:20 2003 129.6.88.137 [MSG] dgp: http://mini.net/tchat/logs/ Wed May 14 15:31:48 2003 193.98.144.2 [MSG] suchenwi: Jeff: here's how true&blue GB2312 looks like: Wed May 14 15:31:50 2003 193.98.144.2 [MSG] suchenwi: 16137 % encoding convertto gb2312 [encoding convertfrom euc-cn $str]
5ZR;<R7~NqSZN"PMFsR5>-S*U_5DK=S*ILR5RxPPME=aRxPP<MJB Wed May 14 15:33:05 2003 193.98.144.2 [MSG] suchenwi: If you do a 'convertfrom gb2312' on it, you see the Chinese - the second being just one horizontal, 'yi', number 1. Wed May 14 15:34:42 2003 193.98.144.2 [MSG] suchenwi: Oh right, you have it from http://crl.nmsu.edu/Events/FWOI/SecondWorkshop/article.chinese.gb2312.html Wed May 14 15:37:24 2003 209.17.183.249 [MSG] tclguy: OK, fired off a message to the debian guys. I think this is the right thing to do. Wed May 14 15:37:33 2003 209.17.183.249 [MSG] tclguy: At a minimum, it is a better state than the current one. Wed May 14 15:37:35 2003 192.35.44.3 [MSG] kennykb: http://www.xray.mpe.mpg.de/mailing-lists/perl-unicode/2002-02/msg00095.html gives the Perlites' reasoning Wed May 14 15:39:08 2003 209.17.183.249 [MSG] tclguy: ah crap .. IE won't read that right. It jumps into chinese and stays there half-way through Wed May 14 15:39:20 2003 193.98.144.2 [MSG] suchenwi: OT: try http://www.ibiblio.org/pub/packages/ccic/software/info/cjk-codes/GB.html for the cutest 404 I've long seen Wed May 14 15:39:36 2003 209.17.183.249 [MSG] tclguy: anyway ... commiting now. hey, it's all in CVS after all. Wed May 14 15:40:40 2003 193.98.144.2 [MSG] suchenwi: Jeff: IE takes the Chinese escape literally... Wed May 14 15:41:05 2003 193.98.144.2 [MSG] suchenwi: ..try "View Source, there it's readable English. Wed May 14 15:42:02 2003 193.98.144.2 [MSG] suchenwi: IE honours that: The "raw" doublebyte representation
is escaped with ~{...~} sequences. Wed May 14 15:45:28 2003 192.35.44.3 [MSG] kennykb: GB2312 is a character set, not a transport encoding. It is used in two transport encodings, EUC-CN and ISO2022CN, neither of which use it in "raw" form? Wed May 14 15:48:23 2003 193.98.144.2 [MSG] suchenwi: EUC-CN sets the most significant bit on each byte. ISO2022 surrounds 7-bit GB2312 with escape sequences. Wed May 14 15:49:37 2003 192.35.44.3 [MSG] kennykb: but the point is, no transport encoding uses it in raw form, yes? Wed May 14 15:50:10 2003 193.98.144.2 [MSG] suchenwi: http://www.ma.utexas.edu/cgi-bin/info2www?(ISO2022) Wed May 14 15:50:46 2003 193.98.144.2 [MSG] suchenwi: Well, inside the escapes of iso2022cn, it is raw 2x7 bits, looks like ASCII gibberish. Wed May 14 15:51:25 2003 193.98.144.2 [MSG] suchenwi: " ESC $ ( A or ESC $ A : designate GB2312 to G0"