Tcl Source Code

Artifact [932a96bb9f]
Login

Artifact 932a96bb9ffb81c206fcbe15f8da695090b5585f:

Attachment "2003-5-14.txt" to ticket [557030ffff] added by hobbs 2003-05-15 04:46:32.
Wed May 14 14:35:15 2003 209.17.183.249 [MSG] tclguy: argh ... I will spend 30 min on tcl bug 557030 before calling 8.4.3 done
Wed May 14 14:35:25 2003 209.17.183.249 [MSG] tclguy: anyone know chinese?
Wed May 14 14:36:01 2003 193.98.144.2 [MSG] suchenwi: Here!
Wed May 14 14:36:02 2003 192.35.44.3 [MSG] kennykb: suchenwi is a Sinologist by training....
Wed May 14 14:36:41 2003 209.17.183.249 [MSG] tclguy: can you please just browse that bug?
Wed May 14 14:36:58 2003 193.98.144.2 [MSG] suchenwi: Yup - but a bit out of practice, doing all those other languages..
Wed May 14 14:40:07 2003 209.17.183.249 [MSG] tclguy: who knows the difference between gb2312, gb12345 and euc-cn?
Wed May 14 14:40:35 2003 193.98.144.2 [MSG] suchenwi: suchenwi@jaguar% wish                                          ~/src/roi/config/ru_mr
<BR>% info pa
<BR>8.4.1
<BR>% exit
<BR>suchenwi@jaguar% LC_CTYPE=zh_CN.GB2312 wish                    ~/src/roi/config/ru_mr
<BR>zsh: 21620 bus error  LC_CTYPE=zh_CN.GB2312 wish
Wed May 14 14:40:42 2003 193.98.144.2 [MSG] suchenwi: (Solaris, that was)
Wed May 14 14:41:19 2003 193.98.144.2 [MSG] suchenwi: gb2312 is the first Chinese encoding, iusing a 2x7 bit plan. EUC-CN is the same, with added high bit, so it can be mixed with ASCII characters.
Wed May 14 14:41:49 2003 192.35.44.3 [MSG] kennykb: good discussion at <A HREF="chat2.cgi?action=gotourl&url=http://nscp.upenn.edu/aix4.3html/aixprggd/genprogc/codeset_over.htm" TARGET="linkwindow">http://nscp.upenn.edu/aix4.3html/aixprggd/genprogc/codeset_over.htm</A>
Wed May 14 14:42:07 2003 193.98.144.2 [MSG] suchenwi: BTW, it's all described in the paper I sent you :) GB12345 is an extension for rarer characters, I think; I have it printed at home somewhere, but no fonts to actually do something with it.
Wed May 14 14:43:08 2003 209.17.183.249 [MSG] tclguy: so what would the repurcussions be to at least copy euc-cn over gb2312 for 8.4.3?
Wed May 14 14:43:21 2003 193.98.144.2 [MSG] suchenwi: suchenwi@smart2[~/src/roi/config/ru_mr]671: LC_CTYPE=zh_CN.GB2312 wish
<BR>#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?public#?lib/tcl8#?#?#?#?lib/tcl8#?#?#?#?public#?librar#?#?#?librar#?#?#?tcl8#?4.1/librar#?#?#?public#?packages#?languages/tcl8#?4.1/lib/tcl8#?#?#?@O#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?#?suchenwi@smart2[~/src/roi/config/ru_mr]672: 
<BR>suchenwi@smart2[~/src/roi/config/ru_mr]
Wed May 14 14:43:51 2003 193.98.144.2 [MSG] suchenwi: (Linux, that was), 8.4.1 too. I think it can't get worse - people often refer to gb2312 when they mean euc-cn.
Wed May 14 14:45:30 2003 192.35.44.3 [MSG] kennykb: or euc-cn when they mean gbk.
Wed May 14 14:45:38 2003 193.98.144.2 [MSG] suchenwi: On Windows it does not crash, but seems to have no effect on [encoding system].
Wed May 14 14:45:44 2003 209.17.183.249 [MSG] tclguy: you could try option add *Entry*disabledBackground white
Wed May 14 14:49:36 2003 193.98.144.2 [MSG] suchenwi: Jeff - use euc-cn (which provenly works well) as replacement for gb2312
Wed May 14 14:49:50 2003 192.35.44.3 [MSG] kennykb: Hmm, is it true that Guo Biao Kuo is a superset of euc-CN and that in turn a superset of GB2312/ISO2022 ?
Wed May 14 14:52:13 2003 209.17.183.249 [MSG] tclguy: euc-cn appears to be a superset, but with slightly different code points (or something).
Wed May 14 14:52:13 2003 193.98.144.2 [MSG] suchenwi: In my understanding GB2312  &amp; 0x8080 = euc-cn
Wed May 14 14:52:54 2003 193.98.144.2 [MSG] suchenwi: ..and I've worked with euc-cn before knowing that name, with the reference book titled GB 2312-80, for many years.
Wed May 14 14:53:22 2003 193.98.144.2 [MSG] suchenwi: GuoBiao = GB = guojia biaozhun, National Standard. Kuo: extended, extension.
Wed May 14 14:54:13 2003 193.98.144.2 [MSG] suchenwi: Jeff: LC_CTYPE=zh_CN.EUC-CN wish   comes back with the same #?#?.. error.
Wed May 14 14:54:15 2003 209.17.183.249 [MSG] tclguy: I would like to know where the flip the whole tcl/tools/encoding data came from originally.
Wed May 14 14:54:45 2003 209.17.183.249 [MSG] tclguy: copying euc-cn over gb2312 works for me, with LANG=zh_CN.gb2312
Wed May 14 14:55:42 2003 193.98.144.2 [MSG] suchenwi: 'Here's how Perlites see it: <A HREF="chat2.cgi?action=gotourl&url=http://www.perldoc.com/perl5.8.0/lib/Encode/CN.html" TARGET="linkwindow">http://www.perldoc.com/perl5.8.0/lib/Encode/CN.html</A>
Wed May 14 14:56:33 2003 192.35.44.3 [MSG] kennykb: and the pythoneers: <A HREF="chat2.cgi?action=gotourl&url=http://www.basistech.com/papers/chinese/python-zh-transcoding-iuc20-te2.pdf" TARGET="linkwindow">http://www.basistech.com/papers/chinese/python-zh-transcoding-iuc20-te2.pdf</A>
Wed May 14 14:56:33 2003 209.17.183.249 [MSG] tclguy: "When you see charset=gb2312 on mails and web pages, they really mean euc-cn encodings. To fix that, gb2312 is aliased to euc-cn. Use gb2312-raw when you really mean it. "
Wed May 14 14:57:54 2003 129.6.88.137 [MSG] dgp: the origins described in the header of tcl/tools/encoding/gb2312.txt are false?
Wed May 14 14:58:42 2003 209.17.183.249 [MSG] tclguy: there are no actual URL origins there.
Wed May 14 14:58:48 2003 209.17.183.249 [MSG] tclguy: and I want to find updated versions.
Wed May 14 14:59:23 2003 193.98.144.2 [MSG] suchenwi: www.unicode.org should have "official" conversion tables.
Wed May 14 15:02:04 2003 192.35.44.3 [MSG] kennykb: My head hurts. 
<BR>Big5 8ea8 = \u67a1
<BR>HKSCS 8ea8 = Big5 C9A0 = \u7dab = GB12345 47-63
<BR>But GB2312 47-63 = \u7ebf
<BR>But euc-CN 47-63 = \u7dda
<BR>Richard?
Wed May 14 15:02:26 2003 209.17.183.249 [MSG] tclguy: hmmm, I think I will copy euc-cn over gb2312 for now.
Wed May 14 15:02:59 2003 209.17.183.249 [MSG] tclguy: when I do that, setting LC_CTYPE or LANG to zh_CN(.gb2312|.euc-cn) works when I do 'make shell'
Wed May 14 15:03:07 2003 209.17.183.249 [MSG] tclguy: and both return gb2312 ...
Wed May 14 15:03:13 2003 209.17.183.249 [MSG] tclguy: for encoding system that is
Wed May 14 15:03:23 2003 193.98.144.2 [MSG] suchenwi: <A HREF="chat2.cgi?action=gotourl&url=http://www.unicode.org/Public/UNIDATA/Unihan.txt" TARGET="linkwindow">http://www.unicode.org/Public/UNIDATA/Unihan.txt</A> is the fat boy (25 MB, still loading)
Wed May 14 15:03:32 2003 209.17.183.249 [MSG] tclguy: I've looked at unicode.org, but their format is different ...
Wed May 14 15:03:42 2003 193.98.144.2 [MSG] suchenwi: Kevin, wait, let me check
Wed May 14 15:05:27 2003 193.98.144.2 [MSG] suchenwi: \u7ebf is the abbreviated form of \u7dda - so they seem to call euc-cn a non-abbreviated  encoding. That's wrong, to all my knowledge.
Wed May 14 15:06:23 2003 192.35.44.3 [MSG] kennykb: And the other two Hanzi?
Wed May 14 15:06:43 2003 209.17.183.249 [MSG] tclguy: I'm more interested in if the change of encodings works for you, because right now gb2312 is completely bunk.
Wed May 14 15:06:50 2003 193.98.144.2 [MSG] suchenwi: \u7dab actually is the more frequent non-abbreviated form of \u7ebf. Here one simplified char has two non-simplified predecessors, which makes the relation complicated indeed.
Wed May 14 15:06:58 2003 209.17.183.249 [MSG] tclguy: something that "sort of" works is much better than DOA.
Wed May 14 15:07:58 2003 193.98.144.2 [MSG] suchenwi: \u67a1 has nothing to do with the other three - I think it was picked as example of different encoding, same code point.
Wed May 14 15:08:26 2003 192.35.44.3 [MSG] kennykb: ok.
Wed May 14 15:08:59 2003 209.17.183.249 [MSG] tclguy: I could also do what perl did and rename gb2312.enc (the current one) to gb2312-raw.enc, so you have to be more explicit ... if it works at all, but I'd rather see more tests for that.
Wed May 14 15:10:34 2003 209.17.183.249 [MSG] tclguy: any opinions?
Wed May 14 15:10:39 2003 209.17.183.249 [MSG] tclguy: going once ...
Wed May 14 15:10:54 2003 12.231.162.99 [MSG] CoderX2: um, i'm confused what the problem is.
Wed May 14 15:11:16 2003 12.231.162.99 [MSG] CoderX2: does that encoding file work at all?
Wed May 14 15:11:50 2003 193.98.144.2 [MSG] suchenwi: Funny - encoding convertfrom big5 doesn't work with the codes above (if written as \x8e\xa8); while it works with euc-cn, -jp, -kr correctly.
Wed May 14 15:11:52 2003 209.17.183.249 [MSG] tclguy: it appears not to.
Wed May 14 15:12:26 2003 209.17.183.249 [MSG] tclguy: but we have no chinese experts, and it could be the source data for the encoding that is off.
Wed May 14 15:12:35 2003 192.35.44.3 [MSG] kennykb: CoderX2: The problem is that many people write 'gb2312' where they mean 'euc-cn' The encodings are related byt not the same.
Wed May 14 15:13:10 2003 193.98.144.2 [MSG] suchenwi: I suppose I forgot some offset - but I never had docs for big5, or experience (though I read a Chinese eBook with it, so it can't be all bad)
Wed May 14 15:13:32 2003 193.98.144.2 [MSG] suchenwi: How about asking Chengye Mao?
Wed May 14 15:14:26 2003 209.17.183.249 [MSG] tclguy: in fact, if Linux in this case really means euc-cn when they say gb2312, that could be our problem.
Wed May 14 15:14:26 2003 12.231.162.99 [MSG] CoderX2: which brings up the more interesting point, how do we know if the other encodings are correct?
Wed May 14 15:14:39 2003 209.17.183.249 [MSG] tclguy: Veronica is in Sweden.
Wed May 14 15:14:47 2003 193.98.144.2 [MSG] suchenwi: Kevin: I do not share the difference claimed above between GB2312 and euc-cn. Both should be simplified chars. That GB12345 has the corresponding traditional character, fits better into the picture.
Wed May 14 15:15:07 2003 209.17.183.249 [MSG] tclguy: we've had people bring up errors (very few) in other encodings over time.  My guess is this one really is an interpretation problem.
Wed May 14 15:15:27 2003 209.17.183.249 [MSG] tclguy: The perl guys must have good reason for aliasing gb2312 to euc-cn.
Wed May 14 15:15:52 2003 193.98.144.2 [MSG] suchenwi: Proof of the pudding - an encoding is correct if a witness can read the rendering of a known text. That's how I verify euc-cn, and even (a bit of) big5.
Wed May 14 15:15:58 2003 209.17.183.249 [MSG] tclguy: so, back to copy over ... going twice ...
Wed May 14 15:16:46 2003 193.98.144.2 [MSG] suchenwi: ..and iso8859-5 and cp1251 were also sampled correct last week.
Wed May 14 15:16:58 2003 209.17.183.249 [MSG] tclguy: OK, give me some gb2312 text and a graphic correct rendering.
Wed May 14 15:17:15 2003 193.98.144.2 [MSG] suchenwi: ..as was euc-kr, some time ago.
Wed May 14 15:18:57 2003 193.98.144.2 [MSG] suchenwi: Well, will try - but you won't believe it if I produce it with Tcl, right? So I have to fall back to well-known gb2312 codes I still remember.
Wed May 14 15:19:42 2003 192.35.44.3 [MSG] kennykb: <A HREF="chat2.cgi?action=gotourl&url=http://crl.nmsu.edu/Events/FWOI/SecondWorkshop/article.chinese.html" TARGET="linkwindow">http://crl.nmsu.edu/Events/FWOI/SecondWorkshop/article.chinese.html</A>
Wed May 14 15:20:04 2003 209.17.183.249 [MSG] tclguy: does this help? <A HREF="chat2.cgi?action=gotourl&url=http://www.i18nguy.com/unicode-example.html" TARGET="linkwindow">http://www.i18nguy.com/unicode-example.html</A>
Wed May 14 15:20:36 2003 209.17.183.249 [MSG] tclguy: thanks kkb - that shows on IE
Wed May 14 15:22:12 2003 209.17.183.249 [MSG] tclguy: set str {µÚÒ»¼Ò·þÎñÓÚ΢ÐÍÆóÒµ¾­ÓªÕßµÄ˽ӪÉÌÒµÒøÐÐÍŽáÒøÐмÍÊÂ}
Wed May 14 15:22:23 2003 209.17.183.249 [MSG] tclguy: encoding convertfrom gb2312 $str
Wed May 14 15:22:33 2003 193.98.144.2 [MSG] suchenwi: The i18nguy page appears correctly rendered on IE, for PRC, Japan, ROC
Wed May 14 15:22:36 2003 209.17.183.249 [MSG] tclguy: that gives me the same junk back (current gb2312 encoding)
Wed May 14 15:22:52 2003 209.17.183.249 [MSG] tclguy: encoding convertfrom euc-cn $str
Wed May 14 15:23:20 2003 209.17.183.249 [MSG] tclguy: that gives me the exact same bold string in the gb2312 encoding that starts the paragraph of <A HREF="chat2.cgi?action=gotourl&url=http://crl.nmsu.edu/Events/FWOI/SecondWorkshop/article.chinese.gb2312.html" TARGET="linkwindow">http://crl.nmsu.edu/Events/FWOI/SecondWorkshop/article.chinese.gb2312.html</A>
Wed May 14 15:23:36 2003 209.17.183.249 [MSG] tclguy: that confirms for me at least that gb2312 != gb2312.  gb2312 == euc-cn.
Wed May 14 15:24:12 2003 209.17.183.249 [MSG] tclguy: going three times ...
Wed May 14 15:24:31 2003 209.17.183.249 [MSG] tclguy: BTW, this is on Windows, using tkcon.
Wed May 14 15:24:53 2003 193.98.144.2 [MSG] suchenwi: Yes - convertfrom euc-cn gives: "the first service in microindustry dealer's profits trade bank united bank memo"
Wed May 14 15:24:58 2003 209.17.183.249 [LOGIN] tclguy
Wed May 14 15:25:18 2003 193.98.144.2 [MSG] suchenwi: Very simple: pure gb2312 is 7 bits, so it should look like pure-ASCII gibberish.
Wed May 14 15:25:53 2003 192.35.44.3 [MSG] kennykb: <A HREF="chat2.cgi?action=gotourl&url=http://www.geocities.com/tao4dummies/reading_list.html" TARGET="linkwindow">http://www.geocities.com/tao4dummies/reading_list.html</A>
Wed May 14 15:26:52 2003 209.17.183.249 [MSG] tclguy: and it works with SuSE 7.3 with unicode fonts installed just the same (gb2312 bogus, euc-cn works)
Wed May 14 15:26:59 2003 192.35.44.3 [MSG] kennykb: particularly <A HREF="chat2.cgi?action=gotourl&url=http://www.geocities.com/tao4dummies/Laozi_gb.html" TARGET="linkwindow">http://www.geocities.com/tao4dummies/Laozi_gb.html</A>
Wed May 14 15:27:07 2003 209.17.183.249 [MSG] tclguy: so I am convinced that the copy over and no -raw is correct
Wed May 14 15:27:33 2003 209.17.183.249 [MSG] tclguy: color me green
Wed May 14 15:27:53 2003 193.98.144.2 [MSG] suchenwi: Hey wait - the gb2312 encoding itself isn't all wrong. I tried:
Wed May 14 15:27:54 2003 192.35.44.3 [MSG] kennykb: Tclguy: The perlites needed -raw for some reason.
Wed May 14 15:28:12 2003 193.98.144.2 [MSG] suchenwi: proc dec2bytes7 dec {
<BR>    set hi [expr $dec/100]
<BR>    set lo [expr $dec%100]
<BR>    return [format %c%c [expr $hi+32] [expr $lo+32]]
<BR>}
<BR>encoding convertfrom gb2312 [dec2bytes7 2136]
Wed May 14 15:28:29 2003 209.17.183.249 [MSG] tclguy: yeah, well maybe theirs works ... or maybe nobody uses it.
Wed May 14 15:28:39 2003 193.98.144.2 [MSG] suchenwi: which indeed returns the well-known character row 21 col 36 of GB2312
Wed May 14 15:29:27 2003 193.98.144.2 [MSG] suchenwi: So our gb2312.enc appears to work well according to the Chinese standard.
Wed May 14 15:29:39 2003 209.17.183.249 [MSG] tclguy: what are the URL for chat logs?
Wed May 14 15:30:14 2003 193.98.144.2 [MSG] suchenwi: Yes, verified with more GB codes that I happened to know by heart. The bug is not in gb2312.enc.
Wed May 14 15:30:20 2003 129.6.88.137 [MSG] dgp: <A HREF="chat2.cgi?action=gotourl&url=http://mini.net/tchat/logs/" TARGET="linkwindow">http://mini.net/tchat/logs/</A>
Wed May 14 15:31:48 2003 193.98.144.2 [MSG] suchenwi: Jeff: here's how true&amp;blue GB2312 looks like:
Wed May 14 15:31:50 2003 193.98.144.2 [MSG] suchenwi: 16137 % encoding convertto gb2312 [encoding convertfrom euc-cn $str]
<BR>5ZR;&lt;R7~NqSZN"PMFsR5&gt;-S*U_5DK=S*ILR5RxPPME=aRxPP&lt;MJB
Wed May 14 15:33:05 2003 193.98.144.2 [MSG] suchenwi: If you do a 'convertfrom gb2312' on it, you see the Chinese - the second being just one horizontal, 'yi', number 1.
Wed May 14 15:34:42 2003 193.98.144.2 [MSG] suchenwi: Oh right, you have it from <A HREF="chat2.cgi?action=gotourl&url=http://crl.nmsu.edu/Events/FWOI/SecondWorkshop/article.chinese.gb2312.html" TARGET="linkwindow">http://crl.nmsu.edu/Events/FWOI/SecondWorkshop/article.chinese.gb2312.html</A>
Wed May 14 15:37:24 2003 209.17.183.249 [MSG] tclguy: OK, fired off a message to the debian guys.  I think this is the right thing to do.
Wed May 14 15:37:33 2003 209.17.183.249 [MSG] tclguy: At a minimum, it is a better state than the current one.
Wed May 14 15:37:35 2003 192.35.44.3 [MSG] kennykb: <A HREF="chat2.cgi?action=gotourl&url=http://www.xray.mpe.mpg.de/mailing-lists/perl-unicode/2002-02/msg00095.html" TARGET="linkwindow">http://www.xray.mpe.mpg.de/mailing-lists/perl-unicode/2002-02/msg00095.html</A> gives the Perlites' reasoning
Wed May 14 15:39:08 2003 209.17.183.249 [MSG] tclguy: ah crap .. IE won't read that right.  It jumps into chinese and stays there half-way through
Wed May 14 15:39:20 2003 193.98.144.2 [MSG] suchenwi: OT: try <A HREF="chat2.cgi?action=gotourl&url=http://www.ibiblio.org/pub/packages/ccic/software/info/cjk-codes/GB.html" TARGET="linkwindow">http://www.ibiblio.org/pub/packages/ccic/software/info/cjk-codes/GB.html</A> for the cutest 404 I've long seen
Wed May 14 15:39:36 2003 209.17.183.249 [MSG] tclguy: anyway ... commiting now.  hey, it's all in CVS after all.
Wed May 14 15:40:40 2003 193.98.144.2 [MSG] suchenwi: Jeff: IE takes the Chinese escape literally...
Wed May 14 15:41:05 2003 193.98.144.2 [MSG] suchenwi: ..try "View Source, there it's readable English.
Wed May 14 15:42:02 2003 193.98.144.2 [MSG] suchenwi: IE honours that: The "raw" doublebyte representation
<BR>  is escaped with ~{...~} sequences.
Wed May 14 15:45:28 2003 192.35.44.3 [MSG] kennykb: GB2312 is a character set, not a transport encoding. It is used in two transport encodings, EUC-CN and ISO2022CN, neither of which use it in "raw" form?
Wed May 14 15:48:23 2003 193.98.144.2 [MSG] suchenwi: EUC-CN sets the most significant bit on each byte. ISO2022 surrounds 7-bit GB2312 with escape sequences.
Wed May 14 15:49:37 2003 192.35.44.3 [MSG] kennykb: but the point is, no transport encoding uses it in raw form, yes?
Wed May 14 15:50:10 2003 193.98.144.2 [MSG] suchenwi: <A HREF="chat2.cgi?action=gotourl&url=http://www.ma.utexas.edu/cgi-bin/info2www?" TARGET="linkwindow">http://www.ma.utexas.edu/cgi-bin/info2www?</A>(ISO2022)
Wed May 14 15:50:46 2003 193.98.144.2 [MSG] suchenwi: Well, inside the escapes of iso2022cn, it is raw 2x7 bits, looks like ASCII gibberish.
Wed May 14 15:51:25 2003 193.98.144.2 [MSG] suchenwi: "	ESC $ ( A or ESC $ A : designate GB2312 to G0"