Tcl Source Code

View Ticket
Login
Ticket UUID: 3085863
Title: tclUniData 9 years old
Type: Bug Version: obsolete: 8.6b1.1
Submitter: lars_h Created on: 2010-10-12 11:52:18
Subsystem: 44. UTF-8 Strings Assigned To: nijtmans
Priority: 5 Medium Severity:
Status: Closed Last Modified: 2010-12-06 20:00:27
Resolution: Fixed Closed By: nijtmans
    Closed on: 2010-12-06 13:00:27
Description:
It seems the tables in tclUniData.c (i.e., what class the various characters belong to) are based on a UnicodeData.txt file that is at least 9 years old, since they have been unchanged for at least that long. This means for example \u0220 (LATIN CAPITAL LETTER N WITH LONG RIGHT LEG) is not considered to be alphabetic by Tcl

% string is alpha \u0220
0

despite it being listed as having class Lu in http://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt.

A possible factor could be that tools/uniParse.tcl states its input should be the file

  ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData-Latest.txt

which doesn't exist anymore; there is only a UnicodeData.txt which presumably serves the same purpose.

Another factor could be worry that updating these tables could reopen the "Unicode beyond \uFFFF" can of worms, but the CVS comments for tclUniData.c v1.4 says it was generated from the UnicodeData.txt for Unicode 3.1.0, and (if I recall it correctly) that is precisely the first version that added non-BMP characters, so we've already jumped that particular bullet.
User Comments: nijtmans added on 2010-12-06 20:00:27:

allow_comments - 1

It seems everything is OK, so closing

dkf added on 2010-10-24 14:25:16:
Probably not worth your time. 8.4 is probably EOLed now unless someone finds something seriously wrong with it (security hole or crash-bug).

nijtmans added on 2010-10-23 14:34:45:
Added more tests, and backported to 8.5.

Any interest for 8.4 too?

nijtmans added on 2010-10-15 22:28:08:
Checked in in HEAD. Left open because of:
- More tests should be added
- Backport to 8.5/8.4??
But before doing that, I would like to receive
more feedback that this is really OK, not
introducing any problem. I cannot find
anything wrong, but 9 years of changes
is a long time........

lars_h added on 2010-10-14 19:12:05:
Changing uni::shift to 6 (is currently 5, but for some reason a 6 is hardcoded in what uniParse.tcl writes to stdout) lets you get by with 208 groups. Each +1 increase in this doubles the length of the groupMap though, so perhaps it's cheaper to let the entries in the pageMap be swell to 16 bits.

nijtmans added on 2010-10-13 22:17:26:
After this patch:
  % string is alpha \u0220
  1

So it looks like it works!

nijtmans added on 2010-10-13 22:07:08:
New attempt. Didn't try your code (which is obvously the right way),
just stripped of the out-of-range characters from UnicodeData.txt
before running the tools. Here is the result, a new patch.
The table has some 2048 elements now, so that looks more
reasonable, but still more than 256.

Is this better?

nijtmans added on 2010-10-13 22:02:59:

File Added - 389834: 3085863.patch

nijtmans added on 2010-10-13 22:02:20:

File Deleted - 389731:

lars_h added on 2010-10-13 19:20:32:
There is something slightly odd here. The current pageMap vector is indexed by an 11-bit integer (11 = 16 - OFFSET_BITS), meaning 2048 distinct elements can be accessed, but the vector has 5886 elements! Presumably it contains data also for non-BMP characters :-), even though Tcl can't access it :-(.

Looking at uniParse.tcl, it indeed does not have any provisions for ignoring data if the codepoint is out of range. And what is perhaps worse: It only uses the first four hex digits as codepoint (the index variable). That's a bug! (So the data for unaccessible characters was probably wrong anyway.)

Suggested fix: Change the lines

scan [lindex $items 0] %4x index
set index [format 0x%0.4x $index]

to

scan [lindex $items 0] %x index
        if {$index > 0xFFFF} then {
            # Ignore non-BMP characters, as long as Tcl doesn't support them
            continue
        }
set index [format 0x%0.4x $index]

I wouldn't be surprised if that solves the more-than-256 groups problem as well.

nijtmans added on 2010-10-12 22:37:25:
Jeff, I did what Lars suggested, re-generating the necessary files with the latest UnicodeData.txt. The only 'real' thing I had to change is the type static variable
pageMap, from unsigned char to unsigned const, because there are more
than 256 maps now. Here is the patch (after some more manual changes
to regc_locale.c).

All test seem to run fine.
Please evaluate, is there anything I am missing?

nijtmans added on 2010-10-12 22:28:53:

File Added - 389731: 3085863.patch

Attachments: