Tcl Source Code

View Ticket
Login
Ticket UUID: 1236896
Title: make -nocase behave correct with non-ASCII characters
Type: Patch Version: None
Submitter: rmax Created on: 2005-07-12 18:40:41
Subsystem: 44. UTF-8 Strings Assigned To: dkf
Priority: 5 Medium Severity: Minor
Status: Closed Last Modified: 2013-11-12 14:41:00
Resolution: Fixed Closed By: dkf
    Closed on: 2013-11-12 14:41:00
Description:
This patch introduces a Tcl_UtfNcasecmp() function and
uses it as a replacement for strcasecmp in [lsearch],
[lsort], and [switch], when the -nocase option is given.

"make genstubs" is needed after applying the patch.

A TIP for this still has to be written.
User Comments: jan.nijtmans added on 2013-07-03 11:12:52:

Dup of [3613609]

Still want to write a TIP to put this function in the (public) stubs?


jenglish added on 2008-02-29 04:48:36:
Logged In: YES 
user_id=68433
Originator: NO

See also #1902648.  This is a portability problem as well as a correctness problem.

rmax added on 2005-07-19 00:45:20:

File Added - 142464: TclUtfCasecmp-10.patch

rmax added on 2005-07-19 00:45:19:
Logged In: YES 
user_id=124643

This is the fastest version I've been able to come up with.
The patch now adds the same ASCII optimisations to
Tcl_UtfNcmp and Tcl_UtfNcasecmp.

dgp added on 2005-07-18 21:46:06:

File Added - 142444: 1236896.patch

dgp added on 2005-07-18 21:46:04:
Logged In: YES 
user_id=80530


Here's an alternative patch.
Please check its performance
against the earlier patch.

rmax added on 2005-07-15 05:21:46:

File Added - 142100: TclUtfCasecmp-4.patch

rmax added on 2005-07-15 05:21:39:
Logged In: YES 
user_id=124643

OK, so this time it should really be correct, as I have
verified that TclUtfCasecmp produces exactly the same
results as strcasecmp for any combination of characters
between 0x01 and 0x7f, inclusive. I've also added dgp's demo
to the test suite.

I haven't yet timed this version, as I wrote it at home and
the code I used for the performance comparision is at work.
I'll do that tomorrow morning and report the results here.

dgp added on 2005-07-15 03:15:31:
Logged In: YES 
user_id=80530


demo session:

% lsort -nocase {@ `}
@ `
% lsort -nocase {` @}
` @

dgp added on 2005-07-15 03:08:05:
Logged In: YES 
user_id=80530


It looks like the latest patch
doesn't limit its case-equivalencies
to alphabetic characters.

For example, the characters
'[' (\x5b) and '{' (\x7b) get treated
as -nocase equivalents because
they are the same with the 0x20
bit forced set.

rmax added on 2005-07-15 02:16:37:

File Added - 142083: TclUtfcasecmp-3.patch

rmax added on 2005-07-15 02:16:35:
Logged In: YES 
user_id=124643

The new version of the patch does special-casing for ASCII
characters. It makes [lsearch] only slightly slower on lists
of pure ASCII strings, and also improves performance for
strings that contain a mix of ASCII and other characters.

It also renames follows dkf's advice and makes the function
private so that it can be applied without a TIP.

rmax added on 2005-07-14 15:21:27:
Logged In: YES 
user_id=124643

Look at the implementation of Tcl_UtfCasecmp. It has to do
much more work than strcasecmp does to give correct results
for non-ASCII characters.

dkf added on 2005-07-14 15:16:07:
Logged In: YES 
user_id=79902

'cos usually strcasecmp() assumes one-byte-per-char and
probably ISO8859-* so case conversion can be done very
quickly. The updated version has to de-UTF8 each character
before case conversion.

mistachkin added on 2005-07-14 14:42:00:
Logged In: YES 
user_id=113501


Why does [lsort -nocase] take twice as long with this patch?

dgp added on 2005-07-14 05:07:16:
Logged In: YES 
user_id=80530


otherwise this looks ok.

I agree with dkf that a 
two-stage implementation
is the way to go.  Get the fix
in immediately.  Make it public
as approved.

dgp added on 2005-07-14 05:02:52:
Logged In: YES 
user_id=80530


the new test cmdIL-5.6 has values
I assume are in the utf-8 encoding.

Since the *.test files get [source]d
in [encoding system], that's not
really portable.  Would be better to
use \uHHHH backslash substitution
to construct the test strings and results.

rmax added on 2005-07-13 20:57:43:

File Added - 141929: Tcl_Utfcasecmp.patch

Logged In: YES 
user_id=124643

I am uploading a revised version of the patch that changes
the name of the new API to Tcl_UtfCasecmp and removes a
misleading comment that was a copy-n-paste oversight.

Meanwhile I timed [lsort -nocase] and found it to take about
twice as long after adding this patch.

rmax added on 2005-07-13 15:35:05:
Logged In: YES 
user_id=124643

Whoops, there is a typo in my initial comment:
Tcl_UtfNcasecmp() already exists, and my patch adds a
Tcl_Utfcasecmp() function (without "N").

dkf added on 2005-07-13 15:20:12:
Logged In: YES 
user_id=79902

We could try doing this in two stages.

Stage #1: Define a TclUtfNCaseCmp function (or something
similar) that is *not* in any stubs table, but which is used
to fix the guts of the various Tcl commands listed. This
part can be done now.

Stage #2: Change the name of the function to something
exported and put it in the stubs table. This is the stage
that needs documentation work and a TIP.

rmax added on 2005-07-13 01:40:41:

File Added - 141834: Tcl_Utfcasecmp.patch

Attachments: