Ticket UUID: | 1236896 | |||
Title: | make -nocase behave correct with non-ASCII characters | |||
Type: | Patch | Version: | None | |
Submitter: | rmax | Created on: | 2005-07-12 18:40:41 | |
Subsystem: | 44. UTF-8 Strings | Assigned To: | dkf | |
Priority: | 5 Medium | Severity: | Minor | |
Status: | Closed | Last Modified: | 2013-11-12 14:41:00 | |
Resolution: | Fixed | Closed By: | dkf | |
Closed on: | 2013-11-12 14:41:00 | |||
Description: |
This patch introduces a Tcl_UtfNcasecmp() function and uses it as a replacement for strcasecmp in [lsearch], [lsort], and [switch], when the -nocase option is given. "make genstubs" is needed after applying the patch. A TIP for this still has to be written. | |||
User Comments: |
jan.nijtmans added on 2013-07-03 11:12:52:
Dup of [3613609] Still want to write a TIP to put this function in the (public) stubs? jenglish added on 2008-02-29 04:48:36: Logged In: YES user_id=68433 Originator: NO See also #1902648. This is a portability problem as well as a correctness problem. rmax added on 2005-07-19 00:45:20: File Added - 142464: TclUtfCasecmp-10.patch rmax added on 2005-07-19 00:45:19: Logged In: YES user_id=124643 This is the fastest version I've been able to come up with. The patch now adds the same ASCII optimisations to Tcl_UtfNcmp and Tcl_UtfNcasecmp. dgp added on 2005-07-18 21:46:06: File Added - 142444: 1236896.patch dgp added on 2005-07-18 21:46:04: Logged In: YES user_id=80530 Here's an alternative patch. Please check its performance against the earlier patch. rmax added on 2005-07-15 05:21:46: File Added - 142100: TclUtfCasecmp-4.patch rmax added on 2005-07-15 05:21:39: Logged In: YES user_id=124643 OK, so this time it should really be correct, as I have verified that TclUtfCasecmp produces exactly the same results as strcasecmp for any combination of characters between 0x01 and 0x7f, inclusive. I've also added dgp's demo to the test suite. I haven't yet timed this version, as I wrote it at home and the code I used for the performance comparision is at work. I'll do that tomorrow morning and report the results here. dgp added on 2005-07-15 03:15:31: Logged In: YES user_id=80530 demo session: % lsort -nocase {@ `} @ ` % lsort -nocase {` @} ` @ dgp added on 2005-07-15 03:08:05: Logged In: YES user_id=80530 It looks like the latest patch doesn't limit its case-equivalencies to alphabetic characters. For example, the characters '[' (\x5b) and '{' (\x7b) get treated as -nocase equivalents because they are the same with the 0x20 bit forced set. rmax added on 2005-07-15 02:16:37: File Added - 142083: TclUtfcasecmp-3.patch rmax added on 2005-07-15 02:16:35: Logged In: YES user_id=124643 The new version of the patch does special-casing for ASCII characters. It makes [lsearch] only slightly slower on lists of pure ASCII strings, and also improves performance for strings that contain a mix of ASCII and other characters. It also renames follows dkf's advice and makes the function private so that it can be applied without a TIP. rmax added on 2005-07-14 15:21:27: Logged In: YES user_id=124643 Look at the implementation of Tcl_UtfCasecmp. It has to do much more work than strcasecmp does to give correct results for non-ASCII characters. dkf added on 2005-07-14 15:16:07: Logged In: YES user_id=79902 'cos usually strcasecmp() assumes one-byte-per-char and probably ISO8859-* so case conversion can be done very quickly. The updated version has to de-UTF8 each character before case conversion. mistachkin added on 2005-07-14 14:42:00: Logged In: YES user_id=113501 Why does [lsort -nocase] take twice as long with this patch? dgp added on 2005-07-14 05:07:16: Logged In: YES user_id=80530 otherwise this looks ok. I agree with dkf that a two-stage implementation is the way to go. Get the fix in immediately. Make it public as approved. dgp added on 2005-07-14 05:02:52: Logged In: YES user_id=80530 the new test cmdIL-5.6 has values I assume are in the utf-8 encoding. Since the *.test files get [source]d in [encoding system], that's not really portable. Would be better to use \uHHHH backslash substitution to construct the test strings and results. rmax added on 2005-07-13 20:57:43: File Added - 141929: Tcl_Utfcasecmp.patch Logged In: YES user_id=124643 I am uploading a revised version of the patch that changes the name of the new API to Tcl_UtfCasecmp and removes a misleading comment that was a copy-n-paste oversight. Meanwhile I timed [lsort -nocase] and found it to take about twice as long after adding this patch. rmax added on 2005-07-13 15:35:05: Logged In: YES user_id=124643 Whoops, there is a typo in my initial comment: Tcl_UtfNcasecmp() already exists, and my patch adds a Tcl_Utfcasecmp() function (without "N"). dkf added on 2005-07-13 15:20:12: Logged In: YES user_id=79902 We could try doing this in two stages. Stage #1: Define a TclUtfNCaseCmp function (or something similar) that is *not* in any stubs table, but which is used to fix the guts of the various Tcl commands listed. This part can be done now. Stage #2: Change the name of the function to something exported and put it in the stubs table. This is the stage that needs documentation work and a TIP. rmax added on 2005-07-13 01:40:41: File Added - 141834: Tcl_Utfcasecmp.patch |
Attachments:
- TclUtfCasecmp-10.patch [download] added by rmax on 2005-07-19 00:45:20. [details]
- 1236896.patch [download] added by dgp on 2005-07-18 21:46:05. [details]
- TclUtfCasecmp-4.patch [download] added by rmax on 2005-07-15 05:21:40. [details]
- TclUtfcasecmp-3.patch [download] added by rmax on 2005-07-15 02:16:37. [details]
- Tcl_Utfcasecmp.patch [download] added by rmax on 2005-07-13 20:57:43. [details]