Tcl Source Code

View Ticket
Login
Ticket UUID: 403709
Title: Improve speed of [split $string ""] (bug #131523)
Type: Patch Version: None
Submitter: dkf Created on: 2001-02-09 11:30:17
Subsystem: 18. Commands M-Z Assigned To: dkf
Priority: 5 Medium Severity: Minor
Status: Closed Last Modified: 2016-08-22 08:11:46
Resolution: Accepted Closed By: dkf
    Closed on: 2016-08-22 08:11:46
Description:
This patch makes splitting a string into single characters use only as many objects as there are distinct characters in the input string.  It boosts performance in the short string case, which is nice, but turns the long string case from intractable into quite feasable.  :^)

All tests in the current CVS HEAD are passed with this patch applied.
User Comments: dgp added on 2001-02-16 00:12:54:

Patch Code - Modified - New Version

I modified the date in the ChangeLog entry, and made
a change to one of the comments ("integral type" rather
than "numeric type").

Looks like it's ready to go now.  Make it so.  Nice work.

dkf added on 2001-02-15 23:21:16:

Patch Code - Modified - New Version

Hopefully those strange 64-bit beasties will like this version better.

dgp added on 2001-02-13 05:39:03:
Note some more followups posted directly to the
tcl-bugs mailing list:

http://www.geocrawler.com/lists/3/SourceForge/7374/0/5149633/
http://www.geocrawler.com/lists/3/SourceForge/7374/0/5151315/

dgp added on 2001-02-13 00:15:46:
But why the intermediate type of int ?  Why not a direct
cast:

Tcl_UniChar -> (char *)

dkf added on 2001-02-12 16:52:55:

Patch Code - Modified - New Version

The (char *) is correct as that because that is how hash tables configured with TCL_ONE_WORD_KEYS are used (as documented on the manual page.)  The assumption that UNICODE characters are sensibly castable to int is pretty good; it certainly works for UCS-16 and will actually be good in practise (even on 64-bit architectures as deployed) up to UCS-32 which is, incidentally, not supported by the rest of the core anyway.  :^)

Variable names in patch have been modified to match core naming scheme.  My fault for naming them according to the scheme which I use in my own programs instead...

andreas_kupries added on 2001-02-12 13:40:33:
Tcl_UniChar -> int -> (char *) 

The first is safe, AFAIK, because IIRC Tcl_UniChar represents UTF16 coding of characters which is 2 bytes at most. And int is 2 bytes at least. The (char*) should be read is (void*). Tcl_CreateHashTable has a very generic interface to keys, because of the various possible key-types. As the hash is created with one-word-keys an (int*) is ok, even in the guise of a (char*).

I agree that we should change the variable names to match the core style.

dgp added on 2001-02-10 04:08:53:
It applies, builds, and passes 'make test' for me on
a 64-bit Alpha.

I'm a bit uneasy by all the typecasting

Tcl_UniChar -> int -> (char *)

What are they for and are you confident they will be OK
always and everywhere?

Other than that, the variable names hent and intch don't
seem to fit with the core traditions.  Use of hPtr or
hashEntry is more common.

dkf added on 2001-02-09 18:31:13:
Please review; I think this one is OK (it certainly seems to work for me) but it is nice to have someone else give it the once over before applying it...

Attachments: