Tcl Source Code

View Ticket
Login
Ticket UUID: 8e1e31eac0fd6b6c4452bc108a98ab08c6b64588
Title: lsort treats NUL chars strangely
Type: Bug Version: 8.6.6
Submitter: m7j4k9 Created on: 2017-07-20 10:21:46
Subsystem: 17. Commands I-L Assigned To: jan.nijtmans
Priority: 5 Medium Severity: Minor
Status: Closed Last Modified: 2020-08-11 14:57:09
Resolution: Fixed Closed By: oehhar
    Closed on: 2020-08-11 14:57:09
Description:
The NUL character is not collated correctly.

The Tcl command below demonstrates that the NUL character 
is treated by lsort as if it had the bit value 0111 1111.1
-- that's right, as if it were between 127 and 128.

% join [lsort [list \0.1 \7.2 a.3 z.4 \x7F.5 \x80.6 \xFF.7]] :

.2:a.3:z.4:.5: .1:€.6:ÿ.7

Only NUL seems to be treated this oddly as confirmed by:

% for { set list {} ; set i 0 } { $i < 256 } { incr i } { lappend list [format %c.%02x $i $i] }
% lsort $list
User Comments: oehhar added on 2020-08-11 14:57:09:

May I propose to document this in the tcl 8.6 documentation?

As the NUL->\xc080 is not exposed to the script level normally, at least a documentation would be great.

Thank you, Harald


jan.nijtmans added on 2017-11-30 13:15:57:
Now fixed in core-8-branch. Will be in Tcl 8.7.0.

Not worth to backport to 8.6.

jan.nijtmans added on 2017-11-29 09:53:54:

It turns out that for TCL_UTF_MAX == 4 (TIP #389) we need to handle the surrogate pair situation especially anyway, so we cannot escape to introduce a new internal TclStrcmp() function. Doing that, we automatically handle this bug at the same time.

So, I'm on it ....

It's fixed now in the "tip-389" branch. Probably not worth to fix this for 8.6.


sebres added on 2017-07-20 14:35:20:

Shortly: use the option -dictionary by lsort.

In Tcl the string "\0" is an utf-8 sequence c080 hex... See for example expr {"\0" eq [encoding convertfrom utf-8 \xc0\x80]} This going to special handling (resp. special utf-8 table) within Tcl to differentiate between zero-byte and zero-NTS-character.

But lsort (without -dictionary) will do that not for zero-char, but for all other non-ascii charaters also (e. g. umlauts, etc.).

Possibly following example can help you to do the sorting using byte-comparision...


% lsort [list "\0 1" "\x7F 2" "\x80 3"]
{⌂ 2} {  1} {? 3}
% lsort -dictionary [list "\0 1" "\x7F 2" "\x80 3"]
{  1} {⌂ 2} {? 3}
% proc sortbybyte {a b} {expr {[scan $a %c] - [scan $b %c]}}
% lsort -command sortbybyte -index 0 [list "\0 1" "\x7F 2" "\x80 3"]
{  1} {⌂ 2} {? 3}

@TCT, @Jan: should we handle the default sorting algorithm to take into account this byte-sequence, so it will be sorted as first char in utf-8?
I think not...