Ticket UUID: | 8e1e31eac0fd6b6c4452bc108a98ab08c6b64588 | |||
Title: | lsort treats NUL chars strangely | |||
Type: | Bug | Version: | 8.6.6 | |
Submitter: | m7j4k9 | Created on: | 2017-07-20 10:21:46 | |
Subsystem: | 17. Commands I-L | Assigned To: | jan.nijtmans | |
Priority: | 5 Medium | Severity: | Minor | |
Status: | Closed | Last Modified: | 2020-08-11 14:57:09 | |
Resolution: | Fixed | Closed By: | oehhar | |
Closed on: | 2020-08-11 14:57:09 | |||
Description: |
The NUL character is not collated correctly. The Tcl command below demonstrates that the NUL character is treated by lsort as if it had the bit value 0111 1111.1 -- that's right, as if it were between 127 and 128. % join [lsort [list \0.1 \7.2 a.3 z.4 \x7F.5 \x80.6 \xFF.7]] : .2:a.3:z.4:.5: .1:.6:ÿ.7 Only NUL seems to be treated this oddly as confirmed by: % for { set list {} ; set i 0 } { $i < 256 } { incr i } { lappend list [format %c.%02x $i $i] } % lsort $list | |||
User Comments: |
oehhar added on 2020-08-11 14:57:09:
(text/x-fossil-wiki)
May I propose to document this in the tcl 8.6 documentation? As the NUL->\xc080 is not exposed to the script level normally, at least a documentation would be great. Thank you, Harald jan.nijtmans added on 2017-11-30 13:15:57: Now fixed in core-8-branch. Will be in Tcl 8.7.0. Not worth to backport to 8.6. jan.nijtmans added on 2017-11-29 09:53:54: (text/x-fossil-wiki) It turns out that for TCL_UTF_MAX == 4 (TIP #389) we need to handle the surrogate pair situation especially anyway, so we cannot escape to introduce a new internal TclStrcmp() function. Doing that, we automatically handle this bug at the same time. So, I'm on it .... It's fixed now in the "tip-389" branch. Probably not worth to fix this for 8.6. sebres added on 2017-07-20 14:35:20: (text/x-fossil-wiki) Shortly: use the option <code>-dictionary</code> by <code>lsort</code>. In Tcl the string <code>"\0"</code> is an utf-8 sequence c080 hex... See for example <code>expr {"\0" eq [encoding convertfrom utf-8 \xc0\x80]}</code> This going to special handling (resp. special utf-8 table) within Tcl to differentiate between zero-byte and zero-NTS-character. But <code>lsort</code> (without <code>-dictionary</code>) will do that not for zero-char, but for all other non-ascii charaters also (e. g. umlauts, etc.). Possibly following example can help you to do the sorting using byte-comparision... <pre><code> % lsort [list "\0 1" "\x7F 2" "\x80 3"] {⌂ 2} { 1} {? 3} % lsort -dictionary [list "\0 1" "\x7F 2" "\x80 3"] { 1} {⌂ 2} {? 3} % proc sortbybyte {a b} {expr {[scan $a %c] - [scan $b %c]}} % lsort -command sortbybyte -index 0 [list "\0 1" "\x7F 2" "\x80 3"] { 1} {⌂ 2} {? 3} </code></pre> @TCT, @Jan: should we handle the default sorting algorithm to take into account this byte-sequence, so it will be sorted as first char in utf-8?<br/> I think not... |