Tcl Source Code

View Ticket
Login
Ticket UUID: d43f96c1a89be7bdcf3dc791898df3760e083afe
Title: string trimright is broken in 8.6.11
Type: Bug Version: 8.6.11
Submitter: chw Created on: 2021-02-14 05:19:56
Subsystem: 44. UTF-8 Strings Assigned To: jan.nijtmans
Priority: 5 Medium Severity: Minor
Status: Closed Last Modified: 2021-04-19 13:45:40
Resolution: Fixed Closed By: jan.nijtmans
    Closed on: 2021-04-19 13:45:40
Description:
set s "\ud83d\ude02A\ud83d\ude02"
-> 😂A😂

binary encode hex [encoding convertto utf-8 [string trimright $s]]
-> f09f988241f09f9882

binary encode hex [encoding convertto utf-8 [string trimright $s \ud93d\ude02]]
-> f09f988241eda0bd
User Comments: jan.nijtmans added on 2021-04-19 13:45:40:

Sorry, @pouryorick, I don't see anything in your information that points to this ticket as the cause for your problem. Can you file a new ticket?

I don't see any clue why [836cfc20b37d29abf] (the TIP 590 implementation) can cause this.

closing (again)


pooryorick added on 2021-04-18 14:40:11:

A bisect session adding the new tests to each checkout and testing each build with a full "make valgrind" run reveals that the segmentation fault began with [836cfc20b37d29abf].


jan.nijtmans added on 2021-04-16 09:42:53:

The only thing [e10ffc2513de4b76] does is add testcases. Do these testcases expose a bug which already is present for more time?


pooryorick added on 2021-04-15 21:10:37:

Ever since [e10ffc2513de4b76], under a build with --enable-symbols and -DPURIFY make valgrind results in a huge number of "still reachable" reports, and valgrind itself ending in a segmentation fault.


jan.nijtmans added on 2021-02-15 17:02:45:

Fixed now in core-8-6-branch and higher


jan.nijtmans added on 2021-02-15 12:13:56:

It turns out that this bug is already fixed in the tip-575 branch. But this TIP is not voted on. Tcl_UtfPrev() is known not to work well with Emoji, that's what TIP #575 is meant to fix, and that's also the cause of this bug. A side-effect of TIP #575 is that this bug is fixed, without any additional code changes.


jan.nijtmans added on 2021-02-15 08:13:28:

It turns, this isn't fully fixed in 8.7 yet. So, some work to do!

Christian, thanks for the testcases and the patch!


sl1200mk2 added on 2021-02-15 05:55:54:
Hi,
here's just a quick remark... please forgive me if it's not the subject but as I'm having issues with Emojis I thought you should know...
I had a discussion with Csaba (the author of TableList package) about Emoji and here's what he told me:

"I would like to add that I have also built a Tk 8.7a4 version on Linux,
using Tcl 8.7a4.  This combination doesn't eliminate the emoji-related
problems.  The only build that works for me as expected with respect to
the emojis is Tk 8.7a4 built from trunk, using Tcl 9.0a2 built from
trunk as well."

so it would be good to move to Tcl9 or if Tcl8.7 behave like Tcl9 regarding Emojis.

++
nicolas

chw added on 2021-02-14 20:06:56:
Jan, I am not in the position to agree with something. I need to
take things as they are or to fork. For me, definitely 8.6.11 is
burned. And a nogo area as explained. I can wait for things in the
certainly better future in the hopefully soon released 8.7 which
fixes all those subtle problems. But that again is my humble not
relevant opinion. However, there may be (Linux, whatever) distros
out in the wild which adopt the latest and greatest (8.6.11)
version and pick up and spread the unfixed things, so beware!

jan.nijtmans added on 2021-02-14 17:31:41:

This one is already fixed in 8.7, and it's difficult to fix in 8.6 without more internal changes. It's not a regression: all 8.6.x releases have the same behavior.

Would you agree to close this as "Fixed in 8.7"? I don't think it's urgent enough to be backported to 8.6.


chw added on 2021-02-14 15:03:08:
After trying my two latest POC patches with 'CFLAGS=-DTCL_UTF_MAX=4 configure ..."
with new other errors popping up, I will give now up on trying to fix things
until somebody TCT decides on how to proceed towards an ascertained future.

However, let me nevertheless express my not relevant oppionion regarding the way
to support full unicode (beyond BMP) for the record: TIP#389 was the wrong direction
since it tries to hide if not even muddle the problem of representation. It is
crucial to distinguish between internal and external representation. If external
representation shall be unicode with all bells, whistles and smileys so shall the
step from internal to external fulfill this requirement regardless what internally
needs be done. If internally surrogate pairs are used, so what. But hopefully in
all consequences and consistently.

Regarding indexing of "characters": still no problem that a beyond BMP character
accounts for 2 instead of 1 provided that everything(!) is consistent and clearly
documented.

Another minor point: due to the aforementioned problems there never ever will be
an AndroWish/undroidwish release based on 8.6.11. It will be either 8.6.12 (if
obvious problems of 8.6.11 got fixed) or a version of 8.7 (if it happens to not
take over obvious problems of 8.6.11).

chw added on 2021-02-14 10:02:55:
Why do I insist on 8.6.11? Since it pretends to support beyond BMP
due to a working "puts \ud83d\ude02" despite severe deficiencies under
the hood. In my not relevant oppinion this step had better be done
in 8.7. Anyway, a POC patch against the core-8-6-11 branch is in the
attached stringtrim.diff.

Attachments: