Tk Source Code

View Ticket
Login
Bounty program for improvements to Tcl and certain Tcl packages.
Ticket UUID: 6c0d7aec6713ab6a7c3e12dff7f26bff4679bc9d
Title: unicode text input
Type: Bug Version: 8.6
Submitter: chw Created on: 2016-02-17 07:16:57
Subsystem: 44. Generic Fonts Assigned To: jan.nijtmans
Priority: 5 Medium Severity: Minor
Status: Open Last Modified: 2017-05-30 20:45:23
Resolution: None Closed By: nobody
    Closed on: 2016-09-29 12:43:53
Description:
I've added some code in .../win/tkWinX.c in AndroWish's source tree to deal with wParam VK_PACKET in WM_KEYDOWN and WM_SYSKEYDOWN events. This was observed on a Windows 8.1 tablet when the on-screen virtual keyboard is used to enter Emojis for example. The changes allow to enter Emojis when Tcl/Tk is built with TCL_UTF_MAX>3. For more information refer to <a href="http://www.androwish.org/index.html/fdiff?v1=24f8be773411aae9&v2=88aba802462ae980&sbs=1">this diff</a>.
User Comments: budden added on 2017-05-30 20:45:23:
Posted patch at the ticket mentioned by fvogel. Didn't try tests.

fvogel added on 2017-05-30 17:50:22: (text/x-fossil-wiki)
See also [62f1343ad2]

budden added on 2017-05-30 17:31:45:
Also I noted that mention of PenPower in comments to tkWinX.c is not in sync with this description of PenPower issues: https://bugs.openjdk.java.net/browse/JDK-4710727, and code is not look like the right one. JDK says that PenPower sometimes passes two WM_CHARs, while code in GetTranslatedKey looks for wParam>0xff.

budden added on 2017-05-30 16:30:07:
Hi! I'm not sure I'm facing the same issue, but let me try to comment here first. 

My issue is like this one:

https://stackoverflow.com/questions/16004920/when-i-type-non-ascii-characters-using-a-windows-keyboard-i-get

and this:

https://stackoverflow.com/questions/16641361/where-in-the-tkinter-application-stack-is-a-question-mark-substituted-in-p

In my case, it happened when I created my own keyboard layout with Unicode chars not present in my local character set.

After some debugging, I believe information is lost in a GetTranslatedKey function, where PeekMessageA is called instead of PeekMessageW and msg.wParam is  truncated by Windows. When I replaced PeekMessageA with PeekMessageW, I get the correct Unicode char in wParam, but later it goes into conversion and gets broken back. I "fixed" that but obviously broken many things in a neighbourhood, so I suggest no patch. The idea was to alter trans_keys to be an array of unsigned int instead of char and removing some translation. It seemed to work for me, but my knowledge is insufficient to assess this solution in general.

jan.nijtmans added on 2017-05-02 14:45:32: (text/x-fossil-wiki)
It turns out that surrogate handling isn't 100% right, still. Since Tcl uses [https://en.wikipedia.org/wiki/UTF-8#WTF-8|WTF-8] encoding (actually, for TCL_UTF_MAX == 3) , there are a lot of places in Tk where surrogate pairs need to be handled especially.

Thanks, Christian, for pointing this out.

Therefore, re-opening this ticket, and changing the title: The problem is not only on Windows 8 but more generic.

jan.nijtmans added on 2016-09-29 12:43:53:
Merged to core-8-6-branch and trunk.

jan.nijtmans added on 2016-09-28 08:57:00: (text/x-fossil-wiki)
Located many more places in Tk, where surrogate handling can be improved further. See: [http://core.tcl.tk/tk/vdiff?from=5ff8c90caff4d615&to=7d11da38b1366d28&sbs=1]

The idea is to split Unicode characters > 0xffff in two surrogates when converting to UTF-8, and joining the surrogates back again when converting UTF-8 to Unicode. This way emoji's can be handled reasonably well without further tricks, whether TCL_UTF_MAX is 3, 4 or 6.

Feedback welcome! I intend to test this a little bit more, and then merge it to core-8-6-branch and trunk.

jan.nijtmans added on 2016-09-23 14:55:32:
Almost forgot that the fix can be improved still

chw added on 2016-09-14 18:41:17:
I confirm that [fc60f038] works on Windows 8.1 (32 bit) with TCL_UTF_MAX set to 3, 4, and 6. However, I'd prefer not to use XMaxTransChars in TkpGetString() since it is in no way correlated with the various Tcl_Utf/Unicode* APIs.

Further, I propose to change the line from

  if ((keyEv->keycode <= 0xffff) || (len == XMaxTransChars))

to

  if ((keyEv->keycode <= 0xffff) || (len > 3))

which catches all cases where Tcl_UniCharToUtf produces more bytes for codepoints beyond the BMP and is open enough to address even the case when it is decided to drop unicode and extend it to multicode in order to be able to display intergalactic symbols like Klingon-ish, Vogon-ish and so on ;-).

When using TCL_UTF_MAX=3, it is possible to enter (and display correctly) Emojis, although redisplay on editing in text and entry widgets shows strange intermediate results. Thus, it is debatable if we better should always map codepoints beyond the BMP and/or surrogate pairs to 0xFFFD.

jan.nijtmans added on 2016-09-13 09:06:36:
Waiting for Christian's confirmation that everything works correctly, then it can be closed

fvogel added on 2016-09-10 09:37:52: (text/x-fossil-wiki)
Couldn't this ticket be closed now that [fc60f038] merged the fix to core-8-6-branch and [aba25651] to trunk?

jan.nijtmans added on 2016-08-31 10:08:50:
Better code committed now, which is symmetric wrt WM_CHAR/WM_UNICHAR, but can be switched in two different behaviors:

USE_EXTRA_EVENTS=0:  Events have no trans_chars when beyond BMP, this will be re-constricted from the keycode (as two surrogates) when the event is received.
USE_EXTRA_EVENTS=1: Your suggestion, splitting the events in higher/lower surrogate pairs.

I think USE_EXTRA_EVENTS=0 is simpler and prevents the need for additional events. But feel free to prove me wrong ;-)

chw added on 2016-08-30 20:00:13:
Regarding WM_UNICHAR: if TCL_UTF_MAX=3 we have two options (points 1 and 2):

0. always test Unicode upper bound (<= 0x10FFFF)
1. translate everything beyond BMP to 0xFFFD (which gives a 3 byte UTF-8 sequence)
2. translate everything beyond BMP to a surrogate pair and call Tk_QueueWindowEvent() twice with these two XKeyEvents.

For TCL_UTF_MAX>3 the only required test is point 0 above, since data will fit into at most 4 byte UTF-8 sequences.

Point 2 for TCL_UTF_MAX=3 has the benefit of being symmetric to my last comment. I.e. regardless of WM_CHAR or WM_UNICHAR messages, code points beyond BMP would appear as a (not specially handled) surrogate pair (and my observation was that Windows is capable of properly rendering it although the Tcl "string length" command counts wrong).

chw added on 2016-08-30 19:42:28:
Please consider: when TCL_UTF_MAX=3 we don't treat surrogate pairs specially, but pass each of both 16 bit items through independently, and will generate at most a 3 byte UTF-8 sequence for each of them in separate XKeyEvents (and separate invocations of GenerateXEvent(). Only the TCL_UTF_MAX>3 cases need to deal with surrogate pairs and can produce a maximum 4 byte long UTF-8 sequence. And this should fit always into the XKeyEvent.trans_chars array. For the first item of a surrogate pair it is remembered in thread local storage without producing an XKeyEvent. The second item completes the surrogate pair using that remembered information and produces a 4 byte UTF-8 sequence in the XKeyEvent.

jan.nijtmans added on 2016-08-30 10:26:38: (text/x-fossil-wiki)
Maybe I found a smarter solution here: [d004b5245b777412] If trans_chars[] is not big enough, don't store anything there but let the receiver calculate the surrogate pair characters based on the keycode. Feedback appriciated!

jan.nijtmans added on 2016-08-30 07:46:31:
Thanks, Christian, patch applied. This should indeed almost work with TCL_UTF_MAX=3, except for the function GenerateXEvent(): In the WM_CHAR case, the windows event should be translated to a series of X-events, but because the trans_chars[] field of XKeyEvent only has 4 places, we cannot store two UTF8 characters there (upper and lower surrogate). So we have to transmit the upper and lower half of the surrogates as separate X events. Or is there something I'm missing?

chw added on 2016-08-29 21:27:04:
The following patch against [rfe-6c0d7aec67] fixes TCL_UTF_MAX=6 and seems to work with TCL_UTF_MAX=3, too (even displaying Emojis properly).

--- old/win/tkWinX.c    2016-08-29 10:02:57.844491784 +0200
+++ new/win/tkWinX.c    2016-08-29 21:58:21.505193164 +0200
@@ -81,6 +81,9 @@
     TkDisplay *winDisplay;     /* TkDisplay structure that represents Windows
                                 * screen. */
     int updatingClipboard;     /* If 1, we are updating the clipboard. */
+#if TCL_UTF_MAX >= 4
+    int surrogateBuffer;       /* Buffer for first of surrogate pair. */
+#endif
 } ThreadSpecificData;
 static Tcl_ThreadDataKey dataKey;
 
@@ -1211,20 +1214,18 @@
                char buffer[TCL_UTF_MAX+1];
 
 #if TCL_UTF_MAX >= 4
-               if ((((int)wParam & 0xfc00) == 0xd800)
-                       && (PeekMessage(&msg, NULL, 0, 0, PM_NOREMOVE) != 0)
-                       && (msg.message == WM_CHAR)) {
-                   MSG msg;
-                   int ch2;
-
-                   GetMessage(&msg, NULL, 0, 0);
-                   ch2 = wParam & 0xffff;
-                   ch1 = ((ch1 & 0x3ff) << 10) | (ch2 & 0x3ff);
-                   ch1 += 0x10000;
-                   event.xkey.nbytes = Tcl_UniCharToUtf(ch1, buffer);
-               } else
+               if ((ch1 & 0xfc00) == 0xd800) {
+                   tsdPtr->surrogateBuffer = ch1;
+                   return;
+               }
+               if ((ch1 & 0xfc00) == 0xdc00) {
+                   ch1 = ((tsdPtr->surrogateBuffer & 0x3ff) << 10) |
+                         (ch1 & 0x3ff);
+                   ch1 += 0x10000;
+                   tsdPtr->surrogateBuffer = 0;
+               }
 #endif
-                   event.xkey.nbytes = Tcl_UniCharToUtf(ch1, buffer);
+               event.xkey.nbytes = Tcl_UniCharToUtf(ch1, buffer);
                for (i=0; i<event.xkey.nbytes && i<XMaxTransChars; ++i) {
                    event.xkey.trans_chars[i] = buffer[i];
                }

chw added on 2016-08-29 16:13:56:
I definitely can confirm now: all AndroWish check-ins since about March (when I initially posted the patch) suffer from the same problem: TCL_UTF_MAX=6 does not fully work regarding the Windows on-screen keyboard since strings are build up as surrogate pairs instead of expanded non-BMP UTF-8 sequences. My apologies for the noise.

chw added on 2016-08-29 16:06:38:
Sorry guys, testing Jan's latest change I'm totally confused now. It seems on my tablet that all Emojis are entered as surrogate pairs when typed into the Tk console window. The text rendering shows nice Emojis, however the string length is wrong when TCL_UTF_MAX=6 is in place. And when a selection is activated on Emojis the surrogate pairs become visible by two replacement characters. When I paste an Emoji sequence from e.g. an undroidwish (SDL2 based) into a plain wish, the string length is correct. I'm afraid this means "Go back to start, do not pass go", bummer.

jan.nijtmans added on 2016-08-29 12:46:40:
So, it's a bug then.... That means, that it should (preferably) be applied to core-8-6-branch, and it should (preferably) work with TCL_UTF_MAX of 3, 4 and 6. Manual inspection shows me, that the original code only works correctly for TCL_UTF_MAX=6, I now adapted it to work form TCL_UTF_MAX=4 as well. I will try to make the corresponding changes for TCL_UTF_MAX=3 as well. Hopefully I didn't break anything (Christian, can you confirm it still works for TCL_UTF_MAX=6 ?)....  stay tuned.

So, I rebased it to core-8-6-branch now. Any more remarks?

oehhar added on 2016-08-28 20:21:41: (text/x-fossil-wiki)
<h1>To bug or not to Bug</h1>
I answered the question:
<h2>TIP or Bug</h2>
with the second choice.

For me, the match should just be merged. It is great - point.

Maybee, it is only required for 32 bit windows, 64 bit will directly use the unicode instead of two surrogates, but this is not a problem I suppose.

Thank you for the great patch and work !

Harald

chw added on 2016-08-27 13:39:01:
To bug or not to bug is not the question ;-) VK_PACKET was observed on Win 8.1 32 bit on a DELL Venue 8 Pro tablet when entering Emojis on the stock on-screen keyboard. Only when the WM_(SYS)KEYDOWN message containing VK_PACKET is re-posted the unicode of the Emoji comes out of the event stream as two WM_CHAR events forming a surrogate pair of the Emoji.

oehhar added on 2016-08-26 17:41:27: (text/x-fossil-wiki)
<h1>Bug?</H1>
For me, it is a bug
<h1>Test</h1>
For a test, one may se the C-Code in [http://stackoverflow.com/questions/22291282/using-sendinput-to-send-unicode-characters-beyond-uffff] (well, it is C gravejard, but the same stuff exists in C)

It does the oposite of the patch, send two surrogates one after the other. This may also be done using two send/postMessage commands.

I don't know how other Tk tests work and if the sendkey command is used there.
Those might be extended or used.

Apparently this is only an issue on 32 bit issue. On 64 bit, one may send the unicode by one message.

Just my 2 pennies,
Harald

jan.nijtmans added on 2016-08-26 08:40:07:
I'll have a look. Thanks! According to TIP #0, any project which has two sponsors from the TCT and no objections just can go on. This one has two sponsors (francois' and mine) and no objections. A TIP is just a way to execute the votes, but in my view a ticket can be used just as well for that for minor issues.

Stay tuned ....

chw added on 2016-08-24 05:10:45:
Thansk for taking the patch into consideration. Regarding VK_PACKET my interpretation of MSDN is that it can be used by input methods such as on-screen keyboards or handwriting recognizers but usually not by normal keyboards. Thus, test cases might require some additional test instrumentation code which allows to pretend an input method.

fvogel added on 2016-08-23 20:24:52: (text/x-fossil-wiki)
Since I was assigned this bug:

- I have applied the patch on a branch named [rfe-6c0d7aec67] that I have growed off trunk.

- I had a cursive look at the patch. It was immediately apparent to me that the OP knows better than me what he is doing here. Givin who this OP is (chw) I think we can perfectly trust him.

- I have nevertheless run the test suite of Tk both with and without the patch and it runs completely clean for me on Windows Vista in both cases. Note however that The current Tcl in trunk has TCL_UTF_MAX = 3, and I didn't try other cases.
I have no plaform running Windows 8 to test this further.

Questions I have now:

1. Does this need a TIP or can we consider that Tk not managing  wParam == VK_PACKET  for  WM_SYSKEYDOWN  and  WM_KEYDOWN  events is a bug of Tk? I'm inclined to go for the latter.

2. Is it possible to write test cases for this new code?

Attachments: