Ticket UUID: | a58737f57bf740979e0ed091118508188b8c28f5 | |||
Title: | The macOS implementation of Tcl_WaitForEvent is incorrect. | |||
Type: | Bug | Version: | 8.6.10 | |
Submitter: | marc_culler | Created on: | 2020-07-10 22:08:56 | |
Subsystem: | 02. Event Loops | Assigned To: | marc_culler | |
Priority: | 5 Medium | Severity: | Important | |
Status: | Closed | Last Modified: | 2020-07-18 16:58:34 | |
Resolution: | Fixed | Closed By: | marc_culler | |
Closed on: | 2020-07-18 16:58:34 | |||
Description: |
The platform specific function Tcl_WaitForEvent (in tclMacOSXNotify.c for macosx) is called from Tcl_DoOneEvent, the core of the Tcl event loop. It gets called in Tcl_DoOneEvent after first checking that there are no asynchronous events and no window events queued. Before calling Tcl_WaitForEvent the setup proc for each available event source is called. Then the check proc is called for each event source. The main job of the setup proc is to set the timeout value which will be passed to Tcl_WaitForEvent. On macOS if there are any events in the Apple window event queue then the timeout is set to 0, meaning that Tcl_WaitForEvent should check for file events and return immediately with a positive return value if any file events have been recorded by an auxiliary notification thread which runs select in the background and records any active files descriptors in a thread specific data structure. If the Apple window event queue is empty then the timeout will be non-zero and Tcl_WaitForEvent will wait until the timeout expires. In actual fact, however, the original Cocoa implementation of Tcl_WaitForEvent did not return immediately when the timeout was set to 0. Instead, before polling for file events, the following loop was run: while (Tcl_ServiceAll() && tsdPtr->waitTime == 0) {} This code predates the release of Cocoa and was probably following the instructions in the Tcl_ServiceAll man page regarding how to embed Tcl in an application which has its own private event loop. There are two issues here. First, the man page does not recommend calling Tcl_ServiceAll in a loop. Doing that would appear to run the risk of a hang, in cases where servicing one event causes a new event to be queued. Indeed, there were many reports of macOS hanging and Kevin Walzer found that calling Tcl_ServiceAll once, not in a loop, reduced the incidence of hangs. So, in commit [c699f98468] in 2015, Kevin removed the loop and replaced it by a single call with noticeable improvement in the behavior. The second issue is more subtle. While it was probably true in the Carbon implementation of Tcl that Tcl actually was being embedded in an application with a private event loop, I think that ceased to be the case with the release of Cocoa. It is true that the NSApplication object has a private event loop which can be started by calling the function run(). But Tcl never calls that function. The run() method handles dispatching of Apple NSEvents to AppKit objects such as NSView and NSWindow. But that was not helpful for Tk since what was needed was to translate NSEvents into XEvents and add them to the Tcl queue so Tk could process them. Instead of running Apple's private event loop, the Cocoa implementation of Tk emulates the Apple run() loop within the check proc for the Aqua event source. The check proc removes NSEvents from the Aqua queue manually and translates them into XEvents before calling NSApplication sendEvent to allow the NSEvents to be processed by the window manager. The adaptation of TclWaitForEvent to account for this change was never done. While Kevin's change in 2015 reduced the number of hangs observed in Tk considerably, they still persisted. Moreover, the design of Tcl_WaitForEvent caused a lot of randomness in when the [NSView drawRect] method was being called. After Christopher Chavez noticed that the drawRect method only gets run during calls to CFRunLoopinMode or NSNextEventMatchingMask (used in the setup proc and the check proc) I was eventually led to study what was actually happening in Tcl_WaitForEvent. (With the release of OSX 10.14 (Mojave) it became critical to control when drawRect gets called because Apple made changes for that release which caused it to be the case that graphics calls which were not made from within drawRect were ignored; those calls would have no effect on the appearance of any application windows, causing Tk to do nothing visible other than display empty black windows.) My conclusion was that TclServiceAll should never be called in TclWaitForEvent. This is consistent with the fact that Tcl is really not being embedded in an application with its own private event loop. I have made this change to tclMacOSXNotify.c in the branch named macOS_hangs. I have tested extensively both with regression tests and real applications. There are no noticeable bad effects in Tcl. All regression tests pass. The branch also makes some unrelated changes having to do with the fact that the NSSpinLock was deprecated by Apple with the release of OSX 10.15 (Catalina) and replaced by the os_unfair_lock. The additional changes allow the use of os_unfair_lock in place of NSSpinLock on systems which support this more efficient new lock type. This change to Tcl, together with changes to how [NSView drawRect] is handled in Tk, led to significant improvements in Tk. Some regression test failures disappeared including those reported in [06d8246baf] and [37a27ddb2] and [aedc9f6fa4] as did many graphical artifacts which were reported for example in tickets [2a61eca3a]. These changes are in the Tk branch idle_curiosity. I have extensively tested the combination of idle_curiosity and macOS_hangs for several months as has Nicolas Bats and, I believe, others. I would like to see both of these merged into the core_8_6_branches before the release of 8.6.11. I would be very grateful for reviews, comments and testing. | |||
User Comments: |
marc_culler (claiming to be Marc Culler) added on 2020-07-18 16:58:34:
The macOS_hangs branch has been merged into core-8-6-branch and core-8-branch. I am closing the ticket. marc_culler (claiming to be Marc Culler) added on 2020-07-17 22:26:38: The drawing artifacts turned out to be related to withdrawing windows before configuring them and then deiconifying them later. The problem appears to have been fixed in Tk by making some small changes to WmWithdrawCmd and TkpWmSetState. chrstphrchvz added on 2020-07-17 03:12:42: Cc'ing myself for updates. Also fixed version field (10.6.10 → 8.6.10). kevin_walzer added on 2020-07-17 02:19:21: Marc, I built the new tip of idle_curiosity and don't see the flicker with the widget demo anymore. However, my Ruby app is still affected. I certainly don't expect you to debug a Ruby app, but the attached screen shot shows what's going on. marc_culler (claiming to be Marc Culler) added on 2020-07-16 14:39:57: Thank you, Kevin. This glitch was caused by the latest commit to the Tk idle_curiosity branch. I believe it is now working again. Please retest with the current tip of idle_curiosity. I do not see any problems with the animated labels when I use that tip. Sorry! kevin_walzer added on 2020-07-16 13:29:31: With the combination of these two branches, I see random flashes and redraw issues in the "Animated Label Demonstration" in the widget demo. Just launch and let it run, the animated labels will briefly flash black before redrawing. I'm not sure there's a pattern, nor can it be reproduced except by letting it run and watch for a bit. I also have a Ruby-Tk application that consistently shows redraw issues upon startup that were not there before installing these branches - a large portion of the window is black and will not draw until you mouse over that section of the window. Similarly, I don't have a sample script to reproduce, but can make a download available for testing. kevin_walzer added on 2020-07-14 02:12:32: After pulling in the commits from today in idle_curiosity, Wish no longer locks with this script. I'll keep poking around just to see if anything crops up. marc_culler (claiming to be Marc Culler) added on 2020-07-13 13:30:31: Thank you, Kevin. I am not able to reproduce this hang. I start wish. I source the file that you pasted in. I press Run. The left counter counts up. I press Pause. The right counter says 2. I am able to enter commands at the terminal and they are executed. I do not see a beachball. I am using the current tips of idle_curiosity and macOS_hangs on Catalina 10.15.5. ???? kevin_walzer added on 2020-07-13 12:24:12: Can't figure out how to attach a file - any suggestions? Anyway, here is the demo code. variable main_loop_stack {} variable stack_height 0 proc push_main_task {args} { variable main_loop_stack lappend main_loop_stack $args } proc completion_main_loop {} { variable main_loop_stack variable stack_height [llength $main_loop_stack] if {[llength $main_loop_stack]} then { set cmd [lindex $main_loop_stack end] set main_loop_stack [lreplace $main_loop_stack end end] } else { set cmd increment_count } eval $cmd after 0 [list after idle completion_main_loop] } variable main_count 0 proc increment_count {} { variable main_count incr main_count push_main_task count_down $main_count } proc count_down {n} { after 5 if {$n} then {push_main_task count_down [expr {$n-1}]} } interp alias {} pause_main_loop {} push_main_task return proc halt_main_loop {} { variable main_loop_stack set main_loop_stack [linsert $main_loop_stack 0 return] } eval { toplevel .control grid [ label .control.lcount -textvariable main_count ] x [ label .control.lheight -textvariable stack_height ] grid [ button .control.halt -text "Halt" -command halt_main_loop ] [ button .control.pause -text "Pause" -command pause_main_loop ] [ button .control.run -text "Run" -command completion_main_loop ] wm title .control "Control panel" } kevin_walzer added on 2020-07-13 12:17:21: Marc, to test this, I went back to the original bug report: https://sourceforge.net/p/tktoolkit/bugs/2823/ and tested the "beachball-bug.tcl" script to document the bug. I'll upload this as an attachment. The bug report outlines the issue and illustrates it nicely - Wish locks up. With the requested branches being tested, Wish locks up again. Wish did NOT lock up after my various commits around this issue. The linked bug report outlines the issue, Daniel Steffen's design decisions, and some proposed fixes, which I later attempted to implement. As I understand it, part of the change in the Tcl branch is removing or substantially modifying my changes. You may have a better insight than me into what's going on here and the behavior of the script - please take a look at it and see if you can reproduce my results. fvogel added on 2020-07-13 09:22:40: > Could someone add "marc_culler" to the "Assigned to" list? Done! jan.nijtmans added on 2020-07-13 08:41:59: B.T.W: Could someone add "marc_culler" to the "Assigned to" list? jan.nijtmans added on 2020-07-13 08:39:10: Reading the above story and testing the macOS_hangs branch, I would like to accept this for 8.6.11. That doesn't mean it's 100% OK, yet, but I think it's better than what's now on core-8-6-branch. So, merging it now would be an improvement already. There still are a few testcases failing on Travis sometimes: chan-io-50.2 and chan-io-50.4, io-50.1, io-50.5, io-50.6, event-1.1 were seen failing sometimes. Are those test-cases sensitive to timing? I don't know, it could very well be that the test-case is the problem here. Worth to have a look at before finally merging macOS_fails to core-8-6-branch. I think there's some work to do for "idle_curiosity" too, but that's stuff for another ticket. Good work! marc_culler (claiming to be Marc Culler) added on 2020-07-11 14:03:16: Thanks Kevin! To test, build both Tcl and Tk using the corresponding branches. With my directory setup: $ rm -rf build $ cd Tcl $ fossil pull $ fossil update macOS_hangs $ make -C macosx $ sudo make -C macosx install $ cd ../Tk $ fossil pull $ fossil update idle_curiosity $ make -C macosx $ sudo make -C macosx install We are looking for the usual problems. We don't want to see hangs or drawing anomalies; at least not new ones that aren't already present in core-8-6-branch. This includes temporary hangs, where a window is not completely rendered until an event such as a mouse click or motion occurs. Tests could include running the regression tests, running the demo script, building and testing your apps with this Tcl and Tk, building and testing any other apps that you have a build setup for. kevin_walzer added on 2020-07-11 13:30:19: Marc, happy to test this; how should I combine these two branches and what should I look for in testing? fvogel added on 2020-07-11 10:13:11: Marc, I'm not really in a position to test. Nevertheless you have my full vote of confidence. |
Attachments:
- idle_curiosity_stringscan.png [download] added by kevin_walzer on 2020-07-17 02:17:22. [details]