Tk Source Code

View Ticket
Login
Ticket UUID: 3028676
Title: Cocoa event loop won't let external events through
Type: Bug Version: current: 8.6.0
Submitter: lars_h Created on: 2010-07-12 22:12:18
Subsystem: 70. Event Loop Assigned To: kevin_walzer
Priority: 7 High Severity: Minor
Status: Closed Last Modified: 2015-03-10 14:50:12
Resolution: Fixed Closed By: kevin_walzer
    Closed on: 2015-03-10 14:50:12
Description:
I've got a program where I've chopped up a long running computation into smaller pieces, using the

  after 0 [after idle {next_iteration}]

idiom for scheduling the next chunk of work while allowing the GUI to stay responsive; I believe this is the recommended way to do this.

Unfortunately my program, which has worked well on my old PPC macs, does not seem to work on my new Intel mac: although the GUI is being updated in response to changes in e.g. label -textvariable values, I'm no longer getting any response to pressing buttons when the computation is running. Since this includes buttons I'd press to control the program, it makes it pretty useless!

Instructions to see the problem:

1. Start wish from a terminal.
2. [source] the attached beachball-bug.tcl.
-> This will create a window "Control panel" with three buttons "Halt", "Pause", and "Run", and two labels (each displaying the number 0).
3. Press "Run".
-> The left label will start to count upwards, slowing down as it goes. The right label will mostly say 1 but sometimes 0.
4. Press "Pause"
-> What should happen is that the counting stops and the right label changes to 2; an [after info] command (that may be typed at the prompt in the terminal) should confirm nothing is happening.
->< However, what rather happens is that the "Pause" button becomes highlighted and *stays highlighted*, while the counting continues unaffected. The mouse pointer will soon turn into the spinning beach ball above wish windows, confirming that wish is not processing any events. At this stage, you may type one (1) command at the terminal and have wish evaluate it; if you make this command "after cancel [after info]" then you can still stop the counting, but otherwise there's no way of stopping it short of killing the wish process.

I see this problem for:
* Tk 8.6b1.2 that I've compiled myself from sources checked out from CVS.
* Tk 8.5.7 from /System/Library/Frameworks/Tk.framework/..., that I suppose Apple compiled.

I don't see this problem for:
* Tk 8.4.19 from /System/Library/Frameworks/Tk.framework/..., that I again suppose Apple compiled.
* Tk 8.5.6 that I compiled on my old mac; this is a PPC executable, but it runs (I suppose) using Rosetta.

The problematic wishes all seem to be 64-bit executables (tcl_platform(wordSize) is 8) whereas the working ones are 32-bit (tcl_platform(wordSize) is 4). I suppose this could mean the faulty/working distinction is rather one of Cocoa/Carbon, but I don't know if there is some other way to check if the 8.4.19 is Carbon or Cocoa.
User Comments: kevin_walzer added on 2015-03-10 14:50:12:
Committed the fix outlined on 3/8; no objections raised on mailing list. Closing bug as fixed.

kevin_walzer added on 2015-03-08 04:48:45:
I've been doing some additional experimentation with this, and this diff of tclMacOSXNotify.c in trunk seems to improve things significantly in terms of avoiding hangs: 

--- tclMacOSXNotify-current.c	2015-03-07 23:31:56.000000000 -0500
+++ tclMacOSXNotify.c	2015-03-07 23:41:21.000000000 -0500
@@ -1412,7 +1412,8 @@
 		(tsdPtr->runLoopNestingLevel > 1
 			|| !tsdPtr->runLoopRunning)) {
 	    tsdPtr->runLoopServicingEvents = 1;
-	    while (Tcl_ServiceAll() && tsdPtr->waitTime == 0) {}
+            /* This call seems to simply force event processing through and prevents hangups that have long been observed with Tk-Cocoa.  */
+	    Tcl_ServiceAll();
 	    tsdPtr->runLoopServicingEvents = 0;
 	}
 	break;

In brief, swapping out the while loop and replacing it with a direct call to Tcl_ServiceAll() just seems to force things through. The demo script attached to this bug (beachball-bug.tcl) does not lock up, although resizing the window and selecting menus does run a bit sluggishly until the events are processed. The shorter script referenced elsewhere here does not freeze at all with this change. 

I see this as a promising fix for this bug, and a way to further reduce the issues with the event loop, but I would like others to test as well and see if their experiences match mine.

lars_h added on 2012-10-02 21:59:12:
Regarding Kevin's comment earlier today, that Tcl-Mac posting is http://code.activestate.com/lists/tcl-mac/1553/

On the matter of work-arounds: What I've been using for over a year now is to change "after 0 ..." to "after 1 ...". I seem to recall that the inner event loop treats the case that there is a timer due right now separately, which might explain why this would make a difference.

In principle, I suppose the Mac version of Tk could be changed to silently do "after 1 ..." whenever it's asked to do "after 0 ...", which would make that work-around invisible on the script level. But before deploying something like that, one should probably check whether it also is a work-around when there is more than one task rescheduling itself in this way.

Would it make sense to update the "group" setting to 8.6b3?

wordtech added on 2012-10-02 19:45:13:

allow_comments - 0

Re-opening for comments based on Adrian Roberts' comments on Tcl-Mac, but previous remarks still stand.

wordtech added on 2012-10-02 08:10:56:

allow_comments - 1

Adding an "update" command to the core procedure in the sample script sidesteps the problem entirely. 

I realize that this script was put together to assemble the core issue, that is, incoming Tcl events overloading the Cocoa/Tk event loop and causing it to lock up, but the fact is that careful management of the event loop with occasional workarounds like "update" or "after" can avoid many of the issues that are commonly reported.

I've spent a lot of time reading both das's code and the relevant docs (Tcl and CoreFoundation), and given the complexity of the task, I've concluded that the current implementation is good enough, albeit far from optimal. das had previously committed a significant update to the notifier, see https://sourceforge.net/mailarchive/message.php?msg_id=23324050, that addressed truly showstopper issues with its performance. Now it is slow and pone to occasional lockups, but in most cases these can be worked around. I seldom see any issues with the event loop in my own applications because I work around the most common issues; it's been a long time since I've had a user complaint about a lockup because of event loop issues.

It seems clear that the Tk/Cocoa event loop is slower, more fragile, and more complex than the Carbon implementation; however, the ultimate responsibility for this lies with Apple's decision to deprecate Carbon and force Cocoa integration. Cocoa has many advantages, but it is not a natural fit for Tk the way that Carbon, which operated at the same low level of abstraction as Tk's C API, was. 

Therefore, I've decided to close this bug, and will spend a bit of time deciding how to document the limitations of Tk-Cocoa's event loop and best practices for working around them.

wordtech added on 2011-08-05 07:50:29:
Further review suggests that event loop/notifier integration between Tk and Cocoa may require some re-design of this by das: I have not been able to isolate a specific place where a bug can be addressed.

wordtech added on 2011-07-07 09:06:41:
I've looked at DAS's comment below: 

"Do note that running the tcl testsuite is insufficient to guard against
regressions in this area as it does not cover the notifier in embedded
mode.
the tk testsuite for TkCoco as well as TkX11 along with manual testing of
the
tk widget demo is required (in particular, test that the animated demos
correctly continue animating during menu tracking/window resizing,
invocation
of a menu item via menu or key binding, and while a modal dialog & a
modal
sheet & a file open dialog/sheet are displayed (all of these involve
slightly
different uses of nested runloops that have caused problems in the past)."

Even though this change fixes issues with bgerror, this proposed change does cause issues with animation during window resize as demonstrated in the Tk widget demo. As a result, I am reluctant to proceed with it as it may have unexpected side effects/breakage. 

Lowering the priority to 7 and changing the resolution to "later" because this bug seems beyond my skill set to fix at this time. Others are welcome to delve into it.

lars_h added on 2011-07-07 01:11:29:
If it's as easy as wordtech says, then splendid! Unfortunately, I'm pretty sure that deleting that while was one of the first things I tried all those months ago -- the loop anyway didn't seem to be more than a premature overoptimisation (avoid dropping out into an "outer" event loop when you know you'll be back here soon anyway) -- but it didn't work for me: as mentioned in my 2010-11-02 comment, it caused some parts of the tcl test suite to be very slow, and at least one test got stuck completely. So be very careful! I suspect things like network speed may affect the order in which things happen, and may affect whether tests like those hang or not.

If you're rather dropping the Tcl_ServiceAll call as well, then I don't see how you can be servicing idle events.

I fear the probability of me coming up with a fix for this bug is at present rather low, however. (Other stuff coming in between.)

wordtech added on 2011-07-06 08:31:06:
This bug appears to manifest itself whenever a bgerror is thrown when accessing a Tk-Cocoa menu, see bug 3285355. I would consider this to be a release-blocking bug, although the bgerror issue itself is trivial to work around by re-defining the bgerror procedure. 

Simply commenting out this line:

while ( Tcl_ServiceAll() && tsdPtr->waitTime == 0) {}

in tclMacOSXNotify.c solves the bgerror issue, it runs fine. It also appears to solve the "beachball" issue in the test script submitted by Lars. 

The only side effect I can see from this change is that the Tcl test suite in "make test" seems to run more slowly. Is there any other testing I should do before committing this change as a short-term solution, pending a more complete investigation/implementation by Lars?

das added on 2010-11-03 05:09:20:
it seems repeated Tcl_ServiceAll() isn't providing quite the semantics needed here, the idle queue should really be processed only once per runloop observer invocation (but all other pending tcl events should be drained)
ISTR trying to achieve this via Tcl_DoOneEvent() invocations initially and being unsuccessful (git history may provide details), we may need extra support in tclNotify.c for this.

lars_h added on 2010-11-03 04:36:02:
Replying to some specific points of das's:
> The TkCocoa notifier is a hybrid of an embedding and an
> embedded notifier, 

Very subtle terminology here... I need to tread carefully.

> see tkMacOSXNotifier.c for details. The
> outermost event loop is driven by tcl via Tcl_WaitForEvent()
> and has a tk event source that processes NSEvents (i.e.
> embedding). That processing may however involve nested
> CFRunLoops (possibly running in different runloop modes)
> e.g. during menu tracking/window resizing/when a modal
> dialog is presented etc. At such time, the relationship is
> reversed (i.e. embedded), the runloop is driven externally
> and our runloop observer processes tcl events via
> Tcl_ServiceAll().

> Note that as far as tk goes, the while loop at issue here is
> only ever used in the second case (i.e. a nested runloop
> invocation).

Yes.

> For typical tcl use as in the testsuite, that
> while loop should never be used at all (the notifier must
> have been switched to embedded-enabled mode via
> Tcl_ServiceModeHook() for it to be used). OTOH this is the
> normal mode of operation when tcl is embedded in a Cocoa app
> that drives its own event loop that is unaware of tcl.

> As far as getting stuck in this while loop, Tcl_ServiceAll()
> is supposed to return true as long as it has processed an
> event, so this might be due to new tcl events continuously
> arriving.

It is. As stated at the top of the original bug report, I'm using the standard [after 0 {after idle next_iteration}] idiom to keep the processes working while allowing the GUI to be responsive. Tcl_ServiceAll will first do the [after 0] script, which adds an event to the idle queue. Then it processes the idle queue, so next_iteration is evaluated, but since that does an [after 0 {after idle next_iteration}], there will now be a new event in the non-idle queue, and tsdPtr->waitTime is 0. But Tk is supposed to also let GUI events be serviced before going back to this timer event! That won't happen if UpdateWaitingListAndServiceEvents never returns.

> From the backtrace you posted, it looks like an after event
> called from TclServiceIdle() is involved that ends up
> calling a nested blocking [after] ?

The blocking [after] is probably not relevant. As I've stated before, it is just a substitute for the useful computation that the loop would be doing, to stop it from running unpleasantly fast.

> Do note that running the tcl testsuite is insufficient to
> guard against regressions in this area as it does not cover
> the notifier in embedded mode. 

Yes, that is what I expected. But changing that single line in tclMacOSXNotify.c *changed* the behaviour of the *Tcl* testsuite, so I didn't dare to proceed to Tk. It seems this code is used even without Tk.

das added on 2010-11-03 00:01:16:
Lars, thanks for looking into this!

The TkCocoa notifier is a hybrid of an embedding and an embedded notifier, see
tkMacOSXNotifier.c for details. The outermost event loop is driven by tcl via
Tcl_WaitForEvent() and has a tk event source that processes NSEvents (i.e.
embedding). That processing may however involve nested CFRunLoops (possibly
running in different runloop modes) e.g. during menu tracking/window
resizing/when a modal dialog is presented etc. At such time, the relationship
is reversed (i.e. embedded), the runloop is driven externally and our runloop
observer processes tcl events via Tcl_ServiceAll().

Note that as far as tk goes, the while loop at issue here is only ever used
in the second case (i.e. a nested runloop invocation). For typical tcl use as
in the testsuite, that while loop should never be used at all (the notifier
must have been switched to embedded-enabled mode via Tcl_ServiceModeHook() for
it to be used).
OTOH this is the normal mode of operation when tcl is embedded in a
Cocoa app that drives its own event loop that is unaware of tcl.

As far as getting stuck in this while loop, Tcl_ServiceAll() is supposed to
return true as long as it has processed an event, so this might be due to
new tcl events continuously arriving.
From the backtrace you posted, it looks like an after event called from
TclServiceIdle() is involved that ends up calling a nested blocking [after] ?

Do note that running the tcl testsuite is insufficient to guard against
regressions in this area as it does not cover the notifier in embedded mode.
the tk testsuite for TkCoco as well as TkX11 along with manual testing of the
tk widget demo is required (in particular, test that the animated demos
correctly continue animating during menu tracking/window resizing, invocation
of a menu item via menu or key binding, and while a modal dialog & a modal
sheet & a file open dialog/sheet are displayed (all of these involve slightly
different uses of nested runloops that have caused problems in the past).

lars_h added on 2010-11-02 18:06:36:
Contued working on this. I request comments from someone who understands how a Tcl notifier is normally supposed to work.

If I understand the Tcl Notifier manpage correctly, a custom modifier embedded into an external event loop (which tclMaxOSXNotify.c appears to be) is supposed to call Tcl_ServiceAll to make sure Tcl events are serviced, *but only once* on each invokation. This makes sense, since Tcl_ServiceAll already contains a loop over the event queue.

However, when I try this (reduce the while loop to just a Tcl_ServiceAll call), parts of the Tcl test suite seem to become very slow, and http11.test gets stuck entirely.

Reading up on CFRunLoop*, I think I understand why. UpdateWaitingListAndServiceEvents is set up as an observer callback, and the kCFRunLoopBeforeWaiting case in which it calls Tcl_ServiceAll is "I'm about to go to sleep now", so this is the CF equivalent of an [after idle] callback. The effect of servicing Tcl events here is that they only get serviced after some other event has occurred, and the CF runloop is preparing to go idle again. Hence a process must receive some external event between each invokation of UpdateWaitingListAndServiceEvents, and if it never does (as might be the case for some thread of http11.test) then it is stuck.

That doesn't mean the while loop is the right thing to do, however; I rather suspect it could be covering up a second bug -- namely that Tcl events are only serviced at idle time -- but I haven't studied the code in enough detail to say that for sure. There could be other situations than this kCFRunLoopBeforeWaiting callback in which Tcl event are serviced, and in that case it might be OK.

Also, upon randomly browsing documentation, it has occurred to me that a minimal fix to the problem at hand could be to set up a version 0 CFRunLoopSource, and replace the while loop

   while ( Tcl_ServiceAll() && tsdPtr->waitTime == 0) {}

with something like

   if ( Tcl_ServiceAll() && tsdPtr->waitTime == 0) {
      CFRunLoopSourceSignal( tsdPtr->wantCallbackRLSource );
   }

to make sure the runloop thinks "oh, something happened", and go idle again a bit later. But I *really* want a second pair of eyes to take a look at the problem as a whole before attempting that.

lars_h added on 2010-10-29 03:00:07:
Have finally gotten around to running this in a debugger.

When Wish is in its unresponsive state, it appears to be sitting in the while loop at line 1417 of tclMacOSXNotify.c:

    while (Tcl_ServiceAll() && tsdPtr->waitTime == 0) {}

The full call stack is typically something like

#00x7fff823fb2fa in mach_msg_trap
#10x7fff823fb96d in mach_msg
#20x7fff802493c2 in __CFRunLoopRun
#30x7fff8024884f in CFRunLoopRunSpecific
#40x10038c799 in Tcl_Sleep at tclMacOSXNotify.c:1545
#50x100350d73 in AfterDelay at tclTimer.c:1042
#60x100350446 in Tcl_AfterObjCmd at tclTimer.c:843
#70x100249c3c in NRRunObjProc at tclBasic.c:4371
#80x1002499d1 in TclNRRunCallbacks at tclBasic.c:4318
#90x10024c24b in TclEvalObjEx at tclBasic.c:5893
#100x10024c1ef in Tcl_EvalObjEx at tclBasic.c:5874
#110x100351019 in AfterProc at tclTimer.c:1173
#120x1003501a6 in TclServiceIdle at tclTimer.c:745
#130x100326bad in Tcl_ServiceAll at tclNotify.c:1087
#140x10038c4bb in UpdateWaitingListAndServiceEvents at tclMacOSXNotify.c:1417
#150x7fff8026d077 in __CFRunLoopDoObservers
#160x7fff802490cf in __CFRunLoopRun
#170x7fff8024884f in CFRunLoopRunSpecific
#180x7fff8356691a in RunCurrentEventLoopInMode
#190x7fff8356671f in ReceiveNextEventCommon
#200x7fff835665d8 in BlockUntilNextEventMatchingListInMode
#210x7fff844ec29e in _DPSNextEvent
#220x7fff844ebbed in -[NSApplication nextEventMatchingMask:untilDate:inMode:dequeue:]
#230x1001309f8 in -[TKApplication(TKNotify) nextEventMatchingMask:untilDate:inMode:dequeue:] at tkMacOSXNotify.c:63
#240x7fff846cc7c1 in -[NSCell trackMouse:inRect:ofView:untilMouseUp:]
#250x7fff846fd536 in -[NSButtonCell trackMouse:inRect:ofView:untilMouseUp:]
#260x7fff846cb4b5 in -[NSControl mouseDown:]
#270x7fff845e5763 in -[NSWindow sendEvent:]
#280x7fff8451aee2 in -[NSApplication sendEvent:]
#290x100130b4e in -[TKApplication(TKNotify) sendEvent:] at tkMacOSXNotify.c:85
#300x10013112c in TkMacOSXEventsCheckProc at tkMacOSXNotify.c:307
#310x100326a28 in Tcl_DoOneEvent at tclNotify.c:964
#320x100026a30 in Tk_MainLoop at tkEvent.c:2134
#330x10003add1 in Tk_MainEx at tkMain.c:363
#340x10000545a in main at tkAppInit.c:70

The AfterDelay and stuff is because beachball-bug.tcl contains an [after 5] (instead of some actual processing) to slow down the counter. The while loop mentioned above is part of #14 (UpdateWaitingListAndServiceEvents), whereas the nearest function from tkMacOSXNotify is #30 (TkMacOSXEventsCheckProc). I suspect this confirms Daniel's suspicions below.

lars_h added on 2010-07-16 14:41:14:
A few additional observations:

1. The following script is sufficient to show how wish goes spinning beachball when one tries to press "Stop":

  pack [label .l -textvariable ::count] -side top
  pack [button .b -text "Start" -command {step}] -side left
  pack [button .b2 -text "Stop" -command {after cancel [after info]}] -side right
  set ::count 0
  proc step {} {
     incr ::count
     after 0 {after idle step}
  }

The symptoms are a bit different however: in this case, the counting stops (at least visibly).

2. If the [after 0] is changed to [after 1] then everything works as expected.

(I still haven't tried debugging it with gdb, but it'd probably be a good exercise.)

lars_h added on 2010-07-13 13:28:52:
Confirming all the working wishes are Carbon (and the buggy Cocoa). Is there a way of forcing build with Carbon? OTOH, I suppose that would mean the executable becomes 32-bit, and wouldn't be able to [load] extensions built 64-bit. Sigh.

A cvs update of the tcl and tk sources changes nothing that seems related to events, save a few test files. Haven't tried building yet, but I'm probably seeing this for a version later than the changes you mentioned.

das added on 2010-07-13 06:11:52:
The tk/macosx/README tells you how to differentiate between Carbon and Cocoa.
However this is most likely an issue with the tcl notifier, and so not a tk bug strictly speaking (although it will only manifest in tk cocoa, which uses the embedded mode of the tcl notifier, unlike tk carbon).

Can you confirm that this happens with the latest tcl from CVS HEAD? there were a bunch of fixes dealing with recursive event loop invocations not that long ago.
In any case, to debug this tclMacOSXNotify.c is what would need to be looked at/stepped through. Running sample/taking a stackshot or attaching with gdb when things are spinning would also be useful

I don't anticipate being able to look into this myself anytime soon however.

lars_h added on 2010-07-13 05:12:19:

File Added - 379875: beachball-bug.tcl

Attachments: