Tcl Source Code

View Ticket
Login
Ticket UUID: 2869384
Title: Tcl_AsyncMark does not work with --enable-threads
Type: Bug Version: obsolete: 8.6b1.1
Submitter: eugene_cdn Created on: 2009-09-29 01:33:23
Subsystem: 04. Async Events Assigned To: mistachkin
Priority: 6 Severity:
Status: Open Last Modified: 2010-03-02 10:25:58
Resolution: None Closed By:
    Closed on:
Description:
OS: sparc Solaris 10
Tcl_AsynMark cannot be called in a signal handler because it locks the same mutex (notifierMutex) as Tcl_WaitForEvent. If a signal is handled in main thread when it serves Tcl events, then deadlock is very probable. Attached example will produce a stack like this
     -----------------  lwp# 1 / thread# 1  --------------------
     feb40408 lwp_park (0, 0, 0)
     ff0385fc Tcl_MutexLock (ff08247c, 0, 0, 1000, 0, 0) + b4
     ff039624 Tcl_AlertNotifier (2bda8, 0, 0, 1000, 0, 0) + 34
     fefd83a8 Tcl_ThreadAlert (1, 0, 0, 0, 0, 0) + a8
     feedf7e0 Tcl_AsyncMark (1a80d8, 0, 0, 0, 0, 0) + 60
     fee306a8 GotAlarm (e, 0, ffbfeb10, 1, 21c30, 21c2c) + 30
     feb40494 __sighndlr (e, 0, ffbfeb10, fee30678, 0, 1) + c
     feb3558c call_user_handler (e, 0, 4, 0, ff0c2000, ffbfeb10) + 3b8
     feb4187c _write   (4, ff05ce88, 1, 0, 0, 2c0b5) + c
     ff03a2f0 Tcl_WaitForEvent (224e8, fffffffd, ff1966e4, ff3ee0f8, ff3f06e0, 0) + 3c0
     fefd7f74 Tcl_DoOneEvent (0, 6e7f8, 0, 0, 1, 0) + 1ec
     ff17c738 Tk_MainLoop (2c1a8, 31188, 0, 1, 1, 5400) + 48
     ff1966e4 Tk_MainEx (2, ffbff2ac, 10e90, 2c1a8, 4, 4) + b1c
     00010e60 main     (2, ffbff2ac, ffbff2b8, 21000, ff0c0100, ff0c0140) + 50
     000109e0 _start   (0, 0, 0, 0, 0, 0) + 108
     -----------------  lwp# 2 / thread# 2  --------------------
     feb40408 lwp_park (0, 0, 0)
     ff0385fc Tcl_MutexLock (ff08247c, fea7bf1c, fea7be9c, fea7be1c, 0, fea7bea5) + b4
     ff03a8e4 NotifierThreadProc (0, fea7c000, 0, 0, ff03a4a8, 1) + 43c
     feb40368 _lwp_start (0, 0, 0, 0, 0, 0)
in a few seconds. To reproduce the problem, use regular wish built with --enable-threads and use attached script as input:
    wish run.tcl
It will hang shortly. To speed up hanging, one can decrease timeouts. Tk must serve some events to make the application hanging faster, so the script changes text widget in loop. Build script is attached (need to modify paths).
User Comments: eugene_cdn added on 2010-03-02 10:25:58:

File Added - 365081: run3.tgz

eugene_cdn added on 2010-03-02 10:20:51:
Sorry for long absence.
1) As I mentioned before, recursive mutex does not protect "protected data" from corruption in signal handler.
2) I made Tcl testcase much more reliable, it hangs just in seconds. Need to change timeout in C code to 100 us and in Tcl to 1 ms and decrease printing. Attaching run3.tcl to use with tclsh.
3) After making mutexes recursive and using modified testcase, it hangs in the same place. I collected a number of different stacks. Last one:
   Tcl_WaitForEvent(timePtr = 0x22328)
   Tcl_MutexUnlock(mutexPtr = 0xff38294c)
   __mutex_unlock(0x2b468, 0x1000, 0xfef6cbc0, 0xff072000, 0x2b470, 0x6)
   --- Signal SIGALRM raised ---
   __sighndlr(0xe, 0x0, 0xffbfe4f0, 0xfee20678, 0x0, 0x1)
   GotAlarm(signo = 14)
   Tcl_AsyncMark(async = 0x51738)
   Tcl_ThreadAlert(threadId = 0x1)
   Tcl_AlertNotifier(clientData = 0x2bbe8)
   Tcl_MutexLock(mutexPtr = 0xff38294c)
   mutex_lock_queue(0x0, 0x0, 0x2b468, 0x0, 0x0, 0x1)
   __lwp_park(0x0, 0x0, 0x0, 0x0, 0xfee50000, 0x1)
4) You search web for "pthread_mutex_lock async signal safety" and you will
   find that you cannot use mutexes in signal handler at all! Here's an example
   of finding:
ASYNC-SIGNAL SAFETY
                The mutex functions are not async-signal safe. What this means is  that
        they should not be called from a signal handler. In particular, calling
        pthread_mutex_lock or pthread_mutex_unlock from a signal handler may
        deadlock the calling thread.
This all means that direction of the research is wrong. Tcl_AsyncMark must be mutex free or must not be used in signal handler.

flatworm added on 2009-11-04 19:12:35:
Note that in order to compile the CVS HEAD with the proposed patch applied under Linux 2.6.26 I had to tweak the Makefile.

Namely, to allow compiling these recursive mutexes code, _XOPEN_SOURCE must be defined to be 500 or higher, and as this POSIX compatibility layer makes strncasecmp() inaccessible, _BSD_SOURCE must also be defined to return it back again.

So, a temporary solution is to make the usual "./configure" step, then open the generated Makefile, locate the "AC_FLAGS" variable and add this line below:
AC_FLAGS += -D_XOPEN_SOURCE=500 -D_BSD_SOURCE

mistachkin added on 2009-11-02 17:21:21:
It seems that my patch fixes something; however, the original test script still hangs on Linux 2.6.  Thanks to flatworm for helping me test this.  I would really appreciate it if the original submitter of the bug would run both tests in their environment and report the results here.

mistachkin added on 2009-11-02 13:50:26:

File Added - 349147: test2869384.zip

mistachkin added on 2009-11-02 13:49:48:
Slightly modified build script and test case to run in my env.

mistachkin added on 2009-11-02 13:35:11:

File Added - 349144: fix2869384.zip

mistachkin added on 2009-11-02 13:34:33:
Please try the attached patch in your environment.

mistachkin added on 2009-11-02 09:56:56:
The extensive use of locking in the Unix notifier seems to be the cause of this issue?  I believe that with careful analysis of the locks in the Unix notifier they can be minimized.  Of special concern is waiting for (an|any) event while holding any locks because we have no way of knowing precisely when an event will be triggered.

eugene_cdn added on 2009-10-29 03:47:58:
Any update? It is not a minor issue. One of the oldest group of functions does not work in the situation it was primarily designed for. And there is no workaround.

eugene_cdn added on 2009-09-29 19:46:13:
I don't think that recursive mutex will help. Cannot modify global data safely in a signal handler without be sure that you are alone.
Agree that the problem cannot be solved without help of another thread that does not receive signals.
Please also note that there may be several signals come before "end of special handling".
Draft idea:
Need a thread that blocks all signals (except, maybe one, usr123) or does not use notifierMutex. Let's call it right_thread.
AsyncMark:
- if (trylock(&notifierMutex) == 0)
-    // normal code
- lock (&another_mutex)
- queue async_tls_data
- unlock (&another_mutex)
- wake up right_thread (using pthread_kill(right_thread,usr123) or pthread_cond_signal)
right_thread:
- lock (&another_mutex)
- take asyn_tls_data from queue
- save async_tls_data to right interp
- unlock (&another_mutex)

There is a non-blocking technique to modify global data. May also be used in any solution.

flatworm added on 2009-09-29 19:19:41:
It seems this function is actually named pthread_mutex_trylock().

The idea is interesting, but I don't quite get how to implement it:
In the defautl mode of operation pthread_mutex_trylock() is defined to return immediately with EBUSY if the mutex lock is held, and Tcl core doesn't expect this behaviour from the mutex API it uses (and exposes).
If we instead make all mutexes "recursive" (in terms of pthreads), this will fix deadlocks occuring in the same thread because of signal handling but will break expectations in all other cases when a mutex locking function is supposed to wait on a mutex already locked by some other thread.

It need further thinking (and taking into account threading subsystem on Windows) as Tcl mutex API hides platform details from the programmer.

eugene_cdn added on 2009-09-29 18:35:49:
pthread_trylock may help (in addition with global flag or something). I don't have clear idea though.

flatworm added on 2009-09-29 18:27:16:
The problem with this approach is that it implies we control all the threads existing when our setup code is beging run and postulate the policy than there shall never be a thread created which will have singals unmasked, except the signal-processing thread.

I reckon this is impossible to achieve from an extension implementing signal handling, and can only possibly be done if integrated directly in the core, which is questionnable, as the concept of signals only exists on Unix.

ferrieux added on 2009-09-29 15:54:36:
I seem to recall (see that 12-year old paper http://www.linuxjournal.com/article/2121 ) that the recommended way of doing pthread+signals is to block all signals in all threads except one, thus effectively guiding signals to one specific thread that you control. Is this still working in modern pthreads ?

flatworm added on 2009-09-29 15:17:48:
I've hit this while implementing an extension to handle POSIX signals (it's unfinished yet so not released).

The problem appears to be more complex and I doubt it can be fixed at all for all possible cases of a deadlock,
simply because pthread's mutex is allowed to deadlock when locked twice from the same thread, and that's
what it does on Linux 2.6 at least.

The problem is that the signal handler bumps into the thread's execution stack at unpredictable moments,
and hence the signal handling can be considered as a "superthread" which don't play by the usual rules for threads.

Conceptually, to make Tcl_AsyncMark() (and any other function which tries to lock the target thread's LTS) not deadlock,
we have to make locking function only attempt locking if some flag is not raised (meaning that no lock
is held). But using a flag means sharing a mutable state, which inherently implies using another mutex, so we're back at the starting point.
Masking all signals before locking the LTS mutex is also a bad idea: a) too many syscalls b) this doesn't help to prevent mutual deadlocks between two threads.

So, in my package mentioned above I finally implemented a solution involving a special manager thread with which signal handlers interact and which dispatches events to target threads. The solution is complicated but works. As I intend to release the code under a Tcl-like lisence, I will be happy to share it, if you so wish.

eugene_cdn added on 2009-09-29 08:54:42:
Tried on Linux, Red Hat 4 update 4. The same problem.

eugene_cdn added on 2009-09-29 08:33:28:

File Added - 344691: asyn.tgz

Attachments: