Tcl Source Code

View Ticket
Login
Ticket UUID: f4f44174e4c88a9132b19bcc4b68346a1d229fe1
Title: protect notifier against signals
Type: Bug Version: trunk
Submitter: aspect Created on: 2016-12-05 23:04:39
Subsystem: 01. Notifier Assigned To: nobody
Priority: 5 Medium Severity: Minor
Status: Open Last Modified: 2018-06-13 22:38:46
Resolution: None Closed By: nobody
    Closed on:
Description:
The following script (using Tclx) eventually hangs, as the signal is
delivered either to the notifier thread or to Tcl_AlertNotifier.

Patch coming.

~~~
package require Tclx
package require Thread

signal trap SIGHUP handler

proc handler args {
    puts "SIGHUP [incr ::HUPS] in [thread::id] $args"
}

proc gotinput {chan} {
    gets $chan d
    puts "hupity $d"
}

set sub [thread::create thread::wait]

if {![catch {
    package require Tcl 8.6
}]} {
    lassign [chan pipe] r w
    chan configure $w -buffering none
    chan configure $r -blocking 0
    chan event $r readable [list gotinput $r]

    thread::transfer $sub $w
    thread::send $sub [list set wchan $w]
} else {
    thread::send $sub [list set wchan stdout]
}

thread::send -async $sub {
    package require Tclx
    set pid [pid]
    while 1 {
        kill HUP $pid
        puts $wchan [incr ::HUP]
    }
} result

vwait result
~~~
User Comments: chw added on 2018-06-13 22:38:46:
Meanwhile I've consolidated things in this check-in: https://www.androwish.org/index.html/info/40790af1e8e4ec9f

For more details see the discussion in https://github.com/flightaware/Tcl-bounties/issues/32

chw added on 2018-05-26 21:23:24:
TIP #509 brought me back to think over signal handling. I tried to direct
signal delivery by using pthread_sigmask() in the this check-in:

  https://www.androwish.org/index.html/info/5cf69f01d7eca5a9

The idea is to have the notifier deal with signals, and no thread else.
It requires to start the notifier thread on POSIX platforms as early as
possible, in order to take over before any other threads are created.
The main (and all later) threads have all signals blocked, except for
the notifier. So far this seems to work as tested on various x86_64
Linuxen.

However, function Tcl_AsyncMark() and friends still are subject to be
re-thougt due to signal safety-ness of system calls being used (see
the function list in http://man7.org/linux/man-pages/man7/signal-safety.7.html).

And then there's TclX's "signal block|unblock" which should not alter the
calling thread's sigmask, but needs to modify the notifier's one by using
some inter-thread communication, which requires Tcl core changes.

aspect added on 2017-01-05 06:09:26:
Just pushed a small update:  after running for some 40 hours it paniced with EAGAIN from write().  Totally expected, and should not be a panic situation.

This does underline the limits of this approach though:


#1 - eventually (under abnormally high Tcl_AsyncMark load) - and at a time we have little or no control over - the pipe will fill and we'll start dropping events.  That's not bad - it's the sort of constraint signal handlers have to deal with - but it's one pipe serving every thread, and we have no way to discard events that are in the buffer so a high-priority alert to thread A could be starved by low-priority alerts to thread B.  Not the best, even for a pathological scenario.

The way the original scheme degraded was actually very nice:  events would be lost if they arrived while their handler was running, high-priority events would always get to the head of the queue (without pre-empting other handlers) and cross-thread contention was .. broken by misusing a mutex, but theoretically fairly good :).


#2 - Windows.  According to MSDN, CreatePipe()s are synchronous "rendezvous" pipes:  writes block until a read occurs.  Also, I'm not really sure what the best interpretation of "signal handler" is on Windows but [https://msdn.microsoft.com/en-us/library/xdkz3x12.aspx] says POSIX-style sighandlers can't use "low-level or stdio.h i/o" or system calls.

An actual Windows expert's opinion on what limits AsyncMark should respect is a good idea here.


Some readers are probably wondering why a good dose of atomics isn't obviously the right solution here:  I'm still wondering myself.  I can almost see how to do it, but can't get past the loss of either:

  - cheap asyncReady:  as in checking a single word in Interp, with no mutex
  - safety:  Tcl_AsyncMark not causing memory errors if called with a deleted token, a token for a deleted interp, or a token that becomes deleted while AsyncMark is running.

aspect added on 2017-01-01 13:35:12:
> Signals and threads are two-component explosive by committee design, now I wonder what Angus McGyver would do.

Possibly, fashion something out of a pipe(2)!  It took me a couple of tries, but I have something on branch [aspect-async-pipe] which I think mostly does the job.

Prior to back-porting to 8.5 I thought it might entirely do the job, but same-thread notifications still need a mutex (Tcl_ThreadAlert(), damn that waitCV!).  However I think [tip-458] makes waitCV and notifierMutex redundant, so this will be allowed!  Maybe?

Note that tcltest cannot test this without a way to install signal handlers that use Tcl_AsyncMark().

Back to my pre-backport commentary, which is still mostly valid:
~~~

The problem is that Tcl_AsyncMark goes under a mutex to deliver async events to other threads.  That is verboten by POSIX, and it's important that the asyncReady check remain cheap so we need to safely update a value in the target threads's memory .. what to do?

In a delicious inversion of what I previously had in mind for signals, I have solved this with another thread.  AsyncMark *is* permitted to write(2), so cross-thread notifications now go down a new pipe to AsyncThreadProc().  All threads can use the same pipe, since (small enough) write(2)s are atomic.

The new thread alternates between blocking read from the pipe, and doing the mutex dance to notify whoever needs an async event.

There's a change in data structures too:  the per-thread list of AsyncHandlers is now a member of sharedData->handlersList (actually a list of lists).  This is so the token sent to AsyncThreadProc can be validated before we go dereferencing it to set async flags.  As a result, calling Tcl_AsyncMark() with a token that was deleted in another thread - or whose thread has exited - is now safe.

Same-thread notifications simply set asyncReady without a mutex.  I think this is safe, as the worst thing that can happen concurrently is AsyncThreadProc setting asyncReady (with a mutex) and the observable outcome should always be asyncReady=1.  Pre-empting itself should always be safe.

The test suite passes, and the script that started this ticket runs for 10e8s of signals without deadlock.  Of course, notifierMutex is still a problem, but we seem to be on the right track.

aspect added on 2016-12-07 02:13:26:
I was just about to report that handling EINTR in File{In,Out}putProc doesn't eliminate the deadlock, it just pushes it out later (with a different stack trace, but this is whack-a-mole).

The interesting part of this trace seems to be:

#10 0x00007f4bba9a95cc in TclChannelEventScriptInvoker (clientData=0x2030f28, mask=2) at generic/tclIO.c:8866
#11 0x00007f4bba9a8e03 in Tcl_NotifyChannel (channel=0x2046e58, mask=2) at generic/tclIO.c:8360
#12 0x00007f4bba9a9003 in ChannelTimerProc (clientData=0x2046e58) at generic/tclIO.c:8526
#13 0x00007f4bba9f67aa in TimerHandlerEventProc (evPtr=0x1f93d68, flags=-3) at generic/tclTimer.c:593



> POSIX does not list pthread_mutex_(unlock|lock)() as async signal safe functions.

Can you give me a source?  I'd like to understand this better .. the Linux manual states:

> These functions shall not return an error code of [EINTR].

Feel free to ping me on chat for an email address to take this discussion off-ticket where we can be more verbose and tangential if you like :)

chw added on 2016-12-07 01:42:06:
Oooooops. That's definitely a deadlock since the signal handling tries to lock the same mutex path as Tcl_AsyncInvoke() since the signal is received by the same thread. Looks like a fundamental design problem. In the olden unthreaden threatless days where Tcl_MutexLock(), Tcl_MutexUnlock(), and the alert function were dummies, this was no problem at all since the flag for async invoke was simply set by the signal handler. Nowadays the signal handler likes to acquire and release a mutex. And POSIX does not list pthread_mutex_(unlock|lock)() as async signal safe functions. BUMMER! Signals and threads are two-component explosive by committee design, now I wonder what Angus McGyver would do.

aspect added on 2016-12-06 22:56:05:
.. although this is not sufficient.  After leaving the test script running for some 3.5m signals, I get another lockup.

Finding all the interruptible syscalls that need to be protected is surely going to be tedious and error-prone .. but a whole-hog solution of turning signals into event-loop events, while simpler to code, is going to require changes in every existing extension that (mis-)uses signals.


This particular mole stuck its head up with the following stack.  Now testing with EINTR handling to FileInputProc/FileOutputProc, though I'm not really convinced this is sensible:


#0  0x00007f7d482b3afd in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f7d482ada0d in pthread_mutex_lock () from /lib64/libpthread.so.0
#2  0x00007f7d48aa8a6b in Tcl_MutexLock (mutexPtr=0x1bd6060) at /home/aspect/Tcl/Env/src/tcl/unix/tclUnixThrd.c:431
#3  0x00007f7d4891d4ae in Tcl_AsyncMark (async=0x1dfbc78) at /home/aspect/Tcl/Env/src/tcl/generic/tclAsync.c:165
#4  <signal handler called>
#5  0x00007f7d48aa8a89 in Tcl_MutexUnlock (mutexPtr=0x1bd6060) at /home/aspect/Tcl/Env/src/tcl/unix/tclUnixThrd.c:457
#6  0x00007f7d4891d5c6 in Tcl_AsyncInvoke (interp=0x1bce2f8, code=0) at /home/aspect/Tcl/Env/src/tcl/generic/tclAsync.c:237
#7  0x00007f7d489232c1 in NRCommand (data=0x1c87060, interp=0x1bce2f8, result=0) at /home/aspect/Tcl/Env/src/tcl/generic/tclBasic.c:4428
#8  0x00007f7d48923129 in TclNRRunCallbacks (interp=0x1bce2f8, result=0, rootPtr=0x1c86fd8) at /home/aspect/Tcl/Env/src/tcl/generic/tclBasic.c:4398
#9  0x00007f7d48925a9e in TclEvalObjEx (interp=0x1bce2f8, objPtr=0x1bd6060, flags=131072, invoker=0x0, word=0) at /home/aspect/Tcl/Env/src/tcl/generic/tclBasic.c:5963
#10 0x00007f7d48925a34 in Tcl_EvalObjEx (interp=0x1bce2f8, objPtr=0x0, flags=131072) at /home/aspect/Tcl/Env/src/tcl/generic/tclBasic.c:5944
#11 0x00007f7d48a355cc in TclChannelEventScriptInvoker (clientData=0x1d3bfd8, mask=2) at /home/aspect/Tcl/Env/src/tcl/generic/tclIO.c:8866
#12 0x00007f7d48a34e03 in Tcl_NotifyChannel (channel=0x1d26db8, mask=2) at /home/aspect/Tcl/Env/src/tcl/generic/tclIO.c:8360
#13 0x00007f7d48a35003 in ChannelTimerProc (clientData=0x1d26db8) at /home/aspect/Tcl/Env/src/tcl/generic/tclIO.c:8526
#14 0x00007f7d48a827aa in TimerHandlerEventProc (evPtr=0x1ddaf18, flags=-3) at /home/aspect/Tcl/Env/src/tcl/generic/tclTimer.c:593
#15 0x00007f7d48a58ecf in Tcl_ServiceEvent (flags=-3) at /home/aspect/Tcl/Env/src/tcl/generic/tclNotify.c:670
#16 0x00007f7d48a591d7 in Tcl_DoOneEvent (flags=-3) at /home/aspect/Tcl/Env/src/tcl/generic/tclNotify.c:903
#17 0x00007f7d489ee656 in Tcl_VwaitObjCmd (clientData=0x0, interp=0x1bce2f8, objc=2, objv=0x1bd1970) at /home/aspect/Tcl/Env/src/tcl/generic/tclEvent.c:1416

aspect added on 2016-12-06 21:59:19:
You are quite right.  The only reason this prevented the hang is I used SIG_BLOCK after releasing the mutex rather than SIG_SETMASK.

the pthread_mutex_lock() call in Tcl_AlertNotifier() was where gdb pointed when I observed the hang, but of course this is not the place to be blocking signals.

The correct place seems to be Tcl_WaitForEvent() - the write() and select() calls in here being EINTR candidates.  It looks like the write() could be wrapped in the moral equivalent of while (errno != EINTR) - and select() being interrupted probably won't hurt anyway - but blocking signals as long as we hold the mutex seems a bit tidier to me.

Fixed patches added to the branch.

chw added on 2016-12-06 01:23:32:
Is it really required to block signals around the beef of Tcl_AlertNotifier()?

According to the manpages of pthread_mutex_lock() and pthread_cond_broadcast() both aren't subject to EINTR and esp. the first thus implicitly is restarted should a signal be delivered while waiting on the mutex.

aspect added on 2016-12-05 23:46:59:
with the feature test macro removed it builds + tests clean once more.

aspect added on 2016-12-05 23:25:47:
Initial patch committed on branch [bug-f4f44174e].

This is going to require some cleaning up wrt feature test macros, and autoconf macros.  It now fails to build on trunk due to AI_PASSIVE - on whatever I previously had checked out it built and tested clean :(.

Worth considering what other critical sections should be protected in the same way.

Longer term, I have a project in the works to give the trunk "proper" thread-safe signal handling via a signalfd-like gadget using a thread and sigwaitinfo(), but that will probably require a TIP and require some coordination with extensions that use signals (tclx, naviserver?)