Tcl Source Code

View Ticket
Login
Ticket UUID: b47b176adf7c33681946162ba6a171281ff8381e
Title: iortrans.tf-11.0 segfault
Type: Bug Version: trunk
Submitter: dgp Created on: 2014-06-18 13:06:53
Subsystem: 26. Channel Transforms Assigned To: dgp
Priority: 5 Medium Severity: Severe
Status: Closed Last Modified: 2014-06-19 16:54:32
Resolution: Fixed Closed By: aku
    Closed on: 2014-06-19 16:54:32
Description:
Not every time, but if the test iortrans.tf-11.0
is forced to run again and again, eventually it
will fail with a segfault:

---- iortrans.tf-11.0 start

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x2aaaae910940 (LWP 24638)]
0x0000000000542697 in DeleteThreadReflectedTransformMap (clientData=0x0)
    at /home/dgp/fossil/tcl/generic/tclIORTrans.c:2353
2353            paramPtr = evPtr->param;
(gdb) print evPtr
$1 = (ForwardingEvent *) 0x0
User Comments: aku added on 2014-06-19 16:54:32:
Thank you for the analysis and fix.

I will think on a suitable comment to add to the code to point out this variability (i.e. that DeleteReflectedTransformMap() may be followed by DeleteThreadReflectedTransformMap() instead of the originator thread handling it.

Will see if I can take the time to fix the same issue in tclIORChan.c (8.5 first, then trunk).

dgp added on 2014-06-19 16:39:10:
fixed on trunk.

dgp added on 2014-06-19 16:31:27:
It appears that the code is written making the
assumption that when ConditionNotify is called
and the the mutex is unlocked, that the next
thing that will lock the mutex is the ConditionWait.
That doesn't have to be true.

In the failing case, ForwardOpToOwnerThread() in
thread 1 placed a ForwardingResult on the forwardList
and then waited in a ConditionWait.

Then in thread 2 a [thread::exit] happens, which 
first leads to a call to DeleteReflectedTransformMap()
when an interp in thread 2 is being torn down.  This
calls ForwardSetStaticError() and then ConditionNotify()
and unlocks the mutex.

If at this point, as often happens, thread 1 were to wake
up and take over, it would pull the now-dead ForwardingResult
off the forwardList, and all would be well....

....however, it can and sometimes does happen, that the
next lock on the mutex will be acquired by thread 2 as
it continues tearing down the thread, and now is in the
call of DeleteThreadReflectedTransformMap().  The ForwardingResult
is still on the forwardList, but now has resultPtr->evPtr == NULL,
which leads to the crash.

What must change is that no code processing the items on the
forwardList can assume that just because a resultPtr is on the
list, it must have a non-NULL resultPtr->evPtr.   Fortunately
that's an easy fix.  Just check for NULL and continue when it is
seen.

dgp added on 2014-06-19 15:44:31:
In the segfaulting run, the Tcl_ConditionWait() call
doesn't return.

aku added on 2014-06-18 19:55:33:
A quick note, the forwarding code in IORTrans should be identical to the forwarding code in tclIORChan. Bugs in one are likely in the other as well.

dgp added on 2014-06-18 14:11:39:
(gdb) bt
#0  0x0000000000542697 in DeleteThreadReflectedTransformMap (clientData=0x0)
    at /home/dgp/fossil/tcl/generic/tclIORTrans.c:2353
#1  0x00000000004ffbcf in Tcl_FinalizeThread ()
    at /home/dgp/fossil/tcl/generic/tclEvent.c:1294
#2  0x000000000057fb58 in Tcl_ExitThread (status=0)
    at /home/dgp/fossil/tcl/generic/tclThread.c:470
#3  0x00000000004fff9e in NewThreadProc (clientData=0xa44df0)
    at /home/dgp/fossil/tcl/generic/tclEvent.c:1559
#4  0x0000003ddb00683d in start_thread () from /lib64/libpthread.so.0
#5  0x0000003dda4d526d in clone () from /lib64/libc.so.6

Passing to aku, in hopes he understands the management
of the "forwardList" so I don't have to digest it.

Attachments: