Tcl package Thread source code

View Ticket
Login
Ticket UUID: 44499074cb431338a001c9d381913ffca29961d3
Title: SIGSEGV under heavy load
Type: Bug Version: 2.8.1
Submitter: tombert Created on: 2017-05-29 12:21:09
Subsystem: 80. Thread Package Assigned To: nobody
Priority: 5 Medium Severity: Critical
Status: Closed Last Modified: 2017-06-16 20:05:33
Resolution: Rejected Closed By: tombert
    Closed on: 2017-06-16 20:05:33
Description:
Ubuntu 14, Tcl/Tk 8.6.6; Running in wish
16 threads concurrently accessing TSVs. Not consistently crashing, but after some period of time.
Here it seems related to the TSV:


Sv_DuplicateObj (objPtr=0xefb5ef0030) at ./generic/threadSvCmd.c:989
989	    if (objPtr->typePtr != NULL) {

#0  Sv_DuplicateObj (objPtr=0xefb5ef0030) at ./generic/threadSvCmd.c:989
#1  0x00007ffff410bf08 in DupListObjShared (srcPtr=0x1048ae0, copyPtr=0x7fffe9175110) at ./generic/threadSvListCmd.c:903
#2  0x00007ffff4101318 in Sv_DuplicateObj (objPtr=0x1048ae0) at ./generic/threadSvCmd.c:1015
#3  0x00007ffff4102ad9 in SvGetObjCmd (arg=0x0, interp=0x7fffb54feb40, objc=3, objv=0x7fffb550d640)
    at ./generic/threadSvCmd.c:1582
#4  0x00007ffff76306a6 in Dispatch (data=0x7fffc163b398, interp=0x7fffb54feb40, result=0)
    at /home/tombert/cm/tcltk86/dcompile/tcl/generic/tclBasic.c:4358
#5  0x00007ffff7630734 in TclNRRunCallbacks (interp=0x7fffb54feb40, result=0, rootPtr=0x7fffed1226d0)
    at /home/tombert/cm/tcltk86/dcompile/tcl/generic/tclBasic.c:4391
#6  0x00007ffff762fec6 in Tcl_EvalObjv (interp=0x7fffb54feb40, objc=4, objv=0x7fffb550d2a0, flags=2097168)
    at /home/tombert/cm/tcltk86/dcompile/tcl/generic/tclBasic.c:4121
#7  0x00007ffff763271e in TclEvalEx (interp=0x7fffb54feb40, 
    script=0x7fffcd76e3b0 "\n        ## create a sorted list of thread priorities\n        set sortedTids {}\n        foreach tid [tsv::get objlockTidQueue global] {\n", ' ' <repeats 12 times>, "lappend sortedTids [tsv::get objlockTidPrio $tid] $t"..., 
    numBytes=1195, flags=262144, line=9, clNextOuter=0x0, 
    outerScript=0x7fffcd76e3b0 "\n        ## create a sorted list of thread priorities\n        set sortedTids {}\n        foreach tid [tsv::get objlockTidQueue global] {\n", ' ' <repeats 12 times>, "lappend sortedTids [tsv::get objlockTidPrio $tid] $t"...) at /home/tombert/cm/tcltk86/dcompile/tcl/generic/tclBasic.c:5260
#8  0x00007ffff76319d2 in Tcl_EvalEx (interp=0x7fffb54feb40, 
    script=0x7fffcd76e3b0 "\n        ## create a sorted list of thread priorities\n        set sortedTids {}\n        foreach tid [tsv::get objlockTidQueue global] {\n", ' ' <repeats 12 times>, "lappend sortedTids [tsv::get objlockTidPrio $tid] $t"..., 
    numBytes=1195, flags=262144) at /home/tombert/cm/tcltk86/dcompile/tcl/generic/tclBasic.c:4925
#9  0x00007ffff7633832 in TclNREvalObjEx (interp=0x7fffb54feb40, objPtr=0x7fffcd779730, flags=262144, invoker=0x0, word=0)
    at /home/tombert/cm/tcltk86/dcompile/tcl/generic/tclBasic.c:6121
User Comments: tombert added on 2017-06-16 20:04:20:
Still no crash. I thing with peace of conscience this bug report can be closed and assumed that all comes from this bug: 

http://core.tcl.tk/tk/tktview/c64420e94ac3e6560ead2065533d254beee538a6

thx

tombert added on 2017-06-15 20:50:15:
I was not able to strip down my program to show that error. Also, and that was confusing, the bug really seems to be Tk related.

I thought I had removed all Tk stuff from the program, but then I saw another [package re Tk] in the main thread. As soon as I remove this [package re Tk] the program runs without crashing.

I am having the program running for several hours now. If it continues running over night I think I can be sure about that.

sebres added on 2017-06-15 06:49:38:
As already said, you catch here an aftereffects (of something evil previously).

The last time I asked you for an example, that produces it. It's too complex? Needs something external?

Another question was "Could you completelly exclude tk-impact here (or some another modules, that are not thread-safe possibly)?"

tombert added on 2017-06-14 22:44:55:
By running helgrind I get a lot of data races such as:

==2614== ----------------------------------------------------------------
==2614== 
==2614== Possible data race during write of size 4 at 0x5602C170 by thread #78
==2614== Locks held: none
==2614==    at 0x4FAAFE3: DupListInternalRep (tclListObj.c:1798)
==2614==    by 0x4FB7DF5: Tcl_DuplicateObj (tclObj.c:1588)
==2614==    by 0x4F615D4: TEBCresume (tclExecute.c:3671)
==2614==    by 0x4E86733: TclNRRunCallbacks (tclBasic.c:4391)
==2614==    by 0x4E8936D: TclEvalObjEx (tclBasic.c:5957)
==2614==    by 0x4E89306: Tcl_EvalObjEx (tclBasic.c:5938)
==2614==    by 0x4FE24DA: AfterProc (tclTimer.c:1191)
==2614==    by 0x4FE144E: TimerHandlerEventProc (tclTimer.c:593)
==2614==    by 0x4FB6CEC: Tcl_ServiceEvent (tclNotify.c:670)
==2614==    by 0x4FB7128: Tcl_DoOneEvent (tclNotify.c:967)
==2614==    by 0x72A3331: ThreadWait (threadCmd.c:2882)
==2614==    by 0x72A05CC: ThreadWaitObjCmd (threadCmd.c:1163)
==2614== 
==2614== This conflicts with a previous write of size 4 by thread #77
==2614== Locks held: none
==2614==    at 0x4FAAF1F: FreeListInternalRep (tclListObj.c:1761)
==2614==    by 0x4FB7BAC: TclFreeObj (tclObj.c:1461)
==2614==    by 0x4F5CE7C: TEBCresume (tclExecute.c:2310)
==2614==    by 0x4E86733: TclNRRunCallbacks (tclBasic.c:4391)
==2614==    by 0x4E8936D: TclEvalObjEx (tclBasic.c:5957)
==2614==    by 0x4E89306: Tcl_EvalObjEx (tclBasic.c:5938)
==2614==    by 0x4FE24DA: AfterProc (tclTimer.c:1191)
==2614==    by 0x4FE144E: TimerHandlerEventProc (tclTimer.c:593)
==2614==  Address 0x5602c170 is 10,384 bytes inside a block of size 16,384 alloc'd
==2614==    at 0x4C2BFA0: malloc (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
==2614==    by 0x4FE08A9: GetBlocks (tclThreadAlloc.c:1044)
==2614==    by 0x4FDF68C: TclpAlloc (tclThreadAlloc.c:358)
==2614==    by 0x4FDFB94: TclpRealloc (tclThreadAlloc.c:514)
==2614==    by 0x4FD4447: GrowStringBuffer (tclStringObj.c:154)
==2614==    by 0x4FD5B3B: AppendUtfToUtfRep (tclStringObj.c:1562)
==2614==    by 0x4FD540A: Tcl_AppendLimitedToObj (tclStringObj.c:1089)
==2614==    by 0x4FD54B1: Tcl_AppendToObj (tclStringObj.c:1130)
==2614==    by 0x4F8EC09: ReadChars (tclIO.c:6044)
==2614==    by 0x4F8E801: DoReadChars (tclIO.c:5836)
==2614==    by 0x4F8E594: Tcl_ReadChars (tclIO.c:5721)
==2614==    by 0x4FA48E7: TclNREvalFile (tclIOUtil.c:1937)
==2614==  Block was alloc'd by thread #78
==2614== 

Since the SIGSEGV is happening in Sv_DuplicateObj method and helgrind warns about Tcl_DuplicateObj should'nt this be worth analyzing further?

sebres added on 2017-06-01 12:16:25:

> I stumbled over the TK bug.

You mean [1e3409bf1c] or yet another one?

> can't I set a watch to identify the procedure corrupting the memory? somehow by constantly checking boundaries?

Sorry but I guess not, resp. not so easy without compiling using something like "Rational Purify", but this way (because of the overhead) you can run in the situation where it is not reproducible anymore.


tombert added on 2017-06-01 11:34:54:
I feared that you gone ask for a "short" example - by trying so I stumbled over the TK bug.
It must be a tsv variable that is corrupted - can't I set a watch to identify the procedure corrupting the memory? somehow by constantly checking boundaries?

I know it is best to provide an example, but I have constantly failed as the bug might occur in the first minutes or after several hours. How do I know if the bug is still there? Not at all.

But ok - if you say there is no other way I will do so ...

sebres added on 2017-06-01 10:20:25:
> Why is llen so high? 

I assume that this list is already broken (tcl-object released? or something similar)..

> How can I find the variable name (can I get it from Tcl_Obj)?

No, you cannot at all. Also it must not necessarily be a variable, just any sub-list or any list used internally.
I assume that many of the call-stack dumps you posted are just consequential errors.

Better would be if you can probide a short test example running in tcl-shell, that seg-faulted by you. If you've such example please post it here.

tombert added on 2017-05-31 21:43:54:
Now in SvSetObjCmd:

print objv[0]
$32 = (Tcl_Obj * const) 0x7fffb8ed3e10
(gdb) print objv[0]->typePtr
$33 = (const Tcl_ObjType *) 0x7ffff7dd8bc0 <tclCmdNameType>
(gdb) print objv[1]->typePtr
$34 = (const Tcl_ObjType *) 0x0
(gdb) print objv[2]->typePtr
$35 = (const Tcl_ObjType *) 0x0
(gdb) print objv[3]->typePtr
$36 = (const Tcl_ObjType *) 0x7ffff7dce9c0 <tclListType>
(gdb) print objv[4]->typePtr
$37 = (const Tcl_ObjType *) 0x0
(gdb) print objv[5]->typePtr
$38 = (const Tcl_ObjType *) 0x202020200a5d312e
(gdb) print objc
$39 = 4
(gdb) print off
$40 = 3
(gdb) print arg
$41 = (ClientData) 0x0
(gdb) print svObj
$42 = (Container *) 0x9bf428


what puzzles me is that svObj seems to point to an invalid address ...

tombert added on 2017-05-31 21:07:43:
I run this test again in the tclsh but this crash still happens.
Unfort. I am not a C expert (thats why I love TCL) but would it be possible to guide me through debugging?

I moved the stack up (now in DupListObjShared) and found following interesting:

(gdb) print llen
$5 = 895694950
(gdb) print srcPtr
$6 = (Tcl_Obj *) 0x7fffb12093a0
(gdb) whatis srcPtr
type = Tcl_Obj *
(gdb) print srcPtr->typePtr
$7 = (const Tcl_ObjType *) 0x7ffff7dce9c0 <tclListType>
(gdb) print srcPtr->refCount
$8 = 247
(gdb) print srcPtr->bytes
$9 = 0x7ffff7dd9448 <tclEmptyString> ""
(gdb) print srcPtr->length
$10 = 0
(gdb) print srcPtr->typePtr
$11 = (const Tcl_ObjType *) 0x7ffff7dce9c0 <tclListType>


Why is llen so high? How can I find the variable name (can I get it from Tcl_Obj)?
How can I peek into the list elements to get an idea what makes llen so high?
etc. etc.

If llen, for whatever reason, is correct, then Tcl_ListObjIndex(NULL, srcPtr, i, &elObj) should be able to copy the content? But it seems to fail.

Do you think this is worth trying? Or am I on the wrong path?

Regards

sebres added on 2017-05-31 19:47:37:

If this ticket belongs the same issue as [1e3409bf1c] (not thread-safe tkcon) then please close it also.


tombert added on 2017-05-29 15:45:26:
Your example below is clear to me, but I am not passing handles to threads. 
No special extension/objects used. I am only passing strings.
In the code below [obj] is a procedure name being executed when it is queued.

tsv::lock objlockLockedObjs {
        tsv::set objlockTidPrio [thread::id] $priority
        tsv::lappend objlockTidQueue global [thread::id]
        tsv::lappend objlockPendingObjs [thread::id] {*}$args
        foreach obj $args {$obj -thread queue gthread objlock lock}
    }

tsv::lock objlockLockedObjs {
        ## corner case - unlock if locking is still pending
        if {[thread::id] in [tsv::get objlockTidQueue global]} {
            tsv::set objlockPendingObjs [thread::id] {}
            tsv::lrem objlockTidQueue global [thread::id]
            ## release from sleep
            thread::send -async [thread::id] [list gthread objlock wakeup]
        }
        ## release owned locks
        foreach {obj tid} [tsv::array get objlockLockedObjs] {if {$tid == [thread::id]} {tsv::set objlockLockedObjs $obj {}}}
    }

sebres added on 2017-05-29 13:28:05:

I don't see something evil in Sv_DuplicateObj resp. DupListObjShared...

Can you possibly provide more info which types of objects are set to the TSV there?

By the tsv, only known types would be duplicated with its internal representation (using typePtr->dupIntRepProc like list/dict resp. scalar like int, etc). All other types would be "duplicated" using "safe" way as stringified tcl-objects, so using same string representation (as new object without type). If you use some extensions with new object types, that do not support safe multi-threaded duplication (or restoration from string representation), you should avoid to set it to the tsv resp. avoid of its usage in other threads.

E. g.:

# thread 1:
set obj [my-tcl-ext::obj ...]; # returns my-own-obj#1
tsv::set some value $obj

thread::send $th {process}

# just 1st example - dual usage of not thread-safe object: my-tcl-ext::do-something $obj # just 2nd example - even delete it in thread 1: my-tcl-ext::delete $obj

# thread 2: proc process {} { set obj [tsv::get some value] my-tcl-ext::do-something $obj; # BOOM in 2 cases, see below }

makes BOOM if:

  • either type of object has no thread-safe duplication functionality;
  • or type of object has no duplication functionality (typePtr->dupIntRepProc is NULL), but restoration of the object from string ("my-own-obj#1") tries to find the object in some list, that is not multi-threaded safe (resp. convert it to the unusable object of another thread).
  • or the extension is compiled for single-threaded environment (no mutex/etc).


tombert added on 2017-05-29 12:44:33:
Another one:

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fff8c7c8700 (LWP 32470)]
Sv_DuplicateObj (objPtr=0x8cef0030) at ./generic/threadSvCmd.c:989
989	    if (objPtr->typePtr != NULL) {
(gdb) 
(gdb) bt
#0  Sv_DuplicateObj (objPtr=0x8cef0030) at ./generic/threadSvCmd.c:989
#1  0x00007ffff410bf08 in DupListObjShared (srcPtr=0x7fffdcd06470, copyPtr=0x2885140) at ./generic/threadSvListCmd.c:903
#2  0x00007ffff4101318 in Sv_DuplicateObj (objPtr=0x7fffdcd06470) at ./generic/threadSvCmd.c:1015
#3  0x00007ffff4102e29 in SvSetObjCmd (arg=0x0, interp=0x7fffdccbe500, objc=4, objv=0x7fffdcccca38)
    at ./generic/threadSvCmd.c:1701
#4  0x00007ffff76306a6 in Dispatch (data=0x7fffd54c06b8, interp=0x7fffdccbe500, result=0)
    at /home/tombert/cm/tcltk86/dcompile/tcl/generic/tclBasic.c:4358
#5  0x00007ffff7630734 in TclNRRunCallbacks (interp=0x7fffdccbe500, result=0, rootPtr=0x7fffdcd052d0)
    at /home/tombert/cm/tcltk86/dcompile/tcl/generic/tclBasic.c:4391
#6  0x00007ffff763336e in TclEvalObjEx (interp=0x7fffdccbe500, objPtr=0x7fffbd5d5710, flags=131072, invoker=0x0, word=0)
    at /home/tombert/cm/tcltk86/dcompile/tcl/generic/tclBasic.c:5957
#7  0x00007ffff7633307 in Tcl_EvalObjEx (interp=0x7fffdccbe500, objPtr=0x7fffbd5d5710, flags=131072)
    at /home/tombert/cm/tcltk86/dcompile/tcl/generic/tclBasic.c:5938
#8  0x00007ffff778c4db in AfterProc (clientData=0x7fffdd060690)
    at /home/tombert/cm/tcltk86/dcompile/tcl/generic/tclTimer.c:1191
#9  0x00007ffff778b44f in TimerHandlerEventProc (evPtr=0x4267f80, flags=-3)
    at /home/tombert/cm/tcltk86/dcompile/tcl/generic/tclTimer.c:593
#10 0x00007ffff7760ced in Tcl_ServiceEvent (flags=-3) at /home/tombert/cm/tcltk86/dcompile/tcl/generic/tclNotify.c:670
#11 0x00007ffff7761129 in Tcl_DoOneEvent (flags=-3) at /home/tombert/cm/tcltk86/dcompile/tcl/generic/tclNotify.c:967
#12 0x00007ffff40fe332 in ThreadWait (interp=0x7fffdccbe500) at ./generic/threadCmd.c:2882
#13 0x00007ffff40fb5cd in ThreadWaitObjCmd (dummy=0x0, interp=0x7fffdccbe500, objc=1, objv=0x7fffdcccbe00)
    at ./generic/threadCmd.c:1163
#14 0x00007ffff76306a6 in Dispatch (data=0x7fffdcd05d28, interp=0x7fffdccbe500, result=0)
    at /home/tombert/cm/tcltk86/dcompile/tcl/generic/tclBasic.c:4358
#15 0x00007ffff7630734 in TclNRRunCallbacks (interp=0x7fffdccbe500, result=0, rootPtr=0x0)
    at /home/tombert/cm/tcltk86/dcompile/tcl/generic/tclBasic.c:4391
#16 0x00007ffff762fec6 in Tcl_EvalObjv (interp=0x7fffdccbe500, objc=1, objv=0x7fffdcccbe00, flags=2097168)
    at /home/tombert/cm/tcltk86/dcompile/tcl/generic/tclBasic.c:4121