Tcl Source Code

View Ticket
Login
Ticket UUID: 2964425
Title: SEGV during make test in ioTrans.test
Type: Bug Version: obsolete: 8.6b1.1
Submitter: eee Created on: 2010-03-05 19:05:06
Subsystem: 24. Channel Commands Assigned To: dkf
Priority: 9 Immediate Severity:
Status: Closed Last Modified: 2010-03-06 06:01:53
Resolution: Fixed Closed By: dkf
    Closed on: 2010-03-05 22:51:08
Description:
$ uname -a
SunOS 5.9 Generic_122300-37 sun4u sparc SUNW,UltraAX-i2
$ gcc --version
gcc (GCC) 3.4.2

I ran ./configure --enable-threads --enable-64bit --enable-64bit-vis --enable-symbols --enable-dtrace, then I did 'make' and 'make test'.

During make test, I saw "ioTrans.test
Test file error: child killed: segmentation violation"
and make test reported that there were zero failures but also said "Test files exiting with errors: ioTrans.test".

I re-ran the test using the command:
$ TESTFLAGS='-verbose betfp -file ioTrans.test' make test

The resulting output is attached.
User Comments: dkf added on 2010-03-06 05:51:08:

allow_comments - 1

OK, identified what caused it. The code to reflect methods had some very strange local variable usage which I cleaned up, but I missed removing an extra Tcl_DecrRefCount which had become unnecessary.

andreas_kupries added on 2010-03-06 05:21:53:
Oh. ... The change not described in the ChangeLog was, I believe, all about changing macro's to do/while notation. I can easily image something going wrong there. And he did to commits I believe, the other something in OO ... Does -D allow for finer granularity than days ?

ferrieux added on 2010-03-06 05:19:54:
In the meantime, I got the bisection done :)
Indeed dkf's commit of today 2010-03-05 is at fault (cvs update -D 2010-03-04 doesn't have the bug). Note it isn't described in the ChangeLog file.

andreas_kupries added on 2010-03-06 05:17:04:
The $c tempchan has the contents of 'test data', and the 'ata' is the last part ... So, this is read data, and properly in the expected result. At this moment it is not clear to me what in my code would try to treat it as integer.

ferrieux added on 2010-03-06 05:07:18:
Uh, I realize this list is the result.
Interesting to note is the fact that the 5 last elements of this 10-element list are:

  expected integer but got {ata\n}

and it turns out ata\n is the thing expected in front of the XXX position.

Does this figure ?

andreas_kupries added on 2010-03-06 05:03:07:
Ah, that is either the 'expected' value for the list of operations, see the code of the testcase, or the list of actual operations accumulated and shuffled around by the helper commands (notes, variable 'res').

ferrieux added on 2010-03-06 04:57:29:
Yes. Also (and maybe in parallel), pinpointing the refcount-zero value in the nested list could help, because the surrounding list seems related to the reflected channel:

{{limit? rt0} {read rt0 {test d}} {limit? rt0} {read rt0 XXX} ...}

The zero-refcount object is at position XXX above. The ellipsis marks 6 other elements (totalling 10 in the toplevel list) .

andreas_kupries added on 2010-03-06 04:48:10:
Huh. That could be a possibility as well. Some memory bug triggering all sorts of strange things. Maybe even a new refcounting error ? Another recent change was to some OO teardown (oo:class oo::object, which are, in dkf's words, in 'incestuous' relationship).

It seems the best we can do now would be to bisect to find when the crash came in, and then work from whatever changes are fingered by that.

ferrieux added on 2010-03-06 04:44:57:
Thanks for the explanation.

Now, could it be that the interplay between the two cooperating threads is done is a slightly wrong order, with something freed before the other thread needs it ?

As a matter of fact, the objPtr in the crash also has refCount==0, and it is not a freshly allocated object, it is an element in a nested list passed as argument to [string match]. Maybe an old refcounting error, recently exposed by a twist of fate ?

andreas_kupries added on 2010-03-06 04:43:12:
The core of .tf-6.1 is

    set c [chan push [tempchan] foo]
    notes [inthread $c {
note [read $c 10]
**close $c
set notes
    } c]

The channel is closed at **. The interesting thing is that you say that it is getting closed in thread::exit instead, i.e. much later ?! Or is thread::exit operating on a different channel ?

andreas_kupries added on 2010-03-06 04:40:03:
The thread closing the channel does own it.

The ".tf" tests are all about threading. They _move_ the channel to a second thread (inthread helper procedure), which operates on it, including closing. The "ForwardOpToOwnerThread" function should possibly be renamed. The "OwnerThread" is the thread which contains the script-level handler of the transform, because that is the part we cannot move with the channel to the other thread. So the reflected-transform phones home with each operation done in that thread so that the script-level handler in the thread where it was created can do what the user wanted, and then phones the result back.

This is described in http://tip.tcl.tk/230 "Interaction with Threads and Other Interpreters".

ferrieux added on 2010-03-06 04:29:40:
It should also be noted that this happens only in threaded build. Knowing this, I looked into the two other threads here. One if the Notifier (of course). But the other one is more tasty: it is doing thread::exit, hence closing channels; in that process it is calling  ForwardOpToOwnerThread(), and blocks on a condvar waiting for the other (crashing) thread to do its part.

 Andreas, why are we closing a non-owned channel despite the exit reform ?

gdb trace of the non-crashing thread:
[Switching to thread 3 (Thread 0xb6c7eb70 (LWP 2193))]#0  0x00aa1424 in __kernel_vsyscall ()
(gdb) where
#0  0x00aa1424 in __kernel_vsyscall ()
#1  0x00507c45 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
#2  0x00d7c36c in Tcl_ConditionWait (condPtr=0xb61416e4, mutexPtr=0xdb5704, timePtr=0x0) at /home/alex/src/cvs/tcl/unix/tclUnixThrd.c:555
#3  0x00d25863 in ForwardOpToOwnerThread (rtPtr=0x8fa7178, op=ForwardedClose, param=0xb6c7dc78) at /home/alex/src/cvs/tcl/generic/tclIORTrans.c:2421
#4  0x00d243a2 in ReflectClose (clientData=0x8fa7178, interp=0x0) at /home/alex/src/cvs/tcl/generic/tclIORTrans.c:919
#5  0x00d11837 in ChanClose (chanPtr=0x8fb96a0, interp=0x0) at /home/alex/src/cvs/tcl/generic/tclIO.c:258
#6  0x00d11b92 in TclFinalizeIOSubsystem () at /home/alex/src/cvs/tcl/generic/tclIO.c:470
#7  0x00ceeb59 in Tcl_FinalizeThread () at /home/alex/src/cvs/tcl/generic/tclEvent.c:1293
#8  0x00d59775 in Tcl_ExitThread (status=0) at /home/alex/src/cvs/tcl/generic/tclThread.c:467
#9  0x0805c613 in ThreadObjCmd (dummy=0x0, interp=0xb6104a60, objc=2, objv=0xb610d600) at /home/alex/src/cvs/tcl/generic/tclThreadTest.c:331
#10 0x00c83771 in NRRunObjProc (data=0xb610ba9c, interp=0xb6104a60, result=0) at /home/alex/src/cvs/tcl/generic/tclBasic.c:4347
#11 0x00c83535 in TclNRRunCallbacks (interp=0xb6104a60, result=0, rootPtr=0xb610bea0, tebcCall=0) at /home/alex/src/cvs/tcl/generic/tclBasic.c:4294
#12 0x00c82eea in Tcl_EvalObjv (interp=0xb6104a60, objc=2, objv=0xb610d600, flags=2097152) at /home/alex/src/cvs/tcl/generic/tclBasic.c:4066
#13 0x00c85041 in TclEvalEx (interp=0xb6104a60, script=0x8f938a0 "testthread exit", numBytes=15, flags=0, line=1, clNextOuter=0x0, outerScript=0x8f938a0 "testthread exit") at /home/alex/src/cvs/tcl/generic/tclBasic.c:5236
#14 0x00c84658 in Tcl_EvalEx (interp=0xb6104a60, script=0x8f938a0 "testthread exit", numBytes=-1, flags=0) at /home/alex/src/cvs/tcl/generic/tclBasic.c:4892
#15 0x00c8593b in Tcl_Eval (interp=0xb6104a60, script=0x8f938a0 "testthread exit") at /home/alex/src/cvs/tcl/generic/tclBasic.c:5809
#16 0x00c86f66 in Tcl_GlobalEval (interp=0xb6104a60, command=0x8f938a0 "testthread exit") at /home/alex/src/cvs/tcl/generic/tclBasic.c:6958
#17 0x0805da9f in ThreadEventProc (evPtr=0x8f93a20, mask=-3) at /home/alex/src/cvs/tcl/generic/tclThreadTest.c:1032
#18 0x00d38fe8 in Tcl_ServiceEvent (flags=-3) at /home/alex/src/cvs/tcl/generic/tclNotify.c:671
#19 0x00d39406 in Tcl_DoOneEvent (flags=-3) at /home/alex/src/cvs/tcl/generic/tclNotify.c:972
#20 0x0805cc5e in ThreadObjCmd (dummy=0x0, interp=0xb6104a60, objc=2, objv=0xb610d340) at /home/alex/src/cvs/tcl/generic/tclThreadTest.c:466
#21 0x00c83771 in NRRunObjProc (data=0xb610be8c, interp=0xb6104a60, result=0) at /home/alex/src/cvs/tcl/generic/tclBasic.c:4347
#22 0x00c83535 in TclNRRunCallbacks (interp=0xb6104a60, result=0, rootPtr=0x0, tebcCall=0) at /home/alex/src/cvs/tcl/generic/tclBasic.c:4294
#23 0x00c82eea in Tcl_EvalObjv (interp=0xb6104a60, objc=2, objv=0xb610d340, flags=2097152) at /home/alex/src/cvs/tcl/generic/tclBasic.c:4066
#24 0x00c85041 in TclEvalEx (interp=0xb6104a60, script=0xb6128a00 "testthread wait", numBytes=15, flags=0, line=1, clNextOuter=0x0, outerScript=0xb6128a00 "testthread wait") at /home/alex/src/cvs/tcl/generic/tclBasic.c:5236
#25 0x00c84658 in Tcl_EvalEx (interp=0xb6104a60, script=0xb6128a00 "testthread wait", numBytes=-1, flags=0) at /home/alex/src/cvs/tcl/generic/tclBasic.c:4892
#26 0x00c8593b in Tcl_Eval (interp=0xb6104a60, script=0xb6128a00 "testthread wait") at /home/alex/src/cvs/tcl/generic/tclBasic.c:5809
#27 0x0805cf7c in NewTestThread (clientData=0xbf907864) at /home/alex/src/cvs/tcl/generic/tclThreadTest.c:619
#28 0x00ceef45 in NewThreadProc (clientData=0x8f92a80) at /home/alex/src/cvs/tcl/generic/tclEvent.c:1554
#29 0x00503ab5 in start_thread () from /lib/libpthread.so.0
#30 0x0045a83e in clone () from /lib/libc.so.6
(gdb)

ferrieux added on 2010-03-06 04:17:39:
Correction, it's iortrans.tf-6.1

Below is the gdb trace, showing the crash in Tcl_GetString.
Interesting fact: we have an objPtr with simultaneously NULL ->bytes and ->typePtr.
Note the following comment just before the point of crash:

/*
 * Note we do not check for objPtr->typePtr == NULL.  An invariant of
 * a properly maintained Tcl_Obj is that at least  one of objPtr->bytes
 * and objPtr->typePtr must not be NULL.  If broken extensions fail to
 * maintain that invariant, we can crash here.
 */

Program received signal SIGSEGV, Segmentation fault.
0x00d3a24f in Tcl_GetString (objPtr=0x8fc51a8) at /home/alex/src/cvs/tcl/generic/tclObj.c:1627
(gdb) where
#0  0x00d3a24f in Tcl_GetString (objPtr=0x8fc51a8) at /home/alex/src/cvs/tcl/generic/tclObj.c:1627
#1  0x00d3a2e0 in Tcl_GetStringFromObj (objPtr=0x8fc51a8, lengthPtr=0xbf9076f4) at /home/alex/src/cvs/tcl/generic/tclObj.c:1677
#2  0x00d2f962 in UpdateStringOfList (listPtr=0x8f981d8) at /home/alex/src/cvs/tcl/generic/tclListObj.c:1889
#3  0x00d3a278 in Tcl_GetString (objPtr=0x8f981d8) at /home/alex/src/cvs/tcl/generic/tclObj.c:1637
#4  0x00d3a2e0 in Tcl_GetStringFromObj (objPtr=0x8f981d8, lengthPtr=0xbf9077e4) at /home/alex/src/cvs/tcl/generic/tclObj.c:1677
#5  0x00d2f962 in UpdateStringOfList (listPtr=0x8f98148) at /home/alex/src/cvs/tcl/generic/tclListObj.c:1889
#6  0x00d3a278 in Tcl_GetString (objPtr=0x8f98148) at /home/alex/src/cvs/tcl/generic/tclObj.c:1637
#7  0x00d63d39 in TclStringMatchObj (strObj=0x8f98148, ptnObj=0x8fc3f48, flags=0) at /home/alex/src/cvs/tcl/generic/tclUtil.c:1777
#8  0x00ca2840 in StringMatchCmd (dummy=0x0, interp=0x8ee9770, objc=3, objv=0x8fb98b0) at /home/alex/src/cvs/tcl/generic/tclCmdMZ.c:2069
#9  0x00c83771 in NRRunObjProc (data=0x8fc517c, interp=0x8ee9770, result=0) at /home/alex/src/cvs/tcl/generic/tclBasic.c:4347
#10 0x00c83535 in TclNRRunCallbacks (interp=0x8ee9770, result=0, rootPtr=0x8fc43c8, tebcCall=1) at /home/alex/src/cvs/tcl/generic/tclBasic.c:4294
#11 0x00cf2c99 in TclExecuteByteCode (interp=0x8ee9770, codePtr=0x8fcec88) at /home/alex/src/cvs/tcl/generic/tclExecute.c:2806
#12 0x00c837f4 in NRCallTEBC (data=0x8fc3fac, interp=0x8ee9770, result=0) at /home/alex/src/cvs/tcl/generic/tclBasic.c:4370
#13 0x00c83535 in TclNRRunCallbacks (interp=0x8ee9770, result=0, rootPtr=0x0, tebcCall=0) at /home/alex/src/cvs/tcl/generic/tclBasic.c:4294
#14 0x00c82eea in Tcl_EvalObjv (interp=0x8ee9770, objc=11, objv=0x8ef09a0, flags=2097152) at /home/alex/src/cvs/tcl/generic/tclBasic.c:4066
#15 0x00c85041 in TclEvalEx (interp=0x8ee9770, script=0x8f33118 "# -*- tcl -*-\n# Functionality covered: operation of the reflected transformation\n#\n# This file contains a collection of tests for one or more of the Tcl\n# built-in commands.  Sourcing this file into T"..., numBytes=46734, flags=0, line=1243, clNextOuter=0x0, outerScript=0x8f33118 "# -*- tcl -*-\n# Functionality covered: operation of the reflected transformation\n#\n# This file contains a collection of tests for one or more of the Tcl\n# built-in commands.  Sourcing this file into T"...) at /home/alex/src/cvs/tcl/generic/tclBasic.c:5236
#16 0x00c84658 in Tcl_EvalEx (interp=0x8ee9770, script=0x8f33118 "# -*- tcl -*-\n# Functionality covered: operation of the reflected transformation\n#\n# This file contains a collection of tests for one or more of the Tcl\n# built-in commands.  Sourcing this file into T"..., numBytes=46734, flags=0) at /home/alex/src/cvs/tcl/generic/tclBasic.c:4892
#17 0x00d29e09 in Tcl_FSEvalFileEx (interp=0x8ee9770, pathPtr=0x8eeef90, encodingName=0x0) at /home/alex/src/cvs/tcl/generic/tclIOUtil.c:1750
#18 0x00d326c4 in Tcl_Main (argc=-1, argv=0xbf908544, appInitProc=0x8048b72 <Tcl_AppInit>) at /home/alex/src/cvs/tcl/generic/tclMain.c:358
#19 0x08048b65 in main (argc=36, argv=0xbf9084b4) at /home/alex/src/cvs/tcl/unix/tclAppInit.c:85
(gdb) print *objPtr
$1 = {refCount = 0, bytes = 0x0, length = -1, typePtr = 0x0, internalRep = {longValue = 150754568, doubleValue = 1.9791706336425661e-265, otherValuePtr = 0x8fc5508, wideValue = 646696352645666056, twoPtrValue = {ptr1 = 0x8fc5508, ptr2 = 0x8f986e8}, ptrAndLongRep = {ptr = 0x8fc5508, value = 150570728}}}
(gdb)

andreas_kupries added on 2010-03-06 02:54:02:
On the other hand, dkf did lots of changes throughout in that changeset, maybe one of them affected something it should not.

andreas_kupries added on 2010-03-06 02:49:03:
There was a recent change to tclIORTrans.c (rev 1.13) ... Just reviewed, and it looks harmless.

ferrieux added on 2010-03-06 02:17:27:
You're lucky, it's reproducible on 32-bit Linux.
You can also precise that the crashing test is iortrans-6.1.
Diving.

eee added on 2010-03-06 02:06:13:
Setting prio to 9 per kbk.

eee added on 2010-03-06 02:05:06:

File Added - 365562: test_fail.txt

Attachments: