Tcl Source Code

View Ticket
Login
Ticket UUID: 3034840
Title: Crash in UpdateStringOfList()
Type: Bug Version: obsolete: 8.6b1.1
Submitter: egavilan Created on: 2010-07-26 18:28:14
Subsystem: 27. Channel Types Assigned To: andreas_kupries
Priority: 9 Immediate Severity:
Status: Closed Last Modified: 2010-08-04 23:49:24
Resolution: Fixed Closed By: andreas_kupries
    Closed on: 2010-08-04 16:49:24
Description:
When trying to run the code on http://wiki.tcl.tk/26735 with Tcl HEAD
(updated today), using the wish8.6 interpreter, I can crash the program
by closing the wish window with the mouse.

$ gdb wish8.6
GNU gdb 6.8
Copyright (C) 2008 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "i486-slackware-linux"...
(gdb) run memchan_2.tcl
Starting program: /usr/local/bin/wish8.6 memchan_2.tcl
[Thread debugging using libthread_db enabled]
[New Thread 0xb7a25ab0 (LWP 555)]
[New Thread 0xb79eeb90 (LWP 558)]
Writing data into rc0
Reading data from rc1
**Hello World**
Writing data into rc1
Reading data from rc0
**Hello There !!**            <=== Close the "." wish window here

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0xb7a25ab0 (LWP 555)]
0xb7eacbed in UpdateStringOfList (listPtr=0x8072ab8)
    at /home/emiliano/src/tcl/generic/tclListObj.c:1865
1865{
(gdb) bt
#0  0xb7eacbed in UpdateStringOfList (listPtr=0x8072ab8)
    at /home/emiliano/src/tcl/generic/tclListObj.c:1865

#1  0xb7eb784f in Tcl_GetString (objPtr=0x8072ab8)
    at /home/emiliano/src/tcl/generic/tclObj.c:1636
#2  0xb7eb78d1 in Tcl_GetStringFromObj (objPtr=0x8072ab8, lengthPtr=0xbf2930d4)
    at /home/emiliano/src/tcl/generic/tclObj.c:1676
#3  0xb7eacc9e in UpdateStringOfList (listPtr=0x8072ab8)
    at /home/emiliano/src/tcl/generic/tclListObj.c:1893

#4  0xb7eb784f in Tcl_GetString (objPtr=0x8072ab8)
    at /home/emiliano/src/tcl/generic/tclObj.c:1636
#5  0xb7eb78d1 in Tcl_GetStringFromObj (objPtr=0x8072ab8, lengthPtr=0xbf2931c4)
    at /home/emiliano/src/tcl/generic/tclObj.c:1676
#6  0xb7eacc9e in UpdateStringOfList (listPtr=0x8072ab8)
    at /home/emiliano/src/tcl/generic/tclListObj.c:1893
  
#7  0xb7eb784f in Tcl_GetString (objPtr=0x8072ab8)
    at /home/emiliano/src/tcl/generic/tclObj.c:1636
#8  0xb7eb78d1 in Tcl_GetStringFromObj (objPtr=0x8072ab8, lengthPtr=0xbf2932b4)
    at /home/emiliano/src/tcl/generic/tclObj.c:1676
#9  0xb7eacc9e in UpdateStringOfList (listPtr=0x8072ab8)
    at /home/emiliano/src/tcl/generic/tclListObj.c:1893

Note that group #1, #2 and #3 repeats in #4, #5 and #6, and then the
backtrace output repeats endlessly with these three lines.

The problem doesn't appear when using the tclsh intepreter (I can interrupt
the program with Ctrl-C, but it finalize without a crash.

The system is Linux 2.6.27.7
User Comments: andreas_kupries added on 2010-08-04 23:49:24:

allow_comments - 1

And done.

andreas_kupries added on 2010-08-04 23:39:04:
Ok, will do. Thanks.

dgp added on 2010-08-04 22:20:20:
Please commit the fixes.

andreas_kupries added on 2010-08-04 05:28:29:
And the IORChan patch for Tcl 8.5.

andreas_kupries added on 2010-08-04 05:28:06:

File Added - 382189: tclIORChan.patch.85

andreas_kupries added on 2010-08-04 04:59:31:
Added patch for tclIORTrans.c. While this patch has a new test there is no crash, either before or after. It certainly shared the same structure, however most methods had no detail arguments, and those which have (read, write) are not called during deletion. So this change is simply a proactive cleanup.

andreas_kupries added on 2010-08-04 04:54:47:

File Added - 382184: tclIORTrans.patch

andreas_kupries added on 2010-08-04 04:22:34:
Full testsuite passed. Patch attached.

andreas_kupries added on 2010-08-04 04:22:08:

File Added - 382182: tclIORChan.patch

andreas_kupries added on 2010-08-04 04:15:32:
My patch in the works is currently running through the testsuite. Before patch the core crashed the new test, after path the new test passes, and all other tests in ioCmd.test pass as well. Now the full testsuite is brought to bear.

I will attach the patch for review, and then also look at tclIORTrans.c, which very very likely has the same issue. Because it is structurally identical and derived from tclIORChan.c. copypastamunged. I will see if a modified copy of 32.2 can crash IORTrans as well.

At last backport to 8.5.

ferrieux added on 2010-08-04 03:21:09:
Though I realize you weren't asking me ;-)
I agree with the general spirit of restoring a zero-RC-effect contract for ITM.
Those 2193-2198 were puzzling anyway.

On a side note, we are again witnessing an effect of the tradition of zero-RC objects: pouring confusion.... I admit it may have been a smart (peephole) optimization, but we definitely lost in code clarity.

dgp added on 2010-08-04 03:07:44:
Committed test iocmd-32.2 which tests for this bug.

andreas_kupries added on 2010-08-04 02:07:52:
Based on the attached RefCountFlow notes of mine it seems that calling InvokeTclMethod (ITM) with an RC=0 Tcl_Obj should consistently delete it before returning. It is only for RC=1 that argOne, argTwo can be both deleted and kept, depending on the code path taken.

3. argOneObj    (a) CCOC,   modeObj,   RC=1.    RC-- @2194,    deleted.
                        RC++ @2225 RC=2,
                        RC-- @2296, RC=1, kept.
        (b) RWatch, maskObj,   RC=1    s.o.
        (c) RClose, ---------------
        (d) RGOpt,  ---------------
        (e) RIn,    toReadObj, RC=0    RC-- @2194, deleted.
                        RC++ @2225 RC=1,
                        RC-- @2296, RC=0, deleted.

ChanCreateObjCmd()
    modeObj = DecodeEventMask() @1975 RC=1.

    InvokeTclMethod ( modeObj, NULL ) => resObj
    modeObj @593 RC-- /assumes ITM rc neutral to detail1. WRONG.
***    modeObj - may and may not be deleted in ITM.


CCOC = ChanCreateObjCmd
RXXX = ReflectXXX

We cannot simply remove the code block going from lines 2193 to 2198. While that would fix things for the case RC=1, consistently keeping the arg* Tcl_Obj's, for the case RC=0 it would then start failing, keeping / deleting the Tcl_Obj based on code path.

Best thing we can do IMHO is to consistently call ITM with arg* Tcl_Obj which are RC=1, plus removing that code block.

Then the objects are consistently kept, and the caller has to delete, i.e. cleanup, by doing its own DecrRefCount.

Does this agree with your (dgp) analysis as well ?

andreas_kupries added on 2010-08-04 02:04:49:

File Added - 382168: RefCountFlow

andreas_kupries added on 2010-07-31 00:29:47:
> It is not safe to call InputTclMethod() with
> zero ref count arguments, but that's what
> ReflectWatch() does.

Hm. Going through this I cannot see how that (maskObj with RC == 0) happens. The maskObj is created by DecodeEventMask() and this function runs Tcl_IncrRefCount immediately on it, in line 1975, causing RC == 1.

I am looking at revision 1.49 of tclIORChan.c

That being said, I do see that ReflectInput/Output and SeekWide invoke ITM with newly minted objects which have RC == 0. (Still have to look at Block and (Set/Get)Option).

andreas_kupries added on 2010-07-30 21:53:12:
dgp - This is just a note that I have seen this thread.
Thank you for your analysis. I'll do a review as quickly as possible.
Note: tclIORTrans.c is structurally similar to tclIORChan.c which this bug is apparently about.
As such it may have the same problem.

dgp added on 2010-07-30 06:55:16:
Bottom line is that InvokeTclMethod and its callers
need a complete refCount management analysis and fixup.

dgp added on 2010-07-30 06:49:18:
Come up on the chat that just adding
a Tcl_IncrRefCount() isn't good enough
to fix things when the rcPtr->interp == NULL
case is in effect.  Then we still would get
two Decrs after one Incr and the same brokenness

dgp added on 2010-07-30 06:39:40:
Anyhow, just to confirm that is the root of the problem,
I added the Tcl_IncrRefCount() call and the bug goes away.
Leaving with aku to choose what fix is preferred.

dgp added on 2010-07-30 06:35:38:
It is not safe to call InputTclMethod() with
zero ref count arguments, but that's what
ReflectWatch() does.

The "design" of this code isn't fully clear
to me, so I don't know whether the right
fix is to add a Tcl_IncrRefCount() before
that call (and check all the similar calls
in the same source code file), or to revise
InvokeTclMethod() itself.

ferrieux added on 2010-07-30 06:35:08:
in InvokeTclMethod, dead channel case, why are argOne and argTwo decr-ref'ed ?

sinner candidate ?

dgp added on 2010-07-30 06:13:28:
Much worse than that.

Somehow the same Tcl_Obj is being
returned to the tclFreeObjList twice in
a row, which leaves it pointing to itself
as the next Tcl_Obj memory chunk to
allocate.  Hilarity ensues.

The code which is making the bad
push back onto the free list is the

Tcl_DecrRefCount(maskObj);

in ReflectWatch(), but it's not yet
clear whether it's the sinner or the
victim of even earlier corruption.

dgp added on 2010-07-30 04:36:08:
The bug is almost certainly in the
list surgery done in FixLevelCode().

dgp added on 2010-07-30 02:05:06:
The invalid value traces back to the
Tcl_SetReturnOptions() call in 
UnmarshallErrorResult().

Since I don't see anything obviously
wrong there, I suspect the error goes
back further and a corrupt value is
getting marshalled in the first place.
Still tracking...

dgp added on 2010-07-30 01:22:56:
That doesn't seem to be it.

If I disable the optimized dict->list
conversion code, the bug remains.

dgp added on 2010-07-30 01:07:41:
Quick scan of the changes file fingers
this as the most likely suspect:

2008-07-20 (enhancement)[2008248] dict->list preserve item intreps (pasadyn)

dkf?  Any chance a bug in that change could cause internal rep
corruption where a list tries to make an element of itself?  And then
any hints what could trigger that bug?

dgp added on 2010-07-30 01:01:48:
Issue appears between the 8.5.3 and 8.5.4 releases of Tcl/Tk.

dgp added on 2010-07-29 20:53:47:
This is an infinite loop created because a tclListType
value somehow has itself as its own second element!

A list value is not supposed to be an element of itself,
and it's clearly an invalid state for a multi-element list
to claim to me an element of itself.  The failure here
is not in the tclListType machinery, but it whatever code
corrupted this value into an invalid state.

Just based on the players involved, I'd point fingers
in the general direction of how (reflected?) channels
interact with exit cleanup.

Another approach might be to `git bisect` to discover
when this failure first showed up.

egavilan added on 2010-07-27 22:05:09:
The problem doesn't happen when Tk is loaded dinamically into the tclsh interpreter via [package require Tk], only using the wish interpreter.
Tried thread-enabled and thread-disabled builds, both crash.

nijtmans added on 2010-07-27 19:39:49:
I can reproduce this problem on Linux (Ubuntu), both with the 8.5 head and 8.6 head!

dkf added on 2010-07-27 19:36:49:
Works for me with tip of 8.5 branch too.

dkf added on 2010-07-27 19:29:16:
Works for me with HEAD. Crashes in 8.5.2. (All on OSX/Intel.)

Attachments: