Tcl Source Code

View Ticket
Login
Ticket UUID: 3006741
Title: SIGSEGV in Tcl_LogCommandInfo
Type: Bug Version: None
Submitter: coldstore Created on: 2010-05-25 03:28:58
Subsystem: 45. Parsing and Eval Assigned To: ferrieux
Priority: 8 Severity:
Status: Closed Last Modified: 2010-05-28 15:20:21
Resolution: Duplicate Closed By: coldstore
    Closed on: 2010-05-28 03:14:00
Description:
Tcl_LogCommandInfo crashes

Program received signal SIGSEGV, Segmentation fault.
Tcl_LogCommandInfo (interp=0x804da18, script=0x88ecae8 "::apply {r {\n    variable pending 0\n    variable starting 1\n    set r [::yield]\n    \n    while {$starting || $pending > 0} {\n\tset starting 0\t;# one shot at starting\n\t# this coroutine loops around acce"..., command=0x88ecae8 "::apply {r {\n    variable pending 0\n    variable starting 1\n    set r [::yield]\n    \n    while {$starting || $pending > 0} {\n\tset starting 0\t;# one shot at starting\n\t# this coroutine loops around acce"..., length=3142) at /home/colin/Desktop/packages/tcl/generic/tclNamesp.c:4964

CVS checkin dates tested:
2010.04.05.14.00.00 - OK
2010.04.06.14.00.00 - breaks

Stack backtrace and relevant inspection attached.
User Comments: coldstore added on 2010-05-28 15:20:21:
I would recommend conditional narrative debugging emitted on Callframe POP and PUSH and on CONTEXT_SAVE and _RESTORE.  The narrative would consist of a dump of the stack's contents in hex.

I've got the code, and could throw some #ifdefs around it, and roll it into the core, if that's broadly considered acceptable.

ferrieux added on 2010-05-28 15:15:05:

allow_comments - 0

For reference, let's spell out the better circumscribed bugreport  that you've just created on the issue: 3008307 .

As a an aside, how would you "perform this kind of forensic inspection" ?

static int IsCorrupted(CallFrame * frame) 
{
   /* magical code here */
} 
?

coldstore added on 2010-05-28 10:14:00:

allow_comments - 1

The cause of this SEGV was found to be stack corruption caused by a bug in coroutines.

It's good that the error-time collection of frame information flushed this out, but I think it may be worth adding some conditional debuggin narrative to explicitly perform this kind of forensic inspection.

ferrieux added on 2010-05-25 21:17:40:
Thank you very much for all this work Colin.

Now my knowledge of this is currently restricted to that of a mere user of the xxxFramePtrs's. The patches in question (TIP 348) indeed crawl the linked list of frames in order to report the objv[] at each level (CALL elements of errorstack). A secondary task (which wasn't initially intended) is to detect [uplevel] shifts and report them (UP elements).

At some point during the development I had noticed that shunning the root frame helped avoiding artifacts at the end of the list; but dgp chimed in and offered a different logic, which is in HEAD now. Donald, maybe you have an insight on the present issue ?

Clearly, all this critically depends on the invariants of the xxxFramePtr graph, which I'd love to get my hands on :}
I asked Miguel for a bit of help.

coldstore added on 2010-05-25 20:15:53:
I spent approximately 4 hours today looking for ways to trap this bug under gdb and valgrind, and tracing the fault to these patches.

The problem is that in some circumstances the interp->framePtr->callerVarPtr chain is broken.

Is there a case in which a frame can be created containing an uninitialized callerVarPtr?  I certainly could not observe callerVarPtr being explicitly set to the value which I consistently observed to be the bogus pointer.

Given the context within which the segv occurred, I suggest that these cases may be associated with ::apply or with ::coroutine, but defer to Alexandre's specific knowledge of the patches and what they seek to achieve.

I avoided the crash by testing for the condition 'iPtr->varFramePtr != iPtr->rootFramePtr' in addition to the existing test 'iPtr->varFramePtr != iPtr->framePtr'.  I note that this condition normally terminates the function in which the crash occurs, except in the case that varFramePtr != rootFramePtr, which is (I believe incorrectly) interpreted by the code as meaning that there's an uplevel in play.  I do not know under which conditions both of these predicates might be true, nor the accurate and complete interpretation of varFramePtr != framePtr, but something smells wrong about the interpretation being placed upon it, and I suspect the assumption that it strongly implies [uplevel] is false in the case of either ::coroutine or ::apply, or both.

coldstore added on 2010-05-25 19:49:06:
I can reproduce it perfectly reliably.

The problem arises in tclNamesp.c:4958 where you have 'if (iPtr->varFramePtr != iPtr->framePtr', and then traverse a chain of links following frame->callerVarPtr when one of those links (the second, in this case) wasn't in fact a frame, but a string, and interestingly always the same string.

I note that when the crash was occurring iPtr->varFramePtr == iPtr->rootFramePtr

I also note that the crash is occurring in the top-most level of a coroutine.

Is it possible that coroutines' top level frame (that is, #1 within the coro) doesn't have its frame initialized to correctly point to rootFramePtr as its caller var frame?


It's also worth noting that the error is occurring in an ::apply which is running as the command of a coro, and both the ::apply namespace and the coro's namespace are not what you would sort of normally expect.  It's a pretty complex case.

ferrieux added on 2010-05-25 19:03:46:
Don't worry Jan, Colin did this because of an SF shortage. He sent me additional info by e-mail. We'll be bringing those data here now that SF seems to be back up.

nijtmans added on 2010-05-25 17:52:19:
Without an example script which causes this SIGSEGV, it's almost impossible to reproduce this. Sure, it's related to the TIP #348 Implementation.

coldstore added on 2010-05-25 10:29:20:

File Added - 375048: gdb

Attachments: