Tcl Source Code

View Ticket
Login
Ticket UUID: 3008307
Title: callerPtr chain does not traverse coro floor
Type: Bug Version: None
Submitter: coldstore Created on: 2010-05-28 03:11:56
Subsystem: 60. NRE and coroutines Assigned To: msofer
Priority: 8 Severity: Minor
Status: Closed Last Modified: 2014-10-10 19:40:34
Resolution: Fixed Closed By: ferrieux
    Closed on: 2014-10-10 19:40:34
Description:
The following code introduces stack corruption:

proc bottomp {y} {
    after 0 top 1
    yield
}

proc topp {x} {
    coroutine bottom bottomp 2
    after 0 bottom 3
    yield
}

coroutine top topp 1
set done 0
vwait done

As follows:

PUSH : 0x881ae48
TclGetNamespaceForQualName: PUSH : 0x8821058 0x881ae48
TclGetNamespaceForQualName: DELETING 0x8821058 POP : 0x881ae48
TclGetNamespaceForQualName: PUSH : 0x8821058 0x881ae48
TclGetNamespaceForQualName: DELETING 0x8821058 POP : 0x881ae48
TclGetNamespaceForQualName: PUSH : 0x8821058 0x881ae48
TclGetNamespaceForQualName: DELETING 0x8821058 POP : 0x881ae48
TclGetNamespaceForQualName: PUSH : 0x8821058 0x881ae48
TclGetNamespaceForQualName: DELETING 0x8821058 POP : 0x881ae48
TclGetNamespaceForQualName: PUSH : 0x8821058 0x881ae48
TclGetNamespaceForQualName: DELETING 0x8821058 POP : 0x881ae48
TclGetNamespaceForQualName: PUSH : 0x8821058 0x881ae48
TclGetNamespaceForQualName: DELETING 0x8821058 POP : 0x881ae48
TclGetNamespaceForQualName: PUSH : 0x8821058 0x881ae48
TclGetNamespaceForQualName: DELETING 0x8821058 POP : 0x881ae48
TclGetNamespaceForQualName: PUSH : 0x8821058 0x881ae48
TclGetNamespaceForQualName: DELETING 0x8821058 POP : 0x881ae48
TclGetNamespaceForQualName: PUSH : 0x8821058 0x881ae48
TclGetNamespaceForQualName: DELETING 0x8821058 POP : 0x881ae48
TclGetNamespaceForQualName: PUSH : 0x8821058 0x881ae48
TclGetNamespaceForQualName: DELETING 0x8821058 POP : 0x881ae48
TclGetNamespaceForQualName: PUSH : 0x8821058 0x881ae48
TclGetNamespaceForQualName: DELETING 0x8821058 POP : 0x881ae48
TclGetNamespaceForQualName: PUSH : 0x8821058 0x881ae48
TclGetNamespaceForQualName: DELETING 0x8821058 POP : 0x881ae48
NamespaceEvalCmd: PUSH : 0x8821318 0x881ae48
NsEval_Callback: DELETING 0x8821318 POP : 0x881ae48
NamespaceEvalCmd: PUSH : 0x8821318 0x881ae48
NsEval_Callback: DELETING 0x8821318 POP : 0x881ae48
NamespaceEvalCmd: PUSH : 0x8821318 0x881ae48
NsEval_Callback: DELETING 0x8821318 POP : 0x881ae48
TclProcCompileProc: PUSH : 0x8821318 0x881ae48
TclProcCompileProc: DELETING 0x8821318 POP : 0x881ae48
PushProcCallFrame: PUSH : 0x8821318 0x881ae48
NamespaceEvalCmd: PUSH : 0x88214d8 0x8821318 0x881ae48
NamespaceEvalCmd: PUSH : 0x8821580 0x88214d8 0x8821318 0x881ae48
NsEval_Callback: DELETING 0x8821580 POP : 0x88214d8 0x8821318 0x881ae48
NsEval_Callback: DELETING 0x88214d8 POP : 0x8821318 0x881ae48
NamespaceEvalCmd: PUSH : 0x8821540 0x8821318 0x881ae48
NsEval_Callback: DELETING 0x8821540 POP : 0x8821318 0x881ae48
InterpProcNR2: POP : 0x881ae48
InterpProcNR2: DELETING 0x8821318
TclNRCoroutineObjCmd save caller: 0x883def8
SAVE : 0x881ae48
TclProcCompileProc: PUSH : 0x8840470 0x881ae48
TclProcCompileProc: DELETING 0x8840470 POP : 0x881ae48
PushProcCallFrame: PUSH : 0x8840470 0x881ae48
TclNRCoroutineObjCmd save caller: 0x8848e90
SAVE : 0x8840470 0x881ae48
TclProcCompileProc: PUSH : 0x883fc70 0x8840470 0x881ae48
TclProcCompileProc: DELETING 0x883fc70 POP : 0x8840470 0x881ae48
PushProcCallFrame: PUSH : 0x883fc70 0x8840470 0x881ae48
NRCoroutineCallerCallback save running: 0x8848e90
SAVE : 0x883fc70 0x8840470 0x881ae48
NRCoroutineCallerCallback restore caller: 0x8848e90
RESTORE : 0x8840470 0x881ae48
NRCoroutineCallerCallback save running: 0x883def8
SAVE : 0x8840470 0x881ae48
NRCoroutineCallerCallback restore caller: 0x883def8
RESTORE : 0x881ae48
NRInterpCoroutine save caller: 0x883def8
SAVE : 0x881ae48
NRInterpCoroutine restore running: 0x883def8
RESTORE : 0x8840470 0x881ae48
InterpProcNR2: POP : 0x881ae48
InterpProcNR2: DELETING 0x8840470
NRCoroutineExitCallback restore caller 0x883def8
RESTORE : 0x881ae48
NRInterpCoroutine save caller: 0x8848e90
SAVE : 0x881ae48
NRInterpCoroutine restore running: 0x8848e90
RESTORE : 0x883fc70 0x8840470 0xfffffff9Segmentation fault
User Comments: ferrieux added on 2014-10-10 19:40:34:
Committed Colin's snippet as test coroutine-7.12, with comments.

msofer added on 2014-10-09 14:48:35:
AFAIU this is only missing tests, the bug is solved. Assigning to alex for a decision.

dgp added on 2014-10-09 14:04:39:
status?

msofer added on 2010-06-05 23:25:49:
Patch committed; rassigning to add tests

ferrieux added on 2010-06-03 14:57:28:
Note that I've now updated the TIP 348 code (commit of 20100603) so that it no longer depends on varFramePtr/callerVarPtr traversal (it now uses ->level offsets, which is safer and faster). 

I'm saying this because this makes the original crash of this ticket vanish.
Of course, Colin's latest snippet, which depends on TIP 280 instead, still crashes unless Miguel's patch is applied.

So it can be considered orthogonal (TIP 348 formerly being a mere observer), and all lights are green for Colin to commit.

msofer added on 2010-06-01 19:23:52:
The problem (to me) was:
  - there was no real evidence of a bug - ie, no misbehaving script in this ticket. Neither short nor long, neither provided nor mentioned
  - the only offered "evidence" is some C tracing code doing things that were NOT meant to be done (follow callerPtr accross coro boundaries); it took me a while to understand that this is what was happening
  - everything was working "as designed"; the junk in the stack was not corruption, just a field that was not updated because it was never meant to be used
  - when I finally realized what was happening, I decided to provide the means to follow callerPtr accross coro boundaries - to simplify debugging code, but also to fix another bug that I found while researching this. The problem was with coros that do not push a frame, eg 
   coroutine C eval $foo

dkf added on 2010-06-01 14:52:57:
Re "10 lines", the issue isn't whether or not it is a bug, but whether it can be reproduced and analyzed. Shorter is better there, especially if it leads to a test (where having it pinned down in 10 lines or less is *good*). The problem with long scripts is that it is highly likely that there will be something else buggering things up in an odd way; getting it short makes it clear that the bug really is understood and squelched.

ferrieux added on 2010-06-01 14:30:35:
Re: "I was unaware that Alexandre was spending time
reproducing the test harness, imperfectly."

1. I spent some time getting familiarity with the NRE and coro code, and 10 minutes producing the invariant checker. No feeling of wasted time, thanks for asking.

2. "Imperfection" is in the eye of the beholder, since the invariant also caught issues in the test suite, and can report that they are solved by Miguel's patch.

3. There's no need to be lecturing people about a culture of "if it doesn't fail within 10 lines...". Nobody feels this way around this table. It was obvious from the start that you had a crash. Turning hex printf's into a check is not an attempt to prove that' there is no bug, it is an attempt to catch it earlier.

4. I 'm also glad the bug's fixed, and I also acknowledge the accuracy of your analysis from the start. If only you could leave your Colt on your side.

coldstore added on 2010-06-01 08:10:22:

File Added - 375778: csegv3.tcl

coldstore added on 2010-06-01 07:48:44:
1. I haven't tested it, as I'm sure it'll work now we know what was going on.
2. the last 11 messages from here ended up in my Junk folder, mis-tagged as spam by the automation.
3. if I had known that the problem was that it couldn't be reproduced in 10 lines was the problem I'd have done it sooner.
4. in the wild, the bug symptoms occurred over 10000 stack ops away from its actual cause
5. I'd built a test harness *many* days ago to diagnose and provoke the bug,
6. I'm pretty sure I explained how the test harness provoked the bug (by assigning clearly spurious values to the next pointer in the deleted stack contents.)
7. nobody but me actually used the test harness I built.
8. because of 2, 6&7, I was unaware that Alexandre was spending time reproducing the test harness, imperfectly.
9. Just ... bleah.
10. I'm glad the bug's fixed.

My take-home is this:
1) A culture of 'if it doesn't fail within 10 lines, it's not a bug' is not conducive to good software.
2) the issue tracker emails don't always arrive.

ferrieux added on 2010-06-01 06:03:16:
OK, varFramePtr chaining was not a good idea in TIP 348. That's fixed now.

ferrieux added on 2010-06-01 03:52:03:

File Added - 375763: stackcheck3.patch

ferrieux added on 2010-06-01 03:51:01:
Thanks Miguel !
After updating the invariant to account for the legal reparenting (attached), I no longer any violation nor crash, neither in the test suite nor in Colin's script. Congrats !
Now observing the traversal of the coro boundary:

  - [info level -N] still stops at the depth of the coro part of the stack. That is expected since it uses the ->level field of CFs, which have not been modified by the reparenting.

  - TIP 348 works better, in that it shows the full depth (both caller and coro parts). However I'm seeing a tiny glitch in the computation of the uplevel offsets:

proc a {} b
proc b {} c
proc c {} {coroutine D d}
proc d {} e
proc e {} {
    #error E1
    yield
    error E2
}

proc a1 {} {D}

catch {
    a
    a1
}

puts [info errorstack]

==> with error E1 (before first yield, caller depth 3):
CALL e CALL d UP 4 CALL c CALL b CALL a

==> with error E2 (after first yield, resumer depth 1):
CALL e CALL d UP 4 CALL a1

Maybe something a bit strange with the varFramePtrs ? (the UP is computed by crawling callerPtr's until the varPtr is found)

msofer added on 2010-06-01 00:44:55:

File Added - 375724: coro.diff

msofer added on 2010-06-01 00:43:53:
Attaching a patch that apparently fixes the issue. I'd appreciate if you guys test and vet it before committing, maybe also add some tests (like colin's small bombing script). Alex: can you commit for me when you guys are satisfied? 

The patch insures that the iPtr->callerPtr can be followed accross coro boundaries.

Thx aplenty - for the hard debugging work, for (even harder!) convincing me that something is amiss, for completing the work from here.

msofer added on 2010-06-01 00:40:58:
Colin managed to get a script that bombs when run from the command line: http://paste.tclers.tk/2104

ferrieux added on 2010-05-31 06:14:22:
To elaborate, the just posted backtrace is obtained with a panic string saying that the callerPtr has been popped (and not even reallocated yet). This occurs in the RESTORE_CONTEXT of NRInterpCoroutine.

ferrieux added on 2010-05-31 05:37:56:

File Added - 375666: stackcheck2.patch

ferrieux added on 2010-05-31 05:36:38:
Good news: after adding a reset of the gencount (to -1) in Tcl_PopCallFrame, I now get a violation with Colin's latest short snippet (and also with coroutine-1.9).

This gives the wanted disambiguation, that the snippet actually show a bug, despite its failure to crash. Attaching the updated gencount patch. Here is the gdb bt from Colin's code:

#0  0x00271424 in __kernel_vsyscall ()
#1  0x003aaa81 in raise () from /lib/libc.so.6
#2  0x003ac34a in abort () from /lib/libc.so.6
#3  0x007ec29f in Tcl_PanicVA () from /home/alex/src/cvs/tcl/unix/libtcl8.6.so
#4  0x007ec2c4 in Tcl_Panic () from /home/alex/src/cvs/tcl/unix/libtcl8.6.so
#5  0x0074b2f8 in NRInterpCoroutine () from /home/alex/src/cvs/tcl/unix/libtcl8.6.so
#6  0x0074a9f5 in NRRunObjProc () from /home/alex/src/cvs/tcl/unix/libtcl8.6.so
#7  0x0074d42d in TclNRRunCallbacks () from /home/alex/src/cvs/tcl/unix/libtcl8.6.so
#8  0x007b2998 in TclExecuteByteCode () from /home/alex/src/cvs/tcl/unix/libtcl8.6.so
#9  0x0074d5df in NRCallTEBC () from /home/alex/src/cvs/tcl/unix/libtcl8.6.so
#10 0x0074d42d in TclNRRunCallbacks () from /home/alex/src/cvs/tcl/unix/libtcl8.6.so
#11 0x00752d73 in Tcl_EvalObjv () from /home/alex/src/cvs/tcl/unix/libtcl8.6.so
#12 0x0075361e in TclEvalEx () from /home/alex/src/cvs/tcl/unix/libtcl8.6.so
#13 0x0075396a in Tcl_EvalEx () from /home/alex/src/cvs/tcl/unix/libtcl8.6.so
#14 0x007dc11f in Tcl_FSEvalFileEx () from /home/alex/src/cvs/tcl/unix/libtcl8.6.so
#15 0x007e2b25 in Tcl_Main () from /home/alex/src/cvs/tcl/unix/libtcl8.6.so
#16 0x080486a1 in main ()
(

ferrieux added on 2010-05-31 05:26:25:
Sorry, typo, it's coroutine-1.9 as mentioned earlier.
Violation is, as also explained earlier and in the attached patch, a case where

   frame->callerGenCount != frame->callerPtr->genCount

That is, a case where the parent has been reallocated behind the child's back.

And no, so far I haven't seen a reproducible crash with small code snippets.

msofer added on 2010-05-31 04:57:27:
Please clarify "violation" and "coroutine-1.19" (no such test in coroutine.test). 

I would really love to have a clearly failing script where the failure does not require special code to see: a segfault, a wrong result. Do we have any?

ferrieux added on 2010-05-31 02:29:17:
Please note that I'm not getting any crash nor gencount-invariant violation (checked on RESTORE_CONTEXT only so far) with Colin's short scripts.

Also note that I consistently get a violation in the RewindCoroutine step of test coroutine-1.19, *and* that this doesn't happen when the same code is run outside tcltest. Admittedly this may not be the core of the bug, but do you have an idea where this difference comes from ?

msofer added on 2010-05-31 00:46:35:
Without yet looking at the tracing code, what SHOULD happen in that
simplest script in terms of CF and Tcl stack management is [1]:

1. interp is created with its main Tcl stack (execEnv0) and CF ptr rootFramePtr (frame0); both things are allocated on the heap. The script is evaluated in this "context"

2. When [boom] is invoked at the end of the script:
.    - a frame1 (allocated in execEnv0) is pushed: [boom]'s CF
.    - [boom]'s body is evaluated
.    - [cc] is invoked in frame1

3. When [cc] is invoked from [boom]'s body:
.    - a frame2 (allocated in execEnv0) is pushed: [cc]'s CF
.    - [cc]'s body is evaluated
.    - [coroutine] is invoked in frame2

4. When [coroutine] is invoked from [cc]'s body:
.    - the coroutineData (CD) struct is allocated on the heap, the command
[C] is created
.    - frame2 is saved in CD as the caller context; the running context is
set to NULL
.    - a new Tcl stack (execEnv1) is created for the coroutine; the
interp's execEnv is set to execEnv1 so that all new stack allocs will
happen here. The old execEnv0 is saved in CD->callerEEPtr
.    - the coroutine command "c" is evaluated AS A SCRIPT [2] in frame2: it
is first compiled, then the bytecodes are sent to TEBC
.    - TEBC sees that this is a coroutine's first run, and sets the
interp's CF context to the root frame0 (so that the coro runs in uplevel
#0)
.    - the command [c] is invoked in frame0

5. When [c] is invoked from [coroutine]'s C implementation:
.    - a frame3 (allocated in execEnv1) is pushed: [c]'s CF
.    - [c]'s body is evaluated in frame3
.    - [yield] is invoked in frame 3

6. When [yield] is invoked from [c]'s body:
.    - TEBC suspends execution of [c]'s bytecodes, saves its internal state
in execEnv1 and then resets the interp's execEnv to execEnv0
.    - NRCoroutineCallerCallback (the last callback in execEnv0's nre
stack) saves the coroutine's running CF "context" (frame3) in the CD
struct, then restores the interp's CF context to the caller's (frame2)
.    - the caller script ([cc]'s body) resumes evaluation; nothing left to
do, TEBC returns
.    - InterpProcNR2 pops frame2 from execEnv0's Tcl stack; interp is back
to (execEnv0, frame1)
.    - control is back to [boom]'s script: [C] is invoked in frame1

7. When [C] is invoked from [boom]'s body:
.    - the current context frame1 is saved in CD->caller
.    - CD->running (frame3) is set as the interp's current CF context
.    - execEnv1 is set as the current tcl stack, execEnv0 is saved in CD->callerEEPtr
.    - TEBC resumes execution of [c]' bytecodes: nothing left to do, it
returns
.    - InterpProcNR2 pops frame3 from execEnv1's Tcl stack; interp is now
at (execEnv1, rootFramePtr)

8. The return causes execEnv1 to wind down completely: [c] returns, the
first nre callback in execEnv1 (NRCoroutineExitCallback) is run:
.    - the [C] command is deleted
.    - execEnv1 is freed, interp's execEnv is set execEnv0
.    - the context is set to CD->caller (ie, frame1)

9. Control returns to execEnv0's nre stack: NRCoroutineCallerCallback
notices that the coro is dead, it frees the CD struct

10. TEBC continues execution of [boom]'s body
.    - nothing left to do, it returns
.    - InterpProcNR2 pops frame1 from execEnv0's Tcl stack and frees it
.    - frame0 is back to being the current CF

Step 7. is the crucial one: the CF stack is left in a "very difficult to
navigate" state, the expectation is that nobody should navigate it. My
guess is that the crash probably happens on error, when the Tip 348
mechanism does try to make it through the chasm. The coro mechanism does a
lot of the actual stack management through nre callbacks, saving data in the CD
struct; this is not reflected directly in the CF stack.

[1] the NRE callback stack plays an important role too, it is partially
omitted from this description to clarify the happenings in the CF and
execEnv subworlds

[2] technical detail: we need to evaluate it as a compiled script in order
to insure that TEBC is running; this is needed for proper management of
yet-another stack of structs that is maintained by TEBC. You don't (yet?)
want to know.

coldstore added on 2010-05-30 11:00:01:
Simpler case, demonstrating stack corruption with a single coro:

proc c {} {
    yield
}

proc cc {} {
    coroutine C c
}

proc boom {} {
    cc
    C
}

boom

coldstore added on 2010-05-30 10:55:16:
Tentative diagnosis:
TclNRCoroutineObjCmd calls TclNREvalObjEx with the command to compile, then evaluate.

The interp passed into TclNREvalObjEx contains the stack frame of CoroutineObjCmd's caller, which is acceptable insofar as the first invocation of that command should probably be able to pass errors back to the caller.

However, at some point (the first ::yield in the coro) this context, complete with the stack of the original caller, becomes the context of the running coro.

This isn't correct.  Once the coro hits its first ::yield, and therefore returns to its creator, it should be running in a global context - that is, the saved running context *must*not* contain the stack frame of the caller, even though it was passed in as the contruction stack.

If this isn't done, and if the caller and creator of a coroutine is itself a coroutine (and even, probably, if it's not) the stack appears in two contexts and can be destroyed independently in either.

I suspect this can be induced without two coroutines in place.

coldstore added on 2010-05-30 08:13:36:
0x86a7610 is top's coro.  0x86a7a90 is bottom's coro

0x869f470 is saved in bottom's caller when *running top - as we are creating,compiling, then running the bottom coro from the top coro.

0x869f470 is then saved in bottom's running after *yield in bottom, when we have constructed, compiled and run bottom to its first ::yield.

0x869f470 is then restored from bottom's caller in order to resume evaluation of top, which then takes us to *yield in top.

0x869f470 is then saved in top's running in order to process top's ::yield.

SO at this point we have 0x869f470 in both top and bottom's running context.

NRInterpCoroutine restore's top's running, to evaluate its [return].  top returns, destroying its stack (and most importantly 0x869f470)

top has been cleaned up now, but there's more to do in bottom, so ...

NRInterpCoroutine restores bottom's running, which still contains 0x869f470, which was deleted by top

So bottom is running with a stack containing an element deleted by the (now defunct) top.  And it will eventually crash.  As it does.

This is emphatically not a case of deleted content being reused.  It is a case of a stack element being duplicated by a coroutine being created within a running coroutine.

One of the assumptions tacitly made about coroutine creation must not hold true when the caller of ::coroutine is itself a coroutine.

If we could work out which assumption, we could presumably avoid the dependence on this false assumption.

coldstore added on 2010-05-30 07:54:18:
To make it clearer where the badness is occurring, I've modified the code.

proc bottomp {y} {
    puts stderr "*running bottom"
    after 0 top 1
    puts stderr "*yield in bottom"
    yield
    puts stderr "*return from bottom"
}

proc topp {x} {
    puts stderr "*running top"
    coroutine bottom bottomp 2
    after 0 bottom 3
    puts stderr "*yield in top"
    yield
    puts stderr "*return from top"
}

puts stderr "*starting"
coroutine top topp 1

set done 0
vwait done

And here is the output (edited for clarity):

*starting
TclNRCoroutineObjCmd save caller: 0x86a7610
SAVE : 0x8679e48
TclProcCompileProc: PUSH : 0x869f470 0x8679e48
TclProcCompileProc: DELETING 0x869f470 POP : 0x8679e48
PushProcCallFrame: PUSH : 0x869f470 0x8679e48
*running top
TclNRCoroutineObjCmd save caller: 0x86a7a90
SAVE : 0x869f470 0x8679e48
TclProcCompileProc: PUSH : 0x869ec70 0x869f470 0x8679e48
TclProcCompileProc: DELETING 0x869ec70 POP : 0x869f470 0x8679e48
PushProcCallFrame: PUSH : 0x869ec70 0x869f470 0x8679e48
*running bottom
*yield in bottom
NRCoroutineCallerCallback save running: 0x86a7a90
SAVE : 0x869ec70 0x869f470 0x8679e48
NRCoroutineCallerCallback restore caller: 0x86a7a90
RESTORE : 0x869f470 0x8679e48
*yield in top
NRCoroutineCallerCallback save running: 0x86a7610
SAVE : 0x869f470 0x8679e48
NRCoroutineCallerCallback restore caller: 0x86a7610
RESTORE : 0x8679e48
NRInterpCoroutine save caller: 0x86a7610
SAVE : 0x8679e48
NRInterpCoroutine restore running: 0x86a7610
RESTORE : 0x869f470 0x8679e48
*return from top
InterpProcNR2: POP : 0x8679e48
InterpProcNR2: DELETING 0x869f470
NRCoroutineExitCallback restore caller 0x86a7610
RESTORE : 0x8679e48
NRInterpCoroutine save caller: 0x86a7a90
SAVE : 0x8679e48
NRInterpCoroutine restore running: 0x86a7a90
RESTORE : 0x869ec70 0x869f470 0xfffffff9Segmentation fault

The badness here is around stack element 0x869f470

I

coldstore added on 2010-05-30 07:27:23:
Alexandre, you can't reproduce it with the code above?

I'd like to point out precisely where the above narrative shows badness:  0x8840470

On creation of the coroutine:

TclNRCoroutineObjCmd save caller: 0x8848e90
SAVE : 0x8840470 0x881ae48

Then a few operations later ...

NRCoroutineCallerCallback save running: 0x8848e90
SAVE : 0x883fc70 0x8840470 0x881ae48
NRCoroutineCallerCallback restore caller: 0x8848e90
RESTORE : 0x8840470 0x881ae48

We can see that 0x8840470 is saved in both the caller context and the running coro's context (with no intervening deletion.)

Eventually 0x8840470 will be popped, and hence deleted, by one of the execution threads, and the stored version will be invalid.

ferrieux added on 2010-05-30 06:31:31:
While I'm still failing to reproduce outside tcltest, I also attach the gdb bt of the broken invariant:

(gdb) where
#0  0x00e8e424 in __kernel_vsyscall ()
#1  0x003aaa81 in raise () from /lib/libc.so.6
#2  0x003ac34a in abort () from /lib/libc.so.6
#3  0x0063940d in Tcl_PanicVA (format=0x693c58 "Insane Stack : frame=%p  gc=%d cgc=%d c->gc=%d !!!", argList=0xbfbfaae4 "\377\377\377\377\304\317\067") at /home/alex/src/cvs/tcl/generic/tclPanic.c:93
#4  0x00639437 in Tcl_Panic (format=0x693c58 "Insane Stack : frame=%p  gc=%d cgc=%d c->gc=%d !!!") at /home/alex/src/cvs/tcl/generic/tclPanic.c:122
#5  0x00576775 in CheckStackSanity (iniframe=0x89cb6c8) at /home/alex/src/cvs/tcl/generic/tclBasic.c:412
#6  0x00581d52 in NRInterpCoroutine (clientData=0x89ff780, interp=0x8949770, objc=0, objv=0x0) at /home/alex/src/cvs/tcl/generic/tclBasic.c:8811
#7  0x005815f3 in RewindCoroutine (corPtr=0x89ff780, result=0) at /home/alex/src/cvs/tcl/generic/tclBasic.c:8643
#8  0x00581649 in DeleteCoroutine (clientData=0x89ff780) at /home/alex/src/cvs/tcl/generic/tclBasic.c:8655
#9  0x0057998c in Tcl_DeleteCommandFromToken (interp=0x8949770, cmd=0x89ff800) at /home/alex/src/cvs/tcl/generic/tclBasic.c:3016
#10 0x00578c49 in Tcl_CreateObjCommand (interp=0x8949770, cmdName=0xbfbfacf8 "foo", proc=0, clientData=0x89ff680, deleteProc=0x5815f9 <DeleteCoroutine>) at /home/alex/src/cvs/tcl/generic/tclBasic.c:2211
#11 0x005806d6 in Tcl_NRCreateCommand (interp=0x8949770, cmdName=0xbfbfacf8 "foo", proc=0, nreProc=0x581b55 <NRInterpCoroutine>, clientData=0x89ff680, deleteProc=0x5815f9 <DeleteCoroutine>) at /home/alex/src/cvs/tcl/generic/tclBasic.c:8225
#12 0x00582230 in TclNRCoroutineObjCmd (dummy=0x0, interp=0x8949770, objc=4, objv=0x8950df4) at /home/alex/src/cvs/tcl/generic/tclBasic.c:8895
#13 0x0057b5c0 in NRRunObjProc (data=0x8a170bc, interp=0x8949770, result=0) at /home/alex/src/cvs/tcl/generic/tclBasic.c:4384
#14 0x0057b384 in TclNRRunCallbacks (interp=0x8949770, result=0, rootPtr=0x894f650, tebcCall=1) at /home/alex/src/cvs/tcl/generic/tclBasic.c:4331
#15 0x005eeaef in TclExecuteByteCode (interp=0x8949770, codePtr=0x89bc1b0) at /home/alex/src/cvs/tcl/generic/tclExecute.c:2822
#16 0x0057b643 in NRCallTEBC (data=0x8a19084, interp=0x8949770, result=0) at /home/alex/src/cvs/tcl/generic/tclBasic.c:4407
#17 0x0057b384 in TclNRRunCallbacks (interp=0x8949770, result=0, rootPtr=0x0, tebcCall=0) at /home/alex/src/cvs/tcl/generic/tclBasic.c:4331
#18 0x0057ad39 in Tcl_EvalObjv (interp=0x8949770, objc=11, objv=0x89509a0, flags=2097152) at /home/alex/src/cvs/tcl/generic/tclBasic.c:4103
#19 0x0057cea0 in TclEvalEx (interp=0x8949770, script=0x8993118 "# Commands covered:  coroutine, yield, [info coroutine]\n#\n# This file contains a collection of tests for experimental commands that are\n# found in ::tcl::unsupported. The tests will migrate to normal "..., numBytes=13132, flags=0, line=191, clNextOuter=0x0, outerScript=0x8993118 "# Commands covered:  coroutine, yield, [info coroutine]\n#\n# This file contains a collection of tests for experimental commands that are\n# found in ::tcl::unsupported. The tests will migrate to normal "...) at /home/alex/src/cvs/tcl/generic/tclBasic.c:5273
#20 0x0057c4b3 in Tcl_EvalEx (interp=0x8949770, script=0x8993118 "# Commands covered:  coroutine, yield, [info coroutine]\n#\n# This file contains a collection of tests for experimental commands that are\n# found in ::tcl::unsupported. The tests will migrate to normal "..., numBytes=13132, flags=0) at /home/alex/src/cvs/tcl/generic/tclBasic.c:4929
#21 0x00625e61 in Tcl_FSEvalFileEx (interp=0x8949770, pathPtr=0x894f020, encodingName=0x0) at /home/alex/src/cvs/tcl/generic/tclIOUtil.c:1753
#22 0x0062e7fc in Tcl_Main (argc=-1, argv=0xbfbfb864, appInitProc=0x8048df2 <Tcl_AppInit>) at /home/alex/src/cvs/tcl/generic/tclMain.c:355
#23 0x08048de5 in main (argc=36, argv=0xbfbfb7d4) at /home/alex/src/cvs/tcl/unix/tclAppInit.c:85

ferrieux added on 2010-05-30 05:56:12:

File Added - 375591: stackcheck.patch

ferrieux added on 2010-05-30 05:54:58:
Doing what I've just described, I get the invariant to break in the test suite in coroutine-1.9. Have yet to analyze (and reproduce outside tcltest), but attaching the patch anyway so that someone can home in faster on the beast.

ferrieux added on 2010-05-30 05:02:21:
To be clearer:
 - record the current gencount on each stack element allocation
 - also record its caller's gencount alongside the pointer
 - we should have the invariant:
      framePtr->callerGenCount==framePtr->callerPtr->genCount;

If anything plays dirty with the stack, this invariant will break.
I'd try to automate the check; hardest part being to find a proper spot to do it.

coldstore added on 2010-05-29 19:07:27:
I see.  If the legally freed stack allocation has been placed on the stack, then it has been placed on the stack with the pointer to the next element in the stack uninitialized from the value it was assigned immediately before it was freed.

So even if this were the case, it simply means there's a different bug - specifically that something is being inserted onto the stack (without passing through PUSH) without its next pointer initialized.

Generation counting would, in either case enumerated, avail us nothing.

ferrieux added on 2010-05-29 17:10:37:
No, I did not say refcounting, but generation counting. That is, a kind of timestamp that is incremented on each CF allocation, and stored in one new field of them. So that, when in your debugging printfs you display a frame pointer (and associated gen count), you can tell whether you are looking at a stale frame or at a new one that just happens to reuse the same address (legally freed) as suggested by Miguel.

coldstore added on 2010-05-29 07:26:36:
Given that we believe stacks should never share elements, but we have evidence that they are, all refcounting would do at this stage is convert a segv into a leak with the added disutility of exposing frames which should not be visible to uplevel.

ferrieux added on 2010-05-29 04:01:58:
Then maybe adding a generation count field could tell them apart;

coldstore added on 2010-05-28 19:49:25:

File Added - 375457: framedebug.patch

msofer added on 2010-05-28 18:08:44:
Note that recreating 'bottom' at each go may well create some confusing ouput here: the callFrame for bottomp's body (and bottom's base CF) will be popped when the old version is rewinded, then the new CFs will be created at the same address - ie, at the precise same spot in Tcl's execution stack.

Depending on the details of the instrumentation, this could appear to be 'reusing a popped CF'.

msofer added on 2010-05-28 17:58:27:
Note that 'topp' recreates the 'bottom' coroutine at each go. What will be happening here is:

1. 'top' starts, creates 'bottom', sets 'bottom' to be called from the evloop and yields

2. 'bottom starts, sets 'top' to be called from the evloop and yields

3. 'top' starts, creates a NEW 'bottom (thus killing the suspended 'bottom'), sets 'bottom' to be called from the evloop and yields

4. goto 2

msofer added on 2010-05-28 17:51:58:
Could you please post the instrumenting code that produces that output?

Attachments: