Tcl Source Code

View Ticket
Login
Ticket UUID: 3142026
Title: Stack fault due to off-by-one in GrowEvaluationStack
Type: Bug Version: obsolete: 8.6b1
Submitter: bharder Created on: 2010-12-22 21:28:27
Subsystem: 17. Commands I-L Assigned To: msofer
Priority: 7 High Severity:
Status: Closed Last Modified: 2011-04-12 09:31:22
Resolution: Fixed Closed By: msofer
    Closed on: 2011-04-12 02:31:22
Description:
lsort -integer violates memory allocation when Tcl is compiled w/ TCL_MEM_DEBUG. This violation was introduced between 20Aug2009 and 21Aug2009 in CVS repo.
User Comments: msofer added on 2011-04-12 09:31:22:

allow_comments - 1

msofer added on 2011-01-13 18:32:17:
Backported to 8.5

bharder added on 2011-01-06 13:09:31:
Data point: This appears solved on my machine (64bit NetBSD). My test program will now successfully run 99999 iterations to completion. Thx Miguel.

dkf added on 2011-01-01 22:17:13:
BTW, it's categorically impossible to trigger the bug this way with 8.5; the 8.5 [lsort] code uses ckalloc and not TclStackAlloc for its temporary memory.

dkf added on 2011-01-01 21:46:37:
Worked around that by using a slave interp. Test now committed (it fails in a 64-bit memdebug build). It's unfortunate that it still depends on some "luck", but I don't know of a better way to trigger it.

Leaving this open for the backport question (up to Miguel).

dkf added on 2010-12-31 23:55:12:
Difficult to test other than with exactly the script as provided. Ugh.

dkf added on 2010-12-31 17:50:36:
Verified that it is fixed. Will consider testing.

msofer added on 2010-12-31 06:11:41:
Fixed in HEAD, commit msg is:

* generic/tclExecute.c (GrowEvaluationStack): off-by-one error in
sizing the new allocation - was ok in comment but wrong in the 
code. Triggered by [Bug 3142026] which happened to require
exactly one more than what was in existence.

Leaving open as 2 things missing:
 1. verify if backport is needed
 2. devise a test that does not depend on sheer (un)luck

msofer added on 2010-12-31 01:38:54:
Note that this is a hi-guard failure when deleting an execution stack: the crash has possibly nothing to do with either lsort or stackAlloc, it looks like someone somewhere is writing zeros one byte past the end of the stack. The crash happens (much?) later when the stack is freed and the guard is checked.

hobbs added on 2010-12-30 21:56:10:
I actually tried to trigger it under valgrind, but couldn't.  Sorry that I didn't stress that both Brad and I used 64-bit before to test.  I'm on the road now, so I don't have the other platforms easily at hand to test more variations.  I thought it might be related to Jan's recent 64-bit alignment stuff, but apparently Brad did isolate it to this area.

dkf added on 2010-12-30 19:17:29:
Not in any config I have; my toolchain is somewhat obsolete now.

msofer added on 2010-12-30 18:01:30:
No 64bit platform handy to repro. Any chance to run under valgrind a -DPURIFY build? It should have symbols but no mem debug ...

dkf added on 2010-12-30 08:35:14:
Actual crash message:

hi guard byte 0 is 0x0   
hi guard byte 1 is 0x0   
hi guard byte 2 is 0x0   
hi guard byte 3 is 0x0   
hi guard byte 4 is 0x0   
hi guard byte 5 is 0x0   
hi guard byte 6 is 0x0   
hi guard byte 7 is 0x0   
total mallocs                  68005
total frees                    55127
current packets allocated      12878
current bytes allocated       645210
maximum packets allocated      12885
maximum bytes allocated       655885
high guard failed at 100247038, /Users/dkf/Documents/software/tcl8.6-commits/generic/tclExecute.c 910
32040 bytes allocated at (/Users/dkf/Documents/software/tcl8.6-commits/generic/tclExecute.c 1110)
Memory validation failure

dkf added on 2010-12-30 08:30:14:
(and students of stack traces will see that I have added [time] to it; it appears to be a by-stander in this)

dkf added on 2010-12-30 08:29:00:
On the basis of that, I suspect it's not a problem with [lsort] (which now has very conventional memory handling) but rather elsewhere; as such, the suggested patch is just papering over the real problem, not fixing it. Assigning to Miguel for comment.

dkf added on 2010-12-30 08:24:45:
I'm guessing the issue is the need to increase the size of Tcl stack space (which appears to have hit a 32k boundary).

dkf added on 2010-12-30 08:21:26:
fails on iteration 1334; stack trace follows:


#0  0x00007fff83b9ff16 in __kill ()
#1  0x00007fff83c10f6d in abort ()
#2  0x00000001000ffc9c in Tcl_PanicVA (format=0x100167bd8 "Memory validation failure", argList=0x7fff5fbfeab0) at /Users/dkf/Documents/software/tcl8.6-commits/generic/tclPanic.c:118
#3  0x00000001000ffd7b in Tcl_Panic (format=0x100167bd8 "Memory validation failure") at /Users/dkf/Documents/software/tcl8.6-commits/generic/tclPanic.c:147
#4  0x0000000100029d2f in ValidateMemory (memHeaderP=0x100247000, file=0x10016c260 "/Users/dkf/Documents/software/tcl8.6-commits/generic/tclExecute.c", line=910, nukeGuards=1) at /Users/dkf/Documents/software/tcl8.6-commits/generic/tclCkalloc.c:270
#5  0x000000010002a642 in Tcl_DbCkfree (ptr=0x100247038 "8?\"", file=0x10016c260 "/Users/dkf/Documents/software/tcl8.6-commits/generic/tclExecute.c", line=910) at /Users/dkf/Documents/software/tcl8.6-commits/generic/tclCkalloc.c:612
#6  0x000000010009fc79 in DeleteExecStack (esPtr=0x100247038) at /Users/dkf/Documents/software/tcl8.6-commits/generic/tclExecute.c:910
#7  0x000000010009ff91 in GrowEvaluationStack (eePtr=0x1003067b8, growth=4002, move=0) at /Users/dkf/Documents/software/tcl8.6-commits/generic/tclExecute.c:1092
#8  0x00000001000a01a6 in StackAllocWords (interp=0x100809e38, numWords=4002) at /Users/dkf/Documents/software/tcl8.6-commits/generic/tclExecute.c:1179
#9  0x00000001000a03d2 in TclStackAlloc (interp=0x100809e38, numBytes=32016) at /Users/dkf/Documents/software/tcl8.6-commits/generic/tclExecute.c:1263
#10 0x0000000100039605 in Tcl_LsortObjCmd (clientData=0x0, interp=0x100809e38, objc=3, objv=0x10022d488) at /Users/dkf/Documents/software/tcl8.6-commits/generic/tclCmdIL.c:3907
#11 0x000000010001d5a2 in NRRunObjProc (data=0x100458390, interp=0x100809e38, result=0) at /Users/dkf/Documents/software/tcl8.6-commits/generic/tclBasic.c:4385
#12 0x000000010001d341 in TclNRRunCallbacks (interp=0x100809e38, result=0, rootPtr=0x1003b58a8) at /Users/dkf/Documents/software/tcl8.6-commits/generic/tclBasic.c:4332
#13 0x000000010001fab9 in TclEvalObjEx (interp=0x100809e38, objPtr=0x6, flags=0, invoker=0x0, word=0) at /Users/dkf/Documents/software/tcl8.6-commits/generic/tclBasic.c:5907
#14 0x000000010001fa5d in Tcl_EvalObjEx (interp=0x100809e38, objPtr=0x6, flags=0) at /Users/dkf/Documents/software/tcl8.6-commits/generic/tclBasic.c:5888
#15 0x0000000100041fa1 in Tcl_TimeObjCmd (dummy=0x0, interp=0x100809e38, objc=2, objv=0x10022d1b8) at /Users/dkf/Documents/software/tcl8.6-commits/generic/tclCmdMZ.c:4089
#16 0x000000010001d5a2 in NRRunObjProc (data=0x1003c1d80, interp=0x100809e38, result=0) at /Users/dkf/Documents/software/tcl8.6-commits/generic/tclBasic.c:4385
#17 0x000000010001d341 in TclNRRunCallbacks (interp=0x100809e38, result=0, rootPtr=0x0) at /Users/dkf/Documents/software/tcl8.6-commits/generic/tclBasic.c:4332
#18 0x000000010001fab9 in TclEvalObjEx (interp=0x100809e38, objPtr=0x6, flags=131072, invoker=0x0, word=0) at /Users/dkf/Documents/software/tcl8.6-commits/generic/tclBasic.c:5907
#19 0x000000010001fa5d in Tcl_EvalObjEx (interp=0x100809e38, objPtr=0x6, flags=131072) at /Users/dkf/Documents/software/tcl8.6-commits/generic/tclBasic.c:5888
#20 0x00000001000c6ec9 in Tcl_RecordAndEvalObj (interp=0x100809e38, cmdPtr=0x1003a42c8, flags=131072) at /Users/dkf/Documents/software/tcl8.6-commits/generic/tclHistory.c:192
#21 0x00000001000f35bf in Tcl_MainEx (argc=-1, argv=0x7fff5fbff608, appInitProc=0x1000029f7 <Tcl_AppInit>, interp=0x100809e38) at /Users/dkf/Documents/software/tcl8.6-commits/generic/tclMain.c:518
#22 0x00000001000f3a39 in Tcl_Main (argc=1, argv=0x7fff5fbff600, appInitProc=0x1000029f7 <Tcl_AppInit>) at /Users/dkf/Documents/software/tcl8.6-commits/generic/tclMain.c:667
#23 0x00000001000029f0 in main (argc=1, argv=0x7fff5fbff600) at /Users/dkf/Documents/software/tcl8.6-commits/unix/tclAppInit.c:80

dkf added on 2010-12-30 08:18:14:
Finally managed to reproduce; it's a 64-bit only bug.

dkf added on 2010-12-29 19:22:25:

File Deleted - 396835:

bharder added on 2010-12-28 04:54:58:

File Added - 397016: patch-aa.v3

bharder added on 2010-12-28 04:53:55:
Tested, still failed. I've updated the patch that solves this particular ticket to work w/ your latest changes. (patch-aa.v3)

dkf added on 2010-12-27 21:14:49:
Could you retest with the HEAD? I fixed another bug that I found when studying what this could possibly be, so maybe that will resolve this too (as I greatly simplified the memory management logic in the process; the handling of -index was unnecessarily over-clever before).

dkf added on 2010-12-26 02:09:09:
I'd like to emphasize that I don't mean to be difficult. I suspect that there may be some weird problem in the code to handle the temporary memory handling in the lsort command. But I can't reproduce for some reason; I need the stack trace from the crash and can't make it myself. :-(

There's too many things that are trickier than they look in this code for me to hunt without at least a copy of a picture of the smoking gun available to me. :-/

dkf added on 2010-12-25 15:57:05:
I usually build with --enable-symbols=mem (so I can run memleak tests) and wasn't seeing it, and even with --enable-symbols=all in a clean build I still don't see it.

bharder added on 2010-12-24 02:47:45:
Hi Donal -- indeed, my first patch was poor... I've reviewed and submitted 'patch-aa.v2' which is much more specific in the code it targets (strictly [lsort]) and does pass the case that was previously faulting. This is _not_ to say that [lindex] etc won't be subject to this same class of bug, but I won't speculate with my patch.

bharder added on 2010-12-24 02:45:36:

File Added - 396835: patch-aa.v2

hobbs added on 2010-12-24 01:41:05:
Donal - you have to compile with --enable-symbols=all (IOW -DTCL_MEM_DEBUG) to ensure the crash.  Brad used NetBSD amd64 and I used SuSE Linux-64 to confirm.

dkf added on 2010-12-23 19:57:12:
BTW, your patch is definitely not minimal. It touches unrelated code (the bits relating to lines 1312 to 1346 are nothing to do with [lsort]).

dkf added on 2010-12-23 19:38:27:
I don't see any problem when building on OSX. Do you have a stack trace?

bharder added on 2010-12-23 15:22:02:
Can confirm patch applies cleanly against CVS co of 23Dec2010 as well, and that Tcl 8.6 build passes test.

bharder added on 2010-12-23 15:05:16:
see attached patch-aa for a minimal (absolute minimal?) patch against 14 Dec code that appears to solve this specific issue (as hinted by jeff hobbs).

bharder added on 2010-12-23 15:03:36:

File Added - 396784: patch-aa

hobbs added on 2010-12-23 04:29:49:
Possibly related to the Tcl_LsortObjCmd change from ckalloc to stackalloc?

bharder added on 2010-12-23 04:28:28:

File Added - 396766: lsortfail.tcl

Attachments: