Tcl Source Code

View Ticket
Login
Ticket UUID: 3166410
Title: "out of stack space" on AIX
Type: Bug Version: obsolete: 8.5.9
Submitter: starwalker2000 Created on: 2011-01-27 07:39:43
Subsystem: 38. Init - Library - Autoload Assigned To: nijtmans
Priority: 5 Medium Severity:
Status: Closed Last Modified: 2011-03-08 05:31:04
Resolution: Fixed Closed By: nijtmans
    Closed on: 2011-03-07 22:31:04
Description:
I've got "out of stack space" error on AIX when running 32bits-tclsh. No such error of 64bits-tclsh.
I've traced the code of tcl, I found following problem:

[1] stopped in TclpGetCStackParams at line 1095 in file "/tellin/hjw/tcl8.5.9/unix/../unix/tclUnixInit.c" ($t1)
 1095               tsdPtr->stackBound = (int *) ((char *)tsdPtr->outerVarPtr -
(dbx) n
stopped in TclpGetCStackParams at line 1097 in file "/tellin/hjw/tcl8.5.9/unix/../unix/tclUnixInit.c" ($t1)
 1097           } else {
(dbx) p tsdPtr->outerVarPtr
0x2ff22460
(dbx) p stackSize
2147450878
(dbx) print tsdPtr->stackBound
0xaff2a462

According to the message above, the stackSize is a very great value, almost 2G, and tsdPtr->stackBound is a overflowed number. This cause following code returns false:
#define CheckCStack(iPtr, localIntPtr) \
   ((localIntPtr) > (iPtr)->stackBound)
(dbx) p &localInt
0x2ff221a0
(dbx) p iPtr->stackBound
0xaff2a462

For 64bits-tclsh, the situation is as follows:
[1] stopped in TclpGetCStackParams at line 1095 in file "/tellin/hjw/tcl8.5.9-64/unix/../unix/tclUnixInit.c" ($t1)
 1095               tsdPtr->stackBound = (int *) ((char *)tsdPtr->outerVarPtr -
(dbx) n
stopped in TclpGetCStackParams at line 1097 in file "/tellin/hjw/tcl8.5.9-64/unix/../unix/tclUnixInit.c" ($t1)
 1097           } else {
(dbx) print tsdPtr->outerVarPtr
0x0ffffffffffff140
(dbx) print stackSize
4294934528
(dbx) print tsdPtr->stackBound
0x0fffffff00007140

The stackSize is amost 4G, but tsdPtr->stackBound is not overflowed. So the CheckCStack returns true.
User Comments: nijtmans added on 2011-03-08 05:31:04:

allow_comments - 1

Fixed on core-8-5-branch. Not applicable to trunk and 8.4

starwalker2000 added on 2011-02-09 11:43:37:
This patch (3166410.patch) works。

nijtmans added on 2011-01-31 14:52:30:

File Added - 400303: 3166410.patch

nijtmans added on 2011-01-31 14:52:08:

File Deleted - 400301:

nijtmans added on 2011-01-31 14:34:40:
Thanks! so how about the attached patch? No matter that we have a very big stack space, calculating
the border should never overflow! If it does, it means that we already occupied a part of the
stack, so the real stack size is lower. Here is a patch trying to accomplish that.

Does this help?

nijtmans added on 2011-01-31 14:32:32:

File Added - 400301: 3166410.patch

starwalker2000 added on 2011-01-31 08:47:34:
I think the stack size is correct. The stack size on AIX can be set by the file /etc/security/limits which set stack to -1 and means "unlimited".

The limitations are as follows:
# ulimit -a
time(seconds)        unlimited
file(blocks)         unlimited
data(kbytes)         unlimited
stack(kbytes)        4194304
memory(kbytes)       unlimited
coredump(blocks)     2097151
nofiles(descriptors) 2000

For 32-bits program, the stack size is nearly 2G. For 64-bits program, the stack size is nearly 4G.

If I change the stack limitation to a smaller number, for example 65536, which makes the stack size to 32M. The 32-bits tclsh won't core dump due to there is no stack bound problem.

mistachkin added on 2011-01-31 03:26:28:
I'm trying to get access to an AIX box to help with this issue.  I have a few theories of my own I would like to test.

nijtmans added on 2011-01-31 03:08:44:
Yes, something is very strange here: A stack size of
2147450878 (0x7FFF7FFE), that's very big! So
maybe the stacksize calculation is simply wrong
for AIX. Then that should be corrected instead of
making the code uglier... I'm hesitating

starwalker2000 added on 2011-01-30 21:31:17:
I still think it's the problem of calculating iPtr->stackBound.
Obviously, 0x2ff221a0 minus 2147450878 is a nagtive value for 32-bits integer.

dkf added on 2011-01-28 22:55:12:
But don't make changes without testing on several platforms (minimally including a normal x86 Unix with gcc, Windows with MSVC, and AIX because it is known to have an issue).

dkf added on 2011-01-28 22:53:50:
Ugly's OK. It's conceptually ugly anyway.

nijtmans added on 2011-01-28 15:06:06:
My guess is that on AIX there is a bug in pointer comparison, such that all pointers above the 2G are considered smaller than pointer below 2G. So, whenever two pointers are compared, one below and the other above the 2G border, the result is not correct. I see 2 possible solutions to this:

- First substracting the two pointers results in a ptr_diff type,
  which is always signed. Then we can compare this to 0, and
  as long as no-one pushed more than 2G on the stack the
  result will be as expected. Well, 2G is an incredable
  amount, I don't think there is any machine with a total
  stack size as big a half the available memory.
- Another solution would be to cast the pointers to
  (size_t) before the comparison, so:

       ((size_t)(localIntPtr) > (size_t)(iPtr)->stackBound)

  Then we simply correct AIX's comparison 'bug', but it
  looks more ugly ;-)

I would prefer the first possibility, but someone might
try to convince me otherwise. Anyone?

starwalker2000 added on 2011-01-27 20:27:12:
These changes work.
After change tclBasic.c:360 from:
((localIntPtr) > (iPtr)->stackBound)
to
(((localIntPtr) - (iPtr)->stackBound) > 0)

It makes the ((localIntPtr) - (iPtr)->stackBound) becomes a positive value. But I've no idea will it cause other problem on other machines.
However, I still think the value of (iPtr)->stackBound is incorrect.

stopped in TclInterpReady at line 3474 in file "/tellin/hjw/tcl8.5.9/unix/../generic/tclBasic.c" ($t1)
 3474               && CheckCStack(iPtr, &localInt)) {
(dbx) print &localInt
0x2ff22150
(dbx) print iPtr->stackBound
0xaff2a462
(dbx) print &localInt - iPtr->stackBound
0x7fff7cee

nijtmans added on 2011-01-27 18:06:28:
And - of course - the same changes in tclBasic.c as well

nijtmans added on 2011-01-27 18:03:42:
How about changing the lines 1071-1073:
    if (stackSize || (tsdPtr->stackBound &&
    ((stackGrowsDown && (&result < tsdPtr->stackBound)) ||
    (!stackGrowsDown && (&result > tsdPtr->stackBound))))) {
to:
    if (stackSize || (tsdPtr->stackBound &&
    ((stackGrowsDown && ((&result - tsdPtr->stackBound) < 0)) ||
    (!stackGrowsDown && ((&result - tsdPtr->stackBound) > 0))))) {

That should always work, no matter that the stackBound is near the 2G
bounary. I would fail when the stack grows to more than half the available
memory, but that seems highly unlikely.

Does that help?

Attachments: