Tcl Source Code

View Ticket
Login
Ticket UUID: 1726873
Title: Intermittent crash in threaded Tcl
Type: Bug Version: None
Submitter: twylite Created on: 2007-05-28 12:34:06
Subsystem: 49. Threading Assigned To: vasiljevic
Priority: 9 Immediate Severity:
Status: Closed Last Modified: 2007-10-26 01:48:52
Resolution: Fixed Closed By: vasiljevic
    Closed on: 2007-10-25 18:48:52
Description:
We have a multithreaded Tcl application running under Windows.  One thread is a TCP server that dispatches calls to a pool of processing threads via thread::send -async(and they reply via thread::send)

We have experienced an intermittent crash bug that occurs in ForgetSyncObject in generic/tclThread.c, causing an access violation exception.  We a problem with the same symptoms on various occasions, and managed to get a debugger involved twice (on different systems, one under 2000 Pro and the other 2003 Server) to confirm that is was the same bug.

By "intermittent" we mean that we can process up to 100 million transactions (that's at least 200m thread::send calls) before seeing the problem ;(  On the up side we're not seeing any resource leaks ;)

static void
ForgetSyncObject(
    char *objPtr, /* Pointer to sync object */
    SyncObjRecord *recPtr) /* Record of sync objects */
{
    int i;

    for (i=0 ; i<recPtr->num ; i++) {
if (objPtr == recPtr->list[i]) {
    recPtr->list[i] = NULL;
    return;
}
    }
}

With the debugger we see i go massively out of range.  In on case by about 17,000; in another by around 100,000.

Looking at code that modifies recPtr->num tells me that there should be no condition under which this can possibly happen (unless 32-bit operations are not atomic on x86, and last time I checked they were).

That only guess we have at the moment is that parts of the stack (including i) are getting trashed by activity somewhere in another thread.  We have only observed this bug in a multithreaded scenario.  But if we set the value of i to (recPtr->num - 1) through the debugger and let the program run, everything is still fine (so the return value on the stack isn't damaged).

Every time we've seen this there has been a call to ForgetSyncObj from Tcl_ConditionFinalize, which was called from ThreadSend (in threadCmd.c in the Thread extension).

To make matters worse we don't know how to reproduce the problem, but we have seen it under both debug/unchecked and release builds of 8.5a6 (built under MSVC6).

We need to try and track down this bug, and have resources to commit to doing so, but right now we're at a loss on how to proceed.  Any suggestions on possible cause and/or how we may be able to force a reproduction may help.
User Comments: vasiljevic added on 2007-10-26 01:48:52:
Logged In: YES 
user_id=95086
Originator: NO

I commited the "conservative" fix. Protect Tcl_MutexFinalize and
Tcl_ConditionFinalize to call ForgetSynObject w/o holding the 
master lock. Interestingly, the RememberSyncObject is always
called with master lock held, so this fix is just filling-in some
missing pieces that were obviously forgotten by somebody.

I am closing this ticket.

dgp added on 2007-10-26 00:11:54:
Logged In: YES 
user_id=80530
Originator: NO


I lack expertise on the issues
involved, but if it's important
that a fix for this be a part of
Tcl 8.5b2 (as the prio-9 setting
would suggest), then a commit
needs to happen now.

vasiljevic added on 2007-10-25 23:01:39:
Logged In: YES 
user_id=95086
Originator: NO

I have a conservative patch in my sandbox.
The one that does not assume atomic operations
on pointers. Everytime I assumed something in
that area, it proved to be wrong in the field.
The only way to get stable MT software is to strictly
stick to locking rules: if something is shared, always
lock it, regardless of what that is. Going this way
will save you headaches down the road with the
speed penalty that the next generation of CPU's
in 6 month time will make obsolete.

I can apply my conservative patches to core-8-4-branch
and head,

dgp added on 2007-10-25 22:34:50:
Logged In: YES 
user_id=80530
Originator: NO


status?

twylite added on 2007-07-02 14:34:47:
Logged In: YES 
user_id=91629
Originator: YES

My original problem is an example of heavy/regular use of the teardown functions.  ThreadSend in the Threads extension sets up and tears down a condition with each call, so if you have a lot of inter-thread communication there will be a performance hit.  

I don't have real-world figures in this case, but I do know that we had a server in C/C++ on Solaris that passed a message through a series of threads (each thread performed a different process/action on the message), and each additional thread had a huge impact on the server's latency on account of thread yields.

The patch does rely on changes to pointer variables being atomic (I have yet to encounter a post-8-bit platform in which this is not the case?).  

With regard to reliability, our patch has been thoroughly tested.  We have an intensive multithreaded test that can pick this bug up within 15 minutes (bear in mind that it is a crash bug that has been in the Tcl core for several years without anyone submitting a bug report) but hasn't managed to break the patch yet.

vasiljevic added on 2007-06-30 20:12:55:
Logged In: YES 
user_id=95086
Originator: NO

Ah... I see all the problems... 

Yes the Tcl_FinalizeMutex/Condition are really 
lacking master locks. And the RememberSyncObj()
needs to check empty slots before re-allocating
new table. This all is true.

What I have some problems with is the patch 
implementation as it assumes that ForgetSyncObj()
can be called unprotected. I understand that this
relies on the fact that updates of an char* are 
supposed to be atomic and thread-safe, right?

I think that creation/teardown of sync primitivies
is not (should not?) be a very frequent operation
after all. Simply wrapping the ForgetSyncObj() in
a TclpMasterLock/Unlock pair should fix the
problem and IMHO not introduce any significant
speed penalties overall.  Yes, we may yield the 
thread at that point, correct, but then again, 
teardowns  are rare and normally happen during 
finalization times, so no big deal.

One may ask: why do we need that 
Remember/Forget thingy after all? 
I cannot answer that question now, 
but IMHO we don't. The Tcl finalization 
is screwed anyways when it comes to threads
so by omitting the Remeber/Forget calls
no real harm would be done. Still the code is
there and somebody put it there on some purpose
and I would not like to question that purpose here.
We must go to core-list to straighten this out. 

Finally, I always prefer stability vs. performance. 
By properly locking the ForgetSyncObj() we may
sacrifice some speed yet buy ourselves simpler 
and more reliable code (no assumptions about 
atomicity of integer/pointer operations). 

Unless somebody pulls out some very persuading
reason why we should not do it this way, please 
speak now. A good example would be some real 
world metrics about application slowdown with 
master locks in place vs. unlocked case (as-is now).

Cheers,
Zoran

vasiljevic added on 2007-06-30 20:07:31:
Logged In: YES 
user_id=95086
Originator: NO

Ah... I see all the problems... 

Yes the Tcl_FinalizeMutex/Condition are really 
lacking master locks. And the RememberSyncObj()
needs to check empty slots before re-allocating
new table. This all is true.

What I have some problems with is the patch 
implementation as it assumes that ForgetSyncObj()
can be called unprotected. I understand that this
relies on the fact that updates of an char* are 
supposed to be atomic and thread-safe, right?

I think that creation/teardown of sync primitivies
is not (should not?) be a very frequent operation
after all. Simply wrapping the ForgetSyncObj() in
a TclpMasterLock/Unlock pair should fix the
problem and IMHO not introduce any significant
speed penalties overall.  Yes, we may yield the 
thread at that point, correct, but then again, 
teardowns  are rare and normally happen during 
finalization times, so no big deal.

One may ask: why do we need that 
Remember/Forget thingy after all? 
I cannot answer that question now, 
but IMHO we don't. The Tcl finalization 
is screwed anyways when it comes to threads
so by omitting the Remeber/Forget calls
no real harm would be done. Still the code is
there and somebody put it there on some purpose
and I would not like to question that purpose here.
We must go to core-list to straighten this out. 

Finally, I always prefer stability vs. performance. 
By properly locking the ForgetSyncObj() we may
sacrifice some speed yet buy ourselves simpler 
and more reliable code (no assumptions about 
atomicity of integer/pointer operations). 

Unless somebody pulls out some very persuading
reason why we should not do it this way, please 
speak now. A good example would be some real 
world metrics about application slowdown with 
master locks in place vs. unlocked case (as-is now).

Cheers,
Zoran

vasiljevic added on 2007-06-19 22:40:24:
Logged In: YES 
user_id=95086
Originator: NO

Thanks for looking into this guys. I have a tough time into getting this done
lately since I'm mostly on the road. But there is a light at the end of the tunnel....
I will soon get all this examined and checked in. Thanks again!

[email protected] added on 2007-06-19 14:46:45:
Logged In: NO 


Another look reveals that Tcl_MutexFinalize may suffer from the same serious flaw.

twylite added on 2007-06-19 14:27:39:

File Added - 233646: tcl-thread-syncobj-leak.patch

Logged In: YES 
user_id=91629
Originator: YES

*mutter* Knew I was forgetting something -- like submitting back our patch ;(

I agree with mistachkin's analysis -- we managed to create a small app with a group of threads just sending messages to each other, and were able to reproduce this bug in anywhere from 1 to 15 minutes.  It is a race condition in the Tcl_*Finalize functions resulting from the lack of locking.  For interest, the window for the bug to occur is about two asm instructions long ;/

The solution we developed (see attached patch file) does not use the masterLock, but it is threadsafe.  We have reworked how the SyncObjRecord lists are represented to get away from the need to free dynamically allocated memory (this is the root of the problem -- either you put a lock around all activity involving that memory, or you rework RememberSyncObj so that it never frees the memory).

We specifically wanted to avoid using the masterLock, otherwise you have to lock/unlock a mutex (critical section, etc.) in order to release a Tcl mutex or condition.  That in my experience has nasty implications for performance (in particular you may end up giving up your timeslice prematurely on some platforms).

In the context of this patch the mini-patch provided by afredd doesn't make much sense -- there are no fields in SyncObjRecord that would really benefit from guarding.

File Added: tcl-thread-syncobj-leak.patch

mistachkin added on 2007-06-19 13:24:21:
Logged In: YES 
user_id=113501
Originator: NO


After a very careful examination of the code involved here it appears to me that Tcl_ConditionFinalize is potentially buggy (i.e. it calls ForgetSyncObject without any masterLock mutex protection).

Also, I believe the mini-patch provided by 'afredd' in the previous comment should be applied to the core.

afredd added on 2007-05-29 23:11:19:
Logged In: YES 
user_id=1386588
Originator: NO


Interesting problem. Just skimmed thro' tclthread.c and
noticed that RememberSyncObject() does not check the list for
free slots before alloc'ing a new list ie. it can grow unnecesarily
and unbounded. Although you'd have hit a malloc failure
long before the "recPtr->max += 8" wrapped... :-\
Anyway this patch at the top of that function might help:

+    /* Check for a free NULL slot in the list (if it has been allocated) */
+    if (recPtr->list != NULL) {
+       for (i = 0; i < recPtr->num; ++i) {
+           if (recPtr->list[i] == NULL) {
+               recPtr->list[i] = objPtr;
+               return;
+           }
+       }
+    }
+


Also if you think that there may be a memory corruption,
you could try something like (untested):

 typedef struct {
+    int guard1;                /* memory corruption guard */
     int num;           /* Number of objects remembered */
+    int guard2;                /* memory corruption guard */
     int max;           /* Max size of the array */
+    int guard3;                /* memory corruption guard */
     char **list;       /* List of pointers */
+    int guard4;                /* memory corruption guard */
 } SyncObjRecord;

-static SyncObjRecord keyRecord = {0, 0, NULL};
-static SyncObjRecord mutexRecord = {0, 0, NULL};
-static SyncObjRecord condRecord = {0, 0, NULL};
+#define GX 0xbbbbbbbb /* the value in the guard fields */
+
+static SyncObjRecord keyRecord   = {GX, 0, GX, 0, GX, NULL, GX};
+static SyncObjRecord mutexRecord = {GX, 0, GX, 0, GX, NULL, GX};
+static SyncObjRecord condRecord  = {GX, 0, GX, 0, GX, NULL, GX};

If the guard values have changed when the crash occurs you
know something's scrawled over memory.

Good luck!

Attachments: