Ticket UUID: | 1726873 | |||
Title: | Intermittent crash in threaded Tcl | |||
Type: | Bug | Version: | None | |
Submitter: | twylite | Created on: | 2007-05-28 12:34:06 | |
Subsystem: | 49. Threading | Assigned To: | vasiljevic | |
Priority: | 9 Immediate | Severity: | ||
Status: | Closed | Last Modified: | 2007-10-26 01:48:52 | |
Resolution: | Fixed | Closed By: | vasiljevic | |
Closed on: | 2007-10-25 18:48:52 | |||
Description: |
We have a multithreaded Tcl application running under Windows. One thread is a TCP server that dispatches calls to a pool of processing threads via thread::send -async(and they reply via thread::send) We have experienced an intermittent crash bug that occurs in ForgetSyncObject in generic/tclThread.c, causing an access violation exception. We a problem with the same symptoms on various occasions, and managed to get a debugger involved twice (on different systems, one under 2000 Pro and the other 2003 Server) to confirm that is was the same bug. By "intermittent" we mean that we can process up to 100 million transactions (that's at least 200m thread::send calls) before seeing the problem ;( On the up side we're not seeing any resource leaks ;) static void ForgetSyncObject( char *objPtr, /* Pointer to sync object */ SyncObjRecord *recPtr) /* Record of sync objects */ { int i; for (i=0 ; i<recPtr->num ; i++) { if (objPtr == recPtr->list[i]) { recPtr->list[i] = NULL; return; } } } With the debugger we see i go massively out of range. In on case by about 17,000; in another by around 100,000. Looking at code that modifies recPtr->num tells me that there should be no condition under which this can possibly happen (unless 32-bit operations are not atomic on x86, and last time I checked they were). That only guess we have at the moment is that parts of the stack (including i) are getting trashed by activity somewhere in another thread. We have only observed this bug in a multithreaded scenario. But if we set the value of i to (recPtr->num - 1) through the debugger and let the program run, everything is still fine (so the return value on the stack isn't damaged). Every time we've seen this there has been a call to ForgetSyncObj from Tcl_ConditionFinalize, which was called from ThreadSend (in threadCmd.c in the Thread extension). To make matters worse we don't know how to reproduce the problem, but we have seen it under both debug/unchecked and release builds of 8.5a6 (built under MSVC6). We need to try and track down this bug, and have resources to commit to doing so, but right now we're at a loss on how to proceed. Any suggestions on possible cause and/or how we may be able to force a reproduction may help. | |||
User Comments: |
vasiljevic added on 2007-10-26 01:48:52:
Logged In: YES user_id=95086 Originator: NO I commited the "conservative" fix. Protect Tcl_MutexFinalize and Tcl_ConditionFinalize to call ForgetSynObject w/o holding the master lock. Interestingly, the RememberSyncObject is always called with master lock held, so this fix is just filling-in some missing pieces that were obviously forgotten by somebody. I am closing this ticket. dgp added on 2007-10-26 00:11:54: Logged In: YES user_id=80530 Originator: NO I lack expertise on the issues involved, but if it's important that a fix for this be a part of Tcl 8.5b2 (as the prio-9 setting would suggest), then a commit needs to happen now. vasiljevic added on 2007-10-25 23:01:39: Logged In: YES user_id=95086 Originator: NO I have a conservative patch in my sandbox. The one that does not assume atomic operations on pointers. Everytime I assumed something in that area, it proved to be wrong in the field. The only way to get stable MT software is to strictly stick to locking rules: if something is shared, always lock it, regardless of what that is. Going this way will save you headaches down the road with the speed penalty that the next generation of CPU's in 6 month time will make obsolete. I can apply my conservative patches to core-8-4-branch and head, dgp added on 2007-10-25 22:34:50: Logged In: YES user_id=80530 Originator: NO status? twylite added on 2007-07-02 14:34:47: Logged In: YES user_id=91629 Originator: YES My original problem is an example of heavy/regular use of the teardown functions. ThreadSend in the Threads extension sets up and tears down a condition with each call, so if you have a lot of inter-thread communication there will be a performance hit. I don't have real-world figures in this case, but I do know that we had a server in C/C++ on Solaris that passed a message through a series of threads (each thread performed a different process/action on the message), and each additional thread had a huge impact on the server's latency on account of thread yields. The patch does rely on changes to pointer variables being atomic (I have yet to encounter a post-8-bit platform in which this is not the case?). With regard to reliability, our patch has been thoroughly tested. We have an intensive multithreaded test that can pick this bug up within 15 minutes (bear in mind that it is a crash bug that has been in the Tcl core for several years without anyone submitting a bug report) but hasn't managed to break the patch yet. vasiljevic added on 2007-06-30 20:12:55: Logged In: YES user_id=95086 Originator: NO Ah... I see all the problems... Yes the Tcl_FinalizeMutex/Condition are really lacking master locks. And the RememberSyncObj() needs to check empty slots before re-allocating new table. This all is true. What I have some problems with is the patch implementation as it assumes that ForgetSyncObj() can be called unprotected. I understand that this relies on the fact that updates of an char* are supposed to be atomic and thread-safe, right? I think that creation/teardown of sync primitivies is not (should not?) be a very frequent operation after all. Simply wrapping the ForgetSyncObj() in a TclpMasterLock/Unlock pair should fix the problem and IMHO not introduce any significant speed penalties overall. Yes, we may yield the thread at that point, correct, but then again, teardowns are rare and normally happen during finalization times, so no big deal. One may ask: why do we need that Remember/Forget thingy after all? I cannot answer that question now, but IMHO we don't. The Tcl finalization is screwed anyways when it comes to threads so by omitting the Remeber/Forget calls no real harm would be done. Still the code is there and somebody put it there on some purpose and I would not like to question that purpose here. We must go to core-list to straighten this out. Finally, I always prefer stability vs. performance. By properly locking the ForgetSyncObj() we may sacrifice some speed yet buy ourselves simpler and more reliable code (no assumptions about atomicity of integer/pointer operations). Unless somebody pulls out some very persuading reason why we should not do it this way, please speak now. A good example would be some real world metrics about application slowdown with master locks in place vs. unlocked case (as-is now). Cheers, Zoran vasiljevic added on 2007-06-30 20:07:31: Logged In: YES user_id=95086 Originator: NO Ah... I see all the problems... Yes the Tcl_FinalizeMutex/Condition are really lacking master locks. And the RememberSyncObj() needs to check empty slots before re-allocating new table. This all is true. What I have some problems with is the patch implementation as it assumes that ForgetSyncObj() can be called unprotected. I understand that this relies on the fact that updates of an char* are supposed to be atomic and thread-safe, right? I think that creation/teardown of sync primitivies is not (should not?) be a very frequent operation after all. Simply wrapping the ForgetSyncObj() in a TclpMasterLock/Unlock pair should fix the problem and IMHO not introduce any significant speed penalties overall. Yes, we may yield the thread at that point, correct, but then again, teardowns are rare and normally happen during finalization times, so no big deal. One may ask: why do we need that Remember/Forget thingy after all? I cannot answer that question now, but IMHO we don't. The Tcl finalization is screwed anyways when it comes to threads so by omitting the Remeber/Forget calls no real harm would be done. Still the code is there and somebody put it there on some purpose and I would not like to question that purpose here. We must go to core-list to straighten this out. Finally, I always prefer stability vs. performance. By properly locking the ForgetSyncObj() we may sacrifice some speed yet buy ourselves simpler and more reliable code (no assumptions about atomicity of integer/pointer operations). Unless somebody pulls out some very persuading reason why we should not do it this way, please speak now. A good example would be some real world metrics about application slowdown with master locks in place vs. unlocked case (as-is now). Cheers, Zoran vasiljevic added on 2007-06-19 22:40:24: Logged In: YES user_id=95086 Originator: NO Thanks for looking into this guys. I have a tough time into getting this done lately since I'm mostly on the road. But there is a light at the end of the tunnel.... I will soon get all this examined and checked in. Thanks again! [email protected] added on 2007-06-19 14:46:45: Logged In: NO Another look reveals that Tcl_MutexFinalize may suffer from the same serious flaw. twylite added on 2007-06-19 14:27:39: File Added - 233646: tcl-thread-syncobj-leak.patch Logged In: YES user_id=91629 Originator: YES *mutter* Knew I was forgetting something -- like submitting back our patch ;( I agree with mistachkin's analysis -- we managed to create a small app with a group of threads just sending messages to each other, and were able to reproduce this bug in anywhere from 1 to 15 minutes. It is a race condition in the Tcl_*Finalize functions resulting from the lack of locking. For interest, the window for the bug to occur is about two asm instructions long ;/ The solution we developed (see attached patch file) does not use the masterLock, but it is threadsafe. We have reworked how the SyncObjRecord lists are represented to get away from the need to free dynamically allocated memory (this is the root of the problem -- either you put a lock around all activity involving that memory, or you rework RememberSyncObj so that it never frees the memory). We specifically wanted to avoid using the masterLock, otherwise you have to lock/unlock a mutex (critical section, etc.) in order to release a Tcl mutex or condition. That in my experience has nasty implications for performance (in particular you may end up giving up your timeslice prematurely on some platforms). In the context of this patch the mini-patch provided by afredd doesn't make much sense -- there are no fields in SyncObjRecord that would really benefit from guarding. File Added: tcl-thread-syncobj-leak.patch mistachkin added on 2007-06-19 13:24:21: Logged In: YES user_id=113501 Originator: NO After a very careful examination of the code involved here it appears to me that Tcl_ConditionFinalize is potentially buggy (i.e. it calls ForgetSyncObject without any masterLock mutex protection). Also, I believe the mini-patch provided by 'afredd' in the previous comment should be applied to the core. afredd added on 2007-05-29 23:11:19: Logged In: YES user_id=1386588 Originator: NO Interesting problem. Just skimmed thro' tclthread.c and noticed that RememberSyncObject() does not check the list for free slots before alloc'ing a new list ie. it can grow unnecesarily and unbounded. Although you'd have hit a malloc failure long before the "recPtr->max += 8" wrapped... :-\ Anyway this patch at the top of that function might help: + /* Check for a free NULL slot in the list (if it has been allocated) */ + if (recPtr->list != NULL) { + for (i = 0; i < recPtr->num; ++i) { + if (recPtr->list[i] == NULL) { + recPtr->list[i] = objPtr; + return; + } + } + } + Also if you think that there may be a memory corruption, you could try something like (untested): typedef struct { + int guard1; /* memory corruption guard */ int num; /* Number of objects remembered */ + int guard2; /* memory corruption guard */ int max; /* Max size of the array */ + int guard3; /* memory corruption guard */ char **list; /* List of pointers */ + int guard4; /* memory corruption guard */ } SyncObjRecord; -static SyncObjRecord keyRecord = {0, 0, NULL}; -static SyncObjRecord mutexRecord = {0, 0, NULL}; -static SyncObjRecord condRecord = {0, 0, NULL}; +#define GX 0xbbbbbbbb /* the value in the guard fields */ + +static SyncObjRecord keyRecord = {GX, 0, GX, 0, GX, NULL, GX}; +static SyncObjRecord mutexRecord = {GX, 0, GX, 0, GX, NULL, GX}; +static SyncObjRecord condRecord = {GX, 0, GX, 0, GX, NULL, GX}; If the guard values have changed when the crash occurs you know something's scrawled over memory. Good luck! |
Attachments:
- tcl-thread-syncobj-leak.patch [download] added by twylite on 2007-06-19 14:27:39. [details]