Tcl Source Code: View Ticket

Ticket UUID:	1437595
Title:	WinSock cores/hangs on thread or app exit
Type:	Bug	Version:	obsolete: 8.4.12
Submitter:	vasiljevic	Created on:	2006-02-23 18:19:44
Subsystem:	27. Channel Types	Assigned To:	vasiljevic
Priority:	5 Medium	Severity:
Status:	Open	Last Modified:	2010-01-12 22:09:17
Resolution:	Fixed	Closed By:
		Closed on:
Description:	There is a sequence problem in the way how the win socket/pipe subsystem is teared down. The generic code which handles the finalization invokes thread exit handlers first, then finalizes IO subsystem. Unfortunately, the finalization of the IO subsystem may access data which is gargabe-collected by the therad exit handler. In such cases, the application will either hang or core. The attached patch works-arround this by moving the functionality of the thread-exit handler into a special TclpFinalizeSockets() call. This call is now part of the generic TclFinalizeIOSubsystem. This change is no-op for Unix. It affects Windows (win/) and old Mac-9 (mac/) code bases.
User Comments:	ferrieux added on 2010-01-12 22:09:17: Ping. Please verify that it is now working in 8.[456] HEADs. ferrieux added on 2009-06-16 05:17:35: I'd bet this has been fixed by the introduction of late exit handlers and the associated demotion of WSACleanup. Zoran can you confirm ? vasiljevic added on 2007-09-26 21:58:34: Logged In: YES user_id=95086 Originator: YES Thanks for your detailed report. I know this and we have fixed our internal copy some time ago and are now field-testing this for a while before I commit the change. The trouble with that WinSock code is that is pretty "old" and clumsy. What you've found is just a small subset of problems.... [email protected] added on 2007-09-26 21:54:21: Logged In: NO I believe there is a defect on socket shutdown (Windows) In TclpFinalizeSockets() a SOCKET_TERMINATE message is posted to the socket's window, followed by a wait for readyEvent to be signaled (which the wndProc signals when its event loop exits, after receiving its destroy message which is triggered by the SOCKET_TERMINATE message). Since readyEvent is also signaled when data arrives on the socket, the wait in TclpFinalizeSockets can return before the socket window has been destroyed if there is activity on the socket at that time. This can cause a number of deleterious side effects, such as the call to UnregisterClass() failing (due to a window still existing) - and then a failure to initialize sockets due to RegisterClassA failing as the window class already exists - in a scenario where Tcl is unloaded and re-loaded from within the same process instance. This issue was resolved by using a separate event object solely for tracking the socket window creation and destruction. A related issue in InitSockets is that sockets are marked as initialized even if one of the initialization steps fails (e.g. RegisterClassA fails). Subsequent calls to InitSockets do not execute the code to setup the Winsock DLL, but do execute per-thread initialization instructions. vasiljevic added on 2006-09-29 18:41:32: Logged In: YES user_id=95086 Most interesting... I will look this over again. After the change we never had any problems in that area of winsock driver. There are others (I will have to fix couple of issues still sitting in the code) and I will use this time to review what you report. Thanks for the notice. dunkfan added on 2006-09-29 18:28:08: Logged In: YES user_id=1183601 FWIW I had a hanging problem with 8.4.13 XPSP2 after calling exit. Tracked it down to tclwinsock.c WaitForSingleObject(tsdPtr- >readyEvent, INFINITE); in the TclpFinalizeSockets() code... Then I included the code which used to be in there (from 8.4.6) to check whether the thread was STILL_ACTIVE before jumping into the Wait call. DWORD exitCode; GetExitCodeThread(tsdPtr->socketThread, &exitCode); if (exitCode == STILL_ACTIVE) { if (tsdPtr->hwnd != NULL) { PostMessage(tsdPtr->hwnd, SOCKET_TERMINATE, 0, 0); /* * Wait for the thread to exit. This ensures that we are * completely cleaned up before we leave this function. / WaitForSingleObject(tsdPtr- >readyEvent, INFINITE); tsdPtr->hwnd = NULL; } } That seems to cure the problem... vasiljevic added on 2006-03-20 15:11:43: Logged In: YES user_id=95086 It does but this is not enough. The problem is that on per-thread there is no such hook. If I Tcl_CreateThread(), than none of the Tcl things get initialized there. I need to Tcl_CreateInterp in that thread in order to pull-up init sequence. This is not always needed nor feasible. In my practical experience, there are quite a few places I used the Tcl_CreateThread of Tcl API to create another thread which does not necessarily has any Tcl interp attached. I still think that, as it is now, is completetly sufficient. David ides abourt scrapping the TclpHasSockets is also worth considering. I can't imagine any modern plafrom today w/o socket support. dgp added on 2006-03-19 08:04:59: Logged In: YES user_id=80530 Just on the point of a hook for initialization. Seems to me that TclInitSubsystems is the base for that. davygrvy added on 2006-03-19 04:14:42: Logged In: YES user_id=7549 kbk brought up the idea to stop the dynamic loading of wsock32.dll and just import link to it. The need for dynamic linking dates back to the early Win95 days when tcp/ip (DUN) wasn't always installed. TclpHasSockets() can possibly disappear if we want to do it, though we still have to WSAStartUp(). I'm not sure if that helps us, just makes InitSockets() smaller. vasiljevic added on 2006-03-19 00:40:31: Logged In: YES user_id=95086 Damn. The more I dig into that, the more dirt comes out... My humble conclusion is that the entire init/teardown of Tcl is a patchwork. It is not possible to change an A without getting the Z affected :-( As I see, the socket subsystem is defined to be loaded lazy. Means, first thing that pulls any socket-related functionality being it [socket] command or any exported C-API call will eventually load and initialize the entire subsystem. Per-se this is not a bad idea as it conserves resources. But it creates problems if you need to have a call like TclpHasSockets. This one is one of the callers of InitSockets() so if InitSockets() block the caller will block as well. If you divorce those two, then you need to explicitly load the socket subsystem, which is what I wanted to do. But then, you end up in a situation that you need to do this before doing anything* else with C- or Tcl-API. Unfortunately there is no such hook in the API. So I will have to think little bit more. Current situation is: TclpHasSockets() will block until InitSockets() has created the complete environment including the helper socket thread and the associated window. I personally do not see any problem in such behaviour. Yet, I will try another attempt, but the future does not look bright... unless we re-write the entire per-thread initialization stuff which I would not like to do at this time. davygrvy added on 2006-03-17 15:44:03: Logged In: YES user_id=7549 Sounds great. dkf added on 2006-03-16 16:30:04: Logged In: YES user_id=79902 Sounds like a good plan to me (speaking from the sidelines). vasiljevic added on 2006-03-16 16:19:48: Logged In: YES user_id=95086 Here is how I would solve that: I would introduce new platform-specific call: TclpInitializeSockets(). This should be symetric to TclpFinalizeSockets() and should be called from within TclInitSubsystems() in the generic layer. This call would do exactly the same what the InitSockets() was doing in tclWinSock.c. Actually, I'd just rename InitSockets() -> TclpInitializeSockets() and get it called from the generic part explictly. OTOH, the TclpHasSockets() will just examine winsock.hModule != NULL hence it is non-block. It will not attempt to lazy-load the entire socket subsystem as it is now. So one more lock-source less, which is a nice side-effect. This way the entire socket subsystem is called once per thread creation. If any call to it breaks at some place AFTER (initially ) loading the socket DLL then we punt (Tcl_Panic). What do you think? vasiljevic added on 2006-03-16 03:18:05: Logged In: YES user_id=95086 Correct. Once the DLL is loaded OK and all needed functions were found, we should have a working socket subsystem. If at that point ANY of the things fail to initialize, then this is a sign to stop. It is better to stop then go halfway on, IMO. The TclpHasSockets() should not break and should really just test the winsock.hModule != NULL and return TCL_OK in that case. I guess the only question now is how to get all this done right. The tricky part is InitSockets() and TclpHasSockets(). These two must somehow divorce. I'm not yet sure how but I will give it a thought today/tomorrow. I would Tcl_Panic in all cases when something goes wrong in InitSockets() if and only if all functions were loaded OK from the DLL. davygrvy added on 2006-03-16 01:57:35: Logged In: YES user_id=7549 I hadn't thought about a hard abort before, but it wouldn't be wrong in this case over the punt. Same rule can apply to the proceeding check for a thread handle and even the following window handle check, too. We can't allow TclpHasSockets() to cause a hard abort though. So it might be a bit of a dance to get right. vasiljevic added on 2006-03-15 05:25:25: Logged In: YES user_id=95086 This does not sound very deterministic to me. I would never trust such a system. Why 20 secs? Why not 10 or 100 or 1000? I do not believe this is the right way to go. I guess if the thread needs that long time (20 secs in that case is almost forever) to start up, then something is wrong with the computer and there is no reason to continue. If a thread A wants/needs to start the thread B then it must wait forever if needed. Because if it really waits forever then you have some other problem. I see it this way: somebody wants to use socket communication. For that Windows needs a helper thread: If this thread is not started then we need to Tcl_Panic! See the example of the notifier fhread. If this thread does not start, would you also wait 20 secs and declare threading communication useless or would you panic the system? Fortunately the wise designer of that part decided to Tcl_Panic which is the only correct way to go. Why relaxing that behaviour for sockets? Please understand: the simplest thing for me is to go and put that timeout back there. But I'm still not convinced that would be the right solution. davygrvy added on 2006-03-15 04:50:21: Logged In: YES user_id=7549 The issue is not CreateWindow blocking, but the time it takes the thread to start. If your system is at such a state where it takes more than 20 seconds to get the notification window up, you'd better punt. That's my point. The timeout was very valid. By taking it out, your saying the state of the system doesn't matter and we should wait until next thursday for the thread to start and create the notification window. All it is a "head check" and nothing more. vasiljevic added on 2006-03-15 04:22:31: Logged In: YES user_id=95086 The SocketThread's first thing to do is to CreateWindow (line 2326) then SetEvent(tsdPtr->readyEvent) (line 2333) The thread creating it sits in WaitForSingleObject(tsdPtr->readyEvent, INFINITE); and waits for the readyEvent to be signalized. There is no problem there. Unless the CreateWindow() blocks forever. Then the SetEvent is never executed and the caller blocks indefinitely. Hence my question once more: can CreateWindow block forever? Yes or no? If yes: when and why? I can put back the timeout there but I do not see any valid reason why to do that. Do you have a test case that brings the system in such state? davygrvy added on 2006-03-15 04:03:16: Logged In: YES user_id=7549 What guarentee do you have that the thread can create the new window and hit the winproc's WM_CREATE? An INFINITE timeout is rather presumptuous. Please find it in your heart to put back the conservative "head check" timeout. vasiljevic added on 2006-03-15 03:29:36: Logged In: YES user_id=95086 David, You are speaking of some email where you explained the problem. Where did you send it? I never got any such email! All I got was your response on this thread in Bug tracker. Hence I had to ask what's wrong. You would do the same, wouldn't you? I believe we have some comm problem here... I'm quoting: It waits for the async notification window to be in a ready state for use by WSAAsyncSelect. If it really is taking over 20 seconds to create a window in a thread, something else is wrong. WHAT else is wrong? Is this system (Window) so undeterministic? I asked once. I'm asking once again: tsdPtr->hwnd = CreateWindow("TclSocket", "TclSocket", WS_TILED, 0, 0, 0, 0, NULL, NULL, windowClass.hInstance, arg); Is this call (CreateWindow) defined to return sometime? Can it block forever? If yes, for what reason i.e. under which circumstances? If yes, is the system usabe any more after that? Do you have a situation where you can simply and easily reproduce it? I have read the function specs in MSDN and there is nothing about that call blocking indefinitely. I decided to block there instead of go over it. I would like to know WHY the app blocks there! It is so simple to put a timeout there and abandon but this is not buying you anything and will most probably hang somewhere else. If the CreateWindow is known to block indefinitely, then why has anybody created such a poor implementation at all in the first place? If it is so, then it has to be re-written. I never solve any such problems by brute-force (which the timeout at THAT place definitely is). And I have good reasons to follow that mantra. So: once again: can CreateWindow block indefinitely? If yes: when and why? davygrvy added on 2006-03-15 02:19:33: Logged In: YES user_id=7549 I already explained why in email. The problem is not swept under the carpet. By removing the timeout and replacing with infinite, you have created a situation where a deadlock can happen. Let's go over this again.. actually, why don't I just paste my email here instead.. Zoran, It isn't a timer, it's a wait with a timeout. It waits for the async notification window to be in a ready state for use by WSAAsyncSelect. If it really is taking over 20 seconds to create a window in a thread, something else is wrong. The state left after the timeout isn't undefined. All publics do the SocketsEnabled() check to see that the system is usable, else punt. I don't see how using an infinite timeout makes it wait less for the window to become ready. If you have patches for improvements, go for it, especially regarding how Tcl does a tear-down. Yes, I agree that the I/O system tear-down is a bit backassward. vasiljevic added on 2006-03-14 15:27:23: Logged In: YES user_id=95086 I would love to hear why and where is the deadlock? If you recall, there was a thread on TCL core list about that, where I asked if there is/could-be such situation that the socket thread cannot start because it cannot create that dummy window for event passing. Well, nobody commented on that and Win manuals were also not helpful. I decided to let it block there for reason. If it blocks, then there is some other problem and it is not OK to put it under the carpet. You tell about deadlocks. What deadlocks? Can you explan when and how the deadlock can happen? I have no problem in putting that silly timeout there but from this perspective and with this knowledge I think it is just a bad design to leave it there. I'm open to learn, though. davygrvy added on 2006-03-14 08:19:19: Logged In: YES user_id=7549 r1.52 has introduced a deadlock problem for when the notification can not start. The 20 second timeout that used to be on line 502 needs to be put back. What happens if tsdPtr->readyEvent never is signaled? It blocks forever. Don't let it block forever, do something resonable to avoid a deadlock. vasiljevic added on 2006-03-11 00:41:09: Logged In: YES user_id=95086 Ok, it is now in both 8.4 and head branches. I will still have to add tests for exposing this bug so it won't hit us in future, hence the bug-ticket is sill left open. vasiljevic added on 2006-03-10 20:40:23: Logged In: YES user_id=95086 Ah, Andreas. regarding your hint about lazy handling the socket subsystem init/teardown, the answer is unfortunately NO, rather than yes. Why? Because this thing will just needlessly init/teardown the subsystem exactly 1000 times which is 999 times too much: time {close [socket $host $port]} 1000 if there are no other sockets opened in the process. I believe we should keep explicit teardown. I also believe we should introduce explicit init as well, rather than checkin the state of the subsystem all the time by the InitSockets(). This new call can be part of the Tcl lib bootstrap during initialization of the IO subsystem. This way it will be symetric and easy to understand. What do you think? andreas_kupries added on 2006-03-01 05:43:53: Logged In: YES user_id=75003 Yes, that (looping and repetition) is how I would do it as well for a case which fails non-deterministically in general, but with a relatively high probability. 20% is high IMHO. So trying a hundred times is good enough for me. vasiljevic added on 2006-03-01 05:30:30: Logged In: YES user_id=95086 I wish I had a simple test case... Normally, one would create a thread (using testthread command from the test suite), open the socket channel in that thread and then exit the thread. But... this will NOT always break. Sometimes nothing will happen, sometimes the thread will never exit and everyting will stall, sometimes, the app will core. In our app it was about 2 of 10 i.e. 2 hangs out of 10 attempts. Well I could try to stress the behaviour by creating/destroying threads and channels in a row for some 10's or 100's of times... andreas_kupries added on 2006-03-01 05:16:02: Logged In: YES user_id=75003 > OK. Easy. But you will have to follow me... I will try. > If you open a socket, don't close it and exit the thread > (or the application, this will eventually come to > > Tcl_FinalizeThread() > > The Tcl_FinalizeThread will pull thread-exit handlers > one by one. One of them will be the handler to tear down > the socket subsystem. > This thread-exit handler will tear-down the socket thread and > all the sync primitieves for the synchronization. After that > no calls into this subsystem should be performed. Ah, now I see what problem I had before. I was looking at the TclpFinalizesockets for the Mac :( D'oh. > But.... > After that, the TclFinalizeIOSubsystem() is called. > The TclFinalizeIOSubsystem walks the list of opened channels, > founds one still opened and tries to close it (line 254 in > generic/tclIO.c). > The Tcl_Close will invoke CloseChannel and this will call Tcl_CutChannel. > The Tcl_CutChannel will, as the last step of operation, call > ThreadActionProc with TCL_CHANNEL_THREAD_REMOVE. > Understand? Now, the ThreadActionProc will result in > TcpThreadActionprroc (line 2689 win/tclWinChan.c) and will start > using tsdPtr->socketListLock, will attemt to SendMessage to > window owned by the SocketThread which already is exited > and so on and so on... Ok, yes I see the problem with that. So the new ThreadAction stuff I did caused this, due to not fully understood finalize ordering. My apologies. Although ... No, we had no thread actions before, causing crashes when the socket moved around between threads. My integration fixes this, and opened the problem with the finalize ordering :( Thanks for finding and fixing this. > Is this now little bit more clear? Yep. > Concerning lazy init/teardown of socket subsystem, the answer is: yes. > If this is in place, then this will be better and more logical. The first > channel/socket inits, the last closing tears down the subsystem. > This is how I'd do the thing If I had designed it. Note: ThreadActions are a very new thing, they were not available when the driver was written initially. > But I was not so I > had to at least rectify the obviously broken current implementation. I guess we can try to rewrite the driver to use of the thread actions in this way, now that the problem is fixed. Note: Do you have test cases which demonstrate the crash with the old implementation ? That would be good to have to prevent any future implementation from running into the same trap. vasiljevic added on 2006-03-01 04:19:39: Logged In: YES user_id=95086 OK. Easy. But you will have to follow me... If you open a socket, don't close it and exit the thread (or the application, this will eventually come to Tcl_FinalizeThread() The Tcl_FinalizeThread will pull thread-exit handlers one by one. One of them will be the handler to tear down the socket subsystem. This thread-exit handler will tear-down the socket thread and all the sync primitieves for the synchronization. After that no calls into this subsystem should be performed. But.... After that, the TclFinalizeIOSubsystem() is called. The TclFinalizeIOSubsystem walks the list of opened channels, founds one still opened and tries to close it (line 254 in generic/tclIO.c). The Tcl_Close will invoke CloseChannel and this will call Tcl_CutChannel. The Tcl_CutChannel will, as the last step of operation, call ThreadActionProc with TCL_CHANNEL_THREAD_REMOVE. Understand? Now, the ThreadActionProc will result in TcpThreadActionprroc (line 2689 win/tclWinChan.c) and will start using tsdPtr->socketListLock, will attemt to SendMessage to window owned by the SocketThread which already is exited and so on and so on... Is this now little bit more clear? Concerning lazy init/teardown of socket subsystem, the answer is: yes. If this is in place, then this will be better and more logical. The first channel/socket inits, the last closing tears down the subsystem. This is how I'd do the thing If I had designed it. But I was not so I had to at least rectify the obviously broken current implementation. andreas_kupries added on 2006-03-01 03:57:32: Logged In: YES user_id=75003 Hm. I can't see in the patch which of the data released by the thread-exit-handler is accessed later during finalization. The event source ? If yes, how ? Question: Is that something which in 8.5 can be handled by the TIP #218 Thread Actions ? (Last channel removed from a thread causes tear-down of the subsystem, like first channel added may auto-initialize it ?) vasiljevic added on 2006-03-01 03:30:52: Logged In: YES user_id=95086 Changes in tclEvent.c are just cosmetics. I wanted to check this in but AK suggested to file the bug-report so others can eventually comment. Yes, similar chnages are also needed for 8.5. I wanted to do both in one run. dgp added on 2006-03-01 01:34:22: File Added - 169193: 1437595.patch dgp added on 2006-03-01 01:34:21: Logged In: YES user_id=80530 Here's the patch in unified diff format for easier review. Can you confirm that all the changes to tclEvent.c are cosmetic only? Does this need further review, or will you be committing it? Are similar changes required for Tcl 8.5a4 ? vasiljevic added on 2006-02-24 01:19:46: File Added - 168602: winsock.patch

Attachments:

1437595.patch [download] added by dgp on 2006-03-01 01:34:22. [details]
winsock.patch [download] added by vasiljevic on 2006-02-24 01:19:44. [details]