Tcl Source Code

View Ticket
Login
Ticket UUID: 1437595
Title: WinSock cores/hangs on thread or app exit
Type: Bug Version: obsolete: 8.4.12
Submitter: vasiljevic Created on: 2006-02-23 18:19:44
Subsystem: 27. Channel Types Assigned To: vasiljevic
Priority: 5 Medium Severity:
Status: Open Last Modified: 2010-01-12 22:09:17
Resolution: Fixed Closed By:
    Closed on:
Description:
There is a sequence problem in the way how the 
win socket/pipe subsystem is teared down. The generic
code which handles the finalization invokes thread
exit handlers first, then finalizes IO subsystem.
Unfortunately, the finalization of the IO subsystem
may access data which is gargabe-collected by the 
therad exit handler. In such cases, the application
will either hang or core.

The attached patch works-arround this by moving the
functionality of the thread-exit handler into a 
special TclpFinalizeSockets() call. This call is now
part of the generic TclFinalizeIOSubsystem.

This change is no-op for Unix. It affects Windows
(win/) and old Mac-9 (mac/) code bases.
User Comments: ferrieux added on 2010-01-12 22:09:17:
Ping. Please verify that it is now working in 8.[456] HEADs.

ferrieux added on 2009-06-16 05:17:35:
I'd bet this has been fixed by the introduction of late exit handlers and the associated demotion of WSACleanup. Zoran can you confirm ?

vasiljevic added on 2007-09-26 21:58:34:
Logged In: YES 
user_id=95086
Originator: YES

Thanks for your detailed report. I know this and we have fixed
our internal copy some time ago and are now field-testing this
for a while before I commit the change. The trouble with that 
WinSock code is that is pretty "old" and clumsy. What you've
found is just a small subset of problems....

[email protected] added on 2007-09-26 21:54:21:
Logged In: NO 

I believe there is a defect on socket shutdown (Windows)

In TclpFinalizeSockets() a SOCKET_TERMINATE message is posted to the socket's window, followed by a wait for readyEvent to be signaled (which the wndProc signals when its event loop exits, after receiving its destroy message which is triggered by the SOCKET_TERMINATE message).  

Since readyEvent is also signaled when data arrives on the socket, the wait in TclpFinalizeSockets can return before the socket window has been destroyed if there is activity on the socket at that time.

This can cause a number of deleterious side effects, such as the call to UnregisterClass() failing (due to a window still existing) - and then a failure to initialize sockets due to RegisterClassA failing as the window class already exists - in a scenario where Tcl is unloaded and re-loaded from within the same process instance.

This issue was resolved by using a separate event object solely for tracking the socket window creation and destruction.

A related issue in InitSockets is that sockets are marked as initialized even if one of the initialization steps fails (e.g. RegisterClassA fails).  Subsequent calls to InitSockets do not execute the code to setup the Winsock DLL, but do execute per-thread initialization instructions.

vasiljevic added on 2006-09-29 18:41:32:
Logged In: YES 
user_id=95086

Most interesting... I will look this over again. After the change
we never had any problems in that area of winsock driver.
There are others (I will have to fix couple of issues still 
sitting in the code) and I will use this time to review what
you report.

Thanks for the notice.

dunkfan added on 2006-09-29 18:28:08:
Logged In: YES 
user_id=1183601

FWIW I had a hanging problem with 8.4.13 XPSP2 after 
calling exit. 

Tracked it down to tclwinsock.c WaitForSingleObject(tsdPtr-
>readyEvent, INFINITE);
in the TclpFinalizeSockets() code...

Then I included the code which used to be in there (from 
8.4.6) to check whether the thread was STILL_ACTIVE before 
jumping into the Wait call.

DWORD exitCode;
GetExitCodeThread(tsdPtr->socketThread, &exitCode);
   if (exitCode == STILL_ACTIVE) {
       if (tsdPtr->hwnd != NULL) {
    PostMessage(tsdPtr->hwnd, SOCKET_TERMINATE, 0, 
0);
    /*
     * Wait for the thread to exit. This 
ensures that we are
     * completely cleaned up before we 
leave this function.
     */
    WaitForSingleObject(tsdPtr-
>readyEvent, INFINITE);
    tsdPtr->hwnd = NULL;
}
    }

That seems to cure the problem...

vasiljevic added on 2006-03-20 15:11:43:
Logged In: YES 
user_id=95086

It does but this is not enough. The problem is that on per-thread there is
no such hook. If I Tcl_CreateThread(), than none of the Tcl things get
initialized there. I need to Tcl_CreateInterp in that thread in order to
pull-up init sequence. This is not always needed nor feasible.
In my practical experience, there are quite a few places I used the
Tcl_CreateThread of Tcl API to create another thread which does not
necessarily has any Tcl interp attached.

I still think that, as it is now, is completetly sufficient.
David ides abourt scrapping the TclpHasSockets is also worth 
considering. I can't imagine any modern plafrom today w/o 
socket support.

dgp added on 2006-03-19 08:04:59:
Logged In: YES 
user_id=80530


Just on the point of a hook
for initialization.  Seems to
me that TclInitSubsystems is
the base for that.

davygrvy added on 2006-03-19 04:14:42:
Logged In: YES 
user_id=7549

kbk brought up the idea to stop the dynamic loading of
wsock32.dll and just import link to it.  The need for
dynamic linking dates back to the early Win95 days when
tcp/ip (DUN) wasn't always installed.

TclpHasSockets() can possibly disappear if we want to do it,
though we still have to WSAStartUp().

I'm not sure if that helps us, just makes InitSockets() smaller.

vasiljevic added on 2006-03-19 00:40:31:
Logged In: YES 
user_id=95086

Damn. 

The more I dig into that, the more dirt comes out...

My humble conclusion is that the entire init/teardown of 
Tcl is a patchwork. It is not possible to change an A without
getting the Z affected :-(

As I see, the socket subsystem is defined to be loaded lazy.
Means, first thing that pulls any socket-related functionality
being it [socket] command or any exported C-API call will
eventually load and initialize the entire subsystem.
Per-se this is not a bad idea as it conserves resources.
But it creates problems if you need to have a call like TclpHasSockets.

This one is one of the callers of InitSockets() so if InitSockets() block
the caller will block as well. If you divorce those two, then you need to
explicitly load the socket subsystem, which is what I wanted to do.
But then, you end up in a situation that you need to do this before
doing *anything* else with C- or Tcl-API. Unfortunately there is no
such hook in the API. 

So I will have to think little bit more. 

Current situation is: TclpHasSockets() will block until InitSockets() has
created the complete environment *including* the helper socket thread
and the associated window. I personally do not see any problem in
such behaviour.  Yet, I will try another attempt, but the future does not
look bright... unless we re-write the entire per-thread initialization
stuff which I would not like to do at this time.

davygrvy added on 2006-03-17 15:44:03:
Logged In: YES 
user_id=7549

Sounds great.

dkf added on 2006-03-16 16:30:04:
Logged In: YES 
user_id=79902

Sounds like a good plan to me (speaking from the sidelines).

vasiljevic added on 2006-03-16 16:19:48:
Logged In: YES 
user_id=95086

Here is how I would solve that:

I would introduce new platform-specific call: TclpInitializeSockets().
This should be symetric to TclpFinalizeSockets() and should be called
from within TclInitSubsystems() in the generic layer.
This call would do exactly the same what the InitSockets() was doing in 
tclWinSock.c. Actually, I'd just rename InitSockets() -> TclpInitializeSockets()
and get it called from the generic part explictly. 

OTOH, the TclpHasSockets() will just examine winsock.hModule != NULL
hence it is non-block. It will not attempt to lazy-load the entire socket
subsystem as it is now. So one more lock-source less, which is a nice
side-effect.

This way the entire socket subsystem is called *once* per thread creation.
If any call to it breaks at some place AFTER (initially ) loading the socket DLL 
then we punt (Tcl_Panic).

What do you think?

vasiljevic added on 2006-03-16 03:18:05:
Logged In: YES 
user_id=95086

Correct. Once the DLL is loaded OK and all needed functions were found,
we *should* have a working socket subsystem.
If at that point ANY of the things fail to initialize, then this is a sign to stop.
It is better to stop then go halfway on, IMO.

The TclpHasSockets() should not  break and should really just test the 
winsock.hModule != NULL and return TCL_OK  in that case. 

I guess the only question now is how to get all this done right.
The tricky part is InitSockets() and TclpHasSockets(). 
These two must somehow divorce. I'm not yet sure how but
I will give it a thought today/tomorrow.

I would Tcl_Panic in all cases when something goes wrong in InitSockets()
if and only if all functions were loaded OK from the DLL.

davygrvy added on 2006-03-16 01:57:35:
Logged In: YES 
user_id=7549

I hadn't thought about a hard abort before, but it wouldn't
be wrong in this case over the punt.  Same rule can apply to
the proceeding check for a thread handle and even the
following window handle check, too.

We can't allow TclpHasSockets() to cause a hard abort
though.  So it might be a bit of a dance to get right.

vasiljevic added on 2006-03-15 05:25:25:
Logged In: YES 
user_id=95086

This does not sound very deterministic to me. I would never
trust such a system. Why 20 secs? Why not 10 or 100 or 1000?

I do not believe this is the right way to go.

I guess if the thread needs that long time (20 secs in that case is
almost forever) to start up, then something is wrong with the 
computer and there  is no reason to continue.
If a thread A wants/needs to start the thread B then it must
wait forever if needed. Because if it really waits forever 
then you have some other problem. I see it this way:
somebody wants to use socket communication. For that Windows
needs a helper thread: If this thread is not started then we need
to Tcl_Panic! See the example of the notifier fhread. If this thread
does not start, would you also wait 20 secs and declare threading
communication useless or would you panic the system? Fortunately
the wise designer of that part decided to Tcl_Panic which is the
only correct way to go. 
Why relaxing that behaviour for sockets?

Please understand: the simplest thing for me is to go and put that
timeout back there. But I'm still not convinced that would be the
right solution.

davygrvy added on 2006-03-15 04:50:21:
Logged In: YES 
user_id=7549

The issue is not CreateWindow blocking, but the time it
takes the thread to start.  If your system is at such a
state where it takes more than 20 seconds to get the
notification window up, you'd better punt.  That's my point.
 The timeout was very valid.  By taking it out, your saying
the state of the system doesn't matter and we should wait
until next thursday for the thread to start and create the
notification window.

All it is a "head check" and nothing more.

vasiljevic added on 2006-03-15 04:22:31:
Logged In: YES 
user_id=95086

The SocketThread's first thing to do is to 
   CreateWindow  (line 2326)
   then
   SetEvent(tsdPtr->readyEvent) (line 2333)

The thread creating it sits in 
   WaitForSingleObject(tsdPtr->readyEvent, INFINITE);
and waits for the readyEvent to be signalized.

There is no problem there. Unless the CreateWindow() 
blocks forever. Then the SetEvent is never executed
and the caller blocks indefinitely. Hence my question
once more: can CreateWindow block forever?
Yes or no? If yes: when and why?

I can put back the timeout there but I do not see any valid
reason why to do that. Do you have a test case that brings
the system in such state?

davygrvy added on 2006-03-15 04:03:16:
Logged In: YES 
user_id=7549

What guarentee do you have that the thread can create the
new window and hit the winproc's WM_CREATE?  An INFINITE
timeout is rather presumptuous.

Please find it in your heart to put back the conservative
"head check" timeout.

vasiljevic added on 2006-03-15 03:29:36:
Logged In: YES 
user_id=95086

David, 

You are speaking of some email where  you explained
the problem. Where did you send it? I never got any such email!

All I got was your response on this thread in Bug tracker. 
Hence I had to ask what's wrong. You would do the same,
wouldn't you? I believe we have some comm problem here...

I'm quoting: 
It waits for the async notification window to be in a ready state for use
by WSAAsyncSelect. If it really is taking over 20 seconds to
create a window in a thread, something else is wrong. 


WHAT else is wrong? Is this system (Window) so undeterministic?
I asked once. I'm asking once again:

       tsdPtr->hwnd = CreateWindow("TclSocket", "TclSocket", 
    WS_TILED, 0, 0, 0, 0, NULL, NULL, windowClass.hInstance, arg);

Is this call (CreateWindow) defined to return sometime?
Can it block forever?
If yes, for what reason i.e. under which circumstances? 
If yes, is the system usabe any more after that?
Do you have a situation where you can simply 
and easily reproduce it? 

I have read the function specs in MSDN and there is 
nothing about that call blocking indefinitely.

I decided to block there instead of go over it. I would 
like to know WHY the app blocks there! It is so simple
to put a timeout there and abandon but this is not buying
you anything and will most probably hang somewhere else.

If the CreateWindow is known to block indefinitely, then
why has anybody created such a poor implementation 
at all in the first place? If it is so, then it has to be re-written.
I never solve any such problems by brute-force (which the
timeout at THAT place definitely is). And I have good reasons
to follow that mantra.

So: once again: can CreateWindow block indefinitely? 
If yes: when and why?

davygrvy added on 2006-03-15 02:19:33:
Logged In: YES 
user_id=7549

I already explained why in email.  The problem is not swept
under the carpet.  By removing the timeout and replacing
with infinite, you have created a situation where a deadlock
can happen.

Let's go over this again..  actually, why don't I just paste
my email here instead..

Zoran,

It isn't a timer, it's a wait with a timeout.  It waits for
the async notification window to be in a ready state for use
by WSAAsyncSelect. If it really is taking over 20 seconds to
create a window in a thread, something else is wrong.  The
state left after the timeout isn't undefined.  All publics
do the SocketsEnabled() check to see that the system is
usable, else punt.  I don't see how using an infinite
timeout makes it wait less for the window to become ready.

If you have patches for improvements, go for it, especially
regarding how Tcl does a tear-down.  Yes, I agree that the
I/O system tear-down is a bit backassward.

vasiljevic added on 2006-03-14 15:27:23:
Logged In: YES 
user_id=95086

I would love to hear why and where is the deadlock?

If you recall, there was a thread on TCL core list about
that, where I asked if there is/could-be such situation that
the socket thread cannot start because it cannot create that
dummy window for event passing. Well, nobody commented 
on that and Win manuals were also not helpful.
I decided to let it block there for reason. If it blocks, then there
is some other problem and it is not OK to put it under the carpet.
You tell about deadlocks. What deadlocks? Can you explan when
and how the deadlock can happen?

I have no problem in putting that silly timeout there but from
this perspective and with this knowledge I think it is just a bad
design to leave it there. I'm open to learn, though.

davygrvy added on 2006-03-14 08:19:19:
Logged In: YES 
user_id=7549

r1.52 has introduced a deadlock problem for when the
notification can not start.  The 20 second timeout that used
to be on line 502 needs to be put back.

What happens if tsdPtr->readyEvent never is signaled?  It
blocks forever.  Don't let it block forever, do something
resonable to avoid a deadlock.

vasiljevic added on 2006-03-11 00:41:09:
Logged In: YES 
user_id=95086

Ok, it is now in both 8.4 and head branches.
I will still have to add tests for exposing
this bug so it won't hit us in future, hence
the bug-ticket is sill left open.

vasiljevic added on 2006-03-10 20:40:23:
Logged In: YES 
user_id=95086

Ah, Andreas. regarding your hint about lazy handling
the socket subsystem init/teardown, the answer is
unfortunately NO, rather than yes.

Why?

Because this thing will just needlessly init/teardown
the subsystem exactly 1000 times which is 999 times 
too much:

  time {close [socket $host $port]} 1000
 
if there are no other sockets opened in
the process. 

I believe we should keep explicit teardown.

I also believe we should introduce explicit
init as well, rather than checkin the state
of the subsystem all the time by the InitSockets().
This new call can be part of the Tcl lib bootstrap
during initialization of the IO subsystem. This
way it will be symetric and easy to understand.

What do you think?

andreas_kupries added on 2006-03-01 05:43:53:
Logged In: YES 
user_id=75003

Yes, that (looping and repetition) is how I would do it as
well for a case which fails non-deterministically in
general, but with a relatively high probability. 20% is high
IMHO. So trying a hundred times is good enough for me.

vasiljevic added on 2006-03-01 05:30:30:
Logged In: YES 
user_id=95086

I wish I had a simple test case...
Normally, one would create a thread (using testthread command 
from the test suite), open the socket channel in that thread and then
exit the thread. But... this will NOT always break. Sometimes
nothing will happen, sometimes the thread will never exit and
everyting will stall, sometimes, the app will core. In our app
it was about 2 of 10 i.e. 2 hangs out of 10 attempts.
Well I could try to stress the behaviour by creating/destroying
threads and channels in a row for some 10's or 100's of times...

andreas_kupries added on 2006-03-01 05:16:02:
Logged In: YES 
user_id=75003


> OK. Easy.  But you will have to follow me...

I will try.
 
> If you open a socket, don't close it and exit the thread
> (or the application,  this will eventually come to
> 
>   Tcl_FinalizeThread()
> 
> The Tcl_FinalizeThread will pull thread-exit handlers
> one by one. One of them will be the handler to tear down
> the socket subsystem.

> This thread-exit handler will tear-down the socket thread and
> all the sync primitieves for the synchronization. After that 
> no calls into this subsystem should be performed.

Ah, now I see what problem I had before. I was looking at
the TclpFinalizesockets for the Mac :( D'oh.

> But....
> After that, the TclFinalizeIOSubsystem() is called.
> The TclFinalizeIOSubsystem walks the list of opened channels,
> founds one still opened and tries to close it (line 254 in 
> generic/tclIO.c).
> The Tcl_Close will invoke CloseChannel and this will call
Tcl_CutChannel.
> The Tcl_CutChannel will, as the last step of operation, call
> ThreadActionProc with  TCL_CHANNEL_THREAD_REMOVE.
> Understand? Now, the ThreadActionProc will result in 
> TcpThreadActionprroc (line 2689 win/tclWinChan.c) and will
start 
> using tsdPtr->socketListLock, will attemt to SendMessage to 
> window owned by the SocketThread which already is exited 
> and so on and so on... 

Ok, yes I see the problem with that. So the new ThreadAction
stuff I did caused this, due to not fully understood
finalize ordering. My apologies. Although ... No, we had no
thread actions before, causing crashes when the socket moved
around between threads. My integration fixes this, and
opened the problem with the finalize ordering :( Thanks for
finding and fixing this.


> Is this now little bit more clear?

Yep.
 
> Concerning lazy init/teardown of socket subsystem, the
answer is: yes.
> If this is in place, then this will be better and more
logical. The first
> channel/socket inits, the last closing tears down the
subsystem.
> This is how I'd do the thing If I had designed it.

Note: ThreadActions are a very new thing, they were not
available when the driver was written initially.

> But I was not so I
> had to at least rectify the obviously broken current
implementation.

I guess we can try to rewrite the driver to use of the
thread actions in this way, now that the problem is fixed.

Note: Do you have test cases which demonstrate the crash
with the old implementation ? That would be good to have to
prevent any future implementation from running into the same
trap.

vasiljevic added on 2006-03-01 04:19:39:
Logged In: YES 
user_id=95086

OK. Easy.  But you will have to follow me...

If you open a socket, don't close it and exit the thread
(or the application,  this will eventually come to

  Tcl_FinalizeThread()

The Tcl_FinalizeThread will pull thread-exit handlers
one by one. One of them will be the handler to tear down
the socket subsystem.
This thread-exit handler will tear-down the socket thread and
all the sync primitieves for the synchronization. After that 
no calls into this subsystem should be performed.
But....
After that, the TclFinalizeIOSubsystem() is called.
The TclFinalizeIOSubsystem walks the list of opened channels,
founds one still opened and tries to close it (line 254 in generic/tclIO.c).
The Tcl_Close will invoke CloseChannel and this will call Tcl_CutChannel.
The Tcl_CutChannel will, as the last step of operation, call
ThreadActionProc with  TCL_CHANNEL_THREAD_REMOVE.
Understand? Now, the ThreadActionProc will result in 
TcpThreadActionprroc (line 2689 win/tclWinChan.c) and will start 
using tsdPtr->socketListLock, will attemt to SendMessage to 
window owned by the SocketThread which already is exited 
and so on and so on... 

Is this now little bit more clear?

Concerning lazy init/teardown of socket subsystem, the answer is: yes.
If this is in place, then this will be better and more logical. The first
channel/socket inits, the last closing tears down the subsystem.
This is how I'd do the thing If I had designed it. But I was not so I
had to at least rectify the obviously broken current implementation.

andreas_kupries added on 2006-03-01 03:57:32:
Logged In: YES 
user_id=75003

Hm. I can't see in the patch which of the data released by the
thread-exit-handler is accessed later during finalization.
The event source ? If yes, how ?

Question: Is that something which in 8.5 can be handled by
the TIP #218 Thread Actions ? (Last channel removed from a
thread causes tear-down of the subsystem, like first channel
added may auto-initialize it ?)

vasiljevic added on 2006-03-01 03:30:52:
Logged In: YES 
user_id=95086

Changes in tclEvent.c are just cosmetics.

I wanted to check this in but AK suggested to file the bug-report
so others can eventually comment. 

Yes, similar chnages are also needed for 8.5. I wanted
to do both in one run.

dgp added on 2006-03-01 01:34:22:

File Added - 169193: 1437595.patch

dgp added on 2006-03-01 01:34:21:
Logged In: YES 
user_id=80530


Here's the patch in
unified diff format for easier
review.

Can you confirm that all the
changes to tclEvent.c are
cosmetic only?

Does this need further review,
or will you be committing it?

Are similar changes required
for Tcl 8.5a4 ?

vasiljevic added on 2006-02-24 01:19:46:

File Added - 168602: winsock.patch

Attachments: