Tcl Source Code

View Ticket
Login
Ticket UUID: 8bd13f07bde6fb0631f27927e36461fdefe8ca95
Title: Closing tcl pipes prevents windows threads from starting up
Type: Patch Version: All
Submitter: stanko Created on: 2014-02-18 19:20:21
Subsystem: 49. Threading Assigned To: sebres
Priority: 5 Medium Severity: Critical
Status: Pending Last Modified: 2023-10-16 18:49:02
Resolution: Fixed Closed By: nobody
    Closed on:
Description:

Bug reproduction:
This issue affects all Windows tcl builds and is hard to reproduce; I had the most success in reproducing it on a very busy virtual vmware machine with windows server 2012.

Problem:
The problem is that tcl terminates pipe's helper threads if they don't exit in a "timely" manner (20 ms).
In some cases, 20 ms is not enough and those threads are terminated while they are being unloaded by the windows.

More details:
Terminating threads during their initialization or teardown phase may result in ntdll.dll's LoaderLock to remain locked indefinitely.
This causes ntdll.dll's LdrpInitializeThread() to deadlock trying to acquire LoaderLock.
LdrpInitializeThread() is executed within new threads to perform initialization and to execute DllMain() of loaded dlls.
As a result, all new threads are deadlocked in their initialization phase and never execute, even though CreateThread() reports successful thread creation.
This results in a very weird process-wide behavior, which is extremely hard to debug: threads are simply not being executed…

Fix:
The fix is to make sure that threads are not terminated during their initialization or teardown phase.
See attached patch for a fix.
Note that even though this patch fixes the issue at hand, tclWinPipe.c still terminates threads on busy machines very often (even if they are not blocked on IO), resulting in significant memory leaks.
The proper fix would be for tcl to stop terminating threads once and for all. This may require redesign of windows' pipes implementation, maybe using windows named pipes instead of anonymous ones.

User Comments: jan.nijtmans added on 2023-10-16 18:49:02:

@sebres wrote:

>I'll happily fix the issues, if someone gets trouble with it :)

Already assigned to @sebres.

A [New ticket] came in, claiming that this patch is causing problems. @sebres, would you be able to have a look (I really don't want to revert this change, because trading one bug for another makes no sense at all)


sebres added on 2017-05-05 14:45:42:
Interim status: the bug "R6016" is fixed now (must be still rebased & pushed to fossil), but...

But, as a result of new implementation of the pipe-workers (like a pool with reusable threads), it is also very fast now and I get race condition by multi-threded execution somewhere round about process-end.

Just to compare (10 threads using "tpool", exec cmd /c echo test):

## == new version (very fast, but with sporadic "race-conditional" error):

***tid2F34--ERROR:child process exited abnormally
    while executing
"exec $env(COMSPEC) /c echo test 2>@1"
    invoked from within
"time {exec $env(COMSPEC) /c echo test 2>@1} 100"
******** end
...
***tid151C--42278.33 microseconds per iteration
***tid1E50--42476.65 microseconds per iteration
***tid27DC--42676.75 microseconds per iteration
***tid294C--42818.5 microseconds per iteration

## == old version (slow, but always without errors):

***tid24C0--379016.0423759178 microseconds per iteration
***tid1F2C--380512.505328907 microseconds per iteration
...
***tid11F4--395925.6099012296 microseconds per iteration
***tid1F1C--398007.4137968089 microseconds per iteration
***tid2B2C--398003.92374362086 microseconds per iteration

Not reproduceable with fewer threads on my mashine (should be >= 10).
If someone have any idea - welcome...

sebres added on 2017-05-03 19:11:22:
So bug "R6016" occurs currently on exit of tclsh if some pipe-workers are still alive (in shutdown process), but the main-thread exited before it happens.

In this case 8.6/trunk (with fix-1997007) may be affected also, if someone uses tcl.dll with own or rewritten TclMain, where exit occurs faster).
Just currently not reproduceable, because tcl finalization is a little bit slower there.
So Jan, you were right with the "vague fears" :)

The possible solution (that would be too complex just for this case):
- create pool of pipe-workers (for the case if pipe-worker receives PTI_STATE_STOP signal and shutdown asynchronously);
- in case that the pipe-thread is not down, add worker to this pool;
- create thread exit-handler (on main thread) that will definitively shutdown all pending workers (waiting for down and/or terminate pipe-worker if it still alive).

Another possible solution: calling TerminateProcess implicit before ExitProcess.
It works, but I don't like it also.

Thinking about better handling to realize proper shutdown of workers (in exit case).
WiP

sebres added on 2017-05-03 16:58:55:
New insights about "R6016" bug:
- important to reproduce it faster - shutdown "Windows Defender" service:
net stop /y WinDefend
- test case to reproduce:
```tcl
set tclsh [info nameofexecutable]
time { 
  puts "[string repeat = 50] [incr i]. =="
  set out [exec $tclsh ../../tests/all.tcl -file *exec.* -match "*exec-3.* *exec-4.* *exec-5.*"]
  puts $out
  if {[regexp {====[^\n]+ FAILED|error R6016} $out]} {error FAILED}
} 100
```

Strange is:
- Bug occurs in "core-8-5-branch-fix-1997007" (with "fix-1997007"), independent with Jan's changes (GetModuleHandle) from [8785cdc62c] or without.
- Bug occurs in "core-8-5-branch-fix-1997007" reduced to relevant changes from 8.6 of "fix-1997007" only.
- NOT occurs in "core-8-6-branch" (with "fix-1997007").
- NOT occurs in "core-8-5-branch" (without "fix-1997007"), but sporadic previous BUG with suspended/hanging processes (resp. deadlock in ntdll).

It looks like either the tcl-process finalization differs greatly between tcl8.5 and tcl8.6. Or something using win-pipes was changed there in-between.

sebres added on 2017-05-03 14:00:07:
Been glad prematurely. 
The bug "R6016" is still present (although rarely now).

WiP

sebres added on 2017-05-03 10:51:18:
I'll happily merge it to core-8-5-branch next days and cherry-pick relevant changes (especially above mentioned bug with CreateThread) into 8.6/trunk.
So reviewers are welcome!

sebres added on 2017-05-03 10:40:46:
Back-ported to core-8-5-branch (see [core-8-5-branch-fix-1997007]).
Test cases fixed and normalized similar 8.6-version.

BTW By testing I found another very sporadic bug (present in all versions), and fixed now in this 8.5th branch (I'll partially merge it to 8.6/trunk later):

  "Runtime Error R6016 - Not enough space for thread data"

Bug goes to usage of CreateThread (together with CRT/TLS, that may cause conflict by loading DLL, resp. attach thread to the DLL).

See section "Remarks" in https://msdn.microsoft.com/en-us/library/windows/desktop/ms682453(v=vs.85).aspx

So introduces new threaded declarations TclWinThreadCreate/TclWinThreadExit:
because of usage of CRT and/or TLS (if threaded) - use the _beginthreadex/_endthreadex functions for thread management rather than CreateThread/ExitThread.
Prevents sporadic run-time errors, like R6016 (not enough space for thread data).

It is extremely sporadic and hard to reproduce (expected nested exec of something using tcl.dll and very fast multi-core CPU), e. g. it is reproduceable by repeated calls of the "exec.test".
For example:

% time { puts [exec tclsh85t.exe ../../tests/all.tcl -file *exec.*] } 50

==== exec-6.2 redirecting stderr through a pipeline FAILED
==== Contents of test case:

    exec [interpreter] "$path(sh)" -c "\"$path(echo)\" foo bar 1>&2" |& [interpreter] "$path(cat)"

---- Result was:
foo bar

runtime error R6016
- not enough space for thread data
---- Result should have been (exact matching):
foo bar
==== exec-6.2 FAILED

Fixed in [44defcb5cfc826569797f0bbe474595dbc454c01]

jan.nijtmans added on 2017-05-01 08:53:48:
/april 15/may 15/c

jan.nijtmans added on 2017-05-01 08:52:44:
Since "vague fears" is indeed the only thing I have, I now merged it to core-8-6-branch and trunk (leaving core-8-5-branch backport to be done). Actually, I prefer "more testing" above being too careful. But this means I would like to request the actual 8.6.7 release to be delayed to after - say - april 15. That should be enough to get everything tested well. And it would give time to factor out the recently discovered "file normalization" problem.

Thanks! (I'm very happy being corrected on this one ....)

sebres added on 2017-04-28 19:37:41:
> I'll happily revert the patch if people find trouble with it.

I'll happily fix the issues, if someone gets trouble with it :)

Unfortunately, sometimes such issues cannot clearly point to the broken place (where it exactly happens, resp. this definitely belongs to this "threaded" pipe-handling. 
But, in this case the potential problem can be little bit narrow down:
- if it occurs on windows but not on linux;
- if pipes involved (console, anonymous pipes, exec);
- e. g. happens implicit after opening/closing pipe (start/end of process);
etc.

From my side it was tested long enough (except the winSerial, but the logic there and even the code is the same).
So it looks good (and I guess all current issues seem to be fixed).

dgp added on 2017-04-28 18:03:11:
Can anyone articulate an actual reason other than
vague fears about why this should not go out into
8.6.7 RCs.

If you just want to "wait for more testing", I 
claim the RCs are your best bet to get that, and
I'll happily revert the patch if people find trouble
with it.

Expect me to merge the branch soon if no one else
beats me to it, or gives an actual reason why the
code isn't right.

sebres added on 2017-04-28 15:08:16:
Ok, in-between I'll make a back-porting for core-8-5-branch (as fix-1997007-8-5)

jan.nijtmans added on 2017-04-28 14:36:57:
Since Tcl 8.6.7 is about to be released, I would recommend to wait a few more days: Since the change is quite massive, I would prefer it to live in the "core-8-6-branch" (and core-8-5-branch and trunk) somewhat longer, so we can be more sure that nothing is broken.

I think everything is OK, I would like to see it merged soon! But waiting a few more days - since this bug is already in the core for sooooooo long time - shouldn't make the real difference.

I you don't merge it, I will merge it to core-8-6-branch just after the Tcl 8.6.7 release ..... Promised .... !  But if you prefer to do it yourself, you have my blessing!

Thanks for all the work! And thanks to Alex for his review and remarks!
      Jan Nijtmans

ferrieux added on 2017-04-28 14:28:16:
Obviously :)

sebres added on 2017-04-27 16:38:26:
[ping] merge?

ferrieux added on 2017-04-11 22:00:50:
OK, but note that exit within a fileevent or within a timer are exactly the same re stack unwinding (because fileevents also end up as the equivalent of [after 0]), and that's usually not deemed too aggressive since it (exit from a callback) has worked like a charm for decades (the event loop is a very old concept in Tcl history, unlike threads).

Anyway, as you noted, this no longer matters since you got to the core of the issue, smashed the bug, tidied up a dozen other things on the way, and left the place all clean. Your work is awesome. Thanks a lot !!!

sebres added on 2017-04-11 18:47:53:
@ferrieux Thanks a lot... but in-between I've found already the bug, that was in tclWinPipe and tclWinConsole.
I was just wondering that without this line, it works nevertheless perfect on my windows (and linux platforms).
Now it does no matter (ATM).

But just by the way:

> If the former: leaving long-running useless children is not really clean, right ?

IMHO the right way would be to use something like "after ... timeout" (instead of exit in readable):

+ after 5000 {set x timeout}
+ # or even
+ after 5000 exit
  vwait x

> If the latter: ...

Well, exit in readable is IMHO too aggressive, because may be even too error-prone, just because of tcl finalization of all subsystem's during exit (including notify, event, alloc etc), but we doing exit in event there!
The goal of the cases from "socket.test" is to cover the socket module, not the whole tcl-handling (like pipes, events, notifying etc.).

Just as comment BTW.

ferrieux added on 2017-04-11 18:28:27:
Unclear whether the question is "why we should" (why is it a good thing to...) or "why this should" (how does it reach that goal).

If the former: leaving long-running useless children is not really clean, right ?

If the latter: as soon as the interpreter doing all these tests exits, it closes all its pipes to those children, leading to EOF on their side, hence the [fileevnt readble exit].

sebres added on 2017-04-11 18:27:43:
Fixed now in branch fix-1997007, short description:

- common pipe-helper threads handling over all win-modules (winPipe, winConsole, winSerial), thus fewer code-duplication
- faster and more robust pipe-workers handling:
   * shared structures of pipe-workers rewritten using atomic "state" of the thread;
   * asynchronous start/stop of pipe-workers (if possible), try the soft way to end workers using cancelSynchronousIo before it would be terminated;
   * the workers are almost never terminated (so no hang-ups and deadlocks in ntdll);
- several typos/copy&paste bug fixed, see [3ce2af8196] and [8fb632a498] (I'll additionally cherry-pick this separately to 8.6-branch, together with fixes for sporadic test-cases bugs, see two last commits);


I cannot reproduce the mentioned bugs anymore ([1997007], [1939509], [9d75181]).

sebres added on 2017-04-11 15:33:27:
As regards hang-up in "socket.test", it looks like the malefactor is the `fileevent stdin readable exit`. At least it is not reproducible, if this line is removed (on windows as well as on linux).

@all: can someone say why this should "avoid having to wait for some child processes to time out after all tests have been run", as introduced in [1c63c22e4d1939f1] (I think another lines are copy/paste of this)?
Any ideas?

sebres added on 2017-04-11 10:28:10:
Got it! The hang-up goes to "wrong" exit from PipeReaderThread (where in opposition to PipeWriterThread, the event "readable" will not be set before the pipe-worker leaves the process cycle, so "deadlocking" waiter for readable in flush by close channel, if child-process ends and closes std-handles).

So I'll fix it together with the new version of this patch.

sebres added on 2017-04-11 10:01:18:
I was almost ready with the newer version of this fix, but have realized by testing that the test-case "socket-14.8.0" blocks sporadically (deadlocked resp. endless waiting).
Then I've tested it without any patch in original 8.6 with same result - it hangs sporadically. Adding of debugging "printf's" shows, that this occurs in the test clean-up by "close $fd", but in C-code it comes not to the PipeClose2Proc.

So seems to be an issue of Tcl_Close (together with windows).

Very annoying, so searching further (WiP).

jan.nijtmans added on 2017-04-06 15:21:38:
Since this issue is apparently not solved yet, changing status accordingly. Thanks "sebres" for all your work, looking forward for your solutions.

Closing my "bug-8bd13f07bd" branch, since it won't go forward anyway.

sebres added on 2017-04-06 15:07:28:
I found another way to handle this problem, namely using CancelSynchronousIo to cancel pending synchronous I/O operations for both reader/writer.

I'll propose soon a new version of fix with this.

sebres added on 2017-04-06 14:29:22:
So that's it with overlapped async-IO in winPipe:

According to MSDN "Asynchronous (overlapped) read and write operations are not supported by anonymous pipes. This means that you cannot use the ReadFileEx and WriteFileEx functions with anonymous pipes." Thx Microsoft!

https://msdn.microsoft.com/de-de/library/windows/desktop/aa365141(v=vs.85).aspx

Therefore it cannot be used as replacement for both reader/writer threads (for std-handles).


IMHO the only one way would be to a pending thread pool, as was suggested from me.

sebres added on 2017-04-05 19:38:25:
Unfortunately neither the original proposed patch nor my rewritten branch do not completely solve the issue (despite it is indeed better with both fixes);
I could reproduce again a suspend state of [9d75181] with original path and 
a dead-lock inside kernel with my branch (I've a newer version in-between without suspend/resume using SignalObjectAndWait instead, that is still more better but nevertheless error-prone).

I suspect this going to the undefined behavior resp. rather unexpected here, because under windows a thread waiting on a synchronization object can be momentarily removed from the wait state by a kernel-mode APC, and then returned to the wait state after the APC is complete (per design).

I think the only way would be to rewrite current handling completely using either async-IO (overlapped ReadFileEx/WriteFileEx) or using simple thread pool for the pending reader/writer, that firstly will become reusable functionality (like get pending or create), secondly can be sane removed after reach of some timeout resp. by system shutdown (e. g. within handler created with Tcl_CreateExitHandler).

I have experience with windows async-IO (and iocp), so can rewrite it with it.
I will take a look on winPipe, etc. and think about a little bit...

sebres added on 2017-04-05 10:46:09:
rebased to 8.6 as "fix-1997007" (if accepted I've a rebase for 8.5 also).

sebres added on 2017-04-05 09:50:39:
I'll push now a modification of this patch:
- wait to start thread using already available event handles (new events do not needed);
- prevents infinite waits (timeout 5000ms instead of infinite), etc;
- additionally fixed 1 typo-bug (set priority by the wrong thread handle);

Currently works 2 days non-stop without any problem.

jan.nijtmans added on 2017-04-05 09:36:47:

Committed proposed patch to "bug-8bd13f07bd" branch. Looks good to me, well-written! Compiled on win32/win64, and appears to run fine.

I'll wait a few days for more feedback, but I'm confident that this patch is at least an improvement. My +1 for merging to trunk.


danckaert added on 2017-04-05 07:34:49:

The patch proposed here also solves the issue reported in [1997007].

There are some other tickets that look similar: [1939509] , [9d75181].


Attachments: