Tcl Source Code

View Ticket
Login
Ticket UUID: 1997007
Title: [open|] blocks when redirecting stderr to pipe (Vista)
Type: Bug Version: obsolete: 8.6a1
Submitter: danckaert Created on: 2008-06-18 16:11:19
Subsystem: 24. Channel Commands Assigned To: ferrieux
Priority: 5 Medium Severity: Minor
Status: Closed Last Modified: 2017-04-05 09:46:47
Resolution: Duplicate Closed By: jan.nijtmans
    Closed on: 2017-04-05 09:46:47
Description:
This report refers to the standalone [pipe] command which is currently part of TclX, but proposed for inclusion in the core in TIP304.

When using the [open|] command in combination with [pipe] to redirect a process' stderr, the [open|] command sometimes blocks on Vista:
  pipe pr pw
  open |[list {*}$command 2>@ $pr]

This happens both with Tcl 8.5.2 and 8.6.

The application goes into "not responding" state, and doesn't wake up again. It's difficult to reproduce (only happens between 1/100 ... 1/1000 times on average), but it never happens when using "2@>1" instead. It doesn't happen either when redirecting to a fake pipe created by another open command.

Further investigation learnt that it actually blocks during the "CreateProcess" system call, which should not block according to Microsoft's documentation (it should either fail or succeed). The launched process is visible in the Task manager, but it is in "suspended" state and has a very low memory footprint (about 56K).
User Comments: jan.nijtmans added on 2017-04-05 09:46:47:

Let's close this one, as dup of [8bd13f07bd]: "Closing tcl pipes prevents windows threads from starting up". Since Koen reports that the proposed patch there seems to fix this issue too.

If - after the acceptence of the proposed patch there - this issue appears not solved, feel free to re-open!


sebres added on 2017-04-04 17:26:16:
Just as possible canditate for similar issue [9d75181ee70af318830e99ede6ebb5df72a9b079]

danckaert added on 2017-04-04 15:17:07:

Many years later, it appears that this issue is similar to the one reported in http://core.tcl.tk/tcl/tktview?name=8bd13f07bde6fb06.

The patch proposed there (http://core.tcl.tk/tcl/artifact/7db16089766c834d) fixes this issue too.


danckaert added on 2008-08-26 19:57:29:

File Added - 290769: procexp.log

Logged In: YES 
user_id=1388916
Originator: YES

I've not looked at this for a while, because I have the feeling this "remote debugging" is not really leading anywhere. Also, I somewhat hoped the problem would be solved by Vista SP1, or by compiling with MSVC 9.0. I already tried the latter, and that didn'nt help... As for SP1, it seems my turn to download it hasn't arrived yet.

In any case, trying to answer your questions:
- Children comsuming CPU and no pipes: never block.
- Children consuming no CPU, but using pipes: yes, sometimes they also block when overall CPU consumption is high.
- I've attached a log file of ProcExp.
- [chan names] reports what is to be expected. If I launch 5 parallel processes; it reports 10 channels + the standard channels.

File Added: procexp.log

ferrieux added on 2008-08-20 02:29:42:
Logged In: YES 
user_id=496139
Originator: NO

Koen, I'm still interested in your answers :-)

ferrieux added on 2008-07-03 17:15:23:
Logged In: YES 
user_id=496139
Originator: NO

OK. New bunch of ideas:
- try to separate overall CPU consumption from children CPU consumption. For this spawn two copies of your program in parallel: one with children consuming CPU and *no* TclX pipes, one with waiting children (cat or equivalent) and TclX pipes. See if any of them blocks in each category.
- what about overall handle count (ProcExp or Windows Task Manager) ?
- what about [chan names] ?

danckaert added on 2008-07-03 17:02:01:
Logged In: YES 
user_id=1388916
Originator: YES

That's right, I don't see the deadlock if the children just wait on their input, or if they just do a sleep().
When a child blocks, it blocks immediately, even before it has been properly initalized by the OS. See the last part of the plm.log file I attached earlier.
I never write anything to the stdin of the children. They write to stdout and occasionaly to stderr, but I also observe the deadlock with a child that doesn't write anything (like the one in my previous message).

ferrieux added on 2008-07-01 04:15:35:
Logged In: YES 
user_id=496139
Originator: NO

This CPU consumption effect is interesting. So basically if the children only block on their input (like cat), you never observe the deadlock, right ?
Can you also investigate about what portion of its life the blocked child has lived before blocking (by writing to a file) ?
Also, again what about the IO done on the pipe ?

danckaert added on 2008-06-30 22:54:42:
Logged In: YES 
user_id=1388916
Originator: YES

Here are some answers:
- The problem happens irrespective of the child, as long as it takes at least some time. I can reproduce it with the following child: int main() {for (int i = 0; i < 10000000; ++i); return 0;}
- It also happens with exec&, and also with >@ instead of 2>@, and also when redirecting one of them to NUL:
- I can't reproduce it with the faked pipe. Even tried with a single faked pipe created before all child processes.
- Hmm, also can't reproduce it when using a single TclX pipe... (instead of one per child).

ferrieux added on 2008-06-30 18:51:23:
Logged In: YES 
user_id=496139
Originator: NO

Excellent ! Thanks for this decisive test Koen.
Now we still have problem, Houston...
But at least we'll be barking up the proper old tree ;-)

A few ideas now:
- is 2>@ different from >@ in that respect (use vanilla [exec &] and try separate >@ and 2>@ to a pipe, then both) ?
- for the 99% cases that work, what kind of IO do you do on the pipes ? Do the children write much to stderr ? Do you write much to the children's stdin ?
- when you say that the faked pipe removes the issue, I suspect a Heisenbug effect due to the kernel being distracted by the machinery necessary for the faked pipe. Can you create the faked pipe longer before spawning ? (possibly create all of them beforehand; otherwise use [after] to schedule final [exec&] several seconds after [open |cat r+])
- try to lower handle consumption: redo the failing test, but without the stdouts at all (> NUL:)
- same, but converging all 2>@ stderr to the same, single TclX pipe.

That's it for now ;-)

danckaert added on 2008-06-30 17:41:48:
Logged In: YES 
user_id=1388916
Originator: YES

Ok, I can also reproduce the problem with AS Tclsh 8.5.2 and TclX.

ferrieux added on 2008-06-27 22:54:09:
Logged In: YES 
user_id=496139
Originator: NO

> Using TclX itself is not an option,
> since it doesn't compile on Windows

??? There are binary versions all around the planet, including in the AS teapot.
Could you please take AS tclsh (not a recompiled one), TclX from teapot, and retry ?

-Alex

danckaert added on 2008-06-27 22:36:51:

File Added - 282940: strace.log

Logged In: YES 
user_id=1388916
Originator: YES

Sorry, didn't have time earlier to look at this again.

Using TclX itself is not an option, since it doesn't compile on Windows (Makefile.vc is severely broken).
However, I can reproduce the problem with plain tclsh8.6 (patched with chanpipe4, of course).

The application is single-interp/single-thread. It can spawn many parallel children, but the main app iself is purely event-based.

The child process which blocks is random, it's not always the same.

I've attached the tail of an strace log for the main application (which was tclsh86.exe in this case).

File Added: strace.log

ferrieux added on 2008-06-27 03:45:41:
Logged In: YES 
user_id=496139
Originator: NO

Could you please quickly answer my questions ? Holding one's breath for a braindead OS is one thing, but doing that blindly (with no guarantee that it hits stock tclsh) is beyond me...

ferrieux added on 2008-06-20 22:12:37:
Logged In: YES 
user_id=496139
Originator: NO

Thanks for this first bunch of answers.
Now a few additions:
 - Can you use StraceNTx (all syscalls) instead of ProcMon (IO and registry only) ?
 - I can see your "Tcl" process has a nonstandard name "chess86mk.exe". Can you give details ? It is single-threaded/single-interpreter ?
 - In any case can you reproduce with vanilla tclsh+TclX ? <== VERY IMPORTANT 
 - Your example shows one success with mist.exe, one failure with showcolor.exe. Is it just a coincidence, or do all mist suceed and all showcolor fail ?

The fact that single-processor lowers the probability strongly hints at a race condition in Windows.
I know that inheritable handles are shared across all threads of the parent, thus you need a critical section around CreateProcess in that case, but in any case I sincerely hope you're *not* spawning children from different threads in the parent...

danckaert added on 2008-06-20 21:34:11:

File Added - 281993: plm.log

Logged In: YES 
user_id=1388916
Originator: YES

Sorry for the typos. I actually used "2>@ $pw" and "2>@1".

Yes, my machine is dual-core. When I disable the second core in the BIOS, it becomes much harder to reproduce the problem. Normally I can reproduce it in just one minute, with a single core it took half an hour (and I could reproduce only once up to now).

Some information about the nature of the processes:
- the children are simple console processes
- the main application exists in a console and GUI version; both have the same problem

The main application has an option to set the number of children to be launched in parallel. The higher this number, the earlier the problem shows up. When set to 1, it's hard to reproduce (but it still happens sometimes). The children are short-lived (<0.1s); the main app is set up such that a new child is launched when a previous one dies.

I've not been able to reproduce the problem on XP. Though I should add that that's a single-core machine.

Currently I can't use a debugger; this Vista machine is not my development machine. However, I've used a process monitor to observe the system calls made by the processes. I've attached the log file.

File Added: plm.log

ferrieux added on 2008-06-20 04:34:29:
Logged In: YES 
user_id=496139
Originator: NO

Also:
  - try on XP
  - is your machine a multi-processor by any chance ? (see 1939509)

ferrieux added on 2008-06-19 02:34:05:
Logged In: YES 
user_id=496139
Originator: NO

(Hope this one gets through, SF is read-only)

Two errors in your comment, but maybe these are just typos :-)

  - should be 2>@ $pw, not $pr
  - should be 2>@1 ,not 2@>1

Second, can you really try with TclX instead of 304 (to be sure) ?

Other things to try too:

  - exec {*}$cmd 2>@ $pw &
  - exec {*}$cmd >@ $pw &
  - exec {*}$cmd <@ $pr &
  - combinations

Also, once locked, can you attach to both parent and child with a debugger ?

Attachments: