Tcl Source Code

View Ticket
Login
Ticket UUID: 3522560
Title: iocmd.tf-24.15,16 are brittle (refchan+transfer)
Type: Bug Version: obsolete: 8.6b2
Submitter: ferrieux Created on: 2012-04-30 14:06:29
Subsystem: 26. Channel Transforms Assigned To: andreas_kupries
Priority: 8 Severity:
Status: Closed Last Modified: 2012-05-11 04:07:05
Resolution: Fixed Closed By: ferrieux
    Closed on: 2012-05-09 22:34:39
Description:
Tests iocmd.tf-24.15,16 deadlock since TIP #398.
This is apparently due to their explicit dependency on the background flush-on-exit.
I lack the deep understanding of these tests to go deeper, so assigning to you Andreas.

Note that #398 still allows background flushes on [close], it is just [exit] which aborts it. Si maybe it is only a matter of waiting a bit since those are finite, very small flushes. 

But I'm equally ready to do more work in the #398 implementation if this is actually highlighting a more subtle impact on transforms/refchan.
User Comments: ferrieux added on 2012-05-11 04:07:05:
I'm also having second thoughts about the branch created to enforce "no EAGAIN on blocking channels".
Indeed, though the documented semantics seems to imply it, I could not produce a serious misbehavior with its violation, and the aforementioned comment about shared fds threatens of potential incompatibilities...
Any thoughts ?

ferrieux added on 2012-05-10 05:34:39:
That last commit also re-enabled iocmd.tf-24.16, which now works reliably (along with .15).
Closing.

andreas_kupries added on 2012-05-10 05:29:02:
22:28:42 [360eb75e01] *CURRENT* Undone part of change [32d93a8414], keeping
         [chan postevent] synchronous for owner == handler. (user: andreask
         tags: trunk)

andreas_kupries added on 2012-05-10 02:18:44:
Committed. Uh. I worked on trunk. This is something we might wish to backport. 8.5 is likely affected by the same issue.

=== 2012-05-09 ===
19:09:51 [32d93a8414] *CURRENT* * generic/tclIORChan.c [Bug 3522560]: Fixed
         the crash, enabled the test case. Modified [chan postevent] to
         properly inject the event(s) into the owner thread's event queue for
         execution in the correct context. Renamed the ForwardOpTo...Thread()
         function to match with our terminology. (user: andreask tags: trunk)

19:03:42 [92b2807753] * tests/ioCmd.test [Bug 3522560]: Added a test which
         crashes the core if it were not disabled as knownBug. For a reflected
         channel transfered to a different thread the [chan postevent] run in
         the handler thread tries to execute the owner threads's fileevent
         scripts by itself, wrongly reaching across thread boundaries. (user:
         andreask tags: trunk)

ferrieux added on 2012-05-09 15:16:49:
Here is a summary of where we stand, for those who are watching.

As Andreas says, there were 3 issues intersecting here:

 (1) EAGAIN for a blocking channel ; not really a problem here, though branch bug-3522560 explores cleaning that up.

 (2) Incomplete finalization of blocked nb channels on thread-exit; fixed in trunk 2 days ago.

 (3) Improper cross-thread call into owner context in [chan postevent] for transferred refchans ; this one is the killer.

Note that (2) was, by luck, the trigger to make (3) manifest ...

Now (3) is being fixed by Andreas's work. Since we have a demonstrating setup, a specific bug and commit will follow shortly.

ferrieux added on 2012-05-07 23:40:46:
Andreas, I found why you weren't seeing your CloseProc called. It was indeed an exaggerated "shortcut" in FinalizeIO when #398 is active. It is now fixed in trunk.

Please continue your tests from trunk; the branch bug-3522560, which is *not * fixed, holds an orthogonal feature (enforcing nonblocking for EAGAIN), which may or may not be validated, and will remain mergeable this way.

ferrieux added on 2012-05-06 16:28:28:
Just to let people know: a BIG fix by Andreas is cooking: let [chan postevent] work properly in transferred refchans.

ferrieux added on 2012-05-03 04:50:28:

File Added - 442620: refchanbug.patch

ferrieux added on 2012-05-03 04:49:57:

File Deleted - 442487:

ferrieux added on 2012-05-03 04:11:05:
Updated branch now does [fconfigure -blocking 0] where it should have in the first place.

ferrieux added on 2012-05-02 04:08:38:
Changing the title, lowering prio, according to previous observations.

ferrieux added on 2012-05-01 23:49:56:
Further digging shows a deeper problem in the "transfer-refchan" combination. The hidden [update] needed for interthread exchanges raises complex reentrancy issues. This preexisted, and has nothing to do with #398.

ferrieux added on 2012-05-01 06:15:07:
(the previous patch is against trunk, not the experimental constraint-enforcing branch)

ferrieux added on 2012-05-01 06:14:34:

File Added - 442487: refchanbug.patch

ferrieux added on 2012-05-01 06:13:28:
While seeking the way to fix the tests, we approached more closely the teardown of a cross-thread reflected channel. Attached patch to the tests generates on iocmd.tf-24.16 a panic:

FlushChannel: damaged channel list

After the following verbose dump:

[alex@dellalex unix]$ ./tcltest ../tests/ioCmd.test -file ioCmd.test -match iocmd.tf-24.16 -testdir ../tests -verbose start
FOO:initialize rc0 {read write}
---- iocmd.tf-24.16 start
FOO:write rc0 ABC
RET:EAGAIN
FOO:watch rc0 write
POSTEVENT
FOO2:write rc0 ABC
FOO2:watch rc0 {}

ferrieux added on 2012-05-01 06:10:01:
Trunk has a more generic fix: adding timeouts to vwaits in these tests. They still fail, but no longer deadlock.

ferrieux added on 2012-05-01 04:27:11:
Note that the known test failures induced by this patch are all iocmd*24.1[56] variants (tf or not, ioCmd2 or not).

ferrieux added on 2012-05-01 04:16:33:
Created branch bug-3522560 with the following comment:

Explore enforcing the (implied) law that EAGAIN is forbidden for a blocking channel. Violation discovered is iocmd.tf-24.1[56]56 after TIP#398. Turns deadlock of these tests into frank failures.

Note that the patch also includes adding timeouts to various vwaits in the tests. Indeed, at first glance the vwaits in the -cleanup part sound fishy, but they are needed to avoid inter-test event leaks, as has been seen before (see fossil history of these tests). So just time out after 1 sec, for they should be instantaneous.

The C part of the patch raises an error when a blocking channel returns EAGAIN or EWOULDBLOCK.
This is an experimental feature, due to an ominous comment in the code:

   * This used to check for CHANNEL_NONBLOCKING, and panic if
   * the channel was blocking. However, it appears that setting
   * stdin to -blocking 0 has some effect on the stdout when
   * it's a tty channel (dup'ed underneath)

note we no longer panic, but raise a proper error, since it could be a simple script error in a refchan.
Still, the threat of a possible stdout becoming nonblocking at the ioctl level, while still remaining blocking as seen by Tcl (due to fd sharing and [fconfigure stdin -blocking 0]), must be investigated.

Please test !!!

Attachments: