Ticket UUID: | 3522560 | |||
Title: | iocmd.tf-24.15,16 are brittle (refchan+transfer) | |||
Type: | Bug | Version: | obsolete: 8.6b2 | |
Submitter: | ferrieux | Created on: | 2012-04-30 14:06:29 | |
Subsystem: | 26. Channel Transforms | Assigned To: | andreas_kupries | |
Priority: | 8 | Severity: | ||
Status: | Closed | Last Modified: | 2012-05-11 04:07:05 | |
Resolution: | Fixed | Closed By: | ferrieux | |
Closed on: | 2012-05-09 22:34:39 | |||
Description: |
Tests iocmd.tf-24.15,16 deadlock since TIP #398. This is apparently due to their explicit dependency on the background flush-on-exit. I lack the deep understanding of these tests to go deeper, so assigning to you Andreas. Note that #398 still allows background flushes on [close], it is just [exit] which aborts it. Si maybe it is only a matter of waiting a bit since those are finite, very small flushes. But I'm equally ready to do more work in the #398 implementation if this is actually highlighting a more subtle impact on transforms/refchan. | |||
User Comments: |
ferrieux added on 2012-05-11 04:07:05:
I'm also having second thoughts about the branch created to enforce "no EAGAIN on blocking channels". Indeed, though the documented semantics seems to imply it, I could not produce a serious misbehavior with its violation, and the aforementioned comment about shared fds threatens of potential incompatibilities... Any thoughts ? ferrieux added on 2012-05-10 05:34:39: That last commit also re-enabled iocmd.tf-24.16, which now works reliably (along with .15). Closing. andreas_kupries added on 2012-05-10 05:29:02: 22:28:42 [360eb75e01] *CURRENT* Undone part of change [32d93a8414], keeping [chan postevent] synchronous for owner == handler. (user: andreask tags: trunk) andreas_kupries added on 2012-05-10 02:18:44: Committed. Uh. I worked on trunk. This is something we might wish to backport. 8.5 is likely affected by the same issue. === 2012-05-09 === 19:09:51 [32d93a8414] *CURRENT* * generic/tclIORChan.c [Bug 3522560]: Fixed the crash, enabled the test case. Modified [chan postevent] to properly inject the event(s) into the owner thread's event queue for execution in the correct context. Renamed the ForwardOpTo...Thread() function to match with our terminology. (user: andreask tags: trunk) 19:03:42 [92b2807753] * tests/ioCmd.test [Bug 3522560]: Added a test which crashes the core if it were not disabled as knownBug. For a reflected channel transfered to a different thread the [chan postevent] run in the handler thread tries to execute the owner threads's fileevent scripts by itself, wrongly reaching across thread boundaries. (user: andreask tags: trunk) ferrieux added on 2012-05-09 15:16:49: Here is a summary of where we stand, for those who are watching. As Andreas says, there were 3 issues intersecting here: (1) EAGAIN for a blocking channel ; not really a problem here, though branch bug-3522560 explores cleaning that up. (2) Incomplete finalization of blocked nb channels on thread-exit; fixed in trunk 2 days ago. (3) Improper cross-thread call into owner context in [chan postevent] for transferred refchans ; this one is the killer. Note that (2) was, by luck, the trigger to make (3) manifest ... Now (3) is being fixed by Andreas's work. Since we have a demonstrating setup, a specific bug and commit will follow shortly. ferrieux added on 2012-05-07 23:40:46: Andreas, I found why you weren't seeing your CloseProc called. It was indeed an exaggerated "shortcut" in FinalizeIO when #398 is active. It is now fixed in trunk. Please continue your tests from trunk; the branch bug-3522560, which is *not * fixed, holds an orthogonal feature (enforcing nonblocking for EAGAIN), which may or may not be validated, and will remain mergeable this way. ferrieux added on 2012-05-06 16:28:28: Just to let people know: a BIG fix by Andreas is cooking: let [chan postevent] work properly in transferred refchans. ferrieux added on 2012-05-03 04:50:28: File Added - 442620: refchanbug.patch ferrieux added on 2012-05-03 04:49:57: File Deleted - 442487: ferrieux added on 2012-05-03 04:11:05: Updated branch now does [fconfigure -blocking 0] where it should have in the first place. ferrieux added on 2012-05-02 04:08:38: Changing the title, lowering prio, according to previous observations. ferrieux added on 2012-05-01 23:49:56: Further digging shows a deeper problem in the "transfer-refchan" combination. The hidden [update] needed for interthread exchanges raises complex reentrancy issues. This preexisted, and has nothing to do with #398. ferrieux added on 2012-05-01 06:15:07: (the previous patch is against trunk, not the experimental constraint-enforcing branch) ferrieux added on 2012-05-01 06:14:34: File Added - 442487: refchanbug.patch ferrieux added on 2012-05-01 06:13:28: While seeking the way to fix the tests, we approached more closely the teardown of a cross-thread reflected channel. Attached patch to the tests generates on iocmd.tf-24.16 a panic: FlushChannel: damaged channel list After the following verbose dump: [alex@dellalex unix]$ ./tcltest ../tests/ioCmd.test -file ioCmd.test -match iocmd.tf-24.16 -testdir ../tests -verbose start FOO:initialize rc0 {read write} ---- iocmd.tf-24.16 start FOO:write rc0 ABC RET:EAGAIN FOO:watch rc0 write POSTEVENT FOO2:write rc0 ABC FOO2:watch rc0 {} ferrieux added on 2012-05-01 06:10:01: Trunk has a more generic fix: adding timeouts to vwaits in these tests. They still fail, but no longer deadlock. ferrieux added on 2012-05-01 04:27:11: Note that the known test failures induced by this patch are all iocmd*24.1[56] variants (tf or not, ioCmd2 or not). ferrieux added on 2012-05-01 04:16:33: Created branch bug-3522560 with the following comment: Explore enforcing the (implied) law that EAGAIN is forbidden for a blocking channel. Violation discovered is iocmd.tf-24.1[56]56 after TIP#398. Turns deadlock of these tests into frank failures. Note that the patch also includes adding timeouts to various vwaits in the tests. Indeed, at first glance the vwaits in the -cleanup part sound fishy, but they are needed to avoid inter-test event leaks, as has been seen before (see fossil history of these tests). So just time out after 1 sec, for they should be instantaneous. The C part of the patch raises an error when a blocking channel returns EAGAIN or EWOULDBLOCK. This is an experimental feature, due to an ominous comment in the code: * This used to check for CHANNEL_NONBLOCKING, and panic if * the channel was blocking. However, it appears that setting * stdin to -blocking 0 has some effect on the stdout when * it's a tty channel (dup'ed underneath) note we no longer panic, but raise a proper error, since it could be a simple script error in a refchan. Still, the threat of a possible stdout becoming nonblocking at the ioctl level, while still remaining blocking as seen by Tcl (due to fd sharing and [fconfigure stdin -blocking 0]), must be investigated. Please test !!! |
Attachments:
- refchanbug.patch [download] added by ferrieux on 2012-05-03 04:50:28. [details]