Tcl Source Code: View Ticket

Ticket UUID:	2946474
Title:	[exit] doesn't wait for background flushes of close channels
Type:	Bug	Version:	None
Submitter:	ferrieux	Created on:	2010-02-05 11:06:45
Subsystem:	25. Channel System	Assigned To:	andreas_kupries
Priority:	9 Immediate	Severity:
Status:	Closed	Last Modified:	2011-08-18 03:36:34
Resolution:	Fixed	Closed By:	ferrieux
		Closed on:	2011-08-17 20:36:34
Description:	While [exit] normally waits for output flushing on all channels, including non-blocking ones, [close] also schedules a background flush, but a subsequent [exit] aborts that flush: (in another terminal: mkfifo /tmp/fifo;sleep 9999 < /tmp/fifo) set ff [open /tmp/fifo w] fconfigure $ff -blocking 0 puts -nonewline $ff [string repeat A 999999] flush $ff close $ff exit => exits immediately Comment out the 'close' above, and the [exit] blocks again.... This bug is present in 8.[456] HEAD, but not in 8.3.5.
User Comments:	ferrieux added on 2011-08-18 03:36:10: Many thanks. Integrated and committed. andreas_kupries added on 2011-08-18 03:02:57: Attached a file containing the iocmd.tf-24.16 test which captures and shows the write caused by bg-flush-on-close. Tested with and without the modification of tclIO.c (See the attached asyncflush2.patch). Fails without the change, succeeds with the patch. andreas_kupries added on 2011-08-18 02:59:02: File Added - 421393: asyncflush-new-test.txt ferrieux added on 2011-08-12 06:17:22: Updated ;-) ferrieux added on 2011-08-12 06:17:07: File Deleted - 361738: ferrieux added on 2011-08-12 06:16:45: File Added - 420921: asyncflush2.patch andreas_kupries added on 2011-08-11 22:36:22: "update"! Meh. Did not think of this simplest thing. Forcibly flush the event queue, of course. ferrieux added on 2011-08-11 06:55:47: Hum, it seems the threadReap does the necessary sync on the thread side... however some event callbacks are already posted, and never get a chance to the event loop (main thread) before 24.15 end. As proof, add 'update' to the end of the -cleanup of 24.15 and everything works :) Q: are there other places where this could be useful ? ferrieux added on 2011-08-11 06:31:34: When you say "Then the thread exits, the event is still in the queue.", what about the following slightly different interpretation: - the thread owns a channel - on that nonblocking channel, a background flush'n'close is scheduled - then the threads tries to exit :) - at that time, Tcl_FinalizeThread is called, and calls TclFinalizeIOSubsystem - without the patch, TFIOS just gives up instantly (the reason for this ticket) and the thread is dead. No further callback is to be feared on its behalf. - with the patch, TFIOS blocks on flushing the channel. So the callbacks (refchan) are done IN THE THREAD - for lack of synchronization, the execution of this thread overlaps the next test and wreaks havoc due to the shared variable names as you identified. If you agree with this sequence, then what about just joining the thread before ending 14.15 ? andreas_kupries added on 2011-08-11 06:17:12: It also works (the inter-test leakage) because we are using the same name (foo) for the Tcl handler of all these channels. The ghost flush of the channel 24.1 is put into the event queue of the main thread, because that is where the handler lives. Then the thread exits, the event is still in the queue. Later in 25.1 it has made it to the front of the queue, and is run, using the changed handler for 25.1 instead of the original of 24.1. So, we could stop the leak by using different proc names for the handler as well. In that case the event may cause a bgerror, trying to be run with a non-existent handler procedure. I do wonder if we can track enough information in the ref-channel that when it is gone in the thread any outstanding events can be canceled, i.e. by either removing them from the event queue, or just finding and marking them as 'deleted', causing the C level receiver to later discard it quietly when it comes to the head of the queue. Look at ForwardProc (tclIORchan.c), and 'SrcExitProc'. Data structures involved are ForwardingEvent and ForwardingResult Whatever is done to the RChan to handle the situation, the RTrans might need equivalent changes. ferrieux added on 2011-08-11 06:05:05: Analysis by aku: iocmd.tf-25.1 here suffers from a "ghost flush" from the previous test, 24.1, which uses threads. (the semantic leak is made possible by the fact that the script level is at work -- refchan) Indeed this reminds us that FinalizeIOSubsystem is also closed at the end of a thread's life (not just at process exit). So back to the workbench; this perspective was not in the initial design :} ferrieux added on 2011-08-11 05:26:48: Uh, now the patch makes iocmd.tf-25.1 fail: ==== iocmd.tf-25.1 chan configure, cgetall, standard options FAILED ==== Contents of test case: set res {} proc foo {args} {oninit; onfinal; track; note MUST_NOT_HAPPEN; return} set c [chan create {r w} foo] notes [inthread $c { note [fconfigure $c] close $c notes } c] rename foo {} set res ---- Result was: {write rc149 ABC} MUST_NOT_HAPPEN {-blocking 1 -buffering full -buffersize 4096 -encoding utf-8 -eofchar {{} {}} -translation {auto lf}} ---- Result should have been (glob matching): {-blocking 1 -buffering full -buffersize 4096 -encoding * -eofchar {{} {}} -translation {auto }} ==== iocmd.tf-25.1 FAILED andreas_kupries added on 2010-06-03 04:09:10: [13:38]ferrieuxHi Andreas [13:42]aku. [13:47]akuyes? [13:55]ferrieuxdunno what to do about 2946474 [13:56]akua moment [13:58]akupatch has test cases ? [14:01]ferrieuxno. it's a two-line patch. moreover, a test case must cope with a hanging [close], so must kill subprocess... [14:01]ferrieuxit's doable, just not done so far [14:01]ferrieux(eg with a secondary process draining the other side of a pipe slowly) [14:02]ferrieuxbut I'm more concerned by dgp's comment about changing things too fast ! [14:02]akuok, green light [14:03]ferrieuxoops sorry lost your last line [14:03]ferrieuxclosed the window [14:03]akuok, gren light - was last line [14:03]ferrieuxwhat about his concerns ? [14:05]akuhe actually said 'changing too much', not 'too fast' ... And he has a point regarding the understanding - I know that I do not have a clear understanding of the whole exit sequence and channels [14:05]akuI have no idea about making it better either, unfortunately. [14:06]ferrieuxwhat about going ahead, letting 8.6b2 users stress-test , and backporting at 8.6.0 time if no big issue ? [14:06]akuI'm fine with that [14:06]ferrieuxok thanks [14:06]akunp [14:07]ferrieuxwill build a test case, update the patch, and propose him the deal [14:07]akushall I put this thread into the bug ? As record of me saying 'go' [14:07]aku? [14:07]ferrieuxplease do [14:07]akuok dgp added on 2010-06-02 22:45:32: I still have nothing to add to this. Quoting: "So long as whatever you propose to do doesn't break the 8.4 branch again, I don't think I will mind it. I certainly have no knowledge on which to suggest anything better. All I have is the ongoing unease that we're tweaking all of these things -- channel closes and exit sequences -- too much for my taste, without any clear understanding on my part that we're actually making things better instead of replacing one set of problems with a different set of problems." Passing by the channel maintainer for lack of any better idea... ferrieux added on 2010-06-02 22:31:59: Raising the prio, waiting for a green (or red) light. Note that if we do nothing, we blatantly violate the docs. ferrieux added on 2010-02-12 05:32:35: I copied these elements to the true home of this red herring: 2762041. Now requesting permission to commit the current patch. On which HEAD ? ferrieux added on 2010-02-12 05:23:08: Further progress: switching back the channel to blocking mode and explicitly flushing it (in httpd11.tcl) yields exactly the same pattern of a single send() of 435 bytes. So the bug has nothing to do with the exit sequence, which is the subject of the present report. ferrieux added on 2010-02-12 05:12:42: A few more data points. First, note that a less intrusive variant of the bug-trigger is strace -f -o /dev/null make ... Second, the problem seems to be linked with the gzip stacked channel. Indeed the failure is due to only 8192 bytes getting transmitted out of the 130k of the file, and these 8192 bytes are compressed to 435 bytes, sent in one syscall: 16090 send(4, "\37\213\10\0\0\0\0\0\0\3\355\312\301\t\203@\24\4\320\273Ul\5\261\201\317V\221\0064\32\24"..., 435, 0) = 435 16090 close(4) = 0 16090 write(1, "close sock4\n", 12) = 12 So it's a failure to flush on the stacked channel side, and again the patch seems orthogonal to that. ferrieux added on 2010-02-12 04:01:27: FWIW, under linux I get the three tests http11-2.[678] to fail pretty (90%) reliably by spawning them under strace: strace -f -tt -T -o tra make test-tcl TESTFLAGS="-f http11.test" I have yet to comb through these huge trace files :/ However, the high probability of failure of the three tests is not affected by the application of this patch... dkf added on 2010-02-11 16:14:39: I suspect that this is related to an intermittent failure seen in http11-2.7: ==== http11-2.7 -channel,encoding deflate,non-chunked FAILED ==== Contents of test case: set tok [http::geturl http://localhost:$httpd_port/testdoc.html?close=1 -timeout 5000 -channel $chan -headers {accept-encoding deflate}] http::wait $tok seek $chan 0 set data [read $chan] list [http::status $tok] [http::code $tok] [check_crc $tok $data] [meta $tok connection] [meta $tok content-encoding] [meta $tok transfer-encoding] [expr {[file size testdoc.html]-[file size testfile.tmp]}] ---- Result was: ok {HTTP/1.1 200 OK} {crc32 mismatch: c554bc27 ne edc98c73} close deflate {} 121845 ---- Result should have been (exact matching): ok {HTTP/1.1 200 OK} ok close deflate {} 0 ==== http11-2.7 FAILED Note that this is an intermittent failure. I cannot generate it on demand, nor do I know the exact system conditions under which the problem shows itself. (Looks random and somewhat rare to me, but a failure to send the last chunk could do it.) dgp added on 2010-02-11 11:55:13: I no longer have any memory of what I did to come up with that patch. Indeed, I can't even tell you if I invented it, or merely committed something from someone else. Reading the comments on prior bugs, it looks like the only reason I got involved was because some backport broke the stable branch, and I came up with something to undo the breakage. So long as whatever you propose to do doesn't break the 8.4 branch again, I don't think I will mind it. I certainly have no knowledge on which to suggest anything better. All I have is the ongoing unease that we're tweaking all of these things -- channel closes and exit sequences -- too much for my taste, without any clear understanding on my part that we're actually making things better instead of replacing one set of problems with a different set of problems. ferrieux added on 2010-02-10 14:38:50: OK. Now, considering that the patch attached here does not* revert the fix you did, and AFAICT fixes all known problems of exit-time flushes, may I commit it ? (Note that it is key to making TIP 361 interesting) Backporting seems also interesting since the legacy branches' HEADs are currently violating the manpages; but this can happen after some time. dgp added on 2010-02-10 12:55:32: Everything I know is recorded at 2371600. ferrieux added on 2010-02-08 05:51:29: Bug can be traced to 2008-12-01 commit by dgp on core-8-4-branch: 2008-12-01 Don Porter <[email protected]> 288 289 * generic/tclIO.c (TclFinalizeIOSubsystem): Revised latest commit to 290 something that doesn't crash the test suite. 291 where 'latest commit' refers to: 292 2008-11-25 Andreas Kupries <[email protected]> 293 294 * generic/tclIO.c (TclFinalizeIOSubsystem): Applied backport of 295 Alexandre Ferrieux's patch for [Bug 2270477] to prevent infinite 296 looping during finalization of channels not bound to 297 interpreters. ferrieux added on 2010-02-05 18:09:39: File Added - 361738: asyncflush.patch ferrieux added on 2010-02-05 18:08:03: Attaching a patch to 8.6 fixing the issue, by special-casing in TclFinalizeIOSubsystem of the CHANNEL_CLOSED\|BG_FLUSH_SCHEDULED flagset and turning it back to a synchronous flush.

Attachments:

asyncflush-new-test.txt [download] added by andreas_kupries on 2011-08-18 02:59:02. [details]
asyncflush2.patch [download] added by ferrieux on 2011-08-12 06:16:45. [details]