Ticket UUID: | 780533 | |||
Title: | [fcopy] callback sometimes fails to trigger | |||
Type: | Bug | Version: | obsolete: 8.5.2 | |
Submitter: | msofer | Created on: | 2003-07-30 21:45:03 | |
Subsystem: | 24. Channel Commands | Assigned To: | andreas_kupries | |
Priority: | 9 Immediate | Severity: | ||
Status: | Closed | Last Modified: | 2008-04-05 03:15:24 | |
Resolution: | Fixed | Closed By: | andreas_kupries | |
Closed on: | 2008-04-04 20:15:24 | |||
Description: |
Assigned to dgp for review, as it might have something to do with [Bug 218203]). What I see is: let $s be a socket, $f a normal file, $nBytes a "large" number (8kB or above seems to trigger the error with good probability), both channels with "-translation binary". Then the trouble is that fcopy $s $f -size $nBytes -command myCallback often fails to notice that $nBytes *were* already written to $f, and hence never calls myCallback. It might be an off-by-one heisenbug: as I said, it happens often but not always - and it seems never to happen if one more byte than promised comes through the socket. I attached here a script that produces the error with good probability; uncommenting line 72 does the "one more byte" trick. The script attached below produces the following results on RH7.2 (relevant??): [mig@mini test2]$ /CVS/tcl_SF_clean/unix/tclsh test.tcl /CVS/tcl_SF_clean/unix/tclsh - 8.5a0 client: done sending server: copying 16384 bytes to tempFile client: won't wait anymore ... server: [fcopy] wanted 16384, and got 16384 server: [fcopy] doesn't tell us we are finished, exit anyway whereas the correct output (uncommenting line 72) is [mig@mini test2]$ /CVS/tcl_SF_clean/unix/tclsh test.tcl /CVS/tcl_SF_clean/unix/tclsh - 8.5a0 server: copying 16384 bytes to tempFile client: done sending server: copied 16384 bytes to tempFile. server: all done | |||
User Comments: |
andreas_kupries added on 2008-04-05 03:15:24:
Logged In: YES user_id=75003 Originator: NO Created the test case and committed to 8.4 branch, 8.5 branch, and head. Closing this bug. Finally. ferrieux added on 2008-04-04 22:37:38: Logged In: YES user_id=496139 Originator: NO Oops, sorry, I forgot. No hurry :-) andreas_kupries added on 2008-04-04 22:24:15: Logged In: YES user_id=75003 Originator: NO In principle yes. I left it open as reminder to convert your test script to actual test case. I can close it if you want. ferrieux added on 2008-04-04 14:58:11: Logged In: YES user_id=496139 Originator: NO Andreas, shouldn't this one be closed ? I'm not pressing for this (maybe you have other tests in mind), it's just for consistency with 1932639 which is much younger and already closed. andreas_kupries added on 2008-04-03 03:33:06: File Added - 272934: FC.tcl Logged In: YES user_id=75003 Originator: NO Attaching Alexandre's final test script. Committed the fix to Tcl 8.4 branch, Tcl 8.5 branch, and head. Still have to convert the test script into a test case. File Added: FC.tcl ferrieux added on 2008-04-01 00:14:32: Logged In: YES user_id=496139 Originator: NO Found and fixed. This was a day-1 bug... Bug summary: doing an async [fcopy -size $size], if the input stalls (not EOF) after exactly $size bytes, the callback is not called. The problem is in the CopyData function, near the end of the while(toRead) loop, in the if (cmdPtr) block. Apply the following patch: - if (cmdPtr) { + if (cmdPtr &&(csPtr->toRead != 0)) { /* * The first time we enter this code, there won't be a channel * handler established yet, so do it here. */ Explanation: this test is here to "break out" of the entire routine (not just the while loop) with a return TCL_OK, in async mode, after making sure we have a fileevent to wake us up later. The problem is that at this spot the remaining number of bytes to read due to [-size] has already been updated, and when it reaches zero, we should only break of the while loop in order to terminate properly (and call the callback). Hence the fix: just add "&&(csPtr->toRead != 0)", so that it will fall thru to the beginning of the loop, which does test toRead. We can also understand why it was "rare" and "random": this is because people often use "-size" only as a progress indicator, so the pump is often "re-primed" by additional data on the channel coming after the given -size bytes. But what about the last chunk ? I believe people often use a channel with a proper EOF (not a never-ending pipe like in Miguel's ad hoc example), which completely hides the issue since it fires the readable fileevent ! Below is a small script showing the bug with probability one: the script blocks forever. Patched version completes normally after reading 3 lines from the child process. Test suite OK. -Alex set pi [open "|[info nameofexecutable] 2>@ stderr" r+] fconfigure $pi -translation binary -buffering line fconfigure stdout -translation binary -buffering line puts $pi { fconfigure stdout -translation binary -buffering line puts stderr Waiting... after 1000 foreach x {a b c} { puts stderr Looping... puts $x after 500 } puts stderr Now-sleeping-forever while {[gets stdin line]>=0} {} exit } proc done args {puts stderr DONE;exit} fcopy $pi stdout -size 6 -command done vwait forever hobbs added on 2008-03-30 06:44:18: Logged In: YES user_id=72656 Originator: NO Via Cleverly (non-default value key to trigger error): Consistently reproducible for me for values of 21, 23, 24, 25 & 26 on OS X 10.5.2 (intel) w/Tcl 8.5.0. 22 always seems to work. I haven't tried >= 27. With 8.5.2rc1 on Linux 2.6.18 (sparc64) it seems to be consistently reproducible for values of 14, 16, 17, 18, 19 & 20. 15 always seems to work. I haven't tried >= 21 on this platform. With 8.5.2rc1 on OpenBSD 4.2 (sparc64) it seems to be consistently reproducible for values of 11 & 12. 13-18 always seem to work. 19 & 20 sometimes worked, sometimes didn't. (For example, out of twenty consecutive runs passing 20 as an argument 14 failed, 6 succeeded.) I haven't tried >= 21 on this platform. hobbs added on 2008-03-30 05:04:32: Logged In: YES user_id=72656 Originator: NO Works for me with 8.5.1 and 8.4.18. msofer added on 2004-04-01 21:22:50: Logged In: YES user_id=148712 Is this a real bug, or mainly EBKAC? If it is a real bug, it needs attention - if it is EBKAC, I need a spanking ... The bug I (think that I) see is: [fcopy] is failing to call its callback after copying the required number of bytes. The test script lacks comments, portability ('puts stderr'), and nicety (writes a tempFile in wd, leaves it there), sorry. What it does is: 1. opens a server socket; on connection 4 bytes are read (number N of following bytes), then N bytes are fcopied to tempFile, calling the doneCopy callback when done. doneCopy writes a log to stderr, end then causes the app to exit. The app exits also after receiving the second message through the pipe (used to signal a failure). 2. opens (via pipe) a second tclsh that connects to the server and sends a binary (length+message). After message end, the client waits for 10 seconds and prints an exit message - if it still alive, which should not happen as the pipe should have been closed before that, when the main process exits. Observed behaviour: fcopy's callback is not called after the 10 seconds, the client prints its exit message. Should have observed: fcopy's callback is called, no client exit message seen. Weirdness: if the client sends an extra byte down the socket, all's well. msofer added on 2004-03-27 05:24:22: Logged In: YES user_id=148712 Bug still there, newer linux here :( dgp added on 2003-11-13 09:21:26: Logged In: YES user_id=80530 The test script works for me for both the 8.4 branch and HEAD, on both Solaris 9, and Linux/Alpha. msofer added on 2003-07-31 21:14:56: Logged In: YES user_id=148712 More: * can't repro when the slave is a pipe. * bug also present for client and server on different machines msofer added on 2003-07-31 20:07:49: Logged In: YES user_id=148712 Addenda: * failure reproduced in win2k * the [fcopy] callback does occur after the client closes the socket * different -buffering settigs do not seem to change behaviour msofer added on 2003-07-31 04:45:03: File Added - 57359: test.tcl |