Tcl Source Code

View Ticket
Login
Ticket UUID: 780533
Title: [fcopy] callback sometimes fails to trigger
Type: Bug Version: obsolete: 8.5.2
Submitter: msofer Created on: 2003-07-30 21:45:03
Subsystem: 24. Channel Commands Assigned To: andreas_kupries
Priority: 9 Immediate Severity:
Status: Closed Last Modified: 2008-04-05 03:15:24
Resolution: Fixed Closed By: andreas_kupries
    Closed on: 2008-04-04 20:15:24
Description:
Assigned to dgp for review, as it might have something
to do with [Bug 218203]).

What I see is: let $s be a socket, $f a normal file,
$nBytes a "large" number (8kB or above seems to trigger
the error with good probability), both channels with
"-translation binary". Then the trouble is that
  fcopy $s $f -size $nBytes -command myCallback
often fails to notice that $nBytes *were* already
written to $f, and hence never calls myCallback.

It might be an off-by-one heisenbug: as I said, it
happens often but not always - and it seems never to
happen if one more byte than promised comes through the
socket.

I attached here a script that produces the error with
good probability; uncommenting line 72 does the "one
more byte" trick. 

The script attached below produces the following
results on RH7.2 (relevant??):
 [mig@mini test2]$ /CVS/tcl_SF_clean/unix/tclsh test.tcl 
 /CVS/tcl_SF_clean/unix/tclsh - 8.5a0
 client: done sending
 server: copying 16384 bytes to tempFile
 client: won't wait anymore ...
 server: [fcopy] wanted 16384, and got 16384
 server: [fcopy] doesn't tell us we are finished, exit
anyway

whereas the correct output (uncommenting line 72) is
 [mig@mini test2]$ /CVS/tcl_SF_clean/unix/tclsh test.tcl 
 /CVS/tcl_SF_clean/unix/tclsh - 8.5a0
 server: copying 16384 bytes to tempFile
 client: done sending
 server: copied  16384 bytes to tempFile.
 server: all done
User Comments: andreas_kupries added on 2008-04-05 03:15:24:
Logged In: YES 
user_id=75003
Originator: NO

Created the test case and committed to 8.4 branch, 8.5 branch, and head.
Closing this bug. Finally.

ferrieux added on 2008-04-04 22:37:38:
Logged In: YES 
user_id=496139
Originator: NO

Oops, sorry, I forgot. No hurry :-)

andreas_kupries added on 2008-04-04 22:24:15:
Logged In: YES 
user_id=75003
Originator: NO

In principle yes. I left it open as reminder to convert your test script to actual test case. I can close it if you want.

ferrieux added on 2008-04-04 14:58:11:
Logged In: YES 
user_id=496139
Originator: NO

Andreas, shouldn't this one be closed ? I'm not pressing for this (maybe you have other tests in mind), it's just for consistency with 1932639 which is much younger and already closed.

andreas_kupries added on 2008-04-03 03:33:06:

File Added - 272934: FC.tcl

Logged In: YES 
user_id=75003
Originator: NO

Attaching Alexandre's final test script. Committed the fix to Tcl 8.4 branch, Tcl 8.5 branch, and head. Still have to convert the test script into a test case.
File Added: FC.tcl

ferrieux added on 2008-04-01 00:14:32:
Logged In: YES 
user_id=496139
Originator: NO

Found and fixed. This was a day-1 bug...

Bug summary: doing an async [fcopy -size $size], if the input stalls (not EOF) after exactly $size bytes, the callback is not called.

The problem is in the CopyData function, near the end of the
while(toRead) loop, in the if (cmdPtr) block. Apply the following
patch:

-       if (cmdPtr) {
+       if (cmdPtr &&(csPtr->toRead != 0)) {
           /*
            * The first time we enter this code, there won't be a channel
            * handler established yet, so do it here.
            */

Explanation: this test is here to "break out" of the entire routine
(not just the while loop) with a return TCL_OK, in async mode, after
making sure we have a fileevent to wake us up later. The problem is
that at this spot the remaining number of bytes to read due to [-size]
has already been updated, and when it reaches zero, we should only
break of the while loop in order to terminate properly (and call the
callback).

Hence the fix: just add "&&(csPtr->toRead != 0)", so that it will fall
thru to the beginning of the loop, which does test toRead.

We can also understand why it was "rare" and "random": this is because
people often use "-size" only as a progress indicator, so the pump is
often "re-primed" by additional data on the channel coming after the
given -size bytes. But what about the last chunk ? I believe people
often use a channel with a proper EOF (not a never-ending pipe like in
Miguel's ad hoc example), which completely hides the issue since it
fires the readable fileevent !

Below is a small script showing the bug with probability one: the script blocks forever.
Patched version completes normally after reading 3 lines from the child process.
Test suite OK.

-Alex

set pi [open "|[info nameofexecutable] 2>@ stderr" r+]
fconfigure $pi -translation binary -buffering line
fconfigure stdout -translation binary -buffering line

puts $pi {
    fconfigure stdout -translation binary -buffering line
    puts stderr Waiting...
    after 1000
    foreach x {a b c} {
puts stderr Looping...
puts $x
after 500
    }
    puts stderr Now-sleeping-forever
    while {[gets stdin line]>=0} {}
    exit
}

proc done args {puts stderr DONE;exit}

fcopy $pi stdout -size 6 -command done

vwait forever

hobbs added on 2008-03-30 06:44:18:
Logged In: YES 
user_id=72656
Originator: NO

Via Cleverly (non-default value key to trigger error):

Consistently reproducible for me for values of 21, 23, 24, 25 & 26 on
OS X 10.5.2 (intel) w/Tcl 8.5.0.  22 always seems to work.  I haven't
tried >= 27.

With 8.5.2rc1 on Linux 2.6.18 (sparc64) it seems to be consistently
reproducible for values of 14, 16, 17, 18, 19 & 20.  15 always seems
to work.  I haven't tried >= 21 on this platform.

With 8.5.2rc1 on OpenBSD 4.2 (sparc64) it seems to be consistently
reproducible for values of 11 & 12. 13-18 always seem to work.  19 &
20 sometimes worked, sometimes didn't.  (For example, out of twenty
consecutive runs passing 20 as an argument 14 failed, 6 succeeded.)  I
haven't tried >= 21 on this platform.

hobbs added on 2008-03-30 05:04:32:
Logged In: YES 
user_id=72656
Originator: NO

Works for me with 8.5.1 and 8.4.18.

msofer added on 2004-04-01 21:22:50:
Logged In: YES 
user_id=148712

Is this a real bug, or mainly EBKAC? If it is a real bug, it
needs attention - if it is EBKAC, I need a spanking ...

The bug I (think that I) see is: [fcopy] is failing to call
its callback after copying the required number of bytes.

The test script lacks comments, portability ('puts stderr'),
and nicety (writes a tempFile in wd, leaves it there),
sorry. What it does is:

1. opens a server socket; on connection 4 bytes are read
(number N of following bytes), then N bytes are fcopied to
tempFile, calling the doneCopy callback when done. doneCopy
writes a log to stderr, end then causes the app to exit. The
app exits also after receiving the second message through
the pipe (used to signal a failure).

2. opens (via pipe) a second tclsh that connects to the
server and sends a binary (length+message). After message
end, the client waits for 10 seconds and prints an exit
message - if it still alive, which should not happen as the
pipe should have been closed before that, when the main
process exits.

Observed behaviour: fcopy's callback is not called after the
10 seconds, the client prints its exit message.

Should have observed: fcopy's callback is called, no client
exit message seen.

Weirdness: if the client sends an extra byte down the
socket, all's well.

msofer added on 2004-03-27 05:24:22:
Logged In: YES 
user_id=148712

Bug still there, newer linux here :(

dgp added on 2003-11-13 09:21:26:
Logged In: YES 
user_id=80530


The test script works for me
for both the 8.4 branch and HEAD,
on both Solaris 9, and Linux/Alpha.

msofer added on 2003-07-31 21:14:56:
Logged In: YES 
user_id=148712

More:
* can't repro when the slave is a pipe. 
* bug also present for client and server on different machines

msofer added on 2003-07-31 20:07:49:
Logged In: YES 
user_id=148712

Addenda:
  * failure reproduced in win2k
  * the [fcopy] callback does occur after the client closes
the socket
  * different -buffering settigs do not seem to change behaviour

msofer added on 2003-07-31 04:45:03:

File Added - 57359: test.tcl

Attachments: