Tcl Source Code

View Ticket
Login
Ticket UUID: 2827000
Title: tcl incorrectly reports eof to application
Type: Bug Version: obsolete: 8.6b1.1
Submitter: coldstore Created on: 2009-07-25 13:19:52
Subsystem: 25. Channel System Assigned To: andreas_kupries
Priority: 7 High Severity:
Status: Closed Last Modified: 2009-08-07 05:29:39
Resolution: Fixed Closed By: andreas_kupries
    Closed on: 2009-08-06 22:29:39
Description:
attached is an annotated narrative log of an observed sequence of interactions by the refchan subsystem and this code http://code.google.com/p/wub/source/browse/trunk/Wub/Chan.tcl?r=1798

A connected socket is passed as the 'chan' argument of Chan construction, which is [chan create]d as a refchan.

In the narrative, some data is [read] pursuant to a [readable] event on the underlying socket.  Less data is returned by [Chan read] than is requested.  A buffer-full of data is requested (4096 bytes) but only 1440 bytes are available.

The data is processed by the application, which then attempts to [read] (in non-blocking mode?) more data.  The application uses [gets] to parse a prefix of the 1440 bytes which are available, but it certainly does not consume the entirety.  It attempts to [read] some 20Kb more.

Tcl [Chan watch]es the refchan, and soon after calls [Chan read] again  to request another buffer-full of data.  There has been no readable event triggering, nor has [chan postevent] been called by Chan.  Despite this, Tcl attempts to read.

When there is no data available, as indicated by the return value of "" from [Chan read] upon this second attempt, Tcl indicates [eof] on the refchan.

I would have expected Tcl not to call [Chan read] until after a [chan postevent] had been signalled to it by Chan.

Observations:  The behaviour is intermittent, insofar as sometimes an identical socket interaction succeeds (though rarely.)  I have attempted to reproduce the behaviour in a localhost-connected socket, and failed.  I believe this is because the second and subsequent [Chan read]s succeed.

I think it's likely that Tcl refchan is calling the refchan's read method inappropriately, and then interpreting the 0-length result as end of file even though it has received no indication from the refchan by means of postevent that it can expect any data to be available to read.
User Comments: andreas_kupries added on 2009-08-07 05:29:39:

allow_comments - 1

Committed to HEAD and 8.5 branches.

andreas_kupries added on 2009-08-07 05:26:38:
Updated patches to include doc changes and test cases. Plus fix for threaded build, a bogus Tcl_DecrRefCount which caused double-free of resObj. Found with the new test cases (iocmd.tf-23.10).

andreas_kupries added on 2009-08-07 05:25:24:

File Added - 338233: refchan-read-eagain-85.patch

andreas_kupries added on 2009-08-07 05:25:02:

File Added - 338232: refchan-read-eagain-86.patch

andreas_kupries added on 2009-08-07 04:28:49:

File Deleted - 337086:

dkf added on 2009-07-30 22:36:18:
I can't figure out how to do distill the scripts into a test case; I'm not sure what's wrong in the first place, what's noise and what's relevant. :-(

coldstore added on 2009-07-29 15:34:34:
After 8 hours of running on the wiki I have no reports of any problems.  I conclude that the fix is in!

Thanks Andreas.

coldstore added on 2009-07-29 08:50:26:
Seems to have fixed it!  I constructed a test which sends the header to Wub then pauses before attempting to send the rest of the entity.

Ran it on the unpatched tclsh, and the same eof error was evident.  Ran it on the patched tclsh, and the entity was correctly received.  So I would say the bug is resolved - at least, I'm going to install this on the running wiki and observe.

As a side-effect, Chan got a CaptureChan mixin which records received data to a disk file.

Thanks!

andreas_kupries added on 2009-07-29 07:52:27:
While you have updated to HEAD, have you applied the patches attached here as well, both to core and Chan.tcl ?
I have not committed my patches to the head yet.

coldstore added on 2009-07-29 07:40:02:
I have updated the wiki's version of tcl to 8.6 HEAD.  When I run the wiki, it behaves the same way (essentially, it resets the connection when it's been given a lot of data to read - more than will fit into a buffer.)  I haven't been able to debug it now, though, I would suppose this is because the generation of debug narrative slow the server down to the extent that network interactions are able to fulfill tcl's expectations in a relatively timely manner, where without the debug narrative they never do.

So I'm a bit stuck.  I can try to winnow down the narrative to try to reproduce the problem, but there's no guarantee that will work.

Is there something else which refchan is doing which is different from what sockets do?  That's about the only thing I can immediately think of which might lead to a diagnosis.

andreas_kupries added on 2009-07-29 06:21:15:
Added patch for 8.5 branch, and the scripts I used (rc2, server) during the investigation.

andreas_kupries added on 2009-07-29 06:20:34:

File Added - 337088: rc2.tcl

andreas_kupries added on 2009-07-29 06:20:02:

File Added - 337087: server.tcl

andreas_kupries added on 2009-07-29 06:19:32:

File Added - 337086: refchan-read-eagain-85.patch

andreas_kupries added on 2009-07-29 06:02:11:
Updated patch. Passes threaded build as well.
Missing from patch: Doc changes and test cases.

andreas_kupries added on 2009-07-29 06:01:02:

File Added - 337085: refchan-read-eagain-86.patch

andreas_kupries added on 2009-07-29 06:00:37:

File Deleted - 337078:

andreas_kupries added on 2009-07-29 05:26:10:

File Added - 337078: refchan-read-eagain-86.patch

andreas_kupries added on 2009-07-29 05:25:33:

File Deleted - 337076:

andreas_kupries added on 2009-07-29 05:25:12:
Second half of the proposed fix is to modify Chan.tcl's 'read' method to detect the EAGAIN condition on the underlying channel and pass it on.

The condition for EAGAIN is "zero-length read in non-blocking mode with the channel not eof". See the patch for the equivalent Tcl code.

With these patches (core and Chan.tcl) applied my reconstruction begins to work.

NOTE: I tested only the non-threaded code patch so far.

andreas_kupries added on 2009-07-29 05:19:01:

File Added - 337077: Chan.tcl.patch

andreas_kupries added on 2009-07-29 05:18:19:
First half of a possible fix is the "refchan-read-eagain-86.patch".

The ReflectInput() is modified to recognize the result of

        return -code error NUMBER ;# NUMBER < 0
and     return -code error EAGAIN

The NUMBER form is allow any errno (EAGAIN for example is -11), although we currently need only EAGAIN.

andreas_kupries added on 2009-07-29 05:15:42:

File Added - 337076: refchan-read-eagain-86.patch

andreas_kupries added on 2009-07-29 05:14:45:
More investigation ... Instrumented the core a bit (what are the arguments and results when calling a channel inputProc, tcl_eof flags), and compared traces for plain socket and IChan over socket.

Sockets are not returning an 0-length read, but errno -11 EAGAIN to signal 'os buffers empty, but not at eof'. This is transformed into CHANNEL_BLOCKED instead of EOF. At the Tcl level this is seen as an empty string result of gets/read, no eof, and no fblocked actually.

IChan then returns this empty string, which indicates becomes a successful (errno 0) 0-length read, which indicates EOF, boom.

The bug is not in IChan, but in the core reflected channel. It never returns error conditions like EWOULDBLOCK or EAGAIN from ReflectInput(), only EINVAL for bad handler and such. And a limitation, the 'read method' has no way to signal this to ReflectInput() either.

andreas_kupries added on 2009-07-29 02:21:16:
Line 72 of chan.tcl, method 'initialize'. -blocking 0 at this point is wrong. A channel starts out in mode 'blocking'. To then set the underlying channel into non-blocking will likely confuse things when it comes to reading, especially regarding short reads (Not possible in blocking mode except when reaching eof. With the underlying channel set to non-blocking on init this can give short reads before eof).

Note that this should not have an effect on the problem we are looking into, because there both channels are set non-blocking. Well, should be, and the save.anno log seems to concur.

andreas_kupries added on 2009-07-29 01:28:43:
A review of the IChan class shows no anomalies. Using watch to (de)activate events on the underlying channel is a nice trick for avoiding flags. It should work I would think. And with that the immediate forwarding of events in the methods readable/writable should work as well.

I tried to compare to my refchan examples, but they are all about base channels without a channel underneath, so not really comparable. The nearest is the fifo2, sort of, as it consists of two ref channels talking to each, i.e. internally joined at the hip. But even there the whole buffer mgmt is by the fifo code, there is no regular channel involved.

After I got me the Debug package I could run the Chan.tcl by itself, demo mode. That worked.

Do you have a demo script using Chan.tcl and showing the problem ?
I hope here that I can avoid getting the whole wub sources.

Looking over the save.anno, maybe some more clues there

> * as a result of an app [gets], tcl watches reads on the rc:
>   chan @@{rc18 watch sock15 read}

Here might be a misunderstanding of the IO core. 'watch' is primarily called in response to 'chan event'. Calls from the guts of gets/read happen for temp. deactivate events, and re-activate them when the Tcl buffers are empty. A call this early is more likely from a 'chan event' than 'gets'.

> (App parses the header from some of the received bytes: Httpd @@{::Httpd::rc18_1504
> reader got request: (...)})

This gets events on rc18, as per step 9.

> * tcl reads a buffer's worth of bytes from rc18, but there are no bytes buffered in sock15, so an > empty string is returned:
>  chan @@{rc18 read sock15 4096 -> 0 bytes '...'}

Yes. This is like in testRefchan.tcl, step 13. Event through timer, not enough data in the buffers, ask the OS if it has more. Empty read.

>  * tcl watches read (presumably because the app has a fileevent readable):
> chan @@{rc18 watch sock15 read}

Like step 15. Interest in the base events is re-activated. This also re-activates sock15 events, should be.

> * application calles [eof rc18] which returns true, and terminates its reader:
>  Httpd @@{::Httpd::rc18_1504 terminate: (entity)}

Hm. Here I would have expected fblocked ...

Hm ... It should be possible to model that using testRefchan ... Basically, instead of going into state 2 where it always returns the empty string it should return nothing once, and at the next read something ... But the next read will not happen, won't it. Because of the 'eof' getting signaled ... Ok. That means I should concentrate on the happenings in and around steps 16 and 17 in rc.tcl.

andreas_kupries added on 2009-07-28 22:13:49:
Ok. Will have a look through this as well.

coldstore added on 2009-07-28 11:04:20:

File Added - 336902: Chan.tcl

coldstore added on 2009-07-28 11:03:45:
Clearly the test isn't a good test of what I'm observing.  Please see enclosed Chan.tcl for the code which produces the events I observed.

Unlike the test, it includes a [watch] method, and it only forwards events to Tcl when tcl has registered interest in that event, and it nonetheless behaves exactly as the original test script behaves.  Moreover, when the refchan is substituted for a socket, it does not behave like a socket.

Your advice on what is missing to make the enclosed Chan object perform precisely as a socket performs would be appreciated.

andreas_kupries added on 2009-07-28 10:01:09:
Behaviour of testRefChan.tcl ...

NOTE: The implementation of 'watch' was missing. This is crucical. You
      have to look at the watch calls to see if _you_ (The rc channel
      implementation) is expected to generate events, or not. See
      steps 3, 7, and 15.

NOTE 2: The handling of eof in getter/reader fell through to regular
     read, I added exit commands there.

NOTE 3: The modified test script is attached.

(.1) Create channel RC

(.2) Set channel RC to blocking --> 

(.3) Set up a fileevent receiver for RC -> getter. This invokes
     rc::watch, asking for read events, this posts a readable event.

(.4) The readable event is received and 'getter' called.

(.5) The 'gets' command in 'getter' finds empty buffers and calls the
     channel's inputProc -> ReflectInput -> procedure rc::read.

(.6) rc::read returns 221 bytes, 10 lines of '111...', an empty line,
    10 lines of '222...'.

(.7) As part of the inputProc processing interest in read-events from
     the channel is disabled. We have data in the Tcl buffers, these
     will get flushed first. No need to ask for events from the OS
     (=== the reflected channel).

(.8) The IO core returns the first '111...' line as result of 'gets'.

(.9) The IO core generates its own fileevents, based on timers, to
     flush the data left over in the Tcl buffers (remaining '111...'
     etc. lines.

(10) Per (.9) 'getter' is called, the cycle repeats.

(11) At the 11th call 'gets' returns an empty line, and the event
     receiver switches to 'reader'. This doesn't change that data in
     the buffers is flushed via timer-based fileevents, there are
     still the '222...' lines etc.

(12) 'reader' is called, for the first of the events after the switch-over.

(13) It tries to [read 10000]. As only less bytes than that are left
     in the Tcl buffers the channel's inputProc is called to sdee that
     we fill it up with more data => ReflectInput => rc::read.

     This is the so-called bogus call, and it is not bogus, but
     completely correct.

(14) This call returns 0 bytes, and the EOF flag in the channel is set.

(15) The IO core now __re-enables__ interest in the 'readable' events
     because all data in the Tcl buffers will be delivered to the
     script, and any more data will have to come from the channel
     itself. <=> Any more events have to come from the channel as
     well, the timer-based events are stopped.

(16) The read call returns the remaining bytes from the Tcl buffers.

At this point the script stops for me with

can't wait for variable "forever": would wait forever

because the RC channel implementation did not set things up to produce
its own events, despite being asked to in step 15, via rc::watch.

The moment we do this the posted event causes another call of 'reader'
sa step (17), which then handles the eof condition by closing the
channel, which disables events, finalizes, and exits. The last is
something I added to the script.



Now, what does that mean for the identity refchan ?

It has to keep track of when the Tcl IO core wants events from it or
not.

If it receives a fileevent from the socket it is connected to at a
time the core is not interested in identity events, then this event
has to be remembered in flags (one per read and write), but not be
acted on, i.e no posting forward to the core.

Only when events are enabled again for the identity chan the flags
have to be checked, and if we got events from the socket we post _one_
event forward.

andreas_kupries added on 2009-07-28 10:00:00:

File Added - 336901: rc.tcl

coldstore added on 2009-07-28 04:28:10:
The testRefchan.tcl script is not wild caught, it is crafted to illustrate and test for the problem I observed.  My assertion that I am not posting an event pertains to the observed problem, which I have annotated in the other attachment, not to the test script, which was written after I'd reported this bug.

It seems to me that either there should never be a read invoked in non-blocking mode without there having been a directly corresponding post event, or at least that under such circumstances a read return value of 0 must not be interpreted as an EOF.

andreas_kupries added on 2009-07-27 23:56:07:
One thing I noted in my quick glance over the testRefchan.tcl test script is that you _are_ posting a 'readable' event from the 'blocking' method, contradicting the "having been no [chan post] indicating" information in your comment here.

I.e. when the blocking status of the channel changes you tell the higher levels unconditionally that you have data to read. Without checking that you truly have such.

That already could be your problem. At least for the test script.

coldstore added on 2009-07-25 21:01:45:
Incidentally:  I would have expected the test script to show tcl sending another readable event to the application when it detected/decided the refchan was [eof].  This would be consistent with standard channel behaviour, but doesn't occur.

coldstore added on 2009-07-25 20:56:20:
I have written and attached a small script testRefchan.tcl, which I believe replicates and illustraes the problem.

It demonstrates that Tcl calls [read] on a non-blocking refchan despite there having been no [chan post] indicating that it should, and then delivering a positive [eof] when that read returned empty string result.

coldstore added on 2009-07-25 20:52:34:

File Added - 336521: testRefchan.tcl

coldstore added on 2009-07-25 20:19:53:

File Added - 336515: save.anno

Attachments: