Tcl Source Code

View Ticket
Login
Ticket UUID: 462317
Title: Notification when there's no data
Type: Bug Version: None
Submitter: kogorman Created on: 2001-09-17 18:54:02
Subsystem: 25. Channel System Assigned To: andreas_kupries
Priority: 7 High Severity:
Status: Closed Last Modified: 2001-09-27 09:27:03
Resolution: Fixed Closed By: andreas_kupries
    Closed on: 2001-09-27 02:27:03
Description:
Using expect5.32 with Tcl8.4a3, my expect script
wedges.
I have used strace(1) and printfs to find that
ChannelTimerProc
is calling TclNotifyChannel, which causes expect to
perform
a blocking read.  This is usually okay, but sometimes
there's
no data, and the read blocks forever, because the
channel
corresponds to a child which is waiting for a command. 
Deadlock.

I can tell from the system call trace that the program
does NOT
have enough information to know whether the read will
block,
but it could guess that it would because the previous
read
returned a short reply.  However, the file descriptor
has not
been tested by select(), so the issuance of the notify
is probably
bogus.

My guess is that there should be something cancelling
the
timer, since it was started before the last successful
read,
and has just been running since then.

I have hundreds of megabytes of trace, but don't
understand
the internals well enough to make much more sense out
of them
than the above, though I'm working on it.  I can boil
the traces
down quite a bit if anyone can work with me on this.

Sorry I can't give a concise code sample, because I
don't know
how to provoke the error without my 1GB Oracle
database.
I have, however, included a boiled-down tail of one of
the
traces.
User Comments: andreas_kupries added on 2001-09-27 09:27:03:
Logged In: YES 
user_id=75003

Committed to head and 8.3.3/cvs (core-8-3-1-branch).

andreas_kupries added on 2001-09-27 09:08:27:

File Added - 11261: 462317.diff.0

Logged In: YES 
user_id=75003

In direct interaction with Kevin
(email) it was determined that the
following interacting factors generated
the problem reported here. A patch to
the Tcl I/O core was created and tested
extensively. The patch survives the
testsuite and also removes the reported
deadlock.

The patch here is the second one to fix
the bug. The first patch was done
against Expect itself and whipped up to
provide a more immediate workaround
while trying to find the trouble spot.

- Tcl had a number of bytes in its
  input buffers and used a timer to
  generate artificial fileevents to
  drain them.

- Expect received these events and read
  data from the channel. It also asked
  for more data than was buffered.

- Epect had set the pty channel to
  non-blocking and therefore expected
  to get short reads if asking for data
  than available.

- The associated pty channel driver did
  not provide a blockmodeproc through
  which the Tcl I/O core tells a driver
  whether a channel is blocking or not.

  IOW, while the Tcl I/O core had to
  provide non-blocking semantics the
  underlying OS call was actually
  blocking.

- When Expect asked for more data than
  available the Tcl I/O core started to
  read from OS, assuming that the
  underlying OS call would not block as
  the channel was set non-blocking. The
  call was blocking instead and closed
  the chain of dependencies, creating
  the deadlock.

Solution:

The Tcl I/O is changed to

- assume that the underlying OS calls
  are blocking if a driver does not
  provide a blockmodeproc.

- never try to read from the OS if it
  is not certain that data is available,
  non-blocking behaviour is expected by
  the caller and the underlying OS call
  assumed to be blocking.

The last item means in practice that
read requests generated while handling
a timer generated fileevent are
restricted to the data in the input
buffers of the channel in question.

Trying to read more will signal
EWOULDBLOCK and cause the caller to see
a short read.

andreas_kupries added on 2001-09-19 23:38:42:
Logged In: YES 
user_id=75003

Ok, a freeze is no good either. Ok, the simpler clients 
should be something we can do.

select cannot and will not protect against the script or 
other extension reading data from a blocking channel 
without having made sure that there is data. Ok, in this 
case the internals on how tcl handles its buffers and when 
it goes to the OS for more also have a hand here. That is 
the trouble, it seems to be a problem in the interaction of 
tcl internals and expect.

I am beginning to feel out of my depth because my knowledge 
of pty's and the expect internals is limited. I will now 
summarize what we found so far and start talking to Don 
Libes to see if he can give me additional pointers.

kogorman added on 2001-09-19 08:35:43:
Logged In: YES 
user_id=326095

"Dies horribly" was perhaps an overstatement.
It freezes.  It does not exactly crash.

You could perhaps duplicate this with a simpler
script and simpler client.  If you want to just make
simple clients, they could be this simple:

call it 'sqlplus', put it in your path
1) produce the SQL> prompt (I think there's
   a space aftger the >
2) take an interactive command
3) echo the command (if that's not automatic)
4) wait a while, up to 15 minutes or so.
5) produce a header line, then a line with
    all hyphens (-).
6) produce a lot of output, with NO blank lines.
7) produce a blank line
8) produce some more stuff, more free-format
    perhaps 100 lines
9) go back to 1

To provoke the error, you may need 2 or 3 clients
producing a LOT of output at once.  This has
always seemed to be going on when the error
happened.  That is, make them take step 5
simultaneously, then fill the pipeline for a while,
perhaps 20,000 bytes.

Finally, I was under the impression the select
lists were built inside tcl.  That's what's confusing
to me about this.  If tcl is doing the select() calls,
I supposed it would be using them to protect from
blocking reads.  That's the behavior Expect is
relying on.

andreas_kupries added on 2001-09-19 07:11:14:
Logged In: YES 
user_id=75003

When you said 'expect dies horribly' I expected that expect 
generated a coredump. You might have to use the 'ulimit' 
command to allow this. See the current settings of your 
shell via 'ulimit -a'.

andreas_kupries added on 2001-09-19 07:09:15:
Logged In: YES 
user_id=75003

I talked about this line

17:03:34 select(23, [4 6 8 10 12 16 18 20 22], [], [4 6 8 
10 12 14 16 18 20 22], {0, 0}) = 0 (Timeout)

... Ok, the manpage of select says that the read list is 
the second argument ... 14 is missing. It shouldn't be, 
there is a Tcl_CreateFileHandler call before that without 
intervening Tcl_DeleteFileHandler.

Understand that we are talking about two levels of 
bufering/data here. One is data buffered in the OS for the 
process. The other is data buffered by Tcl itself. Expect 
is given a notification that there is some data available. 
This data happens to be in the tcl buffers. In our 
situation expect asks tcl for more than is in the buffers, 
so tcl asks the OS and given that the 'read' is blocking 
but as the tcl core believes that it will perform a 
nonblocking call (line 512) tcl is perfectly right to do 
this call. It expect that the call will give it a short 
read. And hangs because unexpectedly the call is actually 
blocking.

If I understand your situation correctly the external 
process waits for a information send to it by expect and 
expect expects (sic!) the same channel to send something to 
it (after it already got partial information ?). This is 
something I should be able to reproduce with a simpler 
script. Right ?

kogorman added on 2001-09-19 07:06:16:
Logged In: YES 
user_id=326095

It's intel/linux.

It seems to be non-threaded, at least
when I've fooled with the code the
threads stuff seems not to compile.

I can rebuild with symbols easily enough.
There are no coredumps now, do you
want me to make one?  Could you
remind me how that's done (some
switch on the kill command?).

kogorman added on 2001-09-19 06:53:50:
Logged In: YES 
user_id=326095

Print what lists?  The printing in the trace
is all done by strace(1).  Even the printf()
output is here because strace catches the
underlying write(2) call.  That actual output
goes to a log file separately from the trace.

It seems expect has been given a notification,
so it's doing a read.  It hasn't tested how much
data there is, so is relying on semantics much
like a unix read(2) from a socket, which will
give a short reply if there's not enough data.
It only blocks if there's no data at all.

I think the right question to be asking is if there's
a way to get tcl to act like that, and if so, what
is missing?

If that behavior is not available at all, then Don
Libes has a harder job, but we're going to need
to know the details of what he can rely on and
what he can't.

andreas_kupries added on 2001-09-19 06:52:52:
Logged In: YES 
user_id=75003

Ok, this is not an option then ... Hm, what platform are 
you on ? If it is sparc/solaris or intel/linux I could have 
a look at the coredump ... Did you configure and compile 
using --enable-symbols ? ... Oh, yes I also assumed a non-
threaded built (Better ask).

kogorman added on 2001-09-19 06:44:44:
Logged In: YES 
user_id=326095

I tried your solution, and Expect dies
horribly.  It doesn't even install because
the install target in the makefile uses
expect, which wedges immediately.

We need something else.

andreas_kupries added on 2001-09-19 06:39:30:
Logged In: YES 
user_id=75003

Just determining the amount buffered and reading that ... 
Yes, we could try that in the highlevel code of expect.

Does this mean the my suggesstion was not sucessful ?

Adding (and removing) a channel to (from) the read/write 
lists of select happens automatically. This is done 
by 'Tcl_CreateFileHandler' ('Tcl_DeleteFileHandler'). I 
haven't looked deeper into expect for this, there is lot 
going on with arming and disarming and I don't know what.

Looking at your trace and the 'select' lines, in what order 
do you print the lists ?

kogorman added on 2001-09-19 06:23:25:
Logged In: YES 
user_id=326095

Would it be good enough for Expect to
determine the amount of data and just
ask for that.  Failing that, how should
Expect get Tcl back into putting the
channel in the select() read-list and
only notifying when a read won't block?

Or am I talking nonsense?

andreas_kupries added on 2001-09-19 05:55:04:
Logged In: YES 
user_id=75003

Ok, it really looks like the situation I thought. expect 
reads more than is in the buffers, goes to the OS and then 
gets stuck in the low-level read.

That the timer exists and fires is correct, there is data 
to read (in the buffers). Just not enough to satisfy expect.

While looking further through the expect code, beginning 
with ExpInputProc I noted an inconsistency in the expect 
channel code. See the following excerpts from exp_chan.c:

Line 53:
Tcl_ChannelType expChannelType = {
    "exp",/* Type name. */
    /* Expect channels are always blocking */
    NULL,/* Set 
blocking/nonblocking mode.*/

Line 115 (ExpInputProc)
    /*
     * Assume there is always enough input available. This 
will block
     * appropriately, and read will unblock as soon as a 
short read is
     * possible, if the channel is in blocking mode. If the 
channel is
     * nonblocking, the read will never block.
     */

Line 512 (expCreateChannel):
    Tcl_SetChannelOption(interp,esPtr->channel,"-
blocking","0");

In other words, several parts of expect seem to disagree 
whether the channel is blocking or nonblocking.

I fear that the last command (line 512) leaves the generic 
I/O layer in the tcl core a bit confused. It tries to 
handle the channel as non-blocking but the low-level file 
descriptor was never told so. This may allow the generic 
I/O core in some situations to try to read from the channel 
when it would not try the same for a blocking channel.

Please try to comment out line 512 in exp_chan.c and see if 
the problem is still there afterward.

andreas_kupries added on 2001-09-19 04:45:52:
Logged In: YES 
user_id=75003

There is a size limit yes, but they SF was/is also balky 
today. I had quite a trouble to login this morning.

kogorman added on 2001-09-19 04:35:35:
Logged In: YES 
user_id=326095

I had trouble uploading this, but the big trace
is there now, gzipped.

Is there a size limit, or was the internet just
being balky today?

++ kevin

kogorman added on 2001-09-19 04:28:55:

File Added - 10944: F022d.tail3.gz

kogorman added on 2001-09-19 04:09:29:
Logged In: YES 
user_id=326095

I'm attaching a trace (big, but much 
smaller than the original), which may
have what you need.

It goes from the last select() call I
found which detected fd 14 readable,
up to just after a read of fd 14 wedges.

++ kevin

andreas_kupries added on 2001-09-19 00:47:36:
Logged In: YES 
user_id=75003

Basically I would like to see how many characters expect 
tries to read and how much is in the tcl buffers at that 
time. That would be in 'expect.c/expIRead'.

Also how much bytes are in the buffer whenever that timer 
fires.

kogorman added on 2001-09-19 00:33:44:
Logged In: YES 
user_id=326095

I've been adding even more printf's to
the code.  Is there anything that would
be particularly helpful to you?

If you want info on tcl buffers, where
would that be helpful, and how best
to show it?

++ kevin

kogorman added on 2001-09-19 00:09:37:
Logged In: YES 
user_id=326095

I've been adding even more printf's to
the code.  Is there anything that would
be particularly helpful to you?

If you want info on tcl buffers, where
would that be helpful, and how best
to show it?

++ kevin

andreas_kupries added on 2001-09-19 00:08:59:
Logged In: YES 
user_id=75003

Huh, complicated script ... I haven't done much expect so I 
hope this question is not stupid: do you believe it is 
possible to use [after] instead of managing the timeouts on 
your own ? (This is an aside question hopefully not 
connected to the problem).

As for looking into the Tcl buffers there 
is 'Tcl_Inputbuffered' returning the number of bytes of 
input currently buffered in the common internal buffer of a 
channel. There is also 'Tcl_InputBlocked' which returns 1 
if input is blocked on this channel, 0 otherwise. Last we 
have 'Tcl_Eof' which returns 1 if the channel is at EOF, 0 
otherwise. Beyond that you have to query the datastructures 
directly.

Regarding "The channel is the pipe to a child of
expect, and as far as I know, there's
no control at the script level about
blocking that.", I assume this to mean that we have no 
direct access to the channel allowing us to perform 
[fconfigure $channel -blocking 0].

andreas_kupries added on 2001-09-18 23:52:51:
Logged In: YES 
user_id=75003

Regarding the many calls of 'time()', this is part of the 
notifier.

Explanation: The relevant functions are 'TimerSetupProc' 
and 'TimerCheckProc' in 'tcl/generic/tclTimer.c'.

Whenever tcl starts to wait for events it will 
use 'gettimeofday' to retrieve the current time down to the 
usec and then uses this to compute the timeout required 
for 'select' based upon the delay times for the known 
timers (after ...).

Similarly when it is done with waiting for events, 
i.e. 'select' returned, it queries the time again to see 
which of the timers expired and have to queue events.

kogorman added on 2001-09-18 23:17:13:

File Added - 10933: F022e.small

kogorman added on 2001-09-18 09:51:58:
Logged In: YES 
user_id=326095

Here's a trace with __FILE__ and __LINE__
in the printfs.

It also tracks the TimerHandler tokens, to
see how they run.  I don't really understand
them, but I highlight token 1ffa because it
seems to be connected with the failure.

kogorman added on 2001-09-18 07:21:29:

File Added - 10905: Team

kogorman added on 2001-09-18 07:21:28:
Logged In: YES 
user_id=326095

I have been talking with Don Libes.
He thinks this is a Tcl problem.

I was originally having the problem
with the stock Expect from RedHat
7.1, and he got me to try the later
release.  I can look up details if you
really need them.

The channel is the pipe to a child of
expect, and as far as I know, there's
no control at the script level about
blocking that.

The trace includes all system calls,
expect or otherwise, from the given
process.   I removed some that were
too prolix (do you really need to call
time() that often?  it's 90% of the
system calls). Some of the printfs are
in expect, some in Tcl.

My own guess is that this is only
going to be a problem with multiple
channels, producing blocks of data
bigger than expect's buffer (4096),
so that there is data left after a read.
Then, as near as I can tell, the timer
is immortal and can fire later at an
inappropriate time.  It won't get a
chance if there's only one channel
active, because of the details of the
code.  But with multiple channels
operating concurrently, .....

I'll work on __FILE__ and __LINE__,
but you know, my printfs are going
to make hash of LINE.  They already
mostly say where they are.

Finally, I'm not sure how to look into
what's in the tcl buffers.

See the attached script.  It isn't going
to work unless you can run 10 SQL*Plus
clients against a TPC database. :o(

andreas_kupries added on 2001-09-18 05:55:47:
Logged In: YES 
user_id=75003

Some more comments. Given the last comments I believe you 
should have a look at how much is in the tcl bufers and how 
much the channel handler in expect tries to read from the 
channel.

Thanks for the trace btw. it helped a lot. A suggestion 
nevertheless: Adding __FILE__ and __LINE__ to the output 
would allow even better to pinpoint where each write call 
is.

andreas_kupries added on 2001-09-18 05:52:20:
Logged In: YES 
user_id=75003

Does the deadlock happen with Tcl 8.3.3. too ?
Please attach the expect script too, if possible.
(Just so that we can take a look at the tcl side of the 
problem too).

It might be interesting [x] to switch the channel into non-
blocking mode and see what happens.

[x] I do not know how much control we have over the created 
channel. 

Note: The timer in question is activated when bytes are 
waiting in the input queue of the channel in question. It 
basically generates artifical fileevents to push these 
bytes into the channel handlers.

Problem is, if there are for example 12 bytes waiting in 
the tcl buffers and then a blocking read is issued from the 
eventhandler for more than 2 bytes the system will try to 
read the missing bytes from the OS. This can cause the 
deadlock too.

Are you able to trace execution inside of expact too ?

kogorman added on 2001-09-18 01:54:06:

File Added - 10883: F022d.tail

Attachments: