Tcl Source Code

View Ticket
Login
Ticket UUID: 943274
Title: Tcl 8.4.5 with Expect 5.39 hangs waiting for data matching p
Type: Bug Version: obsolete: 8.4.6
Submitter: mellenbe Created on: 2004-04-27 19:48:55
Subsystem: 25. Channel System Assigned To: andreas_kupries
Priority: 7 High Severity:
Status: Closed Last Modified: 2004-05-27 23:06:51
Resolution: Fixed Closed By: andreas_kupries
    Closed on: 2004-05-19 19:44:08
Description:
See the thread by the same name at comp.lang.tcl for 
more information.

Running this on solaris.  We have a large testing 
framework written in tcl/expect.  The framework 
executes on one platform (solaris) and uses spawn to 
start rsh on another platform.  The shell is left running 
for the life of a test run which can last several days.

Commands are sent to the rsh shell and expect is used 
to parse the input as it is returned.  We recently 
upgraded to more current versions of tcl 8.4.5 and 
expect 5.39 (from tcl 7.5 and expect 5.21).

What I have found is when the rsh shell returns large 
amounts of data (approx 150,000 bytes) it is very likely 
to produce a hang.

Back trace copied from comp.lang.tcl,

#0  0xff21da28 in _read () from /usr/lib/libc.so.1
#1  0x000284c8 in ExpInputProc 
(instanceData=0x176078,
    buf=0x1784f0 "rver\r\n137.69.53.234\tpv2-
mv\r\n137.69.51.200\tpacket\r\n137.69.3.186
\tmarvel\r\n137.69.12.162\tcal162\r\n137.69.9.235
\tbutiki\r\n137.69.5.45\tableix\r\n137.69.121.160
\tbelem\r\n137.69.101.155\tubik\r\n137.69.103.118
\trunt\r\n13"...,
toRead=4096, errorCodePtr=0xffbfe260) at 
exp_chan.c:122
#2  0x000b1148 in GetInput (chanPtr=0x173a48) at
./../generic/tclIO.c:5303
#3  0x000afb98 in DoReadChars (chanPtr=0x173a48, 
objPtr=0x164b88,
toRead=2092,
    appendFlag=1) at ./../generic/tclIO.c:4447
#4  0x000af92c in Tcl_ReadChars (chan=0x173a48, 
objPtr=0x164b88,
toRead=4000,
    appendFlag=1) at ./../generic/tclIO.c:4339
#5  0x000363d8 in expIRead (interp=0x15ba80, 
esPtr=0x176078,
timeout=10,
    save_flags=0) at expect.c:1652
#6  0x000364b0 in expRead (interp=0x15ba80, 
esPtrs=0x172f80,
esPtrsMax=1,
    esPtrOut=0xffbfe580, timeout=10, key=12317) at 
expect.c:1723
#7  0x000383d4 in Exp_ExpectObjCmd (clientData=0x0, 
interp=0x15ba80,
objc=13,
    objv=0xffbfe9a8) at expect.c:2526
#8  0x00046d18 in TclEvalObjvInternal 
(interp=0x15ba80, objc=13,
    objv=0xffbfe9a8, command=0x12d3a8 "", length=0, 
flags=0)
    at ./../generic/tclBasic.c:3087
#9  0x00047058 in Tcl_EvalObjv (interp=0x15ba80, 
objc=13,
objv=0xffbfe9a8,
    flags=0) at ./../generic/tclBasic.c:3203
#10 0x000332a4 in exp_eval_with_one_arg 
(clientData=0x0,
interp=0x15ba80,

(gdb) p *esPtr
$7 = {channel = 0x173a48, name = "exp4", '\0' <repeats 
36 times>, fdin
= 4,
  fdout = 4, channel_orig = 0x0, fd_slave = -1, 
validMask = 14, pid =
29846,
  buffer = 0x164b88, msize = 4001, umsize = 2000, 
printed = 4, echoed
= 0,
  rm_nulls = 1, open = 1, user_waited = 0, sys_waited = 
0, registered
= 1,
  wait = 0, parity = 1, key = 12317, force_read = 1, 
notified = 0,
  notifiedMask = 2, fg_armed = 0, leaveopen = 0, 
bg_interp = 0x0,
  bg_ecount = 0, bg_status = unarmed, 
freeWhenBgHandlerUnblocked = 0,
  fdBusy = 0, keepForever = 0, valid = 1, nextPtr = 
0x170b50}
(gdb) p toRead
$8 = 4096
(gdb) p bytesRead
$9 = 0

Our script looks like this, I added eof and full_buffer to 
see if either would get executed.  They don't.

 while {1}  {
    expect {
        -i $ID -re "(\[^\r]*)(\r\r\n|\r\n)" {
        append lines "$expect_out(1,string)\n"
        continue
     }
     "$user_prompt" {
         puts "prompt"
         break
     }
     eof {
        puts "\n >>> hit eof"
     }

     full_buffer {
        puts "\n full buffer hit"
        exit -1
     }
  }

The problem is a mix of buffering and blocking IO.  Since 
the rsh is a long running server, we never see an EOF 
from it.  The user_prompt is basically a logical eof we 
use as "end of data".  

What I have seen is all data, including the logical eof 
has been buffered.  There is no more to expect from the 
rsh.  During our last trip through expect/tcl a portion of 
the data was returned but some data would not fit in 
our buffer (expect_out, I think) so the remainder is left 
in the tcl buffers.  

The next time we go through expect/tcl the remainder 
of the data is returned (including our logical eof) but 
then the logic in DoReadChars does another read.  The 
read is blocking and we sit there forever.

There is a tricky exchange that needs to happen here.  
I don't know if it is a problem with expect or tcl or both.

In our case, when expect is reading, blocking reads 
shouldn't be issued unless the pattern matching logic 
has had a chance to look at the data and decided to 
drive another read.  In this case,  the logic in 
DoReadChars copies data to the expect buffer and then, 
without letting expect see it, does another read. 

I circumvented the hang with the following update to 
DoReadChars.  It works well with a long running server 
but causes some of the blocking/nonblocking tests to 
fail.

Original code from DoReadChars around line 4416,

      for (copied = 0; (unsigned) toRead > 0; ) {
          copiedNow = -1;

          if (statePtr->inQueueHead != NULL) {
              if (encoding == NULL) {
                  copiedNow = ReadBytes(statePtr, objPtr, 
toRead,
&offset)
              } else {
                  copiedNow = ReadChars(statePtr, objPtr, 
toRead,
&offset,
                          &factor);
              }
          }

Fixed code,

      for (copied = 0; (unsigned) toRead > 0; ) {
          copiedNow = -1;
          if (statePtr->inQueueHead != NULL) {
              if (encoding == NULL) {
                  copiedNow = ReadBytes(statePtr, objPtr, 
toRead,
&offset);
              } else {
                  copiedNow = ReadChars(statePtr, objPtr, 
toRead,
&offset,
                          &factor);
>>> New           if (copiedNow > 0) {
>>>                   statePtr->flags |= CHANNEL_BLOCKED;
>>>               }
              }
          }

This basically prevents another read and allows a return 
to user code.
 The user will know if another read is necessary.

Contact me if you need more information.

Mike
User Comments: andreas_kupries added on 2004-05-27 23:06:51:
Logged In: YES 
user_id=75003

The application to apply a patch is called 'patch'. If you
are on a normal unix system chances are that you have it
already.

Start with

patch --dry-run < the_patch_file

and if it asks for filenames use the -p option (-p0, -p1,
...) until it stops asking. --dry-run means that it will
_not_ modify the output files, i.e. a test mode. The option
-b ensures that it leaves .orig files behind, for
comparison. .rej files mean that parts of the patch could
not be applied (did not find where to apply,  possibly due
to interfering changes).

'man patch' will give its manpage.

If your system doesn't have it google for 'gnu patch'.

mellenbe added on 2004-05-27 22:55:00:
Logged In: YES 
user_id=1030524

Thanks Andreas, I hadn't noticed that you had attached a 
patch.

What do I need to do to apply this to my source?

Thanks,
Mike

andreas_kupries added on 2004-05-21 04:11:26:

File Added - 87854: tcl_e.patch.gz

andreas_kupries added on 2004-05-21 04:09:53:
Logged In: YES 
user_id=75003

Yes, you would get anything between 8.4.5 and 8.4.6+. Bugs:
According to the ChangeLog actually a lot were fixed. Yes, a
new Tk is needed.

If you wait a moment I will do a diff, and attach the
resulting patch.

mellenbe added on 2004-05-21 03:59:08:
Logged In: YES 
user_id=1030524

So I would checkout core-8.4-branch?  

Then I pick up this fix and anything else that went in 
between 8.4.5 and 8.4.6+?  

Should I expect to find new bugs in this version?  

Against my better judgement we did builds on 8 platforms.  
Now if we move to 8.4.6 do I need to get a new version of 
Tk?

Mike

andreas_kupries added on 2004-05-21 03:36:53:
Logged In: YES 
user_id=75003

There is no release containing this. The only way to get the
change now is to check the sources directly out of the
sourceforge cvs. Either CVS head, which is Tcl 8.5aXXX in
development, or from the 'core-8-4-branch', which is Tcl
8.4.6 +. The latter is most likely what you want.

mellenbe added on 2004-05-21 03:30:57:
Logged In: YES 
user_id=1030524

Good news Andreas.  

How do I get the updated source?  I'd like to build again and 
see if it fixes the problem.

Mike

andreas_kupries added on 2004-05-20 02:44:08:
Logged In: YES 
user_id=75003

Fixed through changes in the Tcl core. For channels based on
drivers where we don't have complete control over blocking
we now keep track iof events coming from the OS, and thus,
if there is actually data waiting. If not the driver is now
bypassed, avoiding the block.

The change has been committed to both 8.4 branch and CVS
head (8.5 devel branch).

nobody added on 2004-05-12 03:07:20:
Logged In: NO 

I will try creating the file as you outlined and running the test 
driver.  I may not be able to do this till next week (5/17 - 5-
21).  I'll let you know.  

Using telnet instead of rsh probably affects the way this 
works.  Using rsh is what is used in our environment.

There is some magic with "ypcat hosts" in our environment 
too but I have produced the hang using cat and the output 
from ypcat.  

There may be several problems in this area; especially given 
the wide variety of OS's this runs on.  

Again, we see the hang when Tcl does a read (when it 
shouldn't) that blocks.  All our data has already been passed 
to the expect_out buffer.

Mike

andreas_kupries added on 2004-05-11 03:15:20:
Logged In: YES 
user_id=75003

Michael, I use the following script to generate a large file:

for {set i 0} {$i < 4096} {incr i} {
    puts "[format %04d $i] | 0123456789 0123456789
0123456789 0123456789 0123456789 0123456789 0123456789
0123456789 0123456789 012"
}


4096 lines, each 110 characters long. The first 4 characters
are a line counter.

The interesting thing was that at least for linux the hang
happened in the middle of line 4084. Near the end, but
nowhere near the prompt. It should not block at that point,
more data _is_ coming in as I see it. Can you check where
you are hanging  with this file ?

andreas_kupries added on 2004-05-11 01:27:38:
Logged In: YES 
user_id=75003

Back to this again ... I was able to repro the hang, running
on Linux. Using the ActiveState port of Expect, for Windows
I do not get a hang, but a 'full buffer hit'. I don't know
if that is also a bug or not. For that test I had to change
the script to use 'telnet' instead of 'rsh' (The windows rsh
is crippled and does not give you a shell). Turning back
around and trying the script with telnet on linux I get a
hang as well, but can take longer for it to happen. Oh, and
I needed a file on the order 400-500 K to make the hang happen.

andreas_kupries added on 2004-05-06 01:46:29:
Logged In: YES 
user_id=75003

Thanks for the script. I will try it and see what happens.
We don't have yellow pages I believe. But cat'ting a large
file is easy.

mellenbe added on 2004-05-06 01:17:21:

File Added - 86186: zt3.exp

Logged In: YES 
user_id=1030524

Hi Andreas,  I've attached a script you can use to produce 
the hang.  

You will have to modify it for your environment, I've added << 
indicators in obvious spots.

I run it over a network, spawns a shell on another machine 
(i.e., what I refer to as the long running server).  I feed it 
commands, 'ypcat hosts' is killing us but 'cat' of a large file 
seems to also do it.  The prompt is just our protocol.  You 
can adapt this to your environment.  I played around with 
different match_max values.  A smaller value may make it fail 
faster.   System and network activity plays a part in this.

I hope this helps.

Mike

andreas_kupries added on 2004-05-05 23:51:16:
Logged In: YES 
user_id=75003

Having reread all the comments at
http://sourceforge.net/tracker/index.php?func=detail&aid=462317&group_id=10894&atid=110894
I tend even more to the conclusion that the current problem
is in essence the same as the one described there, but
different enough in the details to not be caught by the code
we added for #462317.

andreas_kupries added on 2004-05-05 23:37:36:
Logged In: YES 
user_id=75003

Found the entry in the changelog after a cvs annotate ...
Here it is. And this seems to be exactly about the situation
you are encountering. Seems that I have to revisit this code.

2001-09-26  Andreas Kupries 
<[email protected]> 

* doc/fileevent.n: Accepted [Patch #465279] adding an
example to
  the fileevent manpage. Minor modifications to get a better
  formatting. Report and patch by David N. Welton
  <[email protected]>.

* The changes below fix [Bug #462317] where Expect tried to
read
  more than was in the buffers and then blocked in the OS
call as
  its pty channel driver provides no blockmodeproc through
which
  the OS could be notified of blocking-behaviour. Because
of this
  the general I/O core has to take more care than usual to
  preserve the semantics of non-blocking channels.

  The problem was reported by "Kevin O'Gorman"
  <[email protected]>.

* generic/tclIO.c (Tcl_ReadRaw): Do not read from the driver if
  the channel is non-blocking and the fileevent causing the
read
  was generated by a timer. We do not know if there is data
  available from the OS. Instead of going to the OS for
more and
  potentially blocking we simply signal EWOULDBLOCK to the
higher
  levels to cause the system to wait for true fileevents.
  (GetInput): Same as before.
  (ChannelTimerProc): Added set and clear of CHANNEL_TIMER_FEV.

* generic/tclIO.h (CHANNEL_TIMER_FEV): New flag for
channels. Is
  set if a fileevent was generated by a timer, the channel
is not
  blocking and the driver did not provide a blockmodeproc.
In that
  case the I/O core has to be especially careful about going to
  the driver for more data.

andreas_kupries added on 2004-05-05 23:32:00:
Logged In: YES 
user_id=75003

I skipped right over the timer_fev. sometimes I am to near
to the code. The timer_fev is set if the fileevent handler
the read is in was invoked not by a real fileevent, but by a
fake file event generated by a timer, to flush data kept in
the buffers ... Interesting, set if and only if there is
blockmode proc, i.e. if the generic layer cannot know if the
underlying driver will block or not. And 'expect' fits that
to a tee. I have to gather more information about timer_fev
it seems, when it was introduced, changelog comments from
that time etc.

In the meantime, testscript. Yes please. Please try to
reduce your application to something small (and without
large IP of yours) and attach it to this bug report.

If I can invoke a script which is selfcontained and reliably
hangs itself debugging will be much easier for me. 

Regarding telling that there is more data on the channel ...
Tcl cannot, not directly. For drivers with a blockmodeproc
it can tell the driver whether to read blocking or
non-blocking, and in the non-blocking case the driver ensure
that it stops reading the moment the os tells it that it has
no more data right now. Expect is a driver without a
blockmodeproc, and so Tcl cannot really tell when there is
no more data, and has to be extra careful to call on the OS
only if there should be data present. This timer_fev flag I
overlooked earlier has exactly this purpose. And maybe there
is something wrong in the logic of its handling. I don't
know yet, I have to read more changelog.

Thanks.

mellenbe added on 2004-05-05 07:55:01:
Logged In: YES 
user_id=1030524

As far as I know we do not change anything in the way the 
channel is set up.   In addition, these same scripts work fine 
with Tcl7.5 and Expect5.21.

There are two conditions checked at 5298, 
CHANNEL_NONBLOCKING was always on and I never remember 
seeing CHANNEL_TIMER_FEV.  Both conditions have to be 
true to avoid the read.   This is how we got to the else 
condition.

I believe our channel is always in non-blocking mode just as 
expect set it up.  I didn't save any of the traces I did after I 
instrumented several of these routines.

The sequence I saw in DoReadChars is this,
At 4411 we enter the for loop  
At 4417 ReadChars is called and sets copiedNow.  After this 
call, the data that was copied is the last of the data from our 
server and contains our logical eof consisting of 3 % signs 
(i.e., the prompt).  There is no more data to expect from the 
server after the prompt is seen.

Next I believe we fall through to the logic at 4437 and take 
the else condition at 4455 where copied and toRead are 
updated.  Remember at this point there is not more data so 
next time through the for loop we somehow end up making 
the function call at 4447.  Since the server has no more 
data, the read hangs.  

I think when the data was returned by ReadChars expect 
should have been given an opportunity to examine it before 
the next read.

How can Tcl tell when there is no more data at the channel?  
Our server is long running.  There are not eof or other 
indicators sent.  Logical eof is determined by looking for a 
prompt in the data, it's totally application specific.

Sometimes I can hang this right away with my test script 
other times it takes longer.  I hope this helps.  Let me know if 
you need more information or the test script.  I can 
reproduces this from solaris to solaris over a network.

Mike

andreas_kupries added on 2004-05-05 06:28:57:
Logged In: YES 
user_id=75003

I am trying to understand this and have some questions. You
said that GetInput is blocking in the read ... Line 5303 ...
Ok, just before that the generic IO system looks if the
channel is non-blocking or not. It tries to read only if the
channel is in blocking mode. So your expect channel is in
blocking mode, right ? On the other hand, when expect
creates a channel it sets this chanel to -blocking 0, i.e.
non-blocking (exp_chan.c, line 520, function expCreateChannel).

So, does your application (or a package it uses) switch the
expect channel into blocking mode ? If yes, can you disable
that part and check if your application will still work
correctly ?

From the Tcl core's perspective the behaviour is correct:

- On the next to last round Expect sees that it has not much
room in expect_out and ask for less data than Tcl has in its
buffers, and Tcl returns that.
- On the next round Expect asks for much more data than Tcl
has in the buffers, and so Tcl goes looking for more, and as
the channel is blocking it also knows that it should wait
for the OS to deliver the data. If the channel had been
non-blocking Tcl would not only see that the OS is currently
out of data but would also abort the operation and simply
return what data it has, instead of trying to forcibly
fulfill the request completely.

Attachments: