Tcl Source Code: View Ticket

Ticket UUID:	943274
Title:	Tcl 8.4.5 with Expect 5.39 hangs waiting for data matching p
Type:	Bug	Version:	obsolete: 8.4.6
Submitter:	mellenbe	Created on:	2004-04-27 19:48:55
Subsystem:	25. Channel System	Assigned To:	andreas_kupries
Priority:	7 High	Severity:
Status:	Closed	Last Modified:	2004-05-27 23:06:51
Resolution:	Fixed	Closed By:	andreas_kupries
		Closed on:	2004-05-19 19:44:08
Description:	See the thread by the same name at comp.lang.tcl for more information. Running this on solaris. We have a large testing framework written in tcl/expect. The framework executes on one platform (solaris) and uses spawn to start rsh on another platform. The shell is left running for the life of a test run which can last several days. Commands are sent to the rsh shell and expect is used to parse the input as it is returned. We recently upgraded to more current versions of tcl 8.4.5 and expect 5.39 (from tcl 7.5 and expect 5.21). What I have found is when the rsh shell returns large amounts of data (approx 150,000 bytes) it is very likely to produce a hang. Back trace copied from comp.lang.tcl, #0 0xff21da28 in _read () from /usr/lib/libc.so.1 #1 0x000284c8 in ExpInputProc (instanceData=0x176078, buf=0x1784f0 "rver\r\n137.69.53.234\tpv2- mv\r\n137.69.51.200\tpacket\r\n137.69.3.186 \tmarvel\r\n137.69.12.162\tcal162\r\n137.69.9.235 \tbutiki\r\n137.69.5.45\tableix\r\n137.69.121.160 \tbelem\r\n137.69.101.155\tubik\r\n137.69.103.118 \trunt\r\n13"..., toRead=4096, errorCodePtr=0xffbfe260) at exp_chan.c:122 #2 0x000b1148 in GetInput (chanPtr=0x173a48) at ./../generic/tclIO.c:5303 #3 0x000afb98 in DoReadChars (chanPtr=0x173a48, objPtr=0x164b88, toRead=2092, appendFlag=1) at ./../generic/tclIO.c:4447 #4 0x000af92c in Tcl_ReadChars (chan=0x173a48, objPtr=0x164b88, toRead=4000, appendFlag=1) at ./../generic/tclIO.c:4339 #5 0x000363d8 in expIRead (interp=0x15ba80, esPtr=0x176078, timeout=10, save_flags=0) at expect.c:1652 #6 0x000364b0 in expRead (interp=0x15ba80, esPtrs=0x172f80, esPtrsMax=1, esPtrOut=0xffbfe580, timeout=10, key=12317) at expect.c:1723 #7 0x000383d4 in Exp_ExpectObjCmd (clientData=0x0, interp=0x15ba80, objc=13, objv=0xffbfe9a8) at expect.c:2526 #8 0x00046d18 in TclEvalObjvInternal (interp=0x15ba80, objc=13, objv=0xffbfe9a8, command=0x12d3a8 "", length=0, flags=0) at ./../generic/tclBasic.c:3087 #9 0x00047058 in Tcl_EvalObjv (interp=0x15ba80, objc=13, objv=0xffbfe9a8, flags=0) at ./../generic/tclBasic.c:3203 #10 0x000332a4 in exp_eval_with_one_arg (clientData=0x0, interp=0x15ba80, (gdb) p esPtr $7 = {channel = 0x173a48, name = "exp4", '\0' <repeats 36 times>, fdin = 4, fdout = 4, channel_orig = 0x0, fd_slave = -1, validMask = 14, pid = 29846, buffer = 0x164b88, msize = 4001, umsize = 2000, printed = 4, echoed = 0, rm_nulls = 1, open = 1, user_waited = 0, sys_waited = 0, registered = 1, wait = 0, parity = 1, key = 12317, force_read = 1, notified = 0, notifiedMask = 2, fg_armed = 0, leaveopen = 0, bg_interp = 0x0, bg_ecount = 0, bg_status = unarmed, freeWhenBgHandlerUnblocked = 0, fdBusy = 0, keepForever = 0, valid = 1, nextPtr = 0x170b50} (gdb) p toRead $8 = 4096 (gdb) p bytesRead $9 = 0 Our script looks like this, I added eof and full_buffer to see if either would get executed. They don't. while {1} { expect { -i $ID -re "(\[^\r])(\r\r\n\|\r\n)" { append lines "$expect_out(1,string)\n" continue } "$user_prompt" { puts "prompt" break } eof { puts "\n >>> hit eof" } full_buffer { puts "\n full buffer hit" exit -1 } } The problem is a mix of buffering and blocking IO. Since the rsh is a long running server, we never see an EOF from it. The user_prompt is basically a logical eof we use as "end of data". What I have seen is all data, including the logical eof has been buffered. There is no more to expect from the rsh. During our last trip through expect/tcl a portion of the data was returned but some data would not fit in our buffer (expect_out, I think) so the remainder is left in the tcl buffers. The next time we go through expect/tcl the remainder of the data is returned (including our logical eof) but then the logic in DoReadChars does another read. The read is blocking and we sit there forever. There is a tricky exchange that needs to happen here. I don't know if it is a problem with expect or tcl or both. In our case, when expect is reading, blocking reads shouldn't be issued unless the pattern matching logic has had a chance to look at the data and decided to drive another read. In this case, the logic in DoReadChars copies data to the expect buffer and then, without letting expect see it, does another read. I circumvented the hang with the following update to DoReadChars. It works well with a long running server but causes some of the blocking/nonblocking tests to fail. Original code from DoReadChars around line 4416, for (copied = 0; (unsigned) toRead > 0; ) { copiedNow = -1; if (statePtr->inQueueHead != NULL) { if (encoding == NULL) { copiedNow = ReadBytes(statePtr, objPtr, toRead, &offset) } else { copiedNow = ReadChars(statePtr, objPtr, toRead, &offset, &factor); } } Fixed code, for (copied = 0; (unsigned) toRead > 0; ) { copiedNow = -1; if (statePtr->inQueueHead != NULL) { if (encoding == NULL) { copiedNow = ReadBytes(statePtr, objPtr, toRead, &offset); } else { copiedNow = ReadChars(statePtr, objPtr, toRead, &offset, &factor); >>> New if (copiedNow > 0) { >>> statePtr->flags \|= CHANNEL_BLOCKED; >>> } } } This basically prevents another read and allows a return to user code. The user will know if another read is necessary. Contact me if you need more information. Mike
User Comments:	andreas_kupries added on 2004-05-27 23:06:51: Logged In: YES user_id=75003 The application to apply a patch is called 'patch'. If you are on a normal unix system chances are that you have it already. Start with patch --dry-run < the_patch_file and if it asks for filenames use the -p option (-p0, -p1, ...) until it stops asking. --dry-run means that it will _not_ modify the output files, i.e. a test mode. The option -b ensures that it leaves .orig files behind, for comparison. .rej files mean that parts of the patch could not be applied (did not find where to apply, possibly due to interfering changes). 'man patch' will give its manpage. If your system doesn't have it google for 'gnu patch'. mellenbe added on 2004-05-27 22:55:00: Logged In: YES user_id=1030524 Thanks Andreas, I hadn't noticed that you had attached a patch. What do I need to do to apply this to my source? Thanks, Mike andreas_kupries added on 2004-05-21 04:11:26: File Added - 87854: tcl_e.patch.gz andreas_kupries added on 2004-05-21 04:09:53: Logged In: YES user_id=75003 Yes, you would get anything between 8.4.5 and 8.4.6+. Bugs: According to the ChangeLog actually a lot were fixed. Yes, a new Tk is needed. If you wait a moment I will do a diff, and attach the resulting patch. mellenbe added on 2004-05-21 03:59:08: Logged In: YES user_id=1030524 So I would checkout core-8.4-branch? Then I pick up this fix and anything else that went in between 8.4.5 and 8.4.6+? Should I expect to find new bugs in this version? Against my better judgement we did builds on 8 platforms. Now if we move to 8.4.6 do I need to get a new version of Tk? Mike andreas_kupries added on 2004-05-21 03:36:53: Logged In: YES user_id=75003 There is no release containing this. The only way to get the change now is to check the sources directly out of the sourceforge cvs. Either CVS head, which is Tcl 8.5aXXX in development, or from the 'core-8-4-branch', which is Tcl 8.4.6 +. The latter is most likely what you want. mellenbe added on 2004-05-21 03:30:57: Logged In: YES user_id=1030524 Good news Andreas. How do I get the updated source? I'd like to build again and see if it fixes the problem. Mike andreas_kupries added on 2004-05-20 02:44:08: Logged In: YES user_id=75003 Fixed through changes in the Tcl core. For channels based on drivers where we don't have complete control over blocking we now keep track iof events coming from the OS, and thus, if there is actually data waiting. If not the driver is now bypassed, avoiding the block. The change has been committed to both 8.4 branch and CVS head (8.5 devel branch). nobody added on 2004-05-12 03:07:20: Logged In: NO I will try creating the file as you outlined and running the test driver. I may not be able to do this till next week (5/17 - 5- 21). I'll let you know. Using telnet instead of rsh probably affects the way this works. Using rsh is what is used in our environment. There is some magic with "ypcat hosts" in our environment too but I have produced the hang using cat and the output from ypcat. There may be several problems in this area; especially given the wide variety of OS's this runs on. Again, we see the hang when Tcl does a read (when it shouldn't) that blocks. All our data has already been passed to the expect_out buffer. Mike andreas_kupries added on 2004-05-11 03:15:20: Logged In: YES user_id=75003 Michael, I use the following script to generate a large file: for {set i 0} {$i < 4096} {incr i} { puts "[format %04d $i] \| 0123456789 0123456789 0123456789 0123456789 0123456789 0123456789 0123456789 0123456789 0123456789 012" } 4096 lines, each 110 characters long. The first 4 characters are a line counter. The interesting thing was that at least for linux the hang happened in the middle of line 4084. Near the end, but nowhere near the prompt. It should not block at that point, more data _is_ coming in as I see it. Can you check where you are hanging with this file ? andreas_kupries added on 2004-05-11 01:27:38: Logged In: YES user_id=75003 Back to this again ... I was able to repro the hang, running on Linux. Using the ActiveState port of Expect, for Windows I do not get a hang, but a 'full buffer hit'. I don't know if that is also a bug or not. For that test I had to change the script to use 'telnet' instead of 'rsh' (The windows rsh is crippled and does not give you a shell). Turning back around and trying the script with telnet on linux I get a hang as well, but can take longer for it to happen. Oh, and I needed a file on the order 400-500 K to make the hang happen. andreas_kupries added on 2004-05-06 01:46:29: Logged In: YES user_id=75003 Thanks for the script. I will try it and see what happens. We don't have yellow pages I believe. But cat'ting a large file is easy. mellenbe added on 2004-05-06 01:17:21: File Added - 86186: zt3.exp Logged In: YES user_id=1030524 Hi Andreas, I've attached a script you can use to produce the hang. You will have to modify it for your environment, I've added << indicators in obvious spots. I run it over a network, spawns a shell on another machine (i.e., what I refer to as the long running server). I feed it commands, 'ypcat hosts' is killing us but 'cat' of a large file seems to also do it. The prompt is just our protocol. You can adapt this to your environment. I played around with different match_max values. A smaller value may make it fail faster. System and network activity plays a part in this. I hope this helps. Mike andreas_kupries added on 2004-05-05 23:51:16: Logged In: YES user_id=75003 Having reread all the comments at http://sourceforge.net/tracker/index.php?func=detail&aid=462317&group_id=10894&atid=110894 I tend even more to the conclusion that the current problem is in essence the same as the one described there, but different enough in the details to not be caught by the code we added for #462317. andreas_kupries added on 2004-05-05 23:37:36: Logged In: YES user_id=75003 Found the entry in the changelog after a cvs annotate ... Here it is. And this seems to be exactly about the situation you are encountering. Seems that I have to revisit this code. 2001-09-26 Andreas Kupries <[email protected]> * doc/fileevent.n: Accepted [Patch #465279] adding an example to the fileevent manpage. Minor modifications to get a better formatting. Report and patch by David N. Welton <[email protected]>. * The changes below fix [Bug #462317] where Expect tried to read more than was in the buffers and then blocked in the OS call as its pty channel driver provides no blockmodeproc through which the OS could be notified of blocking-behaviour. Because of this the general I/O core has to take more care than usual to preserve the semantics of non-blocking channels. The problem was reported by "Kevin O'Gorman" <[email protected]>. * generic/tclIO.c (Tcl_ReadRaw): Do not read from the driver if the channel is non-blocking and the fileevent causing the read was generated by a timer. We do not know if there is data available from the OS. Instead of going to the OS for more and potentially blocking we simply signal EWOULDBLOCK to the higher levels to cause the system to wait for true fileevents. (GetInput): Same as before. (ChannelTimerProc): Added set and clear of CHANNEL_TIMER_FEV. * generic/tclIO.h (CHANNEL_TIMER_FEV): New flag for channels. Is set if a fileevent was generated by a timer, the channel is not blocking and the driver did not provide a blockmodeproc. In that case the I/O core has to be especially careful about going to the driver for more data. andreas_kupries added on 2004-05-05 23:32:00: Logged In: YES user_id=75003 I skipped right over the timer_fev. sometimes I am to near to the code. The timer_fev is set if the fileevent handler the read is in was invoked not by a real fileevent, but by a fake file event generated by a timer, to flush data kept in the buffers ... Interesting, set if and only if there is blockmode proc, i.e. if the generic layer cannot know if the underlying driver will block or not. And 'expect' fits that to a tee. I have to gather more information about timer_fev it seems, when it was introduced, changelog comments from that time etc. In the meantime, testscript. Yes please. Please try to reduce your application to something small (and without large IP of yours) and attach it to this bug report. If I can invoke a script which is selfcontained and reliably hangs itself debugging will be much easier for me. Regarding telling that there is more data on the channel ... Tcl cannot, not directly. For drivers with a blockmodeproc it can tell the driver whether to read blocking or non-blocking, and in the non-blocking case the driver ensure that it stops reading the moment the os tells it that it has no more data right now. Expect is a driver without a blockmodeproc, and so Tcl cannot really tell when there is no more data, and has to be extra careful to call on the OS only if there should be data present. This timer_fev flag I overlooked earlier has exactly this purpose. And maybe there is something wrong in the logic of its handling. I don't know yet, I have to read more changelog. Thanks. mellenbe added on 2004-05-05 07:55:01: Logged In: YES user_id=1030524 As far as I know we do not change anything in the way the channel is set up. In addition, these same scripts work fine with Tcl7.5 and Expect5.21. There are two conditions checked at 5298, CHANNEL_NONBLOCKING was always on and I never remember seeing CHANNEL_TIMER_FEV. Both conditions have to be true to avoid the read. This is how we got to the else condition. I believe our channel is always in non-blocking mode just as expect set it up. I didn't save any of the traces I did after I instrumented several of these routines. The sequence I saw in DoReadChars is this, At 4411 we enter the for loop At 4417 ReadChars is called and sets copiedNow. After this call, the data that was copied is the last of the data from our server and contains our logical eof consisting of 3 % signs (i.e., the prompt). There is no more data to expect from the server after the prompt is seen. Next I believe we fall through to the logic at 4437 and take the else condition at 4455 where copied and toRead are updated. Remember at this point there is not more data so next time through the for loop we somehow end up making the function call at 4447. Since the server has no more data, the read hangs. I think when the data was returned by ReadChars expect should have been given an opportunity to examine it before the next read. How can Tcl tell when there is no more data at the channel? Our server is long running. There are not eof or other indicators sent. Logical eof is determined by looking for a prompt in the data, it's totally application specific. Sometimes I can hang this right away with my test script other times it takes longer. I hope this helps. Let me know if you need more information or the test script. I can reproduces this from solaris to solaris over a network. Mike andreas_kupries added on 2004-05-05 06:28:57: Logged In: YES user_id=75003 I am trying to understand this and have some questions. You said that GetInput is blocking in the read ... Line 5303 ... Ok, just before that the generic IO system looks if the channel is non-blocking or not. It tries to read only if the channel is in blocking mode. So your expect channel is in blocking mode, right ? On the other hand, when expect creates a channel it sets this chanel to -blocking 0, i.e. non-blocking (exp_chan.c, line 520, function expCreateChannel). So, does your application (or a package it uses) switch the expect channel into blocking mode ? If yes, can you disable that part and check if your application will still work correctly ? From the Tcl core's perspective the behaviour is correct: - On the next to last round Expect sees that it has not much room in expect_out and ask for less data than Tcl has in its buffers, and Tcl returns that. - On the next round Expect asks for much more data than Tcl has in the buffers, and so Tcl goes looking for more, and as the channel is blocking it also knows that it should wait for the OS to deliver the data. If the channel had been non-blocking Tcl would not only see that the OS is currently out of data but would also abort the operation and simply return what data it has, instead of trying to forcibly fulfill the request completely.

Attachments:

tcl_e.patch.gz [download] added by andreas_kupries on 2004-05-21 04:11:26. [details]
zt3.exp [download] added by mellenbe on 2004-05-06 01:17:21. [details]