Ticket UUID: | 462317 | |||
Title: | Notification when there's no data | |||
Type: | Bug | Version: | None | |
Submitter: | kogorman | Created on: | 2001-09-17 18:54:02 | |
Subsystem: | 25. Channel System | Assigned To: | andreas_kupries | |
Priority: | 7 High | Severity: | ||
Status: | Closed | Last Modified: | 2001-09-27 09:27:03 | |
Resolution: | Fixed | Closed By: | andreas_kupries | |
Closed on: | 2001-09-27 02:27:03 | |||
Description: |
Using expect5.32 with Tcl8.4a3, my expect script wedges. I have used strace(1) and printfs to find that ChannelTimerProc is calling TclNotifyChannel, which causes expect to perform a blocking read. This is usually okay, but sometimes there's no data, and the read blocks forever, because the channel corresponds to a child which is waiting for a command. Deadlock. I can tell from the system call trace that the program does NOT have enough information to know whether the read will block, but it could guess that it would because the previous read returned a short reply. However, the file descriptor has not been tested by select(), so the issuance of the notify is probably bogus. My guess is that there should be something cancelling the timer, since it was started before the last successful read, and has just been running since then. I have hundreds of megabytes of trace, but don't understand the internals well enough to make much more sense out of them than the above, though I'm working on it. I can boil the traces down quite a bit if anyone can work with me on this. Sorry I can't give a concise code sample, because I don't know how to provoke the error without my 1GB Oracle database. I have, however, included a boiled-down tail of one of the traces. | |||
User Comments: |
andreas_kupries added on 2001-09-27 09:27:03:
Logged In: YES user_id=75003 Committed to head and 8.3.3/cvs (core-8-3-1-branch). andreas_kupries added on 2001-09-27 09:08:27: File Added - 11261: 462317.diff.0 Logged In: YES user_id=75003 In direct interaction with Kevin (email) it was determined that the following interacting factors generated the problem reported here. A patch to the Tcl I/O core was created and tested extensively. The patch survives the testsuite and also removes the reported deadlock. The patch here is the second one to fix the bug. The first patch was done against Expect itself and whipped up to provide a more immediate workaround while trying to find the trouble spot. - Tcl had a number of bytes in its input buffers and used a timer to generate artificial fileevents to drain them. - Expect received these events and read data from the channel. It also asked for more data than was buffered. - Epect had set the pty channel to non-blocking and therefore expected to get short reads if asking for data than available. - The associated pty channel driver did not provide a blockmodeproc through which the Tcl I/O core tells a driver whether a channel is blocking or not. IOW, while the Tcl I/O core had to provide non-blocking semantics the underlying OS call was actually blocking. - When Expect asked for more data than available the Tcl I/O core started to read from OS, assuming that the underlying OS call would not block as the channel was set non-blocking. The call was blocking instead and closed the chain of dependencies, creating the deadlock. Solution: The Tcl I/O is changed to - assume that the underlying OS calls are blocking if a driver does not provide a blockmodeproc. - never try to read from the OS if it is not certain that data is available, non-blocking behaviour is expected by the caller and the underlying OS call assumed to be blocking. The last item means in practice that read requests generated while handling a timer generated fileevent are restricted to the data in the input buffers of the channel in question. Trying to read more will signal EWOULDBLOCK and cause the caller to see a short read. andreas_kupries added on 2001-09-19 23:38:42: Logged In: YES user_id=75003 Ok, a freeze is no good either. Ok, the simpler clients should be something we can do. select cannot and will not protect against the script or other extension reading data from a blocking channel without having made sure that there is data. Ok, in this case the internals on how tcl handles its buffers and when it goes to the OS for more also have a hand here. That is the trouble, it seems to be a problem in the interaction of tcl internals and expect. I am beginning to feel out of my depth because my knowledge of pty's and the expect internals is limited. I will now summarize what we found so far and start talking to Don Libes to see if he can give me additional pointers. kogorman added on 2001-09-19 08:35:43: Logged In: YES user_id=326095 "Dies horribly" was perhaps an overstatement. It freezes. It does not exactly crash. You could perhaps duplicate this with a simpler script and simpler client. If you want to just make simple clients, they could be this simple: call it 'sqlplus', put it in your path 1) produce the SQL> prompt (I think there's a space aftger the > 2) take an interactive command 3) echo the command (if that's not automatic) 4) wait a while, up to 15 minutes or so. 5) produce a header line, then a line with all hyphens (-). 6) produce a lot of output, with NO blank lines. 7) produce a blank line 8) produce some more stuff, more free-format perhaps 100 lines 9) go back to 1 To provoke the error, you may need 2 or 3 clients producing a LOT of output at once. This has always seemed to be going on when the error happened. That is, make them take step 5 simultaneously, then fill the pipeline for a while, perhaps 20,000 bytes. Finally, I was under the impression the select lists were built inside tcl. That's what's confusing to me about this. If tcl is doing the select() calls, I supposed it would be using them to protect from blocking reads. That's the behavior Expect is relying on. andreas_kupries added on 2001-09-19 07:11:14: Logged In: YES user_id=75003 When you said 'expect dies horribly' I expected that expect generated a coredump. You might have to use the 'ulimit' command to allow this. See the current settings of your shell via 'ulimit -a'. andreas_kupries added on 2001-09-19 07:09:15: Logged In: YES user_id=75003 I talked about this line 17:03:34 select(23, [4 6 8 10 12 16 18 20 22], [], [4 6 8 10 12 14 16 18 20 22], {0, 0}) = 0 (Timeout) ... Ok, the manpage of select says that the read list is the second argument ... 14 is missing. It shouldn't be, there is a Tcl_CreateFileHandler call before that without intervening Tcl_DeleteFileHandler. Understand that we are talking about two levels of bufering/data here. One is data buffered in the OS for the process. The other is data buffered by Tcl itself. Expect is given a notification that there is some data available. This data happens to be in the tcl buffers. In our situation expect asks tcl for more than is in the buffers, so tcl asks the OS and given that the 'read' is blocking but as the tcl core believes that it will perform a nonblocking call (line 512) tcl is perfectly right to do this call. It expect that the call will give it a short read. And hangs because unexpectedly the call is actually blocking. If I understand your situation correctly the external process waits for a information send to it by expect and expect expects (sic!) the same channel to send something to it (after it already got partial information ?). This is something I should be able to reproduce with a simpler script. Right ? kogorman added on 2001-09-19 07:06:16: Logged In: YES user_id=326095 It's intel/linux. It seems to be non-threaded, at least when I've fooled with the code the threads stuff seems not to compile. I can rebuild with symbols easily enough. There are no coredumps now, do you want me to make one? Could you remind me how that's done (some switch on the kill command?). kogorman added on 2001-09-19 06:53:50: Logged In: YES user_id=326095 Print what lists? The printing in the trace is all done by strace(1). Even the printf() output is here because strace catches the underlying write(2) call. That actual output goes to a log file separately from the trace. It seems expect has been given a notification, so it's doing a read. It hasn't tested how much data there is, so is relying on semantics much like a unix read(2) from a socket, which will give a short reply if there's not enough data. It only blocks if there's no data at all. I think the right question to be asking is if there's a way to get tcl to act like that, and if so, what is missing? If that behavior is not available at all, then Don Libes has a harder job, but we're going to need to know the details of what he can rely on and what he can't. andreas_kupries added on 2001-09-19 06:52:52: Logged In: YES user_id=75003 Ok, this is not an option then ... Hm, what platform are you on ? If it is sparc/solaris or intel/linux I could have a look at the coredump ... Did you configure and compile using --enable-symbols ? ... Oh, yes I also assumed a non- threaded built (Better ask). kogorman added on 2001-09-19 06:44:44: Logged In: YES user_id=326095 I tried your solution, and Expect dies horribly. It doesn't even install because the install target in the makefile uses expect, which wedges immediately. We need something else. andreas_kupries added on 2001-09-19 06:39:30: Logged In: YES user_id=75003 Just determining the amount buffered and reading that ... Yes, we could try that in the highlevel code of expect. Does this mean the my suggesstion was not sucessful ? Adding (and removing) a channel to (from) the read/write lists of select happens automatically. This is done by 'Tcl_CreateFileHandler' ('Tcl_DeleteFileHandler'). I haven't looked deeper into expect for this, there is lot going on with arming and disarming and I don't know what. Looking at your trace and the 'select' lines, in what order do you print the lists ? kogorman added on 2001-09-19 06:23:25: Logged In: YES user_id=326095 Would it be good enough for Expect to determine the amount of data and just ask for that. Failing that, how should Expect get Tcl back into putting the channel in the select() read-list and only notifying when a read won't block? Or am I talking nonsense? andreas_kupries added on 2001-09-19 05:55:04: Logged In: YES user_id=75003 Ok, it really looks like the situation I thought. expect reads more than is in the buffers, goes to the OS and then gets stuck in the low-level read. That the timer exists and fires is correct, there is data to read (in the buffers). Just not enough to satisfy expect. While looking further through the expect code, beginning with ExpInputProc I noted an inconsistency in the expect channel code. See the following excerpts from exp_chan.c: Line 53: Tcl_ChannelType expChannelType = { "exp",/* Type name. */ /* Expect channels are always blocking */ NULL,/* Set blocking/nonblocking mode.*/ Line 115 (ExpInputProc) /* * Assume there is always enough input available. This will block * appropriately, and read will unblock as soon as a short read is * possible, if the channel is in blocking mode. If the channel is * nonblocking, the read will never block. */ Line 512 (expCreateChannel): Tcl_SetChannelOption(interp,esPtr->channel,"- blocking","0"); In other words, several parts of expect seem to disagree whether the channel is blocking or nonblocking. I fear that the last command (line 512) leaves the generic I/O layer in the tcl core a bit confused. It tries to handle the channel as non-blocking but the low-level file descriptor was never told so. This may allow the generic I/O core in some situations to try to read from the channel when it would not try the same for a blocking channel. Please try to comment out line 512 in exp_chan.c and see if the problem is still there afterward. andreas_kupries added on 2001-09-19 04:45:52: Logged In: YES user_id=75003 There is a size limit yes, but they SF was/is also balky today. I had quite a trouble to login this morning. kogorman added on 2001-09-19 04:35:35: Logged In: YES user_id=326095 I had trouble uploading this, but the big trace is there now, gzipped. Is there a size limit, or was the internet just being balky today? ++ kevin kogorman added on 2001-09-19 04:28:55: File Added - 10944: F022d.tail3.gz kogorman added on 2001-09-19 04:09:29: Logged In: YES user_id=326095 I'm attaching a trace (big, but much smaller than the original), which may have what you need. It goes from the last select() call I found which detected fd 14 readable, up to just after a read of fd 14 wedges. ++ kevin andreas_kupries added on 2001-09-19 00:47:36: Logged In: YES user_id=75003 Basically I would like to see how many characters expect tries to read and how much is in the tcl buffers at that time. That would be in 'expect.c/expIRead'. Also how much bytes are in the buffer whenever that timer fires. kogorman added on 2001-09-19 00:33:44: Logged In: YES user_id=326095 I've been adding even more printf's to the code. Is there anything that would be particularly helpful to you? If you want info on tcl buffers, where would that be helpful, and how best to show it? ++ kevin kogorman added on 2001-09-19 00:09:37: Logged In: YES user_id=326095 I've been adding even more printf's to the code. Is there anything that would be particularly helpful to you? If you want info on tcl buffers, where would that be helpful, and how best to show it? ++ kevin andreas_kupries added on 2001-09-19 00:08:59: Logged In: YES user_id=75003 Huh, complicated script ... I haven't done much expect so I hope this question is not stupid: do you believe it is possible to use [after] instead of managing the timeouts on your own ? (This is an aside question hopefully not connected to the problem). As for looking into the Tcl buffers there is 'Tcl_Inputbuffered' returning the number of bytes of input currently buffered in the common internal buffer of a channel. There is also 'Tcl_InputBlocked' which returns 1 if input is blocked on this channel, 0 otherwise. Last we have 'Tcl_Eof' which returns 1 if the channel is at EOF, 0 otherwise. Beyond that you have to query the datastructures directly. Regarding "The channel is the pipe to a child of expect, and as far as I know, there's no control at the script level about blocking that.", I assume this to mean that we have no direct access to the channel allowing us to perform [fconfigure $channel -blocking 0]. andreas_kupries added on 2001-09-18 23:52:51: Logged In: YES user_id=75003 Regarding the many calls of 'time()', this is part of the notifier. Explanation: The relevant functions are 'TimerSetupProc' and 'TimerCheckProc' in 'tcl/generic/tclTimer.c'. Whenever tcl starts to wait for events it will use 'gettimeofday' to retrieve the current time down to the usec and then uses this to compute the timeout required for 'select' based upon the delay times for the known timers (after ...). Similarly when it is done with waiting for events, i.e. 'select' returned, it queries the time again to see which of the timers expired and have to queue events. kogorman added on 2001-09-18 23:17:13: File Added - 10933: F022e.small kogorman added on 2001-09-18 09:51:58: Logged In: YES user_id=326095 Here's a trace with __FILE__ and __LINE__ in the printfs. It also tracks the TimerHandler tokens, to see how they run. I don't really understand them, but I highlight token 1ffa because it seems to be connected with the failure. kogorman added on 2001-09-18 07:21:29: File Added - 10905: Team kogorman added on 2001-09-18 07:21:28: Logged In: YES user_id=326095 I have been talking with Don Libes. He thinks this is a Tcl problem. I was originally having the problem with the stock Expect from RedHat 7.1, and he got me to try the later release. I can look up details if you really need them. The channel is the pipe to a child of expect, and as far as I know, there's no control at the script level about blocking that. The trace includes all system calls, expect or otherwise, from the given process. I removed some that were too prolix (do you really need to call time() that often? it's 90% of the system calls). Some of the printfs are in expect, some in Tcl. My own guess is that this is only going to be a problem with multiple channels, producing blocks of data bigger than expect's buffer (4096), so that there is data left after a read. Then, as near as I can tell, the timer is immortal and can fire later at an inappropriate time. It won't get a chance if there's only one channel active, because of the details of the code. But with multiple channels operating concurrently, ..... I'll work on __FILE__ and __LINE__, but you know, my printfs are going to make hash of LINE. They already mostly say where they are. Finally, I'm not sure how to look into what's in the tcl buffers. See the attached script. It isn't going to work unless you can run 10 SQL*Plus clients against a TPC database. :o( andreas_kupries added on 2001-09-18 05:55:47: Logged In: YES user_id=75003 Some more comments. Given the last comments I believe you should have a look at how much is in the tcl bufers and how much the channel handler in expect tries to read from the channel. Thanks for the trace btw. it helped a lot. A suggestion nevertheless: Adding __FILE__ and __LINE__ to the output would allow even better to pinpoint where each write call is. andreas_kupries added on 2001-09-18 05:52:20: Logged In: YES user_id=75003 Does the deadlock happen with Tcl 8.3.3. too ? Please attach the expect script too, if possible. (Just so that we can take a look at the tcl side of the problem too). It might be interesting [x] to switch the channel into non- blocking mode and see what happens. [x] I do not know how much control we have over the created channel. Note: The timer in question is activated when bytes are waiting in the input queue of the channel in question. It basically generates artifical fileevents to push these bytes into the channel handlers. Problem is, if there are for example 12 bytes waiting in the tcl buffers and then a blocking read is issued from the eventhandler for more than 2 bytes the system will try to read the missing bytes from the OS. This can cause the deadlock too. Are you able to trace execution inside of expact too ? kogorman added on 2001-09-18 01:54:06: File Added - 10883: F022d.tail |
Attachments:
- 462317.diff.0 [download] added by andreas_kupries on 2001-09-27 09:08:27. [details]
- F022d.tail3.gz [download] added by kogorman on 2001-09-19 04:28:53. [details]
- F022e.small [download] added by kogorman on 2001-09-18 23:17:13. [details]
- Team [download] added by kogorman on 2001-09-18 07:21:28. [details]
- F022d.tail [download] added by kogorman on 2001-09-18 01:54:06. [details]