Tcl Source Code: View Ticket

Ticket UUID:	1323992
Title:	socket.test failures
Type:	Bug	Version:	obsolete: 8.4.11
Submitter:	dgp	Created on:	2005-10-11 15:58:41
Subsystem:	27. Channel Types	Assigned To:	vasiljevic
Priority:	9 Immediate	Severity:
Status:	Closed	Last Modified:	2005-10-15 00:38:51
Resolution:	Rejected	Closed By:	vasiljevic
		Closed on:	2005-10-14 17:38:51
Description:	I'm seeing tests socket-9.2 and socket-11.13 fail consistently, but only on one platform. I suspect the real problem is on the system and not in Tcl, but I don't know how to verify that. These tests are failing because the [vwait] is timing out. The tests are not well constructed to communicate that scenario to the user. I had to hack the tests to verify that's what is happening.
User Comments:	vasiljevic added on 2005-10-15 00:38:51: Logged In: YES user_id=95086 Backoff parts responsible for failures of 10.13 Tcl test in favour of external events handling as suggested by AK. Commited to both 8.4 and 8.5 branches. vasiljevic added on 2005-10-14 23:43:54: Logged In: YES user_id=95086 Ah! This is what u mean... I see... Well, to be honest, this would do the trick allright, but it would need to be heavily documented in the Tcl_CCH API, i.e: Tcl_ClearChannelHandlers removes all channelhandlers and event scripts associated with the specified channel, thus shutting down all event processing for this channel. It will however NOT disable events in the event queue of this thread waiting to fire. If you are calling this function before Tcl_CutChannel in order cut the channel from the current thread, you will need to make sure those events are either processed or discarded. Otherwise you are risking late references to an already cut'ed channel which will result in memory corruption. A kind of above thing should be present in the Tcl_CCH manpage. Hm... not that nice for an API. But I understand this is as far as we can get w/o turning everything upside down. OK. I can live with that. andreas_kupries added on 2005-10-14 23:26:23: Logged In: YES user_id=75003 There is no difference with regard to the functionality the two pieces of code do. However "EventProcessingOff" (EPO) is something we can implement and use from _outside_ of the core, as we use only public APIs. I.e. in packages. My proposal is to completely remove the watchProc call from Tcl_CCH and instead implement EPO in the Thread package, and call it in the appropriate place of the transfer sequence. That way we can leave the Tcl IO core unmodified and still get channel transfers which do not dump core. vasiljevic added on 2005-10-14 22:05:30: Logged In: YES user_id=95086 Andreas, What is the difference between: static void EventProcessingOff (chan) { chan = Tcl_GetTopChannel (chan); watch = Tcl_ChannelWatchProc (chan) inst = Tcl_GetChannelInstanceData (chan); watch (inst, 0); } and what I've done in tclIO.c: (chanPtr->typePtr->watchProc)(chanPtr->instanceData, 0); ???? I do not see any difference. Perhaps you wanted to say that this should be done elsewhere and/or before or after some other thing (dunno) but to me the both above are practically equivalent. dgp added on 2005-10-14 04:19:58: Logged In: YES user_id=80530 Just for the record, the commit to HEAD corrects the test failures in the default build, but they are still present in an --enable-threads build. vasiljevic added on 2005-10-14 04:02:57: Logged In: YES user_id=95086 Yes. It is 11.13 test which is failing. I have added this into the both 8.4 and 8.5 branches. As I see Andreas has some ideas and I will have to first understand them. I will leave this for tomorrow (its pretty late here now). hobbs added on 2005-10-14 03:10:53: Logged In: YES user_id=72656 The patch said 11.3 in the C comments - it's 11.13 (oops). hobbs added on 2005-10-14 03:08:02: File Added - 152432: tclio.patch hobbs added on 2005-10-14 03:08:00: Logged In: YES user_id=72656 Attached is my simple but descriptive interim patch for this. I'll leave it to Zoran to commit. andreas_kupries added on 2005-10-14 02:06:58: Logged In: YES user_id=75003 Modulo variable declarations and type casts ... static void EventProcessingOff (chan) { chan = Tcl_GetTopChannel (chan); watch = Tcl_ChannelWatchProc (chan) inst = Tcl_GetChannelInstanceData (chan); watch (inst, 0); } andreas_kupries added on 2005-10-14 02:03:07: Logged In: YES user_id=75003 I agree with the ifdef THREADS as a short-term solution. From the digging Don has done it seems to me that during the _regular_ closing of channel Tcl_ClearChannelHandlers (Tcl_CCH) is called at a very early point in time, when it still may generate events for flushing in the background. And it is expected that these events are processed, and full shutdown of the channel happens only after that has happened. So our belief that we could co-opt Tcl_CCH as the point where event-processing is switched off in preparation for the transfering of a channel was wrong. We need it, but because of the above it must only remove the parts generated events, and not the parts processing those still in the pipeline. This part of the transfer has to go somewhere else. We might try to co-opt a different function for this, or we may need a new public API. Zoran, we will have to go back to the mails we exchanged, dig out the current code sequences used by the channel transfer and see if one of the other functions used can be co-opted instead. If not, then a new API will be needed to perform the action. And if we come to that I will also entertain the notion of moving the whole transfer process and structures into the core, as a Tcl_TransferChannel (chan, destinationthread) or some such. Because we would need a TIP anyway in that case. The threads package would then contain only the Tcl command around that API. ... I belay that. We do not need new APIs. We can use the existing APIs to get the instance data of a channel, its watchproc, and then do the call "watchproc (instance, 0)" using the piecves we got. No need for additional APIs. Tcl_ChannelWatchProc, Tcl_GetChannelInstanceData, and Tcl_GetTopChannel. The latter to make sure that our watchproc call goes through the whole stack, if there is one. vasiljevic added on 2005-10-14 01:41:55: Logged In: YES user_id=95086 Yes. As the matter of fact, why not! At least until I figure out how to close this hole properly. I will commit ifdefs into both 8.4 and 8.5 and start a deep thinking session again.... Damn, all this is getting pretty complicated. dgp added on 2005-10-14 01:35:58: Logged In: YES user_id=80530 Would making the 2004-10-04 commit conditional on #ifdef TCL_THREADS make a greater number of configurations work correctly? vasiljevic added on 2005-10-14 01:28:08: Logged In: YES user_id=95086 Understand. Thanks for going that deep! The problem is/are obviously the events in the queue which do not get processed (deliberately) at the time one calls Tcl_CleanChannelHandlers. If we do not disable those pending events in Tcl_CCH, then we risk hitting a detached channel later on which inevitably leads to core. If we however do disable them, as my change is doing, then we lose some events pertaining to that channel. So it is a lose-lose situation. I have not found a way to somehow "drain" the channel at the time it is about to be spliced out of the thread/interp. This would be the thing needed here. Perhaps I should do some more thinking... hobbs added on 2005-10-14 01:05:40: Logged In: YES user_id=72656 I have confirmed that socket-9.2 and 11.3 fail on a SuSE-9.0 system (kernel 2.4.21-297-default). Both fail with: ---- Result was: 65536 ---- Result should have been (exact matching): 65566 Verified no problems on Solaris 2.8, Solaris-x86 10, HP-UX 11, AIX 5.1, Linux 2.6 or OS X 10.3. 10.4 (Tiger) does have this interesting repeatable failure: ---- Result was: a:one b: c: ---- Result should have been (exact matching): a:one b: c:two ==== socket-2.11 FAILED Specifically, we seem to be losing a flush on close on 2.4. If I explicitly add a 'flush $s' in the writedata procs, the errors go away (that's for the 9.2 and 11.3 tests - no idea about 2.11). dgp added on 2005-10-13 21:32:40: Logged In: YES user_id=80530 Debian Linux, "Sarge" release is based on Linux 2.4.27, so that's a system to test. dgp added on 2005-10-13 21:27:15: Logged In: YES user_id=80530 Some more testing suggests this failure only shows up on (outdated versions of?) the Linux 2.4 kernel. Linux 2.6 users report no test failures. Would be useful for someone to test with the latest 2.4 kernel (kernel.org says that is 2.4.31) in case this is demonstrating a kernel bug that has been fixed. vasiljevic added on 2005-10-13 15:51:15: Logged In: YES user_id=95086 Hm... out of curiosity, I checked this again under Solaris and guess what: it passes all tests OK. This leads me to believe that there is another hidden problem either with this test or with Linux per-se. I can't really point my finger at the exact place yet. If you like I can backoff this change in both 8.4 and 8.5 or invest time to understand and eventually solve this generally. In the former case I'd need to keep my own patched copy of the code because w/o the change, it is unstable and keeps coring my app on all platforms. What should I do? vasiljevic added on 2005-10-13 14:57:02: Logged In: YES user_id=95086 Of course it is possible to do that. Then, you will have ocassional cores when transferring channels between threads. The fact that this one is exposed only on Linux makes me very suspicious about this test. What my change does is prevent processing of events in the thread event queue designated for the currently closing channel. In single threaded env and in the cases the channel transfer caps are not used, this is no issue. In other cases you get difficult to trace and subtle memory corruptions leading to cores at various other non-related places. I can backoff this one easily by commenting out just one single line in tclIO.c but this is potentially just hiding one more deeper problem (on Linux?) which is sitting there waiting to explode. dgp added on 2005-10-13 08:07:20: Logged In: YES user_id=80530 Is it reasonable to revert the change (at least on the "stable" core-8-4-branch) until this is sorted out? I'm uneasy about "stable" releases (like an ActiveTcl release which comes from CVS snapshots) possibly having sockets that do not work on Linux. What bug did the 2005-10-04 commit fix? And is it more critical than the failures detected by these tests? vasiljevic added on 2005-10-11 23:56:37: Logged In: YES user_id=95086 Confirmed on Linux. Indeed, I do have yet to understand why. The same (tests) passed well on Mac OSX. The part in question is in tclIO.c: /* * Must set the interest mask now to 0, otherwise infinite loops * will occur if Tcl_DoOneEvent is called before the channel is * finally deleted in FlushChannel. This can happen if the channel * has a background flush active. * Also, delete all registered file handlers for this channel * (and for the current thread). This prevents executing of pending * file-events still sitting in the event queue of the current thread. * We deliberately do not call UpdateInterest() because this could * re-schedule new events if the channel still needs to be flushed. */ statePtr->interestMask = 0; (chanPtr->typePtr->watchProc)(chanPtr->instanceData, 0); Effectively this disables handling of events in the event queue for the current thread which are still pending when the channel is getting spliced out of the thread for later transfer to some other interp/thread. If this is not done, the test will pass but your app will core. To check, just comment-out the above call to watchProc and try again. I guess I will have to see what this test is actually trying to test. dgp added on 2005-10-11 23:28:56: Logged In: YES user_id=80530 Revision 1.61.2.12 of tclIO.c introduced these test failures. Same changes in Revision 1.95 introduce the same failures on the HEAD branch. both are commits from 2005-10-04 by vasiljevic dgp added on 2005-10-11 23:10:42: Logged In: YES user_id=80530 um, no, these same tests on the same system pass just fine using the Tcl 8.4.11 release. They fail on core-8-4-branch tip. Something has happened to break these tests, even though I see the breakage on only one platform. I see the failure on a Linux/x86 system.

Attachments:

tclio.patch [download] added by hobbs on 2005-10-14 03:08:01. [details]