Tcl Source Code

View Ticket
Login
Ticket UUID: 1323992
Title: socket.test failures
Type: Bug Version: obsolete: 8.4.11
Submitter: dgp Created on: 2005-10-11 15:58:41
Subsystem: 27. Channel Types Assigned To: vasiljevic
Priority: 9 Immediate Severity:
Status: Closed Last Modified: 2005-10-15 00:38:51
Resolution: Rejected Closed By: vasiljevic
    Closed on: 2005-10-14 17:38:51
Description:
I'm seeing tests socket-9.2
and socket-11.13 fail consistently,
but only on one platform.

I suspect the real problem is
on the system and not in Tcl,
but I don't know how to verify
that.

These tests are failing because
the [vwait] is timing out.  The
tests are not well constructed
to communicate that scenario
to the user.  I had to hack the
tests to verify that's what is
happening.
User Comments: vasiljevic added on 2005-10-15 00:38:51:
Logged In: YES 
user_id=95086

Backoff parts responsible for failures of 10.13 
Tcl test in favour of external events handling
as suggested by AK. Commited to both 8.4 
and 8.5 branches.

vasiljevic added on 2005-10-14 23:43:54:
Logged In: YES 
user_id=95086

Ah! This is what u mean... I see...

Well, to be honest, this would do the trick allright, but it would 
need to be heavily documented in the Tcl_CCH API, i.e:

  Tcl_ClearChannelHandlers removes all  channelhandlers  and
  event  scripts associated with the specified channel, thus
  shutting down all event processing for this channel.
  It will however NOT disable events in the event queue of this
  thread waiting to fire. If you are calling this function before 
  Tcl_CutChannel in order cut the channel from the current
  thread, you will need to make sure those events are either 
  processed or discarded. Otherwise you are risking late 
  references to an already cut'ed channel which will result
  in memory corruption.

A kind of above thing should be present in the Tcl_CCH manpage.
Hm... not that nice for an API. But I understand this is
as far as we can get w/o turning everything upside down.
 
OK. I can live with that.

andreas_kupries added on 2005-10-14 23:26:23:
Logged In: YES 
user_id=75003

There is no difference with regard to the functionality the
two pieces of code do. However "EventProcessingOff" (EPO) is
something we can implement and use from _outside_ of the
core, as we use only public APIs. I.e. in packages.

My proposal is to completely remove the watchProc call from
Tcl_CCH and instead implement EPO in the Thread package, and
call it in the appropriate place of the transfer sequence.

That way we can leave the Tcl IO core unmodified and still
get channel transfers which do not dump core.

vasiljevic added on 2005-10-14 22:05:30:
Logged In: YES 
user_id=95086

Andreas,

What is the difference between:

static void EventProcessingOff (chan) {
    chan  = Tcl_GetTopChannel (chan);
    watch = Tcl_ChannelWatchProc (chan)
    inst  = Tcl_GetChannelInstanceData (chan);
    watch (inst, 0);
}

and what I've done in tclIO.c:

    (chanPtr->typePtr->watchProc)(chanPtr->instanceData, 0);

????

I do not see any difference. Perhaps you wanted to say that
this should be done elsewhere and/or before or after some other
thing (dunno) but to me the both above are practically equivalent.

dgp added on 2005-10-14 04:19:58:
Logged In: YES 
user_id=80530


Just for the record, the commit
to HEAD corrects the test failures
in the default build, but they are
still present in an --enable-threads
build.

vasiljevic added on 2005-10-14 04:02:57:
Logged In: YES 
user_id=95086

Yes. It is 11.13 test which is failing. I have added this 
into the both 8.4 and 8.5 branches.

As I see Andreas has some ideas and I will have to 
first understand them. I will leave this for tomorrow
(its pretty late here now).

hobbs added on 2005-10-14 03:10:53:
Logged In: YES 
user_id=72656

The patch said 11.3 in the C comments - it's 11.13 (oops).

hobbs added on 2005-10-14 03:08:02:

File Added - 152432: tclio.patch

hobbs added on 2005-10-14 03:08:00:
Logged In: YES 
user_id=72656

Attached is my simple but descriptive interim patch for
this.  I'll leave it to Zoran to commit.

andreas_kupries added on 2005-10-14 02:06:58:
Logged In: YES 
user_id=75003

Modulo variable declarations and type casts ...

static void EventProcessingOff (chan) {
    chan  = Tcl_GetTopChannel (chan);
    watch = Tcl_ChannelWatchProc (chan)
    inst  = Tcl_GetChannelInstanceData (chan);
    watch (inst, 0);
}

andreas_kupries added on 2005-10-14 02:03:07:
Logged In: YES 
user_id=75003

I agree with the ifdef THREADS as a short-term solution.

From the digging Don has done it seems to me that during the
_regular_ closing of channel Tcl_ClearChannelHandlers
(Tcl_CCH) is called at a very early point in time, when it
still may generate events for flushing in the background.
And it is expected that these events are processed, and full
shutdown of the channel happens only after that has happened.

So our belief that we could co-opt Tcl_CCH as the point
where event-processing is switched off in preparation for
the transfering of a channel was wrong. We need it, but
because of the above it must only remove the parts generated
events, and not the parts processing those still in the
pipeline. This part of the transfer has to go somewhere
else. We might try to co-opt a different function for this,
or we may need a new public API.

Zoran, we will have to go back to the mails we exchanged,
dig out the current code sequences used by the channel
transfer and see if one of  the other functions used can be
co-opted instead.

If  not, then a new API will be needed to perform the
action. And if we come to that I will also entertain the
notion of moving the whole transfer process and structures
into the core, as a Tcl_TransferChannel (chan,
destinationthread) or some such. Because we would need a TIP
anyway in that case. The threads package would then contain
only the Tcl command around that API.

... I belay that. We do not need new APIs. We can use the
existing APIs to get the instance data of a channel, its
watchproc, and then  do the call "watchproc (instance, 0)"
using the piecves we got. No need for additional APIs.
Tcl_ChannelWatchProc, Tcl_GetChannelInstanceData, and
Tcl_GetTopChannel. The latter to make sure that our
watchproc call goes through the whole stack, if there is one.

vasiljevic added on 2005-10-14 01:41:55:
Logged In: YES 
user_id=95086

Yes. As the matter of fact, why not! 
At least until I figure out how to close this hole properly.
I will commit ifdefs into both 8.4 and 8.5 and start a deep
thinking session again.... Damn, all this is getting pretty
complicated.

dgp added on 2005-10-14 01:35:58:
Logged In: YES 
user_id=80530


Would making the 2004-10-04 commit
conditional on #ifdef TCL_THREADS
make a greater number of configurations
work correctly?

vasiljevic added on 2005-10-14 01:28:08:
Logged In: YES 
user_id=95086

Understand. Thanks for going that deep! 

The problem is/are obviously the events in the queue which
do not get processed (deliberately) at the time one
calls Tcl_CleanChannelHandlers. 
If we do not disable those pending events in Tcl_CCH, 
then we risk hitting a detached channel later on which
inevitably leads to core. 
If we however do disable them, as my change is doing,
then we lose some events pertaining to that channel. 
So it is a lose-lose situation.

I have not found a way to somehow "drain" the channel
at the time it is about to be spliced out of the thread/interp.
This would be the thing needed here. Perhaps I should 
do some more thinking...

hobbs added on 2005-10-14 01:05:40:
Logged In: YES 
user_id=72656

I have confirmed that socket-9.2 and 11.3 fail on a SuSE-9.0
system (kernel 2.4.21-297-default).  Both fail with:

---- Result was:
65536
---- Result should have been (exact matching):
65566

Verified no problems on Solaris 2.8, Solaris-x86 10, HP-UX
11, AIX 5.1, Linux 2.6 or OS X 10.3.  10.4 (Tiger) does have
this interesting repeatable failure:

---- Result was:
a:one b: c:
---- Result should have been (exact matching):
a:one b: c:two
==== socket-2.11 FAILED

Specifically, we seem to be losing a flush on close on 2.4.
 If I explicitly add a 'flush $s' in the writedata procs,
the errors go away (that's for the 9.2 and 11.3 tests - no
idea about 2.11).

dgp added on 2005-10-13 21:32:40:
Logged In: YES 
user_id=80530


Debian Linux, "Sarge" release
is based on Linux 2.4.27, so
that's a system to test.

dgp added on 2005-10-13 21:27:15:
Logged In: YES 
user_id=80530


Some more testing suggests
this failure only shows up
on (outdated versions of?) the
Linux 2.4 kernel.  Linux 2.6
users report no test failures.

Would be useful for someone
to test with the latest 2.4 kernel
(kernel.org says that is 2.4.31)
in case this is demonstrating
a kernel bug that has been fixed.

vasiljevic added on 2005-10-13 15:51:15:
Logged In: YES 
user_id=95086

Hm... out of curiosity, I checked this again under Solaris and 
guess what: it passes all tests OK. 
This leads me to believe that there is another hidden problem 
either with this test or with Linux per-se. I can't really point 
my finger at the exact place yet. 
If you like I can backoff this change in both 8.4 and 8.5
or 
invest time to understand and eventually solve this generally.

In the former case I'd need to keep my own patched copy
of the code because w/o the change, it is unstable and 
keeps coring my app on all platforms. 

What should I do?

vasiljevic added on 2005-10-13 14:57:02:
Logged In: YES 
user_id=95086

Of course it is possible to do that. Then, you will have ocassional
cores when transferring channels between threads.
The fact that this one is exposed *only* on Linux makes me very
suspicious about this test.
What my change does is prevent processing of events in the thread
event queue designated for the currently closing channel. In single
threaded env and in the cases the channel transfer caps are not used,
this is no issue. In other cases you get difficult to trace and subtle 
memory corruptions leading to cores at various other non-related places.

I can backoff this one easily by commenting out just one single line
in tclIO.c but this is potentially just hiding one more deeper problem 
(on Linux?) which is sitting there waiting to explode.

dgp added on 2005-10-13 08:07:20:
Logged In: YES 
user_id=80530


Is it reasonable to revert the change
(at least on the "stable" core-8-4-branch)
until this is sorted out?

I'm uneasy about "stable" releases
(like an ActiveTcl release which
comes from CVS snapshots) possibly
having sockets that do not work on
Linux.

What bug did the 2005-10-04 commit
fix?  And is it more critical than
the failures detected by these tests?

vasiljevic added on 2005-10-11 23:56:37:
Logged In: YES 
user_id=95086

Confirmed on Linux. Indeed, I do have yet to understand why.
The same (tests) passed well on Mac OSX. 

The part in question is in tclIO.c:

    /*
     * Must set the interest mask now to 0, otherwise infinite loops
     * will occur if Tcl_DoOneEvent is called before the channel is
     * finally deleted in FlushChannel. This can happen if the channel
     * has a background flush active.
     * Also, delete all registered file handlers for this channel
     * (and for the current thread). This prevents executing of pending
     * file-events still sitting in the event queue of the current thread.
     * We deliberately do not call UpdateInterest() because this could
     * re-schedule new events if the channel still needs to be flushed.
     */

    statePtr->interestMask = 0;
    (chanPtr->typePtr->watchProc)(chanPtr->instanceData, 0);

Effectively this disables handling of events in the event queue
for the current thread which are still pending when the channel
is getting spliced out of the thread for later transfer to some other
interp/thread. If this is not done, the test will pass but your app
will core. To check, just comment-out the above call to watchProc
and try again. 

I guess I will have to see what this test is actually trying to test.

dgp added on 2005-10-11 23:28:56:
Logged In: YES 
user_id=80530


Revision 1.61.2.12 of tclIO.c
introduced these test failures.

Same changes in Revision 1.95
introduce the same failures on
the HEAD branch.

both are commits from 2005-10-04
by vasiljevic

dgp added on 2005-10-11 23:10:42:
Logged In: YES 
user_id=80530


um, no, these same tests on
the same system pass just
fine using the Tcl 8.4.11 release.

They fail on core-8-4-branch tip.

Something has happened to
break these tests, even though
I see the breakage on only one
platform.

I see the failure on a Linux/x86
system.

Attachments: