Tcl Source Code

View Ticket
Login
Ticket UUID: 719790
Title: fcopy -command hangs in Win NT4
Type: Bug Version: obsolete: 8.4.5
Submitter: jcw Created on: 2003-04-11 16:44:28
Subsystem: 25. Channel System Assigned To: andreas_kupries
Priority: 8 Severity:
Status: Closed Last Modified: 2008-04-18 13:33:46
Resolution: Fixed Closed By: sf-robot
    Closed on: 2008-04-18 02:20:37
Description:
Reported by Mads Linden, for tclkit, but I don't think it's tclkit.

To reproduce, run tcllib's ftpd server on Windows, fetch a file > 4 Kb.  The transfer does not complete.

   - server has to be Win32 (running tclkit 8.4.2), Linux works
   - client is Linux in my case (std ftp cmd)
   - transfer needs to be over 4 Kb, it seems
   - may also have to be binary (it is, in my case)
   - ftpd running on server hangs in line 1200, fcopy with -command
   - i.e. input is file, output is socket
   - tclkit 8.4.0 and 8.4.1 reported to work ok

Hm, this looks like trouble in fcopy.  I changed:

  fcopy $f $data(sock2) -command [list ::ftpd::GetDone $sock $data(sock2) $f ""]

to:

  fcopy $f $data(sock2)
  ::ftpd::GetDone $sock $data(sock2) $f "" [tell $f]

($f points to a file)

And now, the ftp transfer works.  The data appears to come across, but the last part does not flush and it ends up deadlocked.

If this is the same problem as reported by Mads Linden, then it works ok on 2k, but not XP.  I've only run a test on NT4.
User Comments: matben added on 2008-04-18 13:33:46:
Logged In: YES 
user_id=108900
Originator: NO

From the discussions of the last progress of this bug and the related bugs, it sounds exactly as the bug I have reported here. I would be very surprised if it's not the same bug. Case closed.

sf-robot added on 2008-04-18 09:20:37:
Logged In: YES 
user_id=1312539
Originator: NO

This Tracker item was closed automatically by the system. It was
previously set to a Pending status, and the original submitter
did not respond within 14 days (the time period specified by
the administrator of this Tracker).

ferrieux added on 2008-04-04 14:56:36:
Logged In: YES 
user_id=496139
Originator: NO

Jeff, I've just noticed in one of the last comments (2007-12-03 matben) that it was reproducible in XP and OSX. Do you think that one should be doable (but maybe the test cases is complex and/or low probability of repro) ?

hobbs added on 2008-04-04 07:21:34:
Logged In: YES 
user_id=72656
Originator: NO

This was likely solved in 8.4.19/8.5.3 with the fcopy fixes for 1932639 and 780533.  We don't have any NT4 around to verify anymore though.

ferrieux added on 2008-04-04 05:44:50:
Logged In: YES 
user_id=496139
Originator: NO

Andreas, don't you think this one might be related to the one we've just fixed ?
Would you have the time to revive the proper legacy environments ?
What I'm getting at, is I would love to see [fcopy] disappear from the Open bug list (or stay for some other good reason we have yet to discover). This would tremendously help spread the word that [fcopy] is usable and should be used.

matben added on 2007-12-03 20:10:04:
Logged In: YES 
user_id=108900
Originator: NO

Update:
The problems with fcopy has now also been seen on MacOSX 10.2 running 8.4.9 and Windows XP running 8.4.13. This happens when I run from "sources" using a normal Tcl/Tk installation. Thus, tclkit is not involved in any way.

I reproduce it each time but only when downloading files from Pandions (http://www.pandion.be/) built in HTTP server which is used for file transfers. In short, the last callback from "fcopy ... -command ..." which contain a partly filled buffer never happens. I have tested both the standard http package and my own httpex package, both uses fcopy with the -channel option. They both behave the same. On the other hand, rewritten the httpex package to instead use fileevents as in:

# This is a trick to put this event at the back of the queue to
# avoid using any 'update'.
after idle [list after 0 [list \
  [namespace current]::SetReadable $s $token]]
    }
}

proc httpex::SetReadable {s token} {    
    
    # We could have been closed since this event comes async.
    if {[lsearch [file channels] $s] >= 0} {
fileevent $s readable  \
  [list [namespace current]::Read $s $token]
    }
}

works consistently. Also, using Firefox or Safari to obtain the url also works consistently. All this points to some problems with the fcopy command which only reveals itself in some circumstances. How Pandions HTTP server works I hav no idea about but its responses are:

httpex::Event state(state)=waiting
line=HTTP/1.1 200 OK
httpex::Event state(state)=getheader
line=Content-Length: 14256
httpex::Event state(state)=getheader
line=Content-Type: application/octet-stream
httpex::Event state(state)=getheader
line=Connection: close
httpex::Event state(state)=getheader
n=0

hobbs added on 2007-02-09 06:19:52:
Logged In: YES 
user_id=72656
Originator: NO

This seems to be an ftpd issue (tcllib), but we have no way to transfer bugs like this ... passing to Andreas to possibly incorporate (do we have tests on this area??).

davygrvy added on 2006-03-14 06:44:32:

File Added - 170799: patch.txt

davygrvy added on 2006-03-14 06:44:27:
Logged In: YES 
user_id=7549

Here's a quicky patch.  I didn't know what to do with an
error returned by [gets] and should go someone, but I didn't
know where to put it.

nobody added on 2006-03-14 03:12:03:
Logged In: NO 

::ftpd::read has an error.  The check for EOF needs to be
after the call to [gets]

matben added on 2006-03-13 21:31:56:
Logged In: YES 
user_id=108900

Is this related?
[ 1437595 ] WinSock cores/hangs on thread or app exit

Mats

matben added on 2006-03-13 21:29:37:
Logged In: YES 
user_id=108900

I have done some testing using davygrvy iocpsock package replacing the 
socket command:
rename socket socket_orig
rename socket2 socket
and can verify that the problem persists :-(
It is not always easily reproducable, but using a few tries on different 
newtworks I was able to hang it. Mostly after the first chunk, but 
sometimes it took several chunks. You can test yourself using
these vfs which are ready to sdx. I was using a tclkit 8.4.9 for these 
tests.

http://www.visit.se/~matben/download/fcopybug.vfs.tgz
http://www.visit.se/~matben/download/fcopybugiocp.vfs.tgz
I repeat, the problem appeared in the 8.4.1 -> 8.4.2 transition.

Mats

davygrvy added on 2006-03-10 15:48:00:
Logged In: YES 
user_id=7549

This bug has been open since april 2003.  My bet is that you
won't find a problem in the code of tclWinSock.c

But you might if you are lucky enough catch a missed FD_READ
never arriving into the socket winproc when you think it
should be.

My bet is still on "bad data in"

jcw added on 2006-03-10 15:34:05:
Logged In: YES 
user_id=1983

davygrvy said: WSAAsyncSelect has a well known problem under load where it
drops notifications to socket events.

I'm not sure that's the issue here.  The problem was reproducible with a small 
transfer and nothing else happening.

davygrvy added on 2006-03-10 15:14:26:
Logged In: YES 
user_id=7549

> I have seen fcopy hang after about 100 chunks (1MB) which
> maybe points to some timing issue (difficult!).

I have seen winsock itself hang from just the same thing. 
And I don't mean tclWinsock.c, I mean the OS interface to
sockets.

If you want explosion proof sockets on windows, Tcl doesn't
have what it takes..  There.. I said it..  And it isn't
something that is repairable.  WSAAsyncSelect is the
problem, and it is internal to the OS.  Think, "bad data in"..

WSAAsyncSelect has a well known problem under load where it
drops notifications to socket events.  It is most
problematic, I found, with listening sockets.  After enough
drops, all socket activity for the process halts and the
process needs to be restarted.

The best way to fix Tcl's winsock implementation is to
replace it with a different notification model such as
WSAEventSelect or IOCP.

Or just use this in the meantime:
http://sf.net/projects/iocpsock

matben added on 2006-03-10 14:30:59:
Logged In: YES 
user_id=108900

If someone come up with modified tclWinSock.c I'll be glad to make 
builds and do the testing. There is still a test kit at: http://www.visit.se/
~matben/download/fcopybug.vfs.tgz

Mats

dgp added on 2006-03-10 11:10:28:
Logged In: YES 
user_id=80530


Can this one get a look
for possible fixing/closing
for Tcl 8.4.13 ?

dkf added on 2005-08-08 22:41:47:
Logged In: YES 
user_id=79902

I don't know windows socket or event code and I can't build
on that platform; you're at *least* as capable of making and
testing the changes as I am. I just know CVS enough to
generate the changes and general coding enough to tell the
difference between changes that definitely don't matter and
ones that might.

matben added on 2005-08-08 22:28:51:
Logged In: YES 
user_id=108900

If you make a modified tclWinSock.c using an educated guess, I 
can do tclkit builds and test them. 
I have seen fcopy hang after about 100 chunks (1MB) which 
maybe points to some timing issue (difficult!).

Mats

dkf added on 2005-08-08 17:12:07:
Logged In: YES 
user_id=79902

There are quite a lot of cosmetic changes from 8.4.1 to
8.4.2 in tclWinSock.c (I presume the fault is here if it is
in Tcl at all) but here are the possible causes in summary:
 * slight change to the way the library is initialized
 * the way the socket helper threads are managed is a bit
different
 * the way the socket helper thread waits for events is
different

I suspect that it's not the first, but I can't prove it.
(It's really frustrating to be unable to duplicate this in
the core!)

matben added on 2005-08-06 20:00:41:
Logged In: YES 
user_id=108900

Update:

I managed to build a tclkit based on tcl/tk 8.4.1 sources as 
following:

1) Used genkit A to get all sources which are tcl/tk 8.4.9 based

2) Found msvc6 archive and imported project into VC7++
(http://www.equi4.com/pub/tk/tars/msvc6.tar.gz)

3) Replaced Tk_MainEx with the one in 8.4.9 + patch all.diffs
(http://www.equi4.com/pub/tk/tars/all.diffs)

4) Built Release target and appended runtime.kit

5) Built starpack to test the fcopy bug using 
http://www.visit.se/~matben/download/fcopybug.vfs.tgz 

6) Tested the http package with -channel that triggers the fcopy 
calls

This works in all situations I have tested that hangs using tcl/tk 
8.4.9
based builds.

Conclusion: the problem is likely to be found in tcl/win and not 
elsewhere.
Previous investigations (by me and Mads) points to 8.4.1 -> 8.4.2
Please do a diff of relevant files here.

I will add more comments if I get closer to this one.

HTH,   Mats

matben added on 2005-07-30 20:04:31:
Logged In: YES 
user_id=108900

Update: It seems that this bug depends on where the url is 
localized, or some consequence thereof (timing issue etc.). My 
investigations shows this:

localhost: always works
LAN: always hangs after first chunk (8192 bytes)
WAN: works most of the time; in one case it stopped after many 
(approx 20) chunks.

Environment: WinXP 1 SP Home Ed, Active State 8.4.9,
tclkit-win32.upx.exe version 8.4.9, vfs at http://www.visit.se/
~matben/download/fcopybug.vfs.tgz

Mats

matben added on 2005-07-20 20:39:41:
Logged In: YES 
user_id=108900

I can just verify that this bug persists in 8.4.9 on WinXP as well.
It can be reproduced using the simple .vfs found at:
http://www.visit.se/~matben/download/fcopybug.vfs.tgz
System:
   WinXP
   8.4.9/tclkit-win32.upx.exe
Symptom:
   Pick a file larger than 8192 bytes to download.
   Watch the console.
   If run from fcopybug.tcl using Tcl/Tk any version (?) it works ok.
   If wrapped using tclkit 8.4.9 it hangs after first chunk (8192 
bytes).
PS: Now squeezed between this bug and 
http://sourceforge.net/mailarchive/
forum.php?thread_id=7358857&forum_id=42051
Very annoying indeed...
/Mats

jcw added on 2005-01-11 20:12:39:
Logged In: YES 
user_id=1983

Whoops garbled url - http://sourceforge.net/tracker/index.php?
func=detail&aid=1060620&group_id=34191&atid=410295

jcw added on 2005-01-11 20:07:31:
Logged In: YES 
user_id=1983

I still can't figure this out, but a surprisingly similar bug was recently fixed 
in Memchan, see http://sourceforge.net/tracker/?func=detail&aid=weird 
thing is many m&group_id=10894&atid=110894

The weird part is that the fcopy here does not appear to involve any VFS 
or memchan, hence no "rechan" either (which is tclkit's simpler cousin of 
memchan), see http://www.equi4.com/cgi-bin/cvsweb/tclkit/src/rechan.c?
rev=1.12

I'm bringing this up here due to the similaries: no final fileevent 
triggering, hence never a chance to test for eof and ending fcopy.

Have pinged Pat Thoyts as well, on the chat.

-jcw

hobbs added on 2004-02-17 03:18:31:
Logged In: YES 
user_id=72656

It would be nice if someone could test selective reverse
ports of the significant files from 8.4.1 to 8.4.5 that
would show if something changed.  The key files that I would
look at would be tcl/win/tclWinPipe.c and
tcl/win/tclWinChan.c.  Everything of course works in regular
Tcl, so these would have to be done as modified base kits.

matben added on 2004-02-16 14:38:50:
Logged In: YES 
user_id=108900

Update:
The testing situation is the following; MyCoccinella app is made 
into a StarPack using TclKits of indicated versions. I make detailed 
outputs around fcopy to see what happens. Testing is for both 
client and server side fcopy on Win98 and Win2k; I don't have 
WinXP. If code is run from an ordinary Tcl/Tk installation 
everything works ok. The code is typically 'fcopy ... -blocksize 
8192 -command tclProc'. The first fcopy always work; it is the 
second call that blocks. The event loop is still running. Thus, files 
smaller than blocksize don't cause problems.

TclKit   8.4.1  8.4.2  8.4.5
Win98  works  fails  fails
Win2k  works  fails  fails

Using 8.4.1 seems to be the "workaround".

Any fix would be greatly appreciated,   Mats

matben added on 2004-02-15 20:52:32:
Logged In: YES 
user_id=108900

I seem to have come across the same bug running Win98 and 
TclKit 8.4.2. I do have a -size argument for the blocksize, and the 
first call to fcopy seems to work, but the second just blocks.
The same code on standard Tcl/Tk 8.4.5 runs just fine.

Mats Bengtsson

nobody added on 2003-11-23 21:11:11:
Logged In: NO 

my notes on the bug:

by giving the "-size" to fcopy of the size being send, will make 
the callback work as expected.

this workaround does unfortantly not work when in ftpd:
::ftpd::command::STOR, since the server does not know the 
size of the incoming file/data.

what is wierd is that sometimes it works fine, like yesterday i 
added an "update" just before fcopy and it worked fine, then 
the day after it stopped working.

i have now seen this bug on 2000,XP

Mads

jcw added on 2003-11-23 20:36:32:
Logged In: YES 
user_id=1983

Sorry to keep raising this issue, but tclkit 8.4.5 continues to fail 
on this case.  I can't see how anything in tclkit affects the 
channel or event systems, so I keep suspecting a weird bug in 
Tcl which tclkit happens to hit in the above example.  The server 
socket is Windows - that's the one invariant in all this AFAICT.

jcw added on 2003-11-11 07:38:01:
Logged In: YES 
user_id=1983

I'll need to triple-check, but when Andreas worked on it, and 
when I tried just now, it still hangs.

hobbs added on 2003-11-11 04:11:56:
Logged In: YES 
user_id=72656

I believe the Andreas fixed this for 8.4.5.

jcw added on 2003-09-18 01:25:12:
Logged In: YES 
user_id=1983

A workaround has been found: provide explicit "-size N" to fcopy

jcw added on 2003-05-13 01:45:13:
Logged In: YES 
user_id=1983

Evrything is ok in ActiveTcl 8.4.2, and in Tclkit 8.4.1, but not in Tclkit 8.4.2.

It looks like I'm going to have to fire up my bug buster.

Will follow up once this has been resolved.

jcw added on 2003-05-12 23:02:16:
Logged In: YES 
user_id=1983

I'm being told that it also fails on Win2k pro (tclkit 8.4.2, threaded build).

Am looking into finding a simpler test case (so far, my tests have been based on 
running sdx starkit, which has ftpd inside).  And whether threads matter...

davygrvy added on 2003-04-18 02:17:33:
Logged In: YES 
user_id=7549

I doubt the current problem came from my changes.  All I did 
was revamp how the message handler window and service 
thread where brought up and down and changed the 
winSockProcs prototypes to be more readable.

I'm working on a similar problem at the job, but happens to be 
in the reverse, with files less than 4k don't flush sometimes...

A good place to start looking would be the generic layer for 
how background flushes work and how the watch mask is 
reset and possibly not turned on again when interest is still 
desired. How this relates to NT4 only is unknown and I 
wouldn't be at all surprised if it turned out to be an OS 
problem that somehow exercises a race condition.

andreas_kupries added on 2003-04-15 02:00:05:
Logged In: YES 
user_id=75003

David, this seems more in your area of expertise. As I also 
see in the changelog that you did a big revamp of all win 
channel drivers in general and sockets in particular I have to 
consider the possibility that these changes introduced the 
problem (Changelog 2002-11-26/27).

My own tests were on a Win2000 system, using it as both 
client and server. This worked. Another combination was linux 
lcient, Win2000 server, which also worked. As described in 
the report I used the standard 'ftp' command as client, for 
both combinations.

For Tcl I used the HEAD of core-8-4-branch (head as of 
saturday April 12, 2003), and the base-kit we build at AS. The 
ftpd is 'tcllib/examples/ftpd/ftpd.tcl', slightly modified (different 
path to the tcllib/modules/ftpd/ftpd.tcl).

No locks, no hangs, for files in the range of several hundred k.

Getting access to an NT4 machine might be possible here, 
but is still a tad difficult.

Attachments: