Tcl Source Code: View Ticket

Ticket UUID:	719790
Title:	fcopy -command hangs in Win NT4
Type:	Bug	Version:	obsolete: 8.4.5
Submitter:	jcw	Created on:	2003-04-11 16:44:28
Subsystem:	25. Channel System	Assigned To:	andreas_kupries
Priority:	8	Severity:
Status:	Closed	Last Modified:	2008-04-18 13:33:46
Resolution:	Fixed	Closed By:	sf-robot
		Closed on:	2008-04-18 02:20:37
Description:	Reported by Mads Linden, for tclkit, but I don't think it's tclkit. To reproduce, run tcllib's ftpd server on Windows, fetch a file > 4 Kb. The transfer does not complete. - server has to be Win32 (running tclkit 8.4.2), Linux works - client is Linux in my case (std ftp cmd) - transfer needs to be over 4 Kb, it seems - may also have to be binary (it is, in my case) - ftpd running on server hangs in line 1200, fcopy with -command - i.e. input is file, output is socket - tclkit 8.4.0 and 8.4.1 reported to work ok Hm, this looks like trouble in fcopy. I changed: fcopy $f $data(sock2) -command [list ::ftpd::GetDone $sock $data(sock2) $f ""] to: fcopy $f $data(sock2) ::ftpd::GetDone $sock $data(sock2) $f "" [tell $f] ($f points to a file) And now, the ftp transfer works. The data appears to come across, but the last part does not flush and it ends up deadlocked. If this is the same problem as reported by Mads Linden, then it works ok on 2k, but not XP. I've only run a test on NT4.
User Comments:	matben added on 2008-04-18 13:33:46: Logged In: YES user_id=108900 Originator: NO From the discussions of the last progress of this bug and the related bugs, it sounds exactly as the bug I have reported here. I would be very surprised if it's not the same bug. Case closed. sf-robot added on 2008-04-18 09:20:37: Logged In: YES user_id=1312539 Originator: NO This Tracker item was closed automatically by the system. It was previously set to a Pending status, and the original submitter did not respond within 14 days (the time period specified by the administrator of this Tracker). ferrieux added on 2008-04-04 14:56:36: Logged In: YES user_id=496139 Originator: NO Jeff, I've just noticed in one of the last comments (2007-12-03 matben) that it was reproducible in XP and OSX. Do you think that one should be doable (but maybe the test cases is complex and/or low probability of repro) ? hobbs added on 2008-04-04 07:21:34: Logged In: YES user_id=72656 Originator: NO This was likely solved in 8.4.19/8.5.3 with the fcopy fixes for 1932639 and 780533. We don't have any NT4 around to verify anymore though. ferrieux added on 2008-04-04 05:44:50: Logged In: YES user_id=496139 Originator: NO Andreas, don't you think this one might be related to the one we've just fixed ? Would you have the time to revive the proper legacy environments ? What I'm getting at, is I would love to see [fcopy] disappear from the Open bug list (or stay for some other good reason we have yet to discover). This would tremendously help spread the word that [fcopy] is usable and should be used. matben added on 2007-12-03 20:10:04: Logged In: YES user_id=108900 Originator: NO Update: The problems with fcopy has now also been seen on MacOSX 10.2 running 8.4.9 and Windows XP running 8.4.13. This happens when I run from "sources" using a normal Tcl/Tk installation. Thus, tclkit is not involved in any way. I reproduce it each time but only when downloading files from Pandions (http://www.pandion.be/) built in HTTP server which is used for file transfers. In short, the last callback from "fcopy ... -command ..." which contain a partly filled buffer never happens. I have tested both the standard http package and my own httpex package, both uses fcopy with the -channel option. They both behave the same. On the other hand, rewritten the httpex package to instead use fileevents as in: # This is a trick to put this event at the back of the queue to # avoid using any 'update'. after idle [list after 0 [list \ [namespace current]::SetReadable $s $token]] } } proc httpex::SetReadable {s token} { # We could have been closed since this event comes async. if {[lsearch [file channels] $s] >= 0} { fileevent $s readable \ [list [namespace current]::Read $s $token] } } works consistently. Also, using Firefox or Safari to obtain the url also works consistently. All this points to some problems with the fcopy command which only reveals itself in some circumstances. How Pandions HTTP server works I hav no idea about but its responses are: httpex::Event state(state)=waiting line=HTTP/1.1 200 OK httpex::Event state(state)=getheader line=Content-Length: 14256 httpex::Event state(state)=getheader line=Content-Type: application/octet-stream httpex::Event state(state)=getheader line=Connection: close httpex::Event state(state)=getheader n=0 hobbs added on 2007-02-09 06:19:52: Logged In: YES user_id=72656 Originator: NO This seems to be an ftpd issue (tcllib), but we have no way to transfer bugs like this ... passing to Andreas to possibly incorporate (do we have tests on this area??). davygrvy added on 2006-03-14 06:44:32: File Added - 170799: patch.txt davygrvy added on 2006-03-14 06:44:27: Logged In: YES user_id=7549 Here's a quicky patch. I didn't know what to do with an error returned by [gets] and should go someone, but I didn't know where to put it. nobody added on 2006-03-14 03:12:03: Logged In: NO ::ftpd::read has an error. The check for EOF needs to be after the call to [gets] matben added on 2006-03-13 21:31:56: Logged In: YES user_id=108900 Is this related? [ 1437595 ] WinSock cores/hangs on thread or app exit Mats matben added on 2006-03-13 21:29:37: Logged In: YES user_id=108900 I have done some testing using davygrvy iocpsock package replacing the socket command: rename socket socket_orig rename socket2 socket and can verify that the problem persists :-( It is not always easily reproducable, but using a few tries on different newtworks I was able to hang it. Mostly after the first chunk, but sometimes it took several chunks. You can test yourself using these vfs which are ready to sdx. I was using a tclkit 8.4.9 for these tests. http://www.visit.se/~matben/download/fcopybug.vfs.tgz http://www.visit.se/~matben/download/fcopybugiocp.vfs.tgz I repeat, the problem appeared in the 8.4.1 -> 8.4.2 transition. Mats davygrvy added on 2006-03-10 15:48:00: Logged In: YES user_id=7549 This bug has been open since april 2003. My bet is that you won't find a problem in the code of tclWinSock.c But you might if you are lucky enough catch a missed FD_READ never arriving into the socket winproc when you think it should be. My bet is still on "bad data in" jcw added on 2006-03-10 15:34:05: Logged In: YES user_id=1983 davygrvy said: WSAAsyncSelect has a well known problem under load where it drops notifications to socket events. I'm not sure that's the issue here. The problem was reproducible with a small transfer and nothing else happening. davygrvy added on 2006-03-10 15:14:26: Logged In: YES user_id=7549 > I have seen fcopy hang after about 100 chunks (1MB) which > maybe points to some timing issue (difficult!). I have seen winsock itself hang from just the same thing. And I don't mean tclWinsock.c, I mean the OS interface to sockets. If you want explosion proof sockets on windows, Tcl doesn't have what it takes.. There.. I said it.. And it isn't something that is repairable. WSAAsyncSelect is the problem, and it is internal to the OS. Think, "bad data in".. WSAAsyncSelect has a well known problem under load where it drops notifications to socket events. It is most problematic, I found, with listening sockets. After enough drops, all socket activity for the process halts and the process needs to be restarted. The best way to fix Tcl's winsock implementation is to replace it with a different notification model such as WSAEventSelect or IOCP. Or just use this in the meantime: http://sf.net/projects/iocpsock matben added on 2006-03-10 14:30:59: Logged In: YES user_id=108900 If someone come up with modified tclWinSock.c I'll be glad to make builds and do the testing. There is still a test kit at: http://www.visit.se/ ~matben/download/fcopybug.vfs.tgz Mats dgp added on 2006-03-10 11:10:28: Logged In: YES user_id=80530 Can this one get a look for possible fixing/closing for Tcl 8.4.13 ? dkf added on 2005-08-08 22:41:47: Logged In: YES user_id=79902 I don't know windows socket or event code and I can't build on that platform; you're at least as capable of making and testing the changes as I am. I just know CVS enough to generate the changes and general coding enough to tell the difference between changes that definitely don't matter and ones that might. matben added on 2005-08-08 22:28:51: Logged In: YES user_id=108900 If you make a modified tclWinSock.c using an educated guess, I can do tclkit builds and test them. I have seen fcopy hang after about 100 chunks (1MB) which maybe points to some timing issue (difficult!). Mats dkf added on 2005-08-08 17:12:07: Logged In: YES user_id=79902 There are quite a lot of cosmetic changes from 8.4.1 to 8.4.2 in tclWinSock.c (I presume the fault is here if it is in Tcl at all) but here are the possible causes in summary: * slight change to the way the library is initialized * the way the socket helper threads are managed is a bit different * the way the socket helper thread waits for events is different I suspect that it's not the first, but I can't prove it. (It's really frustrating to be unable to duplicate this in the core!) matben added on 2005-08-06 20:00:41: Logged In: YES user_id=108900 Update: I managed to build a tclkit based on tcl/tk 8.4.1 sources as following: 1) Used genkit A to get all sources which are tcl/tk 8.4.9 based 2) Found msvc6 archive and imported project into VC7++ (http://www.equi4.com/pub/tk/tars/msvc6.tar.gz) 3) Replaced Tk_MainEx with the one in 8.4.9 + patch all.diffs (http://www.equi4.com/pub/tk/tars/all.diffs) 4) Built Release target and appended runtime.kit 5) Built starpack to test the fcopy bug using http://www.visit.se/~matben/download/fcopybug.vfs.tgz 6) Tested the http package with -channel that triggers the fcopy calls This works in all situations I have tested that hangs using tcl/tk 8.4.9 based builds. Conclusion: the problem is likely to be found in tcl/win and not elsewhere. Previous investigations (by me and Mads) points to 8.4.1 -> 8.4.2 Please do a diff of relevant files here. I will add more comments if I get closer to this one. HTH, Mats matben added on 2005-07-30 20:04:31: Logged In: YES user_id=108900 Update: It seems that this bug depends on where the url is localized, or some consequence thereof (timing issue etc.). My investigations shows this: localhost: always works LAN: always hangs after first chunk (8192 bytes) WAN: works most of the time; in one case it stopped after many (approx 20) chunks. Environment: WinXP 1 SP Home Ed, Active State 8.4.9, tclkit-win32.upx.exe version 8.4.9, vfs at http://www.visit.se/ ~matben/download/fcopybug.vfs.tgz Mats matben added on 2005-07-20 20:39:41: Logged In: YES user_id=108900 I can just verify that this bug persists in 8.4.9 on WinXP as well. It can be reproduced using the simple .vfs found at: http://www.visit.se/~matben/download/fcopybug.vfs.tgz System: WinXP 8.4.9/tclkit-win32.upx.exe Symptom: Pick a file larger than 8192 bytes to download. Watch the console. If run from fcopybug.tcl using Tcl/Tk any version (?) it works ok. If wrapped using tclkit 8.4.9 it hangs after first chunk (8192 bytes). PS: Now squeezed between this bug and http://sourceforge.net/mailarchive/ forum.php?thread_id=7358857&forum_id=42051 Very annoying indeed... /Mats jcw added on 2005-01-11 20:12:39: Logged In: YES user_id=1983 Whoops garbled url - http://sourceforge.net/tracker/index.php? func=detail&aid=1060620&group_id=34191&atid=410295 jcw added on 2005-01-11 20:07:31: Logged In: YES user_id=1983 I still can't figure this out, but a surprisingly similar bug was recently fixed in Memchan, see http://sourceforge.net/tracker/?func=detail&aid=weird thing is many m&group_id=10894&atid=110894 The weird part is that the fcopy here does not appear to involve any VFS or memchan, hence no "rechan" either (which is tclkit's simpler cousin of memchan), see http://www.equi4.com/cgi-bin/cvsweb/tclkit/src/rechan.c? rev=1.12 I'm bringing this up here due to the similaries: no final fileevent triggering, hence never a chance to test for eof and ending fcopy. Have pinged Pat Thoyts as well, on the chat. -jcw hobbs added on 2004-02-17 03:18:31: Logged In: YES user_id=72656 It would be nice if someone could test selective reverse ports of the significant files from 8.4.1 to 8.4.5 that would show if something changed. The key files that I would look at would be tcl/win/tclWinPipe.c and tcl/win/tclWinChan.c. Everything of course works in regular Tcl, so these would have to be done as modified base kits. matben added on 2004-02-16 14:38:50: Logged In: YES user_id=108900 Update: The testing situation is the following; MyCoccinella app is made into a StarPack using TclKits of indicated versions. I make detailed outputs around fcopy to see what happens. Testing is for both client and server side fcopy on Win98 and Win2k; I don't have WinXP. If code is run from an ordinary Tcl/Tk installation everything works ok. The code is typically 'fcopy ... -blocksize 8192 -command tclProc'. The first fcopy always work; it is the second call that blocks. The event loop is still running. Thus, files smaller than blocksize don't cause problems. TclKit 8.4.1 8.4.2 8.4.5 Win98 works fails fails Win2k works fails fails Using 8.4.1 seems to be the "workaround". Any fix would be greatly appreciated, Mats matben added on 2004-02-15 20:52:32: Logged In: YES user_id=108900 I seem to have come across the same bug running Win98 and TclKit 8.4.2. I do have a -size argument for the blocksize, and the first call to fcopy seems to work, but the second just blocks. The same code on standard Tcl/Tk 8.4.5 runs just fine. Mats Bengtsson nobody added on 2003-11-23 21:11:11: Logged In: NO my notes on the bug: by giving the "-size" to fcopy of the size being send, will make the callback work as expected. this workaround does unfortantly not work when in ftpd: ::ftpd::command::STOR, since the server does not know the size of the incoming file/data. what is wierd is that sometimes it works fine, like yesterday i added an "update" just before fcopy and it worked fine, then the day after it stopped working. i have now seen this bug on 2000,XP Mads jcw added on 2003-11-23 20:36:32: Logged In: YES user_id=1983 Sorry to keep raising this issue, but tclkit 8.4.5 continues to fail on this case. I can't see how anything in tclkit affects the channel or event systems, so I keep suspecting a weird bug in Tcl which tclkit happens to hit in the above example. The server socket is Windows - that's the one invariant in all this AFAICT. jcw added on 2003-11-11 07:38:01: Logged In: YES user_id=1983 I'll need to triple-check, but when Andreas worked on it, and when I tried just now, it still hangs. hobbs added on 2003-11-11 04:11:56: Logged In: YES user_id=72656 I believe the Andreas fixed this for 8.4.5. jcw added on 2003-09-18 01:25:12: Logged In: YES user_id=1983 A workaround has been found: provide explicit "-size N" to fcopy jcw added on 2003-05-13 01:45:13: Logged In: YES user_id=1983 Evrything is ok in ActiveTcl 8.4.2, and in Tclkit 8.4.1, but not in Tclkit 8.4.2. It looks like I'm going to have to fire up my bug buster. Will follow up once this has been resolved. jcw added on 2003-05-12 23:02:16: Logged In: YES user_id=1983 I'm being told that it also fails on Win2k pro (tclkit 8.4.2, threaded build). Am looking into finding a simpler test case (so far, my tests have been based on running sdx starkit, which has ftpd inside). And whether threads matter... davygrvy added on 2003-04-18 02:17:33: Logged In: YES user_id=7549 I doubt the current problem came from my changes. All I did was revamp how the message handler window and service thread where brought up and down and changed the winSockProcs prototypes to be more readable. I'm working on a similar problem at the job, but happens to be in the reverse, with files less than 4k don't flush sometimes... A good place to start looking would be the generic layer for how background flushes work and how the watch mask is reset and possibly not turned on again when interest is still desired. How this relates to NT4 only is unknown and I wouldn't be at all surprised if it turned out to be an OS problem that somehow exercises a race condition. andreas_kupries added on 2003-04-15 02:00:05: Logged In: YES user_id=75003 David, this seems more in your area of expertise. As I also see in the changelog that you did a big revamp of all win channel drivers in general and sockets in particular I have to consider the possibility that these changes introduced the problem (Changelog 2002-11-26/27). My own tests were on a Win2000 system, using it as both client and server. This worked. Another combination was linux lcient, Win2000 server, which also worked. As described in the report I used the standard 'ftp' command as client, for both combinations. For Tcl I used the HEAD of core-8-4-branch (head as of saturday April 12, 2003), and the base-kit we build at AS. The ftpd is 'tcllib/examples/ftpd/ftpd.tcl', slightly modified (different path to the tcllib/modules/ftpd/ftpd.tcl). No locks, no hangs, for files in the range of several hundred k. Getting access to an NT4 machine might be possible here, but is still a tad difficult.

Attachments:

patch.txt [download] added by davygrvy on 2006-03-14 06:44:27. [details]