Tcl Source Code

View Ticket
Login
Ticket UUID: 3545363
Title: socket/vwait hang in io-53.10 and socket.test
Type: Bug Version: obsolete: 8.6b3
Submitter: twylite Created on: 2012-07-18 11:48:25
Subsystem: 25. Channel System Assigned To: rmax
Priority: 9 Immediate Severity:
Status: Closed Last Modified: 2012-09-13 14:32:31
Resolution: Fixed Closed By: twylite
    Closed on: 2012-09-13 07:32:31
Description:
Hi,

chan-io-53.10, io-52.10 and socket.test all hang (trunk build from 2012/07/18 using MSVC10, 32-bit tclsh running on Windows 7 64-bit).  The common problem is that a vwait - which should be released by a fileevent or socket server accept - never returns.

Details:

Build setup (32-bit): 'Setting environment for using Microsoft Visual Studio 2010 x86 tools'
Testing platform: Windows 7 (64-bit), PROCESSOR_IDENTIFIER=Intel64 Family 6 Model 30 Stepping 5, GenuineIntel

C:\User\Tcl_BUILD\cmake-2012\tcl-trunk>fossil pull
Server:    http://core.tcl.tk/tcl

C:\User\Tcl_BUILD\cmake-2012\tcl-trunk>fossil update
updated-to:   f0f8f508748b68a589768e9e7df3beea28279df9 2012-07-17 13:08:18 UTC
tags:         trunk

Build commands:
  set BUILDENV=VC10
  set VER=86b3
  set INSTALLDIR=%CD%\Release_%VER%_%BUILDENV%
  set OPTS=threads
  set TCLSH=%INSTALLDIR%\bin\tclsh86t.exe
  mkdir %INSTALLDIR%
  pushd win
  nmake /f makefile.vc release tcltest OPTS=%OPTS% CFG_ENCODING=\\\"utf-8\\\" INSTALLDIR=%INSTALLDIR%
  nmake /f makefile.vc install OPTS=%OPTS% CFG_ENCODING=\\\"utf-8\\\" INSTALLDIR=%INSTALLDIR%
  nmake /f makefile.vc test

FAILURE #1: chan-io-53.10 hangs
  Cause: 'vwait ::forever' never returns; reason unknown
  Reproduce: nmake /f makefile.vc test TESTFLAGS="-file chanio.test -match chan-io-53.* -verbose pbstel"
  Workaround: add a watchdog to -setup to prevent hang; doesn't address underlying problem
    proc watchdog {} {
      lappend ::forever TIMEOUT
      set ::watchdogId [after 4000 [namespace code "watchdog"]]
    }
    watchdog
    # and in -cleanup
    catch { after cancel $::watchdogId }

FAILURE #2: io-53.10 hangs
  As for chan-io-53.10 above

FAILURE #3: io-29.33b crashes (probably unrelated)
C:/User/Tcl_BUILD/cmake-2012/tcl-trunk/tests/io.test:2746: error: test failed: i
o-29.33b TIP#398, no implicit flush of nonblocking on exit
==== io-29.33b TIP#398, no implicit flush of nonblocking on exit FAILED
==== Contents of test case:

    set f [open $path(script) w]
    puts $f {
                fconfigure stdout -blocking 0
                puts -nonewline stdout [string repeat A 655360]
                flush stdout
        }
    close $f
    set f [open $path(script2) w]
    puts $f {after 2000}
    close $f
        set t1 [clock milliseconds]
        set ff [open "|[list [interpreter] $path(script2)]" w]
        catch {unset ::env(TCL_FLUSH_NONBLOCKING_ON_EXIT)}
        exec [interpreter] $path(script) >@ $ff
        set t2 [clock milliseconds]
        close $ff
        expr {($t2-$t1)/2000 ? $t2-$t1 : 0}

---- Test generated error; Return code was: 1
---- Return code should have been one of: 0 2
---- errorInfo: child killed: segmentation violation
    while executing
"exec [interpreter] $path(script) >@ $ff"
    ("uplevel" body line 15)
    invoked from within
"uplevel 1 $script"
---- errorCode: CHILDKILLED 9996 SIGSEGV {segmentation violation}
==== io-29.33b FAILED

FAILURE #4: socket.test hangs in latency testing code (before any tests start)
  Cause: 'vwait s1' never returns; reason unknown
  Reproduce: nmake /f makefile.vc test TESTFLAGS="-file socket.test -verbose pbstel"
  Workaround: none
User Comments: twylite added on 2012-09-13 14:32:31:

allow_comments - 1

twylite added on 2012-09-13 14:32:30:
Fix merged to trunk as [aef7ffe46d] on 2012/08/09.

nijtmans added on 2012-08-06 16:13:34:
>Patch committed to branch bug-3545363. Awaiting test & review.

Any chance of getting this merged to trunk? To me this
Issue would be the (only) blocker for 8.6b3.

twylite added on 2012-07-30 22:34:02:
Thanks Jan.  Richard has confirmed that socket-14.0 is unrelated to this issue.  I also get intermittent socket-14.2 failures that appear unrelated.

nijtmans added on 2012-07-30 22:26:26:
After merging bug-3545363 to trunk, and running the tests,
got 2 relevant failures. The hangs are gone!
=========================================
socket.test


==== socket-14.0 [socket -async] when server only listens on IPv4 FAILED
==== Contents of test case:

        set client [socket -async localhost $port]
        set after [after 1000 {set x [fconfigure $client -error]}]
        vwait x
        set x

---- Result was:

---- Result should have been (exact matching):
ok
==== socket-14.0 FAILED



==== socket-14.2 [socket -async] fileevent connection refused FAILED
==== Contents of test case:

        if {[catch {socket -async localhost [randport]} client]} {
            regexp {[^:]*: (.*)} $client -> x
        } else {
            fileevent $client writable {set x [fconfigure $client -error]}
            set after [after 1000 {set x timeout}]
            vwait x
            after cancel $after
            if {$x eq "timeout"} {
                append x ": [fconfigure $client -error]"
            }
            close $client
        }
        set x

---- Result was:
timeout: connection refused
---- Result should have been (exact matching):
connection refused
==== socket-14.2 FAILED

.....

twylite added on 2012-07-30 21:08:50:
Patch committed to branch bug-3545363.  Awaiting test & review.

twylite added on 2012-07-25 03:07:38:

File Added - 449718: td-patch-3545363-3.diff

twylite added on 2012-07-20 17:38:22:
I've uploaded td-patch-3545363-1.diff, a patch against branch  bug-3545363 checkout a1fbf5e7.  It passes all (relevant) tests on Windows 7 64-bit (32-bit tclsh) with no crashes or hangs.

Bug 3545363 is caused by multiple code locations treating infoPtr->sockets as a single fd rather than a linked list of fds.  For connected sockets (i.e. not server listening sockets) this is a valid assumption; but other code locations such as TcpCloseProc() and SocketProc() must handle all socket types and treat infoPtr->sockets as a list.

TcpCloseProc() was leaking memory by not freeing the infoPtr->sockets linked list.  Adding the ckfree() exposed a bug where TcpAccept() was dereferencing a released pointer.  If TcpAccept() accepts the socket and calls the infoPtr->acceptProc() then Tcl code (the -server script) is evaluated.  In socket_any-2.10 the Tcl script closes the server socket (frees infoPtr->sockets and infoPtr).  After the eval, TcpAccept() returns up the stack to SocketEventProc() which is still walking fds (now freed).  Before adding ckfree() to TcpCloseProc() fds was leaked so the for loop called TcpAccept() again which would get INVALID_SOCKET and write two values into fds->infoPtr (previously freed), but this overwrite wasn't causing a crash (just invisible corruption).

This has been fixed by having the accept logic terminate the loop after one socket is accepted.  To ensure locking is handled correctly the accept() and infoPtr->readyEvents manipulation has been moved to SocketEventProc(), with TcpAccept() just handling the setup of the client channel after the socket is accepted.  Having all the logics and readyEvents manipulation in one place should also improve code maintainability.

Slight refactoring in CreateSocket() and NewSocketInfo() to extract common logic to add to the infoPtr->sockets linked list, now in AddSocketInfoFd().  This also makes clearer what CreateSocket() is trying to do.

twylite added on 2012-07-20 17:22:32:

File Added - 449371: td-patch-3545363-1.diff

mistachkin added on 2012-07-20 03:09:10:
FYI, just confirmed that this is the same issue that was causing the Eagle socket tests to hang as well in 8.6.

rmax added on 2012-07-19 17:55:44:
Committed a proposed fix to the bug-3545363 branch.

rmax added on 2012-07-19 16:00:11:
As a workaround, you can [set ::tcl::unsupported::socketAF inet] to disable IPv6 in the [socket] code.
Debugging of this is currently being coordinated via the Tcl'ers chat in case you want to join in.

twylite added on 2012-07-19 15:45:24:
I have built with MSVC2008 (VC 9.0) and tested on Windows XP (32-bit, no IPv6 installed) and don't have the hanging problem, but did reproduce the io-29.33b crash (bug 3545365).

With the 32-bit VC10 build on Windows 7 64-bit, changing localhost to 127.0.0.1 in C:\Windows\System32\drivers\etc\hosts fixed the socket.test issue, but not chan-io-53.10.  socket.test passes all socket_any tests, and hangs occasionally on socket_inet-11.7 (seems related to temp files adjacent to the .exe), and fails all socket_inet6-2.* tests and hangs on socket_inet6-2.11.

I've upped the priority of this bug to 8 - if this bug reproduces on other systems then it should block the 8.6b3 release as most socket code will fail on affected systems.  The problem also seems to be behind failures in thread-17.6 and thread-17.7 (workaround is to set localhost to 127.0.0.1 in hosts), and I'm still working through other test failures (its hard to get a run through the test suite that doesn't hang).

ferrieux added on 2012-07-18 22:42:00:
Good investigation ! Now that the smoking gun is IPv6, reassigning to Reinhard.

twylite added on 2012-07-18 21:50:59:
Okay, using the above script with 'set addr {-myaddr localhost}' causes a hang on vwait.  Telnet connects, but nothing comes out on the Tcl side.

This precisely explains the zlib-8.3 test failure.

And I suspect this isolates the problem(s): nmake /f makefile.vc test TESTFLAGS="-file socket.test -match socket_inet6* -verbose pbstel"

All socket_inet6-2.* tests fail for me, and 2.11 hangs.  Details are below.

I'm not familiar with IPv6 networks, and my OS (Windows 7 64-bit) only has IPv4 network access.  Obviously there appears to be a IPv6 stack present (and 'ping localhost' or 'ping ::1' works).  I've tried enabling IPv6 on my network adapter and giving it an address (that works, ipconfig /all shows the address, I can ping it) but this makes no difference to the socket_inet6-2.* tests.

Despite this, I would expect these tests not to hang.  I would expect a typical Windows 7 home user to have a similar 'default' IPv6 stack and consequently I would expect this bug to reproduce on such a system.


C:/User/Tcl_BUILD/cmake-2012/tcl-trunk/tests/socket.test:317: error: test failed
: socket_inet6-2.1 tcp connection
==== socket_inet6-2.1 tcp connection FAILED
==== Contents of test case:

    # $x == "ready" at this point
    set sock [socket $localhost $listen]
    lappend x [gets $f]
    close $sock
    lappend x [gets $f]

---- Result was:
ready timed_out {}
---- Result should have been (exact matching):
ready done {}
==== socket_inet6-2.1 FAILED

---- socket_inet6-2.2 start


C:/User/Tcl_BUILD/cmake-2012/tcl-trunk/tests/socket.test:348: error: test failed
: socket_inet6-2.2 tcp connection with client port specified
==== socket_inet6-2.2 tcp connection with client port specified FAILED
==== Contents of test case:

    # $x == "ready" at this point
    set sock [socket -myport $port $localhost $listen]
    puts $sock hello
    flush $sock
    lappend x [expr {[gets $f] eq "hello $port"}]
    close $sock
    return $x

---- Result was:
ready 0
---- Result should have been (exact matching):
ready 1
==== socket_inet6-2.2 FAILED

---- socket_inet6-2.3 start


C:/User/Tcl_BUILD/cmake-2012/tcl-trunk/tests/socket.test:383: error: test failed
: socket_inet6-2.3 tcp connection with client interface specified
==== socket_inet6-2.3 tcp connection with client interface specified FAILED
==== Contents of test case:

    # $x == "ready" at this point
    set sock [socket -myaddr $localhost $localhost $listen]
    puts $sock hello
    flush $sock
    lappend x [gets $f]
    close $sock
    return $x

---- Result was:
ready {}
---- Result should have been (exact matching):
ready {hello ::1}
==== socket_inet6-2.3 FAILED

---- socket_inet6-2.4 start


C:/User/Tcl_BUILD/cmake-2012/tcl-trunk/tests/socket.test:416: error: test failed
: socket_inet6-2.4 tcp connection with server interface specified
==== socket_inet6-2.4 tcp connection with server interface specified FAILED
==== Contents of test case:

    # $x == "ready" at this point
    set sock [socket $localhost $listen]
    puts $sock hello
    flush $sock
    lappend x [gets $f]
    close $sock
    return $x

---- Result was:
ready {}
---- Result should have been (exact matching):
ready hello
==== socket_inet6-2.4 FAILED

---- socket_inet6-2.5 start


C:/User/Tcl_BUILD/cmake-2012/tcl-trunk/tests/socket.test:450: error: test failed
: socket_inet6-2.5 tcp connection with redundant server port
==== socket_inet6-2.5 tcp connection with redundant server port FAILED
==== Contents of test case:

    # $x == "ready" at this point
    set sock [socket $localhost $listen]
    puts $sock hello
    flush $sock
    lappend x [gets $f]
    close $sock
    return $x

---- Result was:
ready {}
---- Result should have been (exact matching):
ready hello
==== socket_inet6-2.5 FAILED

---- socket_inet6-2.6 start
++++ socket_inet6-2.6 PASSED
---- socket_inet6-2.7 start


C:/User/Tcl_BUILD/cmake-2012/tcl-trunk/tests/socket.test:493: error: test failed
: socket_inet6-2.7 echo server, one line
==== socket_inet6-2.7 echo server, one line FAILED
==== Contents of test case:

    set s [socket $localhost $listen]
    fconfigure $s -buffering line -translation lf
    puts $s "hello abcdefghijklmnop"
    set x [gets $s]
    close $s
    list $x [gets $f]

---- Result was:
{} timeout
---- Result should have been (exact matching):
{hello abcdefghijklmnop} done
==== socket_inet6-2.7 FAILED

---- socket_inet6-2.8 start


C:/User/Tcl_BUILD/cmake-2012/tcl-trunk/tests/socket.test:535: error: test failed
: socket_inet6-2.8 echo server, loop 50 times, single connection
==== socket_inet6-2.8 echo server, loop 50 times, single connection FAILED
==== Contents of test case:

    set s [socket $localhost $listen]
    fconfigure $s -buffering line
    catch {
        for {set x 0} {$x < 50} {incr x} {
            puts $s "hello abcdefghijklmnop"
            gets $s
        }
    }
    close $s
    catch {set x [gets $f]}
    return $x

---- Result was:
done 0
---- Result should have been (exact matching):
done 50
==== socket_inet6-2.8 FAILED

---- socket_inet6-2.9 start
++++ socket_inet6-2.9 PASSED
---- socket_inet6-2.10 start


C:/User/Tcl_BUILD/cmake-2012/tcl-trunk/tests/socket.test:597: error: test failed
: socket_inet6-2.10 close on accept, accepted socket lives
==== socket_inet6-2.10 close on accept, accepted socket lives FAILED
==== Contents of test case:

    set ss [socket -server accept 0]
    proc accept {s a p} {
        global ss
        close $ss
        fileevent $s readable "readit $s"
        fconfigure $s -trans lf
    }
    proc readit {s} {
        global done
        gets $s
        close $s
        set done 1
    }
    set cs [socket $localhost [lindex [fconfigure $ss -sockname] 2]]
    puts $cs hello
    close $cs
    vwait done
    return $done

---- Result was:
timed_out
---- Result should have been (exact matching):
1
==== socket_inet6-2.10 FAILED

twylite added on 2012-07-18 21:14:57:
By all means :)  To avoid giving a stupid answer, I have done the following:

1. Reboot (to make sure nothing is misbehaving)
2. Turned Windows Firewall off in Control Panel (for Domain, Private and Public networks)
3. Stopped the Windows Firewall service.

socket.test and chan-io-53.10 still hang.

4. Start Windows Firewall service.
5. Enable firewall in Control Panel

chanio.test requests firewall access for the tclsh86t.exe process (granted).
socket.test and chan-io-53.10 still hang.

Now the making-progress bit:
(1) The socket.test hang is IPv6 related.
(2) I forgot to mention that various http and http11 tests seem to hang as well (exact tests not investigated, I just started skipping the files).
(2) The test suite has a bunch of socket -server stuff that _doesn't_ hang ... and all of it uses -myaddr ... except chan-io-53.10, socket.test latency calc, and some http stuff.

Test script:
--
set port 0
set addr {} ;# {} or {-myaddr 127.0.0.1} or {-myaddr 10.11.12.79}
set s [socket -server [list apply {{s a p} {
  puts "Accepted client connection '$s' (a='$a', p='$p')"
  set ::clisock $s
  puts $s "hello"; flush $s
}}] {*}$addr $port]
puts "Opened server socket '[fconfigure $s -sockname]'"
vwait ::clisock
--

Results (manually collated):
--
FAIL: addr = {} ; telnet localhost $port = connect, hang
PASS: addr = {} ; telnet localhost ($port + 1) = connecting ... failed
PASS: addr = {} ; telnet 127.0.0.1 $port = ok
PASS: addr = {} ; telnet 10.11.12.79 $port = ok

PASS: addr = {-myaddr 127.0.0.1} ; telnet localhost $port = ok
PASS: addr = {-myaddr 127.0.0.1} ; telnet localhost ($port + 1) = connecting ... failed
PASS: addr = {-myaddr 127.0.0.1} ; telnet 127.0.0.1 $port = ok
PASS: addr = {-myaddr 127.0.0.1} ; telnet 10.11.12.79 $port = connecting ... failed

PASS: addr = {-myaddr 10.11.12.79} ; telnet localhost $port = connecting ... failed
PASS: addr = {-myaddr 10.11.12.79} ; telnet localhost ($port + 1) = connecting ... failed
PASS: addr = {-myaddr 10.11.12.79} ; telnet 127.0.0.1 $port = connecting ... failed
PASS: addr = {-myaddr 10.11.12.79} ; telnet 10.11.12.79 $port = ok

ping localhost
Pinging dbn-trevord-i5... [::1] with 32 bytes of data:
Reply from ::1: time<1ms
--

The connect in socket.test uses localhost rather than 127.0.0.1 -- changing that allows socket.test to run (there are some failure, I will investigate further).

The chan-io-53.10 doesn't use -myaddr but the connect goes to 127.0.0.1 -- according to my test script that works, so we may be looking at two independent issues here.

Please advise on whether the IPv6 localhost address is expected to work or not?

ferrieux added on 2012-07-18 19:53:51:
May I risk a stupid suggestion ? Windows firewall ?

twylite added on 2012-07-18 19:00:27:
I also get a failure in zlib-8.3, which seems to be related.  I'll hold back on filing a separate bug for that until this issue is resolved.

==== zlib-8.3 zlib transformation and fileevent FAILED
==== Contents of test case:

    set sin [socket localhost $port]
    try {
        fconfigure $sin -translation binary
        zlib push gunzip $sin
        after 1000 {set total timeout}
        fcopy $sin $fout -command {apply {{c {e {}}} {
            set ::total [expr {$e eq {} ? $c : $e}]
        }}}
        vwait total
        after cancel {set total timeout}
    } finally {
        close $sin
    }
    append total --> [file size $file]

---- Result was:
timeout-->0
---- Result should have been (exact matching):
81920-->81920
==== zlib-8.3 FAILED

twylite added on 2012-07-18 18:50:49:
May be related to #3295154

Attachments: