Tcl Source Code

View Ticket
Login
2014-06-16
13:38
[1758a0b603] socket_*-2.13 : Workaround the broken select() in some Linux kernels that fails to repo... check-in: dc173b3985 user: dgp tags: trunk
13:34 Closed ticket [1758a0b603]: Flushing channels can caused leaked file-descriptors plus 7 other changes artifact: 83c174b7de user: dgp
12:44 Ticket [1758a0b603]: 3 changes artifact: 05a3ac56fe user: dgp
12:44
[1758a0b603] socket-2.13 : Workaround the broken select() in some Linux kernels that fails to report... check-in: e67cc60d94 user: dgp tags: core-8-5-branch
2014-06-13
21:15
Draft test for [1758a0b603]. check-in: 71669f9456 user: dgp tags: core-8-5-branch
2014-06-11
17:25 Ticket [1758a0b603] Flushing channels can caused leaked file-descriptors status still Open with 3 other changes artifact: a86c0b0bfa user: dgp
2014-05-20
18:32 Ticket [1758a0b603]: 3 changes artifact: 75996dbed2 user: dgp
2014-03-24
10:21 Ticket [1758a0b603]: 3 changes artifact: 1e950c5970 user: a3a3el
10:20 Add attachment flush_channel2.patch to ticket [1758a0b603] artifact: 24b73c35a4 user: a3a3el
10:18 Add attachment server2.tcl to ticket [1758a0b603] artifact: 09562d035b user: a3a3el
10:17 Ticket [1758a0b603] Flushing channels can caused leaked file-descriptors status still Open with 3 other changes artifact: 67001f6c3a user: a3a3el
2014-03-21
15:15 Ticket [1758a0b603]: 4 changes artifact: 0d324ddbe9 user: a3a3el
15:14 Add attachment flush_channel.patch to ticket [1758a0b603] artifact: fa0b5aa45b user: a3a3el
15:11 Add attachment select_test.c to ticket [1758a0b603] artifact: 67d2ea8f06 user: a3a3el
15:11 Add attachment server.tcl to ticket [1758a0b603] artifact: a2719e933b user: a3a3el
15:11 Add attachment client.tcl to ticket [1758a0b603] artifact: e3330cc6cb user: a3a3el
14:43 New ticket [1758a0b603] Flushing channels can caused leaked file-descriptors. artifact: ce40f565e8 user: a3a3el

Ticket UUID: 1758a0b6035fcb3425cae757aeba704d67346902
Title: Flushing channels can caused leaked file-descriptors
Type: Bug Version: 8.5.15, 8.6.1
Submitter: a3a3el Created on: 2014-03-21 14:43:56
Subsystem: 25. Channel System Assigned To: dgp
Priority: 5 Medium Severity: Important
Status: Closed Last Modified: 2014-06-16 13:34:54
Resolution: Fixed Closed By: dgp
    Closed on: 2014-06-16 13:34:54
Description:
It turns out that under some Linux kernels (including rhel5 and rhel6), select(2) does not return as writable sockets which have been closed from the other end, even though calls to sendmsg (which is what POSIX gives as the implied output function for sockets) would not block.  This can result in Tcl's leaking socket-descriptors.

Source code is attached.

Compare this snippet:

  $ lsb_release -d
  Description:    Red Hat Enterprise Linux Server release 6.5 (Santiago)
  $ uname -r
  2.6.32-358.23.2.el6.x86_64
  $ gcc -Wall -W -pedantic -std=c99 -O3 -D_XOPEN_SOURCE=600 select_test.c -o select_test
  $ ./select_test localhost 12345 ncat
  _system: ncat -l 12345 & sleep 1
  _connect: connected to localhost:12345.
  _select: select = 1, FD_ISSET (3, &wfds) == 1.
  _system: pkill -x ncat; sleep 1
  _select: select = 1, FD_ISSET (3, &wfds) == 1.
  _send: sendmsg 0 returned: 163840
  _send: sendmsg 1 failed: Broken pipe
  _select: select = 0, FD_ISSET (3, &wfds) == 0.

to this one:

  $ lsb_release -d
  Description:    Debian GNU/Linux testing (jessie)
  $ uname -r
  3.10-2-amd64
  $ gcc -Wall -W -pedantic -std=c99 -O3 -D_XOPEN_SOURCE=600 select_test.c -o select_test
  $ ./select_test localhost 12345 nc
  _system: nc -l 12345 & sleep 1
  _connect: connected to localhost:12345.
  _select: select = 1, FD_ISSET (3, &wfds) == 1.
  _system: pkill -x nc; sleep 1
  _select: select = 1, FD_ISSET (3, &wfds) == 1.
  _send: sendmsg 0 returned: 656640
  _send: sendmsg 1 failed: Broken pipe
  _select: select = 1, FD_ISSET (3, &wfds) == 1.

Next I run a server which accepts connexions and then writes long lines to them, and a client which connects, pauses to give the server a chance to fill the send buffers on its socket, starts reading for a while and disconnects.  The client does this a limited number of times and then exits.  Afterwards I run lsof against the server.

  $ /space/azazel/tcl8.5.15/bin/tclsh8.5 server.tcl 2300 &
  [1] 4886
  $ /space/azazel/tcl8.5.15/bin/tclsh8.5 client.tcl localhost 2300
  $ lsof -p 4886 -n -P 2>/dev/null
  COMMAND   PID    USER   FD   TYPE     DEVICE SIZE/OFF       NODE NAME
  tclsh8.5 4886 jsowden  cwd    DIR      253,2     4096   36449390 /space/azazel/tmp/tcl_async_flush
  tclsh8.5 4886 jsowden  rtd    DIR      253,0     4096          2 /
  tclsh8.5 4886 jsowden  txt    REG      253,2     8815   37227160 /space/azazel/tcl8.5.15/bin/tclsh8.5
  tclsh8.5 4886 jsowden  mem    REG      253,0   156928       1885 /lib64/ld-2.12.so
  tclsh8.5 4886 jsowden  mem    REG      253,0  1926800       1886 /lib64/libc-2.12.so
  tclsh8.5 4886 jsowden  mem    REG      253,0    22536       1986 /lib64/libdl-2.12.so
  tclsh8.5 4886 jsowden  mem    REG      253,0   599384       5964 /lib64/libm-2.12.so
  tclsh8.5 4886 jsowden  mem    REG      253,3   347044        169 /var/db/nscd/passwd
  tclsh8.5 4886 jsowden  mem    REG      253,0 99158576      27784 /usr/lib/locale/locale-archive
  tclsh8.5 4886 jsowden  mem    REG      253,2  1274890   36974995 /space/azazel/tcl8.5.15/lib/libtcl8.5.so
  tclsh8.5 4886 jsowden    0u   CHR    136,602      0t0        605 /dev/pts/602
  tclsh8.5 4886 jsowden    1u   CHR    136,602      0t0        605 /dev/pts/602
  tclsh8.5 4886 jsowden    2u   CHR    136,602      0t0        605 /dev/pts/602
  tclsh8.5 4886 jsowden    3u  IPv4 1302178374      0t0        TCP *:2300 (LISTEN)
  tclsh8.5 4886 jsowden    4u  sock        0,6      0t0 1302186289 can't identify protocol
  tclsh8.5 4886 jsowden    5u  sock        0,6      0t0 1302182312 can't identify protocol
  tclsh8.5 4886 jsowden    6u  sock        0,6      0t0 1302197139 can't identify protocol
  tclsh8.5 4886 jsowden    7u  sock        0,6      0t0 1302196702 can't identify protocol
  tclsh8.5 4886 jsowden    8u  sock        0,6      0t0 1302264163 can't identify protocol
  tclsh8.5 4886 jsowden    9u  sock        0,6      0t0 1302212690 can't identify protocol
  tclsh8.5 4886 jsowden   10u  sock        0,6      0t0 1302250999 can't identify protocol
  tclsh8.5 4886 jsowden   11u  sock        0,6      0t0 1302353390 can't identify protocol
  tclsh8.5 4886 jsowden   12u  sock        0,6      0t0 1302324260 can't identify protocol
  tclsh8.5 4886 jsowden   13u  sock        0,6      0t0 1302326816 can't identify protocol
  tclsh8.5 4886 jsowden   14u  sock        0,6      0t0 1302345636 can't identify protocol
  tclsh8.5 4886 jsowden   15u  sock        0,6      0t0 1302432996 can't identify protocol
  tclsh8.5 4886 jsowden   16u  sock        0,6      0t0 1302369570 can't identify protocol
  tclsh8.5 4886 jsowden   17u  sock        0,6      0t0 1302382000 can't identify protocol
  tclsh8.5 4886 jsowden   18u  sock        0,6      0t0 1302416754 can't identify protocol
  tclsh8.5 4886 jsowden   19u  sock        0,6      0t0 1302442574 can't identify protocol
  tclsh8.5 4886 jsowden   20u  sock        0,6      0t0 1302438961 can't identify protocol
  tclsh8.5 4886 jsowden   21u  sock        0,6      0t0 1302505446 can't identify protocol
  tclsh8.5 4886 jsowden   22u  sock        0,6      0t0 1302545287 can't identify protocol
  tclsh8.5 4886 jsowden   23u  sock        0,6      0t0 1302536915 can't identify protocol
  tclsh8.5 4886 jsowden   24u  sock        0,6      0t0 1302585903 can't identify protocol
  tclsh8.5 4886 jsowden   25u  sock        0,6      0t0 1302557438 can't identify protocol
  tclsh8.5 4886 jsowden   26u  sock        0,6      0t0 1302571883 can't identify protocol
  tclsh8.5 4886 jsowden   27u  sock        0,6      0t0 1302600981 can't identify protocol
  tclsh8.5 4886 jsowden   28u  sock        0,6      0t0 1302606036 can't identify protocol
  tclsh8.5 4886 jsowden   30u  sock        0,6      0t0 1302612110 can't identify protocol
  $ kill %%
  [1]+  Terminated              /space/azazel/tcl8.5.15/bin/tclsh8.5 server.tcl 2300
  $ /space/azazel/tcl8.6.1/bin/tclsh8.6 server.tcl 2300 &
  [1] 32450
  $ /space/azazel/tcl8.6.1/bin/tclsh8.6 client.tcl localhost 2300
  $ lsof -p 32450 -n -P 2>/dev/null
  COMMAND    PID    USER   FD   TYPE     DEVICE SIZE/OFF       NODE NAME
  tclsh8.6 32450 jsowden  cwd    DIR      253,2     4096   36449390 /space/azazel/tmp/tcl_async_flush
  tclsh8.6 32450 jsowden  rtd    DIR      253,0     4096          2 /
  tclsh8.6 32450 jsowden  txt    REG      253,2     9126   40788208 /space/azazel/tcl8.6.1/bin/tclsh8.6
  tclsh8.6 32450 jsowden  mem    REG      253,0   156928       1885 /lib64/ld-2.12.so
  tclsh8.6 32450 jsowden  mem    REG      253,0  1926800       1886 /lib64/libc-2.12.so
  tclsh8.6 32450 jsowden  mem    REG      253,0    22536       1986 /lib64/libdl-2.12.so
  tclsh8.6 32450 jsowden  mem    REG      253,0   145896       1887 /lib64/libpthread-2.12.so
  tclsh8.6 32450 jsowden  mem    REG      253,0    91096       5944 /lib64/libz.so.1.2.3
  tclsh8.6 32450 jsowden  mem    REG      253,0   599384       5964 /lib64/libm-2.12.so
  tclsh8.6 32450 jsowden  mem    REG      253,3   347044        169 /var/db/nscd/passwd
  tclsh8.6 32450 jsowden  mem    REG      253,0 99158576      27784 /usr/lib/locale/locale-archive
  tclsh8.6 32450 jsowden  mem    REG      253,2  1797180   40648124 /space/azazel/tcl8.6.1/lib/libtcl8.6.so
  tclsh8.6 32450 jsowden    0u   CHR    136,602      0t0        605 /dev/pts/602
  tclsh8.6 32450 jsowden    1u   CHR    136,602      0t0        605 /dev/pts/602
  tclsh8.6 32450 jsowden    2u   CHR    136,602      0t0        605 /dev/pts/602
  tclsh8.6 32450 jsowden    3r  FIFO        0,8      0t0 1302920286 pipe
  tclsh8.6 32450 jsowden    4w  FIFO        0,8      0t0 1302920286 pipe
  tclsh8.6 32450 jsowden    5u  IPv4 1302920300      0t0        TCP *:2300 (LISTEN)
  tclsh8.6 32450 jsowden    6u  IPv6 1302920301      0t0        TCP *:2300 (LISTEN)
  tclsh8.6 32450 jsowden    7u  sock        0,6      0t0 1302960490 can't identify protocol
  tclsh8.6 32450 jsowden    8u  sock        0,6      0t0 1302991225 can't identify protocol
  tclsh8.6 32450 jsowden    9u  sock        0,6      0t0 1302999582 can't identify protocol
  tclsh8.6 32450 jsowden   10u  sock        0,6      0t0 1302996445 can't identify protocol
  tclsh8.6 32450 jsowden   11u  sock        0,6      0t0 1303021832 can't identify protocol
  tclsh8.6 32450 jsowden   12u  sock        0,6      0t0 1303026730 can't identify protocol
  tclsh8.6 32450 jsowden   13u  sock        0,6      0t0 1303037627 can't identify protocol
  tclsh8.6 32450 jsowden   14u  sock        0,6      0t0 1303034803 can't identify protocol
  tclsh8.6 32450 jsowden   15u  sock        0,6      0t0 1303037144 can't identify protocol
  tclsh8.6 32450 jsowden   16u  sock        0,6      0t0 1303055527 can't identify protocol
  tclsh8.6 32450 jsowden   17u  sock        0,6      0t0 1303040280 can't identify protocol
  tclsh8.6 32450 jsowden   18u  sock        0,6      0t0 1303051215 can't identify protocol
  tclsh8.6 32450 jsowden   19u  sock        0,6      0t0 1303061492 can't identify protocol
  tclsh8.6 32450 jsowden   20u  sock        0,6      0t0 1303058139 can't identify protocol
  tclsh8.6 32450 jsowden   21u  sock        0,6      0t0 1303069747 can't identify protocol
  tclsh8.6 32450 jsowden   22u  sock        0,6      0t0 1303097499 can't identify protocol
  tclsh8.6 32450 jsowden   23u  sock        0,6      0t0 1303082333 can't identify protocol
  tclsh8.6 32450 jsowden   24u  sock        0,6      0t0 1303084992 can't identify protocol
  tclsh8.6 32450 jsowden   25u  sock        0,6      0t0 1303087033 can't identify protocol
  tclsh8.6 32450 jsowden   26u  sock        0,6      0t0 1303088580 can't identify protocol
  tclsh8.6 32450 jsowden   27u  sock        0,6      0t0 1303094229 can't identify protocol
  tclsh8.6 32450 jsowden   28u  sock        0,6      0t0 1303099862 can't identify protocol
  tclsh8.6 32450 jsowden   29u  sock        0,6      0t0 1303105973 can't identify protocol
  tclsh8.6 32450 jsowden   30u  sock        0,6      0t0 1303140515 can't identify protocol
  tclsh8.6 32450 jsowden   31u  sock        0,6      0t0 1303113400 can't identify protocol
  tclsh8.6 32450 jsowden   32u  sock        0,6      0t0 1303116737 can't identify protocol
  tclsh8.6 32450 jsowden   33u  sock        0,6      0t0 1303152668 can't identify protocol
  tclsh8.6 32450 jsowden   34u  sock        0,6      0t0 1303227072 can't identify protocol
  tclsh8.6 32450 jsowden   35u  sock        0,6      0t0 1303166233 can't identify protocol
  tclsh8.6 32450 jsowden   36u  sock        0,6      0t0 1303167227 can't identify protocol
  tclsh8.6 32450 jsowden   37u  sock        0,6      0t0 1303190255 can't identify protocol
  tclsh8.6 32450 jsowden   38u  sock        0,6      0t0 1303240333 can't identify protocol
  tclsh8.6 32450 jsowden   39u  sock        0,6      0t0 1303247365 can't identify protocol
  tclsh8.6 32450 jsowden   40u  sock        0,6      0t0 1303247820 can't identify protocol
  tclsh8.6 32450 jsowden   41u  sock        0,6      0t0 1303277650 can't identify protocol
  tclsh8.6 32450 jsowden   42u  sock        0,6      0t0 1303250926 can't identify protocol
  tclsh8.6 32450 jsowden   43u  sock        0,6      0t0 1303254036 can't identify protocol
  tclsh8.6 32450 jsowden   44u  sock        0,6      0t0 1303268472 can't identify protocol
  tclsh8.6 32450 jsowden   45u  sock        0,6      0t0 1303275087 can't identify protocol
  tclsh8.6 32450 jsowden   46u  sock        0,6      0t0 1303294151 can't identify protocol
  tclsh8.6 32450 jsowden   47u  sock        0,6      0t0 1303281716 can't identify protocol
  tclsh8.6 32450 jsowden   48u  sock        0,6      0t0 1303327290 can't identify protocol
  tclsh8.6 32450 jsowden   49u  sock        0,6      0t0 1303307061 can't identify protocol
  tclsh8.6 32450 jsowden   50u  sock        0,6      0t0 1303320536 can't identify protocol
  tclsh8.6 32450 jsowden   51u  sock        0,6      0t0 1303331253 can't identify protocol

What appears to be happening is that the client disconnects while the server is
doing asynchronous flushes, which causes the channel's output proc to return -1
(tcl8.5.15/generic/tclIO.c, FlushChannel, ll. 2481ff.):

	/*
	 * Produce the output on the channel.
	 */

	toWrite = BytesLeft(bufPtr);
	if (toWrite == 0) {
	    written = 0;
	} else {
	    written = (chanPtr->typePtr->outputProc)(chanPtr->instanceData,
		RemovePoint(bufPtr), toWrite, &errorCode);
	}

	/*
	 * If the write failed completely attempt to start the asynchronous
	 * flush mechanism and break out of this loop - do not attempt to
	 * write any more output at this time.
	 */

	if (written < 0) {

and all pending output to be discarded (tcl8.5.15/generic/tclIO.c, FlushChannel, ll. 2584ff.):

	    /*
	     * When we get an error we throw away all the output currently
	     * queued.
	     */

	    DiscardOutputQueued(statePtr);
	    continue;
	} else {
	    wroteSome = 1;
	}

Next time round the loop, the buffer pointer is NULL and the loop is exited (tcl8.5.15/generic/tclIO.c, FlushChannel, ll. 2469ff.):

	/*
	 * If the output queue is still empty, break out of the while loop.
	 */

	if (bufPtr == NULL) {
	    break;	/* Out of the "while (1)". */
	}

Since data have already been written during this call, the wroteSome is true, and BG_FLUSH_SCHEDULED flag is not reset (tcl8.5.15/generic/tclIO.c, FlushChannel, ll. 2610ff.):

    /*
     * If we wrote some data while flushing in the background, we are done.
     * We can't finish the background flush until we run out of data and the
     * channel becomes writable again. This ensures that all of the pending
     * data has been flushed at the system level.
     */

    if (statePtr->flags & BG_FLUSH_SCHEDULED) {
	if (wroteSome) {
	    return errorCode;
	} else if (statePtr->outQueueHead == NULL) {
	    ResetFlag(statePtr, BG_FLUSH_SCHEDULED);
	    (chanPtr->typePtr->watchProc)(chanPtr->instanceData,
		    statePtr->interestMask);
	}
    }

Because BG_FLUSH_SCHEDULED is still set, when the channel is closed, Tcl_UnregisterChannel will not call Tcl_Close (tcl8.5.15/generic/tclIO.c, Tcl_UnregisterChannel, ll. 1090ff.):

    /*
     * If the refCount reached zero, close the actual channel.
     */

    if (statePtr->refCount <= 0) {
	/*
	 * Ensure that if there is another buffer, it gets flushed whether or
	 * not we are doing a background flush.
	 */

	if ((statePtr->curOutPtr != NULL) &&
		IsBufferReady(statePtr->curOutPtr)) {
	    SetFlag(statePtr, BUFFER_READY);
	}
	Tcl_Preserve((ClientData)statePtr);
	if (!(statePtr->flags & BG_FLUSH_SCHEDULED)) {
	    /*
	     * We don't want to re-enter Tcl_Close().
	     */

	    if (!(statePtr->flags & CHANNEL_CLOSED)) {
		if (Tcl_Close(interp, chan) != TCL_OK) {
		    SetFlag(statePtr, CHANNEL_CLOSED);
		    Tcl_Release((ClientData)statePtr);
		    return TCL_ERROR;
		}
	    }
	}
	SetFlag(statePtr, CHANNEL_CLOSED);
	Tcl_Release((ClientData)statePtr);
    }

However, because of the non-standard behaviour of select(2) described above, the socket-descriptor will never be returned by select(2) as writable.  Consequently, Tcl_NotifyChannel will never call FlushChannel again, and so FlushChannel doesn't close it either (tcl8.5.15/generic/tclIO.c, FlushChannel, ll. 2627ff.):

    /*
     * If the channel is flagged as closed, delete it when the refCount drops
     * to zero, the output queue is empty and there is no output in the
     * current output buffer.
     */

    if ((statePtr->flags & CHANNEL_CLOSED) && (statePtr->refCount <= 0) &&
	    (statePtr->outQueueHead == NULL) &&
	    ((statePtr->curOutPtr == NULL) ||
	    IsBufferEmpty(statePtr->curOutPtr))) {
	return CloseChannel(interp, chanPtr, errorCode);
    }

Changing FlushChannel so as always to clear BG_FLUSH_SCHEDULED if the output buffer is empty appears to fix the bug (tcl8.5.15/generic/tclIO.c, FlushChannel, ll. 2617ff.):

    if (statePtr->flags & BG_FLUSH_SCHEDULED) {
	if (statePtr->outQueueHead == NULL) {
	    ResetFlag(statePtr, BG_FLUSH_SCHEDULED);
	    (chanPtr->typePtr->watchProc)(chanPtr->instanceData,
		    statePtr->interestMask);
	}
	if (wroteSome) {
	    return errorCode;
	}
    }
User Comments: dgp added on 2014-06-16 13:34:54:
Fixed in active branches.

dgp added on 2014-06-16 12:44:34:
New test socket-2.13 covers this.

dgp added on 2014-06-11 17:25:41:
see proposed fix on bug-1758a0b603 branch.

dgp added on 2014-05-20 18:32:12:
What would help me move forward with this
is a test or tests to add to the test suite
that demonstrate the problem and when the
patch is applied demonstrate the the patch
corrects the problem.  Partnership with someone
on a system where the problem manifests would
also be useful.

a3a3el added on 2014-03-24 10:17:47:
On giving this more thought, I realized that it is also possible that the server might close the channel with buffered output (attached as server2.tcl), the client close its socket before the output is completely flushed and we still end up with leaked descriptors even with my suggested patch.

Moving the check to see whether the channel should be closed before the check of the BG_FLUSH_SCHEDULED flag appears to solve that too (attached as flush_channel2.patch).

a3a3el added on 2014-03-21 15:15:04:
Just noticed the "Attach" button, so the source-code and patch are now with the ticket.

Attachments: