2014-06-16
| ||
13:38 | [1758a0b603] socket_*-2.13 : Workaround the broken select() in some Linux kernels that fails to repo... check-in: dc173b3985 user: dgp tags: trunk | |
13:34 | • Closed ticket [1758a0b603]: Flushing channels can caused leaked file-descriptors plus 7 other changes artifact: 83c174b7de user: dgp | |
12:44 | • Ticket [1758a0b603]: 3 changes artifact: 05a3ac56fe user: dgp | |
12:44 | [1758a0b603] socket-2.13 : Workaround the broken select() in some Linux kernels that fails to report... check-in: e67cc60d94 user: dgp tags: core-8-5-branch | |
2014-06-13
| ||
21:15 | Draft test for [1758a0b603]. check-in: 71669f9456 user: dgp tags: core-8-5-branch | |
2014-06-11
| ||
17:25 | • Ticket [1758a0b603] Flushing channels can caused leaked file-descriptors status still Open with 3 other changes artifact: a86c0b0bfa user: dgp | |
2014-05-20
| ||
18:32 | • Ticket [1758a0b603]: 3 changes artifact: 75996dbed2 user: dgp | |
2014-03-24
| ||
10:21 | • Ticket [1758a0b603]: 3 changes artifact: 1e950c5970 user: a3a3el | |
10:20 | • Add attachment flush_channel2.patch to ticket [1758a0b603] artifact: 24b73c35a4 user: a3a3el | |
10:18 | • Add attachment server2.tcl to ticket [1758a0b603] artifact: 09562d035b user: a3a3el | |
10:17 | • Ticket [1758a0b603] Flushing channels can caused leaked file-descriptors status still Open with 3 other changes artifact: 67001f6c3a user: a3a3el | |
2014-03-21
| ||
15:15 | • Ticket [1758a0b603]: 4 changes artifact: 0d324ddbe9 user: a3a3el | |
15:14 | • Add attachment flush_channel.patch to ticket [1758a0b603] artifact: fa0b5aa45b user: a3a3el | |
15:11 | • Add attachment select_test.c to ticket [1758a0b603] artifact: 67d2ea8f06 user: a3a3el | |
15:11 | • Add attachment server.tcl to ticket [1758a0b603] artifact: a2719e933b user: a3a3el | |
15:11 | • Add attachment client.tcl to ticket [1758a0b603] artifact: e3330cc6cb user: a3a3el | |
14:43 | • New ticket [1758a0b603] Flushing channels can caused leaked file-descriptors. artifact: ce40f565e8 user: a3a3el | |
Ticket UUID: | 1758a0b6035fcb3425cae757aeba704d67346902 | |||
Title: | Flushing channels can caused leaked file-descriptors | |||
Type: | Bug | Version: | 8.5.15, 8.6.1 | |
Submitter: | a3a3el | Created on: | 2014-03-21 14:43:56 | |
Subsystem: | 25. Channel System | Assigned To: | dgp | |
Priority: | 5 Medium | Severity: | Important | |
Status: | Closed | Last Modified: | 2014-06-16 13:34:54 | |
Resolution: | Fixed | Closed By: | dgp | |
Closed on: | 2014-06-16 13:34:54 | |||
Description: |
It turns out that under some Linux kernels (including rhel5 and rhel6), select(2) does not return as writable sockets which have been closed from the other end, even though calls to sendmsg (which is what POSIX gives as the implied output function for sockets) would not block. This can result in Tcl's leaking socket-descriptors. Source code is attached. Compare this snippet: $ lsb_release -d Description: Red Hat Enterprise Linux Server release 6.5 (Santiago) $ uname -r 2.6.32-358.23.2.el6.x86_64 $ gcc -Wall -W -pedantic -std=c99 -O3 -D_XOPEN_SOURCE=600 select_test.c -o select_test $ ./select_test localhost 12345 ncat _system: ncat -l 12345 & sleep 1 _connect: connected to localhost:12345. _select: select = 1, FD_ISSET (3, &wfds) == 1. _system: pkill -x ncat; sleep 1 _select: select = 1, FD_ISSET (3, &wfds) == 1. _send: sendmsg 0 returned: 163840 _send: sendmsg 1 failed: Broken pipe _select: select = 0, FD_ISSET (3, &wfds) == 0. to this one: $ lsb_release -d Description: Debian GNU/Linux testing (jessie) $ uname -r 3.10-2-amd64 $ gcc -Wall -W -pedantic -std=c99 -O3 -D_XOPEN_SOURCE=600 select_test.c -o select_test $ ./select_test localhost 12345 nc _system: nc -l 12345 & sleep 1 _connect: connected to localhost:12345. _select: select = 1, FD_ISSET (3, &wfds) == 1. _system: pkill -x nc; sleep 1 _select: select = 1, FD_ISSET (3, &wfds) == 1. _send: sendmsg 0 returned: 656640 _send: sendmsg 1 failed: Broken pipe _select: select = 1, FD_ISSET (3, &wfds) == 1. Next I run a server which accepts connexions and then writes long lines to them, and a client which connects, pauses to give the server a chance to fill the send buffers on its socket, starts reading for a while and disconnects. The client does this a limited number of times and then exits. Afterwards I run lsof against the server. $ /space/azazel/tcl8.5.15/bin/tclsh8.5 server.tcl 2300 & [1] 4886 $ /space/azazel/tcl8.5.15/bin/tclsh8.5 client.tcl localhost 2300 $ lsof -p 4886 -n -P 2>/dev/null COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME tclsh8.5 4886 jsowden cwd DIR 253,2 4096 36449390 /space/azazel/tmp/tcl_async_flush tclsh8.5 4886 jsowden rtd DIR 253,0 4096 2 / tclsh8.5 4886 jsowden txt REG 253,2 8815 37227160 /space/azazel/tcl8.5.15/bin/tclsh8.5 tclsh8.5 4886 jsowden mem REG 253,0 156928 1885 /lib64/ld-2.12.so tclsh8.5 4886 jsowden mem REG 253,0 1926800 1886 /lib64/libc-2.12.so tclsh8.5 4886 jsowden mem REG 253,0 22536 1986 /lib64/libdl-2.12.so tclsh8.5 4886 jsowden mem REG 253,0 599384 5964 /lib64/libm-2.12.so tclsh8.5 4886 jsowden mem REG 253,3 347044 169 /var/db/nscd/passwd tclsh8.5 4886 jsowden mem REG 253,0 99158576 27784 /usr/lib/locale/locale-archive tclsh8.5 4886 jsowden mem REG 253,2 1274890 36974995 /space/azazel/tcl8.5.15/lib/libtcl8.5.so tclsh8.5 4886 jsowden 0u CHR 136,602 0t0 605 /dev/pts/602 tclsh8.5 4886 jsowden 1u CHR 136,602 0t0 605 /dev/pts/602 tclsh8.5 4886 jsowden 2u CHR 136,602 0t0 605 /dev/pts/602 tclsh8.5 4886 jsowden 3u IPv4 1302178374 0t0 TCP *:2300 (LISTEN) tclsh8.5 4886 jsowden 4u sock 0,6 0t0 1302186289 can't identify protocol tclsh8.5 4886 jsowden 5u sock 0,6 0t0 1302182312 can't identify protocol tclsh8.5 4886 jsowden 6u sock 0,6 0t0 1302197139 can't identify protocol tclsh8.5 4886 jsowden 7u sock 0,6 0t0 1302196702 can't identify protocol tclsh8.5 4886 jsowden 8u sock 0,6 0t0 1302264163 can't identify protocol tclsh8.5 4886 jsowden 9u sock 0,6 0t0 1302212690 can't identify protocol tclsh8.5 4886 jsowden 10u sock 0,6 0t0 1302250999 can't identify protocol tclsh8.5 4886 jsowden 11u sock 0,6 0t0 1302353390 can't identify protocol tclsh8.5 4886 jsowden 12u sock 0,6 0t0 1302324260 can't identify protocol tclsh8.5 4886 jsowden 13u sock 0,6 0t0 1302326816 can't identify protocol tclsh8.5 4886 jsowden 14u sock 0,6 0t0 1302345636 can't identify protocol tclsh8.5 4886 jsowden 15u sock 0,6 0t0 1302432996 can't identify protocol tclsh8.5 4886 jsowden 16u sock 0,6 0t0 1302369570 can't identify protocol tclsh8.5 4886 jsowden 17u sock 0,6 0t0 1302382000 can't identify protocol tclsh8.5 4886 jsowden 18u sock 0,6 0t0 1302416754 can't identify protocol tclsh8.5 4886 jsowden 19u sock 0,6 0t0 1302442574 can't identify protocol tclsh8.5 4886 jsowden 20u sock 0,6 0t0 1302438961 can't identify protocol tclsh8.5 4886 jsowden 21u sock 0,6 0t0 1302505446 can't identify protocol tclsh8.5 4886 jsowden 22u sock 0,6 0t0 1302545287 can't identify protocol tclsh8.5 4886 jsowden 23u sock 0,6 0t0 1302536915 can't identify protocol tclsh8.5 4886 jsowden 24u sock 0,6 0t0 1302585903 can't identify protocol tclsh8.5 4886 jsowden 25u sock 0,6 0t0 1302557438 can't identify protocol tclsh8.5 4886 jsowden 26u sock 0,6 0t0 1302571883 can't identify protocol tclsh8.5 4886 jsowden 27u sock 0,6 0t0 1302600981 can't identify protocol tclsh8.5 4886 jsowden 28u sock 0,6 0t0 1302606036 can't identify protocol tclsh8.5 4886 jsowden 30u sock 0,6 0t0 1302612110 can't identify protocol $ kill %% [1]+ Terminated /space/azazel/tcl8.5.15/bin/tclsh8.5 server.tcl 2300 $ /space/azazel/tcl8.6.1/bin/tclsh8.6 server.tcl 2300 & [1] 32450 $ /space/azazel/tcl8.6.1/bin/tclsh8.6 client.tcl localhost 2300 $ lsof -p 32450 -n -P 2>/dev/null COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME tclsh8.6 32450 jsowden cwd DIR 253,2 4096 36449390 /space/azazel/tmp/tcl_async_flush tclsh8.6 32450 jsowden rtd DIR 253,0 4096 2 / tclsh8.6 32450 jsowden txt REG 253,2 9126 40788208 /space/azazel/tcl8.6.1/bin/tclsh8.6 tclsh8.6 32450 jsowden mem REG 253,0 156928 1885 /lib64/ld-2.12.so tclsh8.6 32450 jsowden mem REG 253,0 1926800 1886 /lib64/libc-2.12.so tclsh8.6 32450 jsowden mem REG 253,0 22536 1986 /lib64/libdl-2.12.so tclsh8.6 32450 jsowden mem REG 253,0 145896 1887 /lib64/libpthread-2.12.so tclsh8.6 32450 jsowden mem REG 253,0 91096 5944 /lib64/libz.so.1.2.3 tclsh8.6 32450 jsowden mem REG 253,0 599384 5964 /lib64/libm-2.12.so tclsh8.6 32450 jsowden mem REG 253,3 347044 169 /var/db/nscd/passwd tclsh8.6 32450 jsowden mem REG 253,0 99158576 27784 /usr/lib/locale/locale-archive tclsh8.6 32450 jsowden mem REG 253,2 1797180 40648124 /space/azazel/tcl8.6.1/lib/libtcl8.6.so tclsh8.6 32450 jsowden 0u CHR 136,602 0t0 605 /dev/pts/602 tclsh8.6 32450 jsowden 1u CHR 136,602 0t0 605 /dev/pts/602 tclsh8.6 32450 jsowden 2u CHR 136,602 0t0 605 /dev/pts/602 tclsh8.6 32450 jsowden 3r FIFO 0,8 0t0 1302920286 pipe tclsh8.6 32450 jsowden 4w FIFO 0,8 0t0 1302920286 pipe tclsh8.6 32450 jsowden 5u IPv4 1302920300 0t0 TCP *:2300 (LISTEN) tclsh8.6 32450 jsowden 6u IPv6 1302920301 0t0 TCP *:2300 (LISTEN) tclsh8.6 32450 jsowden 7u sock 0,6 0t0 1302960490 can't identify protocol tclsh8.6 32450 jsowden 8u sock 0,6 0t0 1302991225 can't identify protocol tclsh8.6 32450 jsowden 9u sock 0,6 0t0 1302999582 can't identify protocol tclsh8.6 32450 jsowden 10u sock 0,6 0t0 1302996445 can't identify protocol tclsh8.6 32450 jsowden 11u sock 0,6 0t0 1303021832 can't identify protocol tclsh8.6 32450 jsowden 12u sock 0,6 0t0 1303026730 can't identify protocol tclsh8.6 32450 jsowden 13u sock 0,6 0t0 1303037627 can't identify protocol tclsh8.6 32450 jsowden 14u sock 0,6 0t0 1303034803 can't identify protocol tclsh8.6 32450 jsowden 15u sock 0,6 0t0 1303037144 can't identify protocol tclsh8.6 32450 jsowden 16u sock 0,6 0t0 1303055527 can't identify protocol tclsh8.6 32450 jsowden 17u sock 0,6 0t0 1303040280 can't identify protocol tclsh8.6 32450 jsowden 18u sock 0,6 0t0 1303051215 can't identify protocol tclsh8.6 32450 jsowden 19u sock 0,6 0t0 1303061492 can't identify protocol tclsh8.6 32450 jsowden 20u sock 0,6 0t0 1303058139 can't identify protocol tclsh8.6 32450 jsowden 21u sock 0,6 0t0 1303069747 can't identify protocol tclsh8.6 32450 jsowden 22u sock 0,6 0t0 1303097499 can't identify protocol tclsh8.6 32450 jsowden 23u sock 0,6 0t0 1303082333 can't identify protocol tclsh8.6 32450 jsowden 24u sock 0,6 0t0 1303084992 can't identify protocol tclsh8.6 32450 jsowden 25u sock 0,6 0t0 1303087033 can't identify protocol tclsh8.6 32450 jsowden 26u sock 0,6 0t0 1303088580 can't identify protocol tclsh8.6 32450 jsowden 27u sock 0,6 0t0 1303094229 can't identify protocol tclsh8.6 32450 jsowden 28u sock 0,6 0t0 1303099862 can't identify protocol tclsh8.6 32450 jsowden 29u sock 0,6 0t0 1303105973 can't identify protocol tclsh8.6 32450 jsowden 30u sock 0,6 0t0 1303140515 can't identify protocol tclsh8.6 32450 jsowden 31u sock 0,6 0t0 1303113400 can't identify protocol tclsh8.6 32450 jsowden 32u sock 0,6 0t0 1303116737 can't identify protocol tclsh8.6 32450 jsowden 33u sock 0,6 0t0 1303152668 can't identify protocol tclsh8.6 32450 jsowden 34u sock 0,6 0t0 1303227072 can't identify protocol tclsh8.6 32450 jsowden 35u sock 0,6 0t0 1303166233 can't identify protocol tclsh8.6 32450 jsowden 36u sock 0,6 0t0 1303167227 can't identify protocol tclsh8.6 32450 jsowden 37u sock 0,6 0t0 1303190255 can't identify protocol tclsh8.6 32450 jsowden 38u sock 0,6 0t0 1303240333 can't identify protocol tclsh8.6 32450 jsowden 39u sock 0,6 0t0 1303247365 can't identify protocol tclsh8.6 32450 jsowden 40u sock 0,6 0t0 1303247820 can't identify protocol tclsh8.6 32450 jsowden 41u sock 0,6 0t0 1303277650 can't identify protocol tclsh8.6 32450 jsowden 42u sock 0,6 0t0 1303250926 can't identify protocol tclsh8.6 32450 jsowden 43u sock 0,6 0t0 1303254036 can't identify protocol tclsh8.6 32450 jsowden 44u sock 0,6 0t0 1303268472 can't identify protocol tclsh8.6 32450 jsowden 45u sock 0,6 0t0 1303275087 can't identify protocol tclsh8.6 32450 jsowden 46u sock 0,6 0t0 1303294151 can't identify protocol tclsh8.6 32450 jsowden 47u sock 0,6 0t0 1303281716 can't identify protocol tclsh8.6 32450 jsowden 48u sock 0,6 0t0 1303327290 can't identify protocol tclsh8.6 32450 jsowden 49u sock 0,6 0t0 1303307061 can't identify protocol tclsh8.6 32450 jsowden 50u sock 0,6 0t0 1303320536 can't identify protocol tclsh8.6 32450 jsowden 51u sock 0,6 0t0 1303331253 can't identify protocol What appears to be happening is that the client disconnects while the server is doing asynchronous flushes, which causes the channel's output proc to return -1 (tcl8.5.15/generic/tclIO.c, FlushChannel, ll. 2481ff.): /* * Produce the output on the channel. */ toWrite = BytesLeft(bufPtr); if (toWrite == 0) { written = 0; } else { written = (chanPtr->typePtr->outputProc)(chanPtr->instanceData, RemovePoint(bufPtr), toWrite, &errorCode); } /* * If the write failed completely attempt to start the asynchronous * flush mechanism and break out of this loop - do not attempt to * write any more output at this time. */ if (written < 0) { and all pending output to be discarded (tcl8.5.15/generic/tclIO.c, FlushChannel, ll. 2584ff.): /* * When we get an error we throw away all the output currently * queued. */ DiscardOutputQueued(statePtr); continue; } else { wroteSome = 1; } Next time round the loop, the buffer pointer is NULL and the loop is exited (tcl8.5.15/generic/tclIO.c, FlushChannel, ll. 2469ff.): /* * If the output queue is still empty, break out of the while loop. */ if (bufPtr == NULL) { break; /* Out of the "while (1)". */ } Since data have already been written during this call, the wroteSome is true, and BG_FLUSH_SCHEDULED flag is not reset (tcl8.5.15/generic/tclIO.c, FlushChannel, ll. 2610ff.): /* * If we wrote some data while flushing in the background, we are done. * We can't finish the background flush until we run out of data and the * channel becomes writable again. This ensures that all of the pending * data has been flushed at the system level. */ if (statePtr->flags & BG_FLUSH_SCHEDULED) { if (wroteSome) { return errorCode; } else if (statePtr->outQueueHead == NULL) { ResetFlag(statePtr, BG_FLUSH_SCHEDULED); (chanPtr->typePtr->watchProc)(chanPtr->instanceData, statePtr->interestMask); } } Because BG_FLUSH_SCHEDULED is still set, when the channel is closed, Tcl_UnregisterChannel will not call Tcl_Close (tcl8.5.15/generic/tclIO.c, Tcl_UnregisterChannel, ll. 1090ff.): /* * If the refCount reached zero, close the actual channel. */ if (statePtr->refCount <= 0) { /* * Ensure that if there is another buffer, it gets flushed whether or * not we are doing a background flush. */ if ((statePtr->curOutPtr != NULL) && IsBufferReady(statePtr->curOutPtr)) { SetFlag(statePtr, BUFFER_READY); } Tcl_Preserve((ClientData)statePtr); if (!(statePtr->flags & BG_FLUSH_SCHEDULED)) { /* * We don't want to re-enter Tcl_Close(). */ if (!(statePtr->flags & CHANNEL_CLOSED)) { if (Tcl_Close(interp, chan) != TCL_OK) { SetFlag(statePtr, CHANNEL_CLOSED); Tcl_Release((ClientData)statePtr); return TCL_ERROR; } } } SetFlag(statePtr, CHANNEL_CLOSED); Tcl_Release((ClientData)statePtr); } However, because of the non-standard behaviour of select(2) described above, the socket-descriptor will never be returned by select(2) as writable. Consequently, Tcl_NotifyChannel will never call FlushChannel again, and so FlushChannel doesn't close it either (tcl8.5.15/generic/tclIO.c, FlushChannel, ll. 2627ff.): /* * If the channel is flagged as closed, delete it when the refCount drops * to zero, the output queue is empty and there is no output in the * current output buffer. */ if ((statePtr->flags & CHANNEL_CLOSED) && (statePtr->refCount <= 0) && (statePtr->outQueueHead == NULL) && ((statePtr->curOutPtr == NULL) || IsBufferEmpty(statePtr->curOutPtr))) { return CloseChannel(interp, chanPtr, errorCode); } Changing FlushChannel so as always to clear BG_FLUSH_SCHEDULED if the output buffer is empty appears to fix the bug (tcl8.5.15/generic/tclIO.c, FlushChannel, ll. 2617ff.): if (statePtr->flags & BG_FLUSH_SCHEDULED) { if (statePtr->outQueueHead == NULL) { ResetFlag(statePtr, BG_FLUSH_SCHEDULED); (chanPtr->typePtr->watchProc)(chanPtr->instanceData, statePtr->interestMask); } if (wroteSome) { return errorCode; } } | |||
User Comments: |
dgp added on 2014-06-16 13:34:54:
Fixed in active branches. dgp added on 2014-06-16 12:44:34: New test socket-2.13 covers this. dgp added on 2014-06-11 17:25:41: see proposed fix on bug-1758a0b603 branch. dgp added on 2014-05-20 18:32:12: What would help me move forward with this is a test or tests to add to the test suite that demonstrate the problem and when the patch is applied demonstrate the the patch corrects the problem. Partnership with someone on a system where the problem manifests would also be useful. a3a3el added on 2014-03-24 10:17:47: On giving this more thought, I realized that it is also possible that the server might close the channel with buffered output (attached as server2.tcl), the client close its socket before the output is completely flushed and we still end up with leaked descriptors even with my suggested patch. Moving the check to see whether the channel should be closed before the check of the BG_FLUSH_SCHEDULED flag appears to solve that too (attached as flush_channel2.patch). a3a3el added on 2014-03-21 15:15:04: Just noticed the "Attach" button, so the source-code and patch are now with the ticket. |
Attachments:
- flush_channel2.patch [download] added by a3a3el on 2014-03-24 10:20:01. [details]
- server2.tcl [download] added by a3a3el on 2014-03-24 10:18:58. [details]
- flush_channel.patch [download] added by a3a3el on 2014-03-21 15:14:30. [details]
- select_test.c [download] added by a3a3el on 2014-03-21 15:11:46. [details]
- server.tcl [download] added by a3a3el on 2014-03-21 15:11:39. [details]
- client.tcl [download] added by a3a3el on 2014-03-21 15:11:30. [details]