Tcl Source Code: View Ticket

Ticket UUID:	1a6a36d901dedead248ea96df09d5dc159532652
Title:	Linux: epoll() cpu usage is high (in conjunction with after command)
Type:	Bug	Version:	8.7a1
Submitter:	bll	Created on:	2017-09-08 16:01:44
Subsystem:	01. Notifier	Assigned To:	jan.nijtmans
Priority:	5 Medium	Severity:	Important
Status:	Closed	Last Modified:	2017-10-27 07:29:23
Resolution:	Fixed	Closed By:	jan.nijtmans
		Closed on:	2017-10-27 07:29:23
Description:	Apparently epoll() is not working correctly. The CPU usage is much too high. On Linux, Tcl/Tk 8.7a1 bll-tecra:bll$ ps -f -t pts/1 -o pcpu,args %CPU COMMAND 0.0 bash 96.5 \_ /home/bll/local87/bin/tclsh8.7 ../ballroomdj.tcl 94.5 \_ /home/bll/local87/bin/tclsh8.7 /home/bll/sources/ballroomdj/bin/son 99.8 \_ /home/bll/local87/bin/tclsh8.7 /home/bll/sources/ballroomdj/bin/pla Tcl/Tk 8.6.7 (tclkit): bll-tecra:bll$ ps -f -t pts/1 -o pcpu,args %CPU COMMAND 0.0 bash 3.4 \_ /home/bll/sources/ballroomdj/linux/64/tclsh ../ballroomdj.tcl 0.5 \_ /home/bll/sources/ballroomdj/linux/64/tclsh /home/bll/sources/ballr 3.5 \_ /home/bll/sources/ballroomdj/linux/64/tclsh /home/bll/sources/ballr Tcl/Tk 8.6.7 (built the same as Tcl/Tk 8.7): bll-tecra:bll$ ps -f -t pts/1 -o pcpu,args %CPU COMMAND 0.0 bash 5.4 \_ /home/bll/local/bin/tclsh8.6 ../ballroomdj.tcl 0.9 \_ /home/bll/local/bin/tclsh8.6 /home/bll/sources/ballroomdj/bin/songs 3.4 \_ /home/bll/local/bin/tclsh8.6 /home/bll/sources/ballroomdj/bin/playe
User Comments:	jan.nijtmans added on 2017-10-27 07:29:23: Proposed fix committed to trunk. Thanks! If problems remain with the epoll notifier, please re-open this ticket or open a new one (depends on how related it is to this one). Hopefully everything is OK now. bll added on 2017-10-26 17:59:08: A fix has been pushed: https://github.com/lalbornoz/tcl/commit/7249454d27938a9834bb4068d2994fdd98c24605 With the patch: bll-tecra:bll$ time $HOME/local87/bin/tclsh8.7 t.tcl ^C real 0m7.413s user 0m0.029s sys 0m0.005s dbus test: bll-tecra:bll$ ps -f -t pts/1 -o pcpu,args %CPU COMMAND 0.0 bash 0.0 \_ rlwrap /home/bll/local87/bin/tclsh8.7 I don't have a specific test for the disconnected socket, but I tested with an older version of my application and it seems fine. anonymous (claiming to be Lucio Andrés Illanes Albornoz) added on 2017-10-26 17:26:38: I've pushed a fix concerning epoll_wait() timeout calculation to my GitHub: <https://github.com/lalbornoz/tcl/commit/7249454d27938a9834bb4068d2994fdd98c24605> Can you confirm that this reduces CPU usage down to expected values w/ both the busyWait as well as the disconnected sockets tests? bll added on 2017-09-14 17:32:07: Oh, I know -- it was very manual with a control-c. Not accurate at all. But it is very obvious. 0.12% vs. 99.98% cpu usage. ferrieux added on 2017-09-14 17:25:05: OK, just saw you aligned for the same real time. Sorry for the noise. Will dig later. ferrieux added on 2017-09-14 17:23:46: I don't understand: this test program should live forever, so the measurements, if any, should be in steady-state CPU consumption (percent), not in total times at end of execution. bll added on 2017-09-14 04:01:53: The following test program (pure Tcl): proc busyWait { } { after 200 ::busyWait } ::busyWait vwait ::forever With Tcl 8.6.7 uses very little cpu. real 0m8.134s user 0m0.006s sys 0m0.004s With Tcl 8.7 uses a lot of cpu. real 0m8.351s user 0m6.679s sys 0m1.670s (Updated the title of the ticket). bll added on 2017-09-10 18:36:09: So Problem C is somehow caused by a clock display. I have not tested this as a stand-alone procedure!!! The procedure: proc doClock { } { variable vars # this is not quite localized... if { [$::bdjopt get MQSHOWCLOCK] eq {12 Hour} } { set vars(clock.disp) [clock format [clock seconds] \ -format "%A %B %Oe %r" -locale system] } else { set vars(clock.disp) [clock format [clock seconds] \ -format "%A %B %Oe %R" -locale system] } after 500 ::bdj::doClock } And it is started with an initial: doClock I commented out the [clock] commands and that has no effect. Commenting out the after command returns the code to normal CPU usage. bll added on 2017-09-10 16:00:39: Confirmed aspect's test. I do not know why BallroomDJ does not have the same problem under 8.6.7 when it does not unregister the handlers. I have put in fixes into BallroomDJ to make sure the handlers are unregistered, but there's yet another issue somewhere (call this problem C). The program gets into another tight loop, and as best I can tell, file descriptor 5 is the X11 named socket. 5656 08:27:53.661551 epoll_wait(4, {}, 512, 0) = 0 5656 08:27:53.661576 recvmsg(5, 0x7ffc0890dba0, 0) = -1 EAGAIN (Resource temporarily unavailable) 5656 08:27:53.661602 recvmsg(5, 0x7ffc0890dba0, 0) = -1 EAGAIN (Resource temporarily unavailable) ... This is only in a particular situation where another process is started (and various new socket connections are made, but no disconnects). And dbus is not loaded. Trying to find some sort of clue on what's happening... aspect added on 2017-09-10 15:01:27: To be clear: testing with the below script on Linux, making a single connection to port 38000 and closing it (nc localhost 38000; press ^C) I see similar CPU load on 8.6.7 and before vs trunk with epoll. Different syscalls surround recvfrom() but a CPU is pegged in each case. I wonder if the difference you're seeing with ballroomdj is caused by the dbus issue (Problem B) which might be something in that extension not working well on 8.7. Schelte might be better able to investigate if that is the case. aspect added on 2017-09-10 13:56:52: > regression from how 8.6.7 acts. Not so. Readable events have worked this way for a long time. bll added on 2017-09-10 12:20:30: Problem A: Correct. I'm afraid there's going to be a lot of code out there that doesn't unregister the handler. This is a regression from how 8.6.7 acts. Problem B: load [pwd]/../linux/64/libdbus2.1.so puts [dbus connect] vwait forever See attached. ferrieux added on 2017-09-10 08:23:26: Dunno if it's just for illustration/trial, but the commented line prevents your fileevent handler from unregistering itself, so it fires continuously after EOF (problem A). For problem B, assuming there's no disconnection as you're saying (to be verified with tcpdump or netstat), please attach the full strace from process startup, with options -tt (timestamps) and -f (follow children / threads). aspect added on 2017-09-10 08:20:26: The first example isn't clear: handler doesn't close the socket when eof occurs so the readable event keeps firing, as expected. Is this what you're seeing, or something else? bll added on 2017-09-09 18:35:39: Problem B has two server connections open and also initialized the 'dbus' connection from the 'dbus' package. bll added on 2017-09-09 17:29:50: Problem B: epoll_wait(4, {}, 512, 0) = 0 epoll_wait(4, {}, 512, 0) = 0 epoll_wait(4, {}, 512, 0) = 0 epoll_wait(4, {}, 512, 0) = 0 epoll_wait(4, {}, 512, 0) = 0 epoll_wait(4, {}, 512, 0) = 0 This one simply has two sockets open, no disconnections...have to write some code to reproduce. bll added on 2017-09-09 17:16:51: This appears to be a very basic, essential problem. It's in a very tight loop: epoll_wait(4, {}, 512, 0) = 0 recvmsg(5, 0x7ffca1cf1330, 0) = -1 EAGAIN (Resource temporarily unavailable) recvmsg(5, 0x7ffca1cf1330, 0) = -1 EAGAIN (Resource temporarily unavailable) epoll_wait(4, {}, 512, 0) = 0 recvmsg(5, 0x7ffca1cf1330, 0) = -1 EAGAIN (Resource temporarily unavailable) recvmsg(5, 0x7ffca1cf1330, 0) = -1 EAGAIN (Resource temporarily unavailable) Looks like a disconnected socket isn't being handled the same way as 8.6.7. #!/usr/local87/bin/tclsh8.7 proc handler { s } { try { set rc [gets $s line] if { $rc >= 0 } { puts $line } else { #chan event $s readable {} } } on error {err res} { puts $res } } proc accept { s addr port } { try { chan event $s readable [list ::handler $s] } on error {err res} { puts $res } } try { set mainsock [socket -server ::accept -myaddr 127.0.0.1 38000] } on error {err res} { puts $res } vwait ::forever ferrieux added on 2017-09-09 16:48:51: You might want to attach the Tcl code for anybody to make more than guesses. Alternatively, you can strace the process to see the syscalls in the loop. bll added on 2017-09-08 16:18:12: The second process is pure Tcl, no Tk involved. It's just sitting waiting for a message to show up on one of two sockets. I am curious...is there a reason why it is not at 0.0%?

Attachments:

out.txt [download] added by bll on 2017-09-10 12:21:44. [details]