Ticket UUID: | 3380073 | |||
Title: | declining [expr] performance | |||
Type: | Bug | Version: | obsolete: 8.5.10 | |
Submitter: | dgp | Created on: | 2011-07-27 16:32:36 | |
Subsystem: | 47. Bytecode Compiler | Assigned To: | dgp | |
Priority: | 5 Medium | Severity: | ||
Status: | Open | Last Modified: | 2011-07-29 22:24:52 | |
Resolution: | None | Closed By: | ||
Closed on: | ||||
Description: |
A fairly typical calculation proc... set ::RADIAN_CONV 57.2957795131 proc distance { lat1 lon1 lat2 lon2 } { set lat1 [expr {$lat1 / $::RADIAN_CONV}] set lat2 [expr {$lat2 / $::RADIAN_CONV}] set dlon [expr {($lon2 - $lon1) / $::RADIAN_CONV}] return [expr {acos(sin($lat1) * sin($lat2) + \ cos($lat1) * cos($lat2) * cos($dlon)) * \ 6371.0}] } Measured performance trend: % time {distance 41.0 -81.0 42.0 -82.0} 100000 ;# 8.4 1.98998 microseconds per iteration % time {distance 41.0 -81.0 42.0 -82.0} 100000 ;# 8.5 2.17931 microseconds per iteration % time {distance 41.0 -81.0 42.0 -82.0} 100000 ;# 8.6 3.03864 microseconds per iteration | |||
User Comments: |
andreas_kupries added on 2011-07-29 22:24:52:
@cyan, all your benchmark files are now properly attached to this report. andreas_kupries added on 2011-07-29 22:24:22: File Added - 419712: run_benchmark.tcl andreas_kupries added on 2011-07-29 22:24:08: File Added - 419711: benchmark_unrolled.tcl andreas_kupries added on 2011-07-29 22:23:52: File Added - 419710: benchmark_no_f.tcl cyan added on 2011-07-29 22:13:53: Another interesting result: I changed my benchmark script to build the looping machinery and tested command as a string, run it as a lambda, and allow loop unrolling: http://65.111.161.125/benchmark_unrolled.tcl Benchmarked code here: http://65.111.161.125/run_benchmark.tcl Now 8.6 is consistently faster for me (~5-20%), for low values of unroll. When unroll reaches around 5 8.5 starts to pull ahead (about 5%) tcl8.5.9, unroll 1: 3.693 usec per iteration tcl8.6b1.2, unroll 1: 3.505 usec per iteration tcl8.5.9, unroll 2: 3.724 usec per iteration tcl8.6b1.2, unroll 2: 3.470 usec per iteration tcl8.5.9, unroll 5: 3.359 usec per iteration tcl8.6b1.2, unroll 5: 3.444 usec per iteration The code is being run like so: set run " set rtimes {} set start \[clock microseconds\] while {\[incr i\] <= $it_a} { set j 0 lappend rtimes \[clock microseconds\] while {\[incr j\] <= [/ $it_b $unroll]} { [string repeat $args\n\t\t\t\t\t $unroll] } lappend rtimes \[clock microseconds\] } set runtime \[- \[clock microseconds\] \$start\] list \$rtimes \$runtime " lassign [apply [list {} $run [namespace current]]] rtimes runtime msofer added on 2011-07-29 19:33:01: File Added - 419691: mtime.tcl msofer added on 2011-07-29 19:32:22: Interesting: this benchmarker loops calls from within TEBC, invoked as {*}$args IOW, it does sidestep my main beef with [time] (looping over Tcl_EvalObjEx calls) and is a better simulator of "real script behaviour". An even better (?) approach would compile a proc or lambda that contains a bc'ed loop over the script, and run that. The script mtime.tcl (attached) tries to separate that effect (proc mtime), and also to isolate the effect of the increased command call overhead in 8.6. On my machine I get mig@ari:~$ DEVEL/tcl-core/core-8-5/unix/tclsh /tmp/mtime.tcl tcl8.5.9: [ time $cmd $n]: 1.93854 microseconds per iteration [mtime $cmd $n]: 1.81043 microseconds per iteration [mtime $scr $n]: 1.64805 microseconds per iteration mig@ari:~$ DEVEL/tcl-core/trunk/unix/tclsh /tmp/mtime.tcl tcl8.6b1.2: [ time $cmd $n]: 2.21536 microseconds per iteration [mtime $cmd $n]: 1.98749 microseconds per iteration [mtime $scr $n]: 1.69049 microseconds per iteration Note that the first run is a plain [time], the second removes the Tcl_EvalObjEx overhead, the third removes the command call overhead. The script is very much ad-hoc, it is not so trivial to generalise either [mtime] nor [toBody] (it required removing [return] from [distance], for example). In the example then: most of the slowdown seems to be due to measurement problems ([time] is 8.6 unfriendly) and the increased command call overhead in 8.6 (itself mainly due to NRE). cyan added on 2011-07-29 15:01:33: I can't seem to attach a file on the SF tracker, so I've dumped it on an internet-visible machine at http://65.111.161.125/benchmark_no_f.tcl It's nasty throw-away code I put together to confirm a suspicion I had that processor frequency scaling was affecting performance (it was) hobbs added on 2011-07-29 02:53:29: @cyan - can you describe how benchmark::test is written (or even attach code as file here)? It's interesting that it consistently is faster than 'time' in 8.5, but 2/3 the same in 8.6. Which one is "truer"? cyan added on 2011-07-29 01:21:34: Some interesting results: testing on my machine 8.5 and 8.6 were quite close (closer than the original bug report), but also the times reported by the "time" command disagreed quite a lot with a benchmarker I built to investigate performance in some of my code (and showed 8.6 to be faster than 8.5). Trying a few different versions of the distance proc on 8.5 and 8.6 (8.4 doesn't work on the benchmarker) (apologies for the formatting, I used vimdiff to compare the runs): namespace path {::tcl::mathop} set ::RADIAN_CONV 57.2957795131 proc distance { lat1 lon1 lat2 lon2 } { set lat1 [expr {$lat1 / $::RADIAN_CONV}] set lat2 [expr {$lat2 / $::RADIAN_CONV}] set dlon [expr {($lon2 - $lon1) / $::RADIAN_CONV}] return [expr {acos(sin($lat1) * sin($lat2) + \ cos($lat1) * cos($lat2) * cos($dlon)) * \ 6371.0}] } proc distance2 {lat1 lon1 lat2 lon2} { global RADIAN_CONV set lat1 [/ $lat1 $RADIAN_CONV] set lat2 [/ $lat2 $RADIAN_CONV] set dlon [expr {($lon2 - $lon1) / $RADIAN_CONV}] expr {acos(sin($lat1) * sin($lat2) + \ cos($lat1) * cos($lat2) * cos($dlon)) * \ 6371.0} } proc distance3 {lat1 lon1 lat2 lon2} { set lat1[/ $lat1 57.2957795131] set lat2[/ $lat2 57.2957795131] expr { acos(sin($lat1) * sin($lat2) + cos($lat1) * cos($lat2) * cos(($lon2 - $lon1) / 57.2957795131) ) * 6371.0 } } puts "Tcl version: [info patchlevel]" # Initial calls to ensure proc bodies are bytecode compiled, and verify correctness puts "distance: [distance 41.0 -81.0 42.0 -82.0]" puts "distance2: [distance2 41.0 -81.0 42.0 -82.0]" puts "distance3: [distance3 41.0 -81.0 42.0 -82.0]" puts "\n=========================================================" puts "distance (using time): [apply {{} {time {distance 41.0 -81.0 42.0 -82.0} 1000000}}]" puts "distance (using benchmark):" benchmark::test 1000 1000 usec distance 41.0 -81.0 42.0 -82.0 puts "\n=========================================================" puts "distance2 (using time): [apply {{} {time {distance 41.0 -81.0 42.0 -82.0} 1000000}}]" puts "distance2 (using benchmark):" benchmark::test 1000 1000 usec distance2 41.0 -81.0 42.0 -82.0 puts "\n=========================================================" puts "distance3 (using time): [apply {{} {time {distance 41.0 -81.0 42.0 -82.0} 1000000}}]" puts "distance3 (using benchmark):" benchmark::test 1000 1000 usec distance3 41.0 -81.0 42.0 -82.0 Tcl version: 8.5.9 distance: 138.92161836383994 distance2: 138.92161836383994 distance3: 138.92161836383994 ========================================================= distance (using time): 4.02733 microseconds per iteration distance (using benchmark): == testing proc (distance 41.0 -81.0 42.0 -82.0) ===== munge times: 3.432 ms llength times: 1000 descriptive parameters: min: 3.343 usec median: 3.457 usec harmonic_mean: 3.468 usec arithmetic_mean: 3.469 usec max: 4.102 usec sample_stddev: 0.069 usec 3.457 usec per iteration / 288088.464 iterations per second (1000000 iterations in 3.471 seconds) <= 3.344370 1 <= 3.370397 11 ## <= 3.384095 22 #### <= 3.397793 39 ######## <= 3.411491 86 ################# <= 3.425190 94 ################## <= 3.438888 98 ################### <= 3.452586 116 ####################### <= 3.466284 103 #################### <= 3.479983 94 ################## <= 3.493681 79 ############### 257 ################################################## ========================================================= distance2 (using time): 4.05635 microseconds per iteration distance2 (using benchmark): == testing proc (distance2 41.0 -81.0 42.0 -82.0) ===== munge times: 3.415 ms llength times: 1000 descriptive parameters: min: 2.793 usec median: 2.847 usec harmonic_mean: 2.851 usec arithmetic_mean: 2.851 usec max: 3.263 usec sample_stddev: 0.033 usec 2.847 usec per iteration / 350468.313 iterations per second (1000000 iterations in 2.853 seconds) <= 2.793661 1 <= 2.806227 15 ### <= 2.812840 40 ####### <= 2.819454 81 ############### <= 2.826067 79 ############### <= 2.832681 78 ############## <= 2.839294 95 ################# <= 2.845908 83 ############### <= 2.852521 106 ################### <= 2.859135 85 ################ <= 2.865748 65 ############ 272 ################################################## ========================================================= distance3 (using time): 4.123645 microseconds per iteration distance3 (using benchmark): == testing proc (distance3 41.0 -81.0 42.0 -82.0) ===== munge times: 3.505 ms llength times: 1000 descriptive parameters: min: 2.570 usec median: 2.620 usec harmonic_mean: 2.623 usec arithmetic_mean: 2.623 usec max: 2.894 usec sample_stddev: 0.028 usec 2.620 usec per iteration / 380933.661 iterations per second (1000000 iterations in 2.625 seconds) <= 2.570557 1 <= 2.581149 18 ### <= 2.586724 27 #### <= 2.592299 38 ###### <= 2.597873 63 ######### <= 2.603448 108 ################ <= 2.609023 93 ############## <= 2.614597 75 ########### <= 2.620172 90 ############## <= 2.625747 80 ############ <= 2.631321 74 ########### 333 ################################################## Tcl version: 8.6b1.2 distance: 138.92161836383994 distance2: 138.92161836383994 distance3: 138.92161836383994 ========================================================= distance (using time): 3.869864 microseconds per iteration distance (using benchmark): == testing proc (distance 41.0 -81.0 42.0 -82.0) ===== munge times: 2.652 ms llength times: 1000 descriptive parameters: min: 3.679 usec median: 3.812 usec harmonic_mean: 3.816 usec arithmetic_mean: 3.817 usec max: 4.391 usec sample_stddev: 0.061 usec 3.812 usec per iteration / 261862.502 iterations per second (1000000 iterations in 3.819 seconds) <= 3.680225 1 <= 3.703496 8 # <= 3.715744 3 <= 3.727992 17 ## <= 3.740240 33 ### <= 3.752488 43 #### <= 3.764736 55 ###### <= 3.776984 65 ####### <= 3.789232 92 ########## <= 3.801480 86 ######### <= 3.813728 117 ############ 480 ################################################## ========================================================= distance2 (using time): 3.821812 microseconds per iteration distance2 (using benchmark): == testing proc (distance2 41.0 -81.0 42.0 -82.0) ===== munge times: 2.623 ms llength times: 1000 descriptive parameters: min: 3.008 usec median: 3.114 usec harmonic_mean: 3.116 usec arithmetic_mean: 3.117 usec max: 3.862 usec sample_stddev: 0.048 usec 3.114 usec per iteration / 320614.862 iterations per second (1000000 iterations in 3.119 seconds) <= 3.008962 1 <= 3.027237 4 <= 3.036856 6 # <= 3.046474 8 # <= 3.056093 27 ### <= 3.065712 32 ### <= 3.075330 61 ###### <= 3.084949 57 ###### <= 3.094567 109 ########### <= 3.104186 91 ######### <= 3.113804 88 ######### 516 ################################################## ========================================================= distance3 (using time): 3.894228 microseconds per iteration distance3 (using benchmark): == testing proc (distance3 41.0 -81.0 42.0 -82.0) ===== munge times: 2.598 ms llength times: 1000 descriptive parameters: min: 2.654 usec median: 2.741 usec harmonic_mean: 2.746 usec arithmetic_mean: 2.747 usec max: 3.174 usec sample_stddev: 0.043 usec 2.741 usec per iteration / 363791.405 iterations per second (1000000 iterations in 2.749 seconds) <= 2.654868 1 <= 2.671353 3 <= 2.680029 6 # <= 2.688706 25 ### <= 2.697382 44 ##### <= 2.706058 69 ######## <= 2.714735 59 ####### <= 2.723411 81 ########## <= 2.732087 108 ############# <= 2.740764 95 ########### <= 2.749440 83 ########## 426 ################################################## dkf added on 2011-07-28 16:33:11: On my machine (older macbook pro), I get extremely similar performance for 8.5(.2; from AS) and 8.6 (trunk tip, configured for release build) but both are significantly slower than 8.4 (the system installation, 8.4.7). This is particularly the case once using [global] to bring in the constant (a recommended practice). I guess there may be a problem with function calls? 8.5 switched from specialized bytecodes to using general command dispatch, which is lovely and flexible but means that it makes a hot spot (that was hit in 8.6's NRE work) even hotter. ferrieux added on 2011-07-28 02:22:07: Yet another data point: the overhead of going into and out of the proc seems to play a role: 8.5 flat : 5.24437 microseconds per iteration 8.5 proc: 5.12985 microseconds per iteration 8.6 flat : 6.1804 microseconds per iteration 8.6 proc : 6.81622 microseconds per iteration Done with: proc flatloop { lat1 lon1 lat2 lon2 } { time { set lat1 [expr {$lat1 / $::RADIAN_CONV}] set lat2 [expr {$lat2 / $::RADIAN_CONV}] set dlon [expr {($lon2 - $lon1) / $::RADIAN_CONV}] expr {acos(sin($lat1) * sin($lat2) + \ cos($lat1) * cos($lat2) * cos($dlon)) * \ 6371.0} } 100000 } puts [flatloop 41.0 -81.0 42.0 -82.0] puts [time {distance 41.0 -81.0 42.0 -82.0} 100000] hobbs added on 2011-07-28 02:21:43: Both TIPs 174 and 232 have an impact in 8.5. I believe the NRE changes are part of the further 8.6 degradation. You will find a thread on tcl-core about the suspected 8.4->8.5->8.6 issues, but you have to compare the latest, as we made some fixes in 8.5 and 8.6 (notably TIP378) that changed some values. You can also see a further analysis of normalized tclbench data between 8.4, 8.5.9 and 8.5.9-new with some of these fixes in place at: https://spreadsheets.google.com/spreadsheet/ccc?key=0AiqUOwK_jbW-dEtlcFo0b09TUTVsRzFIaWU5bWFXSFE&hl=en_US msofer added on 2011-07-28 01:49:26: Agreed with my own observations: it is a case of "barely measurable" effect of [times] overhead. ferrieux added on 2011-07-28 01:38:36: Also, the :: vs [global] slowdown mentioned in 3302370 also hits orthogonally (roughly 20% in both cases). ferrieux added on 2011-07-28 01:31:50: OK, a bit of hard facts proving that this indeed orthogonal to the [times] overhead: Using proc time2 {s n} { proc _tmp {} [format { set n %d while {[incr n -1]>=0} {%s} } $n $s] return [time _tmp] } puts [time {distance 41.0 -81.0 42.0 -82.0} 100000] puts [time2 {distance 41.0 -81.0 42.0 -82.0} 100000] I get the following pairs: ===== 8.5: 4.97726 microseconds per iteration 491821 microseconds per iteration ===== 8.6: 6.82854 microseconds per iteration 663014 microseconds per iteration So, though the T_EOE effect is indeed a bit stronger on 8.6, the bulk of the slowdown is intrinsic. Second data point: disassembling shows exactly the same bytecodes. msofer added on 2011-07-28 00:46:45: Nothing so odd about it: 1. [time] iterates over Tcl_EvalObjEx 2. Tcl_EvalObjEx IS much slower on 8.6 3. normal Tcl scripts do not iterate over Tcl_EvalObjEx So many benchmarks overemphasize 8.6's slowdowns, some by huge margins. If this is one of those or not, I do not know. My guess is that it will be between 'measurably' and 'quite a bit' ferrieux added on 2011-07-28 00:38:30: Also, I'd check whether the literals shimmer at each turn of the [times] loop. I remember a chat session with miguel, where we discovered that odd things happened to multi-iteration-[times], and didn't with a regular [for] loop in a proc. andreas_kupries added on 2011-07-27 23:42:04: Chat excerpts... aku: Might be interesting to see how the bytecode of 8.4/5/6 differ for the expressions. dgp: Some bisecting to isolate things might be interesting, if the slowdown happened in discrete steps. |
Attachments:
- run_benchmark.tcl [download] added by andreas_kupries on 2011-07-29 22:24:22. [details]
- benchmark_unrolled.tcl [download] added by andreas_kupries on 2011-07-29 22:24:08. [details]
- benchmark_no_f.tcl [download] added by andreas_kupries on 2011-07-29 22:23:52. [details]
- mtime.tcl [download] added by msofer on 2011-07-29 19:33:01. [details]