Tcl Source Code

View Ticket
Login
Ticket UUID: 3380073
Title: declining [expr] performance
Type: Bug Version: obsolete: 8.5.10
Submitter: dgp Created on: 2011-07-27 16:32:36
Subsystem: 47. Bytecode Compiler Assigned To: dgp
Priority: 5 Medium Severity:
Status: Open Last Modified: 2011-07-29 22:24:52
Resolution: None Closed By:
    Closed on:
Description:
A fairly typical calculation proc...

set ::RADIAN_CONV  57.2957795131

proc distance { lat1 lon1 lat2 lon2 } {
    set lat1 [expr {$lat1 / $::RADIAN_CONV}]
    set lat2 [expr {$lat2 / $::RADIAN_CONV}]
    set dlon [expr {($lon2 - $lon1) / $::RADIAN_CONV}]

    return [expr {acos(sin($lat1) * sin($lat2) + \
                           cos($lat1) * cos($lat2) * cos($dlon)) * \
                      6371.0}]
}

Measured performance trend:

% time {distance 41.0 -81.0 42.0 -82.0} 100000    ;# 8.4
1.98998 microseconds per iteration

% time {distance 41.0 -81.0 42.0 -82.0} 100000    ;# 8.5
2.17931 microseconds per iteration

% time {distance 41.0 -81.0 42.0 -82.0} 100000     ;# 8.6
3.03864 microseconds per iteration
User Comments: andreas_kupries added on 2011-07-29 22:24:52:
@cyan, all your benchmark files are now properly attached to this report.

andreas_kupries added on 2011-07-29 22:24:22:

File Added - 419712: run_benchmark.tcl

andreas_kupries added on 2011-07-29 22:24:08:

File Added - 419711: benchmark_unrolled.tcl

andreas_kupries added on 2011-07-29 22:23:52:

File Added - 419710: benchmark_no_f.tcl

cyan added on 2011-07-29 22:13:53:
Another interesting result:  I changed my benchmark script to build the looping machinery and tested command as a string, run it as a lambda, and allow loop unrolling: http://65.111.161.125/benchmark_unrolled.tcl

Benchmarked code here: http://65.111.161.125/run_benchmark.tcl

Now 8.6 is consistently faster for me (~5-20%), for low values of unroll.
When unroll reaches around 5 8.5 starts to pull ahead (about 5%)

tcl8.5.9, unroll 1: 3.693 usec per iteration
tcl8.6b1.2, unroll 1: 3.505 usec per iteration 

tcl8.5.9, unroll 2: 3.724 usec per iteration
tcl8.6b1.2, unroll 2: 3.470 usec per iteration

tcl8.5.9, unroll 5: 3.359 usec per iteration
tcl8.6b1.2, unroll 5: 3.444 usec per iteration

The code is being run like so:
set run "
    set rtimes  {}
    set start   \[clock microseconds\]
    while {\[incr i\] <= $it_a} {
        set j   0
        lappend rtimes  \[clock microseconds\]
        while {\[incr j\] <= [/ $it_b $unroll]} {
            [string repeat $args\n\t\t\t\t\t $unroll]
        }
        lappend rtimes  \[clock microseconds\]
    }
    set runtime     \[- \[clock microseconds\] \$start\]
    list \$rtimes \$runtime
"
lassign [apply [list {} $run [namespace current]]] rtimes runtime

msofer added on 2011-07-29 19:33:01:

File Added - 419691: mtime.tcl

msofer added on 2011-07-29 19:32:22:
Interesting: this benchmarker loops calls from within TEBC, invoked as
   {*}$args
IOW, it does sidestep my main beef with [time] (looping over Tcl_EvalObjEx calls) and is a better simulator of "real script behaviour". An even better (?) approach would compile a proc or lambda that contains a bc'ed loop over the script, and run that.

The script mtime.tcl (attached) tries to separate that effect (proc mtime), and also to isolate the effect of the increased command call overhead in 8.6. On my machine I get

mig@ari:~$ DEVEL/tcl-core/core-8-5/unix/tclsh /tmp/mtime.tcl
tcl8.5.9: 
[ time $cmd $n]: 1.93854 microseconds per iteration
[mtime $cmd $n]: 1.81043 microseconds per iteration
[mtime $scr $n]: 1.64805 microseconds per iteration
mig@ari:~$ DEVEL/tcl-core/trunk/unix/tclsh /tmp/mtime.tcl
tcl8.6b1.2: 
[ time $cmd $n]: 2.21536 microseconds per iteration
[mtime $cmd $n]: 1.98749 microseconds per iteration
[mtime $scr $n]: 1.69049 microseconds per iteration

Note that the first run is a plain [time], the second removes the Tcl_EvalObjEx overhead, the third removes the command call overhead. The script is very much ad-hoc, it is not so trivial to generalise either [mtime] nor [toBody] (it required removing [return] from [distance], for example).

In the example then: most of the slowdown seems to be due to measurement problems ([time] is 8.6 unfriendly) and the increased command call overhead in 8.6 (itself mainly due to NRE).

cyan added on 2011-07-29 15:01:33:
I can't seem to attach a file on the SF tracker, so I've dumped it on an internet-visible machine at http://65.111.161.125/benchmark_no_f.tcl

It's nasty throw-away code I put together to confirm a suspicion I had that processor frequency scaling was affecting performance (it was)

hobbs added on 2011-07-29 02:53:29:
@cyan - can you describe how benchmark::test is written (or even attach code as file here)?  It's interesting that it consistently is faster than 'time' in 8.5, but 2/3 the same in 8.6.  Which one is "truer"?

cyan added on 2011-07-29 01:21:34:
Some interesting results: testing on my machine 8.5 and 8.6 were quite close (closer than the original bug report), but also the times reported by the "time" command disagreed quite a lot with a benchmarker I built to investigate performance in some of my code (and showed 8.6 to be faster than 8.5).

Trying a few different versions of the distance proc on 8.5 and 8.6 (8.4 doesn't work on the benchmarker) (apologies for the formatting, I used vimdiff to compare the runs):

namespace path {::tcl::mathop}

set ::RADIAN_CONV  57.2957795131

proc distance { lat1 lon1 lat2 lon2 } {
set lat1 [expr {$lat1 / $::RADIAN_CONV}]
set lat2 [expr {$lat2 / $::RADIAN_CONV}]
set dlon [expr {($lon2 - $lon1) / $::RADIAN_CONV}]

return [expr {acos(sin($lat1) * sin($lat2) + \
cos($lat1) * cos($lat2) * cos($dlon)) * \
6371.0}]
}

proc distance2 {lat1 lon1 lat2 lon2} {
global RADIAN_CONV
set lat1 [/ $lat1 $RADIAN_CONV]
set lat2 [/ $lat2 $RADIAN_CONV]
set dlon [expr {($lon2 - $lon1) / $RADIAN_CONV}]

expr {acos(sin($lat1) * sin($lat2) + \
cos($lat1) * cos($lat2) * cos($dlon)) * \
6371.0}
}

proc distance3 {lat1 lon1 lat2 lon2} {
set lat1[/ $lat1 57.2957795131]
set lat2[/ $lat2 57.2957795131]
expr {
acos(sin($lat1) * sin($lat2) + 
     cos($lat1) * cos($lat2) * cos(($lon2 - $lon1) / 57.2957795131)
) * 6371.0
}
}


puts "Tcl version: [info patchlevel]"
# Initial calls to ensure proc bodies are bytecode compiled, and verify correctness
puts "distance:  [distance 41.0 -81.0 42.0 -82.0]"
puts "distance2: [distance2 41.0 -81.0 42.0 -82.0]"
puts "distance3: [distance3 41.0 -81.0 42.0 -82.0]"

puts "\n========================================================="
puts "distance  (using time): [apply {{} {time {distance 41.0 -81.0 42.0 -82.0} 1000000}}]"
puts "distance  (using benchmark):"
benchmark::test 1000 1000 usec distance 41.0 -81.0 42.0 -82.0

puts "\n========================================================="
puts "distance2 (using time): [apply {{} {time {distance 41.0 -81.0 42.0 -82.0} 1000000}}]"
puts "distance2 (using benchmark):"
benchmark::test 1000 1000 usec distance2 41.0 -81.0 42.0 -82.0

puts "\n========================================================="
puts "distance3 (using time): [apply {{} {time {distance 41.0 -81.0 42.0 -82.0} 1000000}}]"
puts "distance3 (using benchmark):"
benchmark::test 1000 1000 usec distance3 41.0 -81.0 42.0 -82.0



Tcl version: 8.5.9
distance:  138.92161836383994
distance2: 138.92161836383994
distance3: 138.92161836383994

=========================================================
distance  (using time): 4.02733 microseconds per iteration
distance  (using benchmark):

== testing proc (distance 41.0 -81.0 42.0 -82.0) =====
munge times: 3.432 ms
llength times: 1000
descriptive parameters:
            min:   3.343 usec
         median:   3.457 usec
  harmonic_mean:   3.468 usec
arithmetic_mean:   3.469 usec
            max:   4.102 usec
  sample_stddev:   0.069 usec
3.457 usec per iteration / 288088.464 iterations per second (1000000 iterations in 3.471 seconds)
<=   3.344370      1 
<=   3.370397     11 ##
<=   3.384095     22 ####
<=   3.397793     39 ########
<=   3.411491     86 #################
<=   3.425190     94 ##################
<=   3.438888     98 ###################
<=   3.452586    116 #######################
<=   3.466284    103 ####################
<=   3.479983     94 ##################
<=   3.493681     79 ###############
                 257 ##################################################

=========================================================
distance2 (using time): 4.05635 microseconds per iteration
distance2 (using benchmark):

== testing proc (distance2 41.0 -81.0 42.0 -82.0) =====
munge times: 3.415 ms
llength times: 1000
descriptive parameters:
            min:   2.793 usec
         median:   2.847 usec
  harmonic_mean:   2.851 usec
arithmetic_mean:   2.851 usec
            max:   3.263 usec
  sample_stddev:   0.033 usec
2.847 usec per iteration / 350468.313 iterations per second (1000000 iterations in 2.853 seconds)
<=   2.793661      1 
<=   2.806227     15 ###
<=   2.812840     40 #######
<=   2.819454     81 ###############
<=   2.826067     79 ###############
<=   2.832681     78 ##############
<=   2.839294     95 #################
<=   2.845908     83 ###############
<=   2.852521    106 ###################
<=   2.859135     85 ################
<=   2.865748     65 ############
                 272 ##################################################

=========================================================
distance3 (using time): 4.123645 microseconds per iteration
distance3 (using benchmark):

== testing proc (distance3 41.0 -81.0 42.0 -82.0) =====
munge times: 3.505 ms
llength times: 1000
descriptive parameters:
            min:   2.570 usec
         median:   2.620 usec
  harmonic_mean:   2.623 usec
arithmetic_mean:   2.623 usec
            max:   2.894 usec
  sample_stddev:   0.028 usec
2.620 usec per iteration / 380933.661 iterations per second (1000000 iterations in 2.625 seconds)
<=   2.570557      1 
<=   2.581149     18 ###
<=   2.586724     27 ####
<=   2.592299     38 ######
<=   2.597873     63 #########
<=   2.603448    108 ################
<=   2.609023     93 ##############
<=   2.614597     75 ###########
<=   2.620172     90 ##############
<=   2.625747     80 ############
<=   2.631321     74 ###########
                 333 ##################################################






Tcl version: 8.6b1.2
distance:  138.92161836383994
distance2: 138.92161836383994
distance3: 138.92161836383994

=========================================================
distance  (using time): 3.869864 microseconds per iteration
distance  (using benchmark):

== testing proc (distance 41.0 -81.0 42.0 -82.0) =====
munge times: 2.652 ms
llength times: 1000
descriptive parameters:
            min:   3.679 usec
         median:   3.812 usec
  harmonic_mean:   3.816 usec
arithmetic_mean:   3.817 usec
            max:   4.391 usec
  sample_stddev:   0.061 usec
3.812 usec per iteration / 261862.502 iterations per second (1000000 iterations in 3.819 seconds)
<=   3.680225      1 
<=   3.703496      8 #
<=   3.715744      3 
<=   3.727992     17 ##
<=   3.740240     33 ###
<=   3.752488     43 ####
<=   3.764736     55 ######
<=   3.776984     65 #######
<=   3.789232     92 ##########
<=   3.801480     86 #########
<=   3.813728    117 ############
                 480 ##################################################

=========================================================
distance2 (using time): 3.821812 microseconds per iteration
distance2 (using benchmark):

== testing proc (distance2 41.0 -81.0 42.0 -82.0) =====
munge times: 2.623 ms
llength times: 1000
descriptive parameters:
            min:   3.008 usec
         median:   3.114 usec
  harmonic_mean:   3.116 usec
arithmetic_mean:   3.117 usec
            max:   3.862 usec
  sample_stddev:   0.048 usec
3.114 usec per iteration / 320614.862 iterations per second (1000000 iterations in 3.119 seconds)
<=   3.008962      1 
<=   3.027237      4 
<=   3.036856      6 #
<=   3.046474      8 #
<=   3.056093     27 ###
<=   3.065712     32 ###
<=   3.075330     61 ######
<=   3.084949     57 ######
<=   3.094567    109 ###########
<=   3.104186     91 #########
<=   3.113804     88 #########
                 516 ##################################################

=========================================================
distance3 (using time): 3.894228 microseconds per iteration
distance3 (using benchmark):

== testing proc (distance3 41.0 -81.0 42.0 -82.0) =====
munge times: 2.598 ms
llength times: 1000
descriptive parameters:
            min:   2.654 usec
         median:   2.741 usec
  harmonic_mean:   2.746 usec
arithmetic_mean:   2.747 usec
            max:   3.174 usec
  sample_stddev:   0.043 usec
2.741 usec per iteration / 363791.405 iterations per second (1000000 iterations in 2.749 seconds)
<=   2.654868      1 
<=   2.671353      3 
<=   2.680029      6 #
<=   2.688706     25 ###
<=   2.697382     44 #####
<=   2.706058     69 ########
<=   2.714735     59 #######
<=   2.723411     81 ##########
<=   2.732087    108 #############
<=   2.740764     95 ###########
<=   2.749440     83 ##########
                 426 ##################################################

dkf added on 2011-07-28 16:33:11:
On my machine (older macbook pro), I get extremely similar performance for 8.5(.2; from AS) and 8.6 (trunk tip, configured for release build) but both are significantly slower than 8.4 (the system installation, 8.4.7). This is particularly the case once using [global] to bring in the constant (a recommended practice).

I guess there may be a problem with function calls? 8.5 switched from specialized bytecodes to using general command dispatch, which is lovely and flexible but means that it makes a hot spot (that was hit in 8.6's NRE work) even hotter.

ferrieux added on 2011-07-28 02:22:07:
Yet another data point: the overhead of going into and out of the proc seems to play a role:

8.5 flat : 5.24437 microseconds per iteration
8.5 proc: 5.12985 microseconds per iteration

8.6 flat : 6.1804 microseconds per iteration
8.6 proc : 6.81622 microseconds per iteration

Done with:


proc flatloop { lat1 lon1 lat2 lon2 } {
    time {
   set lat1 [expr {$lat1 / $::RADIAN_CONV}]
   set lat2 [expr {$lat2 / $::RADIAN_CONV}]
   set dlon [expr {($lon2 - $lon1) / $::RADIAN_CONV}]

   expr {acos(sin($lat1) * sin($lat2) + \
                          cos($lat1) * cos($lat2) * cos($dlon)) * \
                     6371.0}
    } 100000
}
puts [flatloop 41.0 -81.0 42.0 -82.0]
puts [time {distance 41.0 -81.0 42.0 -82.0} 100000]

hobbs added on 2011-07-28 02:21:43:
Both TIPs 174 and 232 have an impact in 8.5.  I believe the NRE changes are part of the further 8.6 degradation.  You will find a thread on tcl-core about the suspected 8.4->8.5->8.6 issues, but you have to compare the latest, as we made some fixes in 8.5 and 8.6 (notably TIP378) that changed some values.

You can also see a further analysis of normalized tclbench data between 8.4, 8.5.9 and 8.5.9-new with some of these fixes in place at:
https://spreadsheets.google.com/spreadsheet/ccc?key=0AiqUOwK_jbW-dEtlcFo0b09TUTVsRzFIaWU5bWFXSFE&hl=en_US

msofer added on 2011-07-28 01:49:26:
Agreed with my own observations: it is a case of "barely measurable" effect of [times] overhead.

ferrieux added on 2011-07-28 01:38:36:
Also, the :: vs [global] slowdown mentioned in 3302370 also hits orthogonally (roughly 20% in both cases).

ferrieux added on 2011-07-28 01:31:50:
OK, a bit of hard facts proving that this indeed orthogonal to the [times] overhead:
Using

proc time2 {s n} {
    proc _tmp {} [format {
set n %d
while {[incr n -1]>=0} {%s}
    } $n $s]
    return [time _tmp]
}
puts [time {distance 41.0 -81.0 42.0 -82.0} 100000]
puts [time2 {distance 41.0 -81.0 42.0 -82.0} 100000]

I get the following pairs:
===== 8.5:
4.97726 microseconds per iteration
491821 microseconds per iteration
===== 8.6:
6.82854 microseconds per iteration
663014 microseconds per iteration

So, though the T_EOE effect is indeed a bit stronger on 8.6, the bulk of the slowdown is intrinsic.

Second data point: disassembling shows exactly the same bytecodes.

msofer added on 2011-07-28 00:46:45:
Nothing so odd about it:
  1. [time] iterates over Tcl_EvalObjEx
  2. Tcl_EvalObjEx IS much slower on 8.6
  3. normal Tcl scripts do not iterate over Tcl_EvalObjEx
So many benchmarks overemphasize 8.6's slowdowns, some by huge margins. If this is one of those or not, I do not know. My guess is that it will be between 'measurably' and 'quite a bit'

ferrieux added on 2011-07-28 00:38:30:
Also, I'd check whether the literals shimmer at each turn of the [times] loop. I remember a chat session with miguel, where we discovered that odd things happened to multi-iteration-[times], and didn't with a regular [for] loop in a proc.

andreas_kupries added on 2011-07-27 23:42:04:
Chat excerpts...

aku: Might be interesting to see how the bytecode of 8.4/5/6 differ for the expressions.
dgp: Some bisecting to isolate things might be interesting, if the slowdown happened in discrete steps.

Attachments: