Tcl Source Code

View Ticket
Login
Ticket UUID: 1692378
Title: TIP 280: performance impact
Type: Bug Version: obsolete: 8.6a1
Submitter: nemethi Created on: 2007-04-01 15:47:16
Subsystem: 47. Bytecode Compiler Assigned To: andreas_kupries
Priority: 8 Severity:
Status: Closed Last Modified: 2011-09-24 05:13:15
Resolution: Fixed Closed By: ferrieux
    Closed on: 2011-09-23 22:13:15
Description:
Consider the following simple script:

proc p {arg} {
    switch $arg {
        a {return A}
        b {return B}
        c {return C}
        d {return D}
        e {return E}
        f {return F}
        g {return G}
        h {return H}
        i {return I}
        j {return J}
        k {return K}
        l {return L}
        m {return M}
        n {return N}
        o {return O}
        p {return P}
        q {return Q}
        r {return R}
        s {return S}
        t {return T}
    }
}

puts [time {p "a"} 100000]      ;# matching case
puts [time {p "x"} 100000]      ;# non-matching case

Here is the output when running Tcl 8.4.13 on my rather slow Linux box (SuSE Linux 10.0):

4.89964 microseconds per iteration
5.37998 microseconds per iteration

And here is the output produced by Tcl 8.4.14:

15.80254 microseconds per iteration
5.47919 microseconds per iteration

The last Tcl 8.5.a6 from the CVS performs even more poorly:

18.48226 microseconds per iteration
5.43836 microseconds per iteration

The results above expose a drastical performance regression in the matching case!
User Comments: ferrieux added on 2011-09-24 05:13:15:
Fixed by TIP#378.

msofer added on 2008-07-23 21:33:17:
Logged In: YES 
user_id=148712
Originator: NO

Two things that we can do to reduce the speed impact of Tip #280. Ticket 
now marked as HEAD, but these may be backportable too.

----
(1) Compute things lazily!

Looking specifically at the canon-list path in Tcl_EvalObjv, we are now
getting and searching the string rep of every element in the list. This
work is generally wasted, it is only needed when and if a stack trace is
generated. It may even be very costly, if one of the elements turns out to
be a big list.

So, at least for TCL_LOCATION_EVAL_LIST, we could just store a ref to
objPtr (without a refCount, it is guaranteed to be alive until the command
returns, ie until after a stack trace entry was either generated or
not). If and when a stack trace has to be generated, the search of the
elements' string rep can be performed.

Note that
(a) this may just show how I did not get TIP #280; if true, sorry
(b) similar stuff can possibly be done for the other paths, but the payoff
may be smaller: all other eval paths (compile, interp) have a string rep
available or generate one before running anyway

----
(2) Only attempt to manage/index/whatever Tcl_Objs that can come from
source, avoid dealing with dynamically generated Tcl_Objs if
possible. Currently a Tcl_Obj that either
   a) have no string rep
   b) have a List intrep and the canonicalFlag is set
are guaranteed to be dynamically generated; TIP 280 code should be as
gentle as possible with them

dgp added on 2007-04-11 12:41:04:
Logged In: YES 
user_id=80530
Originator: NO


The dgp-refactor branch is
now up to date with HEAD changes.

There's still much to do to try to
blend the two approaches; the current
merge just makes the existing TIP 280
implementation work on that branch.
It will be at least several days before
I can look at this again, but I wanted
you to know at least the branch merge
was complete.

andreas_kupries added on 2007-04-03 04:00:58:
Logged In: YES 
user_id=75003
Originator: NO


> Can you comment on what compelling feature TIP 280 provides that causes
> it to be included in the ActiveTcl fork of Tcl 8.4.* ?

A customer (Cisco) needing this feature, basing their libraries, etc. on 8.4, and unable to wait for 8.5.

> Regarding solutions, my preference, is still to build on things in the
> dgp-refactor-branch, and store line, file, etc. data in a Tcl_Obj for
> a string parsed into Tcl_Tokens. This would be an example of
> option (2), IIUC.

Interesting. I am aware of your dgp-refactor-branch, but do not truly know what has been going on in there. As such I was not aware that we could have based #280 on work done in that branch. Yes, from the quick description this seems to be a (2). I.e. my understanding is that the modified eval parses a string into an extended token rep and executes the tokens. This token-rep is kept around for the next eval, allowing us to skip the expensive parsing, and with respect to #280 it definitely seems to be a natural location where to cache the line, etc. information. I speculate that the token rep could also be used as the input for modified byte code compiler functions. Full speedup where we parse & byte compile & cache byte codes, partial speed up were we parse and cache tokens, no speedup where we have to reparse every time.

I will try to find time to look into this branch. How much has it diverged from head by now ? I seem to remember that I haven't seen a merge from head to this branch in a long time now, several months I think.

Also, are these changes for which we need a TIP ? And even if we do not need it, should we have a TIP, to document all the changes (new structures I guess, functions, etc.) ?

>I've never made time to pursue that, and quite rightly an implementation
>of TIP 280 got done without waiting around for my lack of progress.

Rightly maybe, but still/also a pity.

>I've got much less to offer regarding ways
>to tweak the existing TIP 280 implementation
>to address this issue, since my thought
>patterns tend to fall back into my own
>solutions, and since I've never made time
>to fully digest the workings of the existing
>implementation.

Very understandable.

I wonder how difficult a merge has become by now, given that both #280 and refactor both touch the eval implementation, and maybe also the compile functions.

nemethi added on 2007-04-03 03:38:03:
Logged In: YES 
user_id=317958
Originator: YES

Thanks to Andreas for his deep analysis.

Just to make things clear:  I have *not* mixed up anything regarding Tcl 8.4.14.  I am using ActiveTcl8.4.14.0.272572-linux-ix86 (does this make a difference?).

I have repeated the test, this time with the script sr.tcl.  Here are the results on my old Linux box:

8.4.13
p/match   4.91813 microseconds per iteration
p/!match  5.41663 microseconds per iteration
q/match   5.2057 microseconds per iteration
q/!match  5.71473 microseconds per iteration

8.4.14
p/match   15.7745 microseconds per iteration
p/!match  5.35078 microseconds per iteration
q/match   15.97124 microseconds per iteration
q/!match  5.52261 microseconds per iteration

8.5a6
p/match   19.0956 microseconds per iteration
p/!match  5.49823 microseconds per iteration
q/match   2.94506 microseconds per iteration
q/!match  2.99353 microseconds per iteration

It is good to know that the -exact option improves the performance when using Tcl 8.5.  It is, however, not very promising to see that w/o this option the switch command performs so much slower.  Just think of the many existing applications and libraries written w/o using the -exact option.  After all, that is the default!

dgp added on 2007-04-03 03:32:48:
Logged In: YES 
user_id=80530
Originator: NO


Thanks for the correction that this
is [switch]-specific.

Can you comment on what compelling
feature TIP 280 provides that causes
it to be included in the ActiveTcl
fork of Tcl 8.4.* ?

Regarding solutions, my preference,
is still to build on things in the
dgp-refactor-branch, and store line,
file, etc. data in a Tcl_Obj for
a string parsed into Tcl_Tokens. 
This would be an example of
option (2), IIUC.

I've never made time to pursue that,
and quite rightly an implementation
of TIP 280 got done without waiting
around for my lack of progress.  I've
got much less to offer regarding ways
to tweak the existing TIP 280 implementation
to address this issue, since my thought
patterns tend to fall back into my own
solutions, and since I've never made time
to fully digest the workings of the existing
implementation.

andreas_kupries added on 2007-04-03 02:30:19:
Logged In: YES 
user_id=75003
Originator: NO

ActiveTcl ... Yes, that does make a difference. Because ActiveState compiles its tclsh's using a backport of TIPs #280. We put an indicator about this into 'tcl_platform'. So you are likely running into the same #280 issue, except that for 8.4 use of '-exact --' is not a viable workaround, because switch has only the simple byte code compilation, and not the deep compilation. That came with 8.5. Which explains that q/match is like p/match for that variant.

nemethi added on 2007-04-03 02:20:00:
Logged In: YES 
user_id=317958
Originator: YES

Thanks to Andreas for his deep analysis.

Just to make things clear:  I have *not* mixed up anything regarding Tcl 8.4.14.  I am using ActiveTcl8.4.14.0.272572-linux-ix86 (does this make a difference?).

I have repeated the test, this time with the script sr.tcl.  Here are the results on my old Linux box:

8.4.13
p/match   4.91813 microseconds per iteration
p/!match  5.41663 microseconds per iteration
q/match   5.2057 microseconds per iteration
q/!match  5.71473 microseconds per iteration

8.4.14
p/match   15.7745 microseconds per iteration
p/!match  5.35078 microseconds per iteration
q/match   15.97124 microseconds per iteration
q/!match  5.52261 microseconds per iteration

8.5a6
p/match   19.0956 microseconds per iteration
p/!match  5.49823 microseconds per iteration
q/match   2.94506 microseconds per iteration
q/!match  2.99353 microseconds per iteration

It is good to know that the -exact option improves the performance when using Tcl 8.5.  It is, however, not very promising to see that w/o this option the switch command performs so much slower.  Just think of the many existing applications and libraries written w/o using the -exact option.  After all, that is the default!

andreas_kupries added on 2007-04-03 01:57:19:
Logged In: YES 
user_id=75003
Originator: NO

In this specific instance the issue is not general overhead in eval, but switch-specific overhead, due to its ability to take the patterns and branches as a single list. Which also has become the standard form.

dgp added on 2007-04-03 01:30:18:
Logged In: YES 
user_id=80530
Originator: NO


Dropping priority since a workaround
has been identified and confirmed.
Remaining issue appears to be general
additional cost of an eval when TIP 280
is active, and a good answer to that
will require some reflection.

andreas_kupries added on 2007-04-03 01:24:01:
Logged In: YES 
user_id=75003
Originator: NO

Attached a test script containing the original p proc, and a q proc using '-exact --'. My results are:

8.5a6
p/match   7.96166 microseconds per iteration
p/!match  2.14355 microseconds per iteration
q/match   1.34635 microseconds per iteration
q/!match  1.35886 microseconds per iteration

8.4.15
p/match   1.94189 microseconds per iteration
p/!match  1.98629 microseconds per iteration
q/match   2.01993 microseconds per iteration
q/!match  2.07895 microseconds per iteration

A slow down in the matching case for p, but not for q (-exact --).
Current theory regarding this slowdown:

In 'q', due to the use of '-exact --' the switch command is deeply byte-compiled, including the branches. This keeps it fast. However in 'p' the generated byte code falls back to the regular command implementation, which 'eval's the found branch. Due #280 this eval is now slower. The non-matching case is still as fast because no branch is found, i.e. there is no slow eval, it is skipped. Does that make sense ?

To test the theory I disabled the code in lines 2933ff and 3002ff in tclCmdMZ.c ... The time for p/match went down to 8.4.15 levels. Refining this further I am coming to the conclusion that the main cause is the call to 'TclListLines' in line 2962. In other words, it is not, as theorized, the eval which is the problem, but the fact that we have to reparse the switch body argument to determine the line the found branch is starting on/at. And we do this every time the switch executed.

Our options to handle this are:

(1) Repeal all of #280.
(2) Find a way to cache the result of the 'TclListLines' so that the parsing is done only once, and further executions of the same switch use the cached information.
(3) Find a way to defer the 'TclListLines', i.e. instead of doing it immediately do it only when needed, i.e. when the information is asked for.

No #1 is not very appealing, as we would loose the whole functionality.
No #3 is conceptually the most elegant IMVHO, but implementation wise I already see problems. Because the EvalObjEx of the switch branch will count line numbers and for that it will need the line the branch is at. I.e. the deferal is not very far. It would work only if the EvalEx counts relative and creates another structure X = <base,relative-line>, where <base> is the defered computation, and X can be defered too. That can become very complex to manage.
No #2 is conceptually not very elegant, i.e. it feels more hackish. compared to #3 its implementation should however be much easier. We 'only' need a way to associate the 'TclListLines' result with (a) the switch, or (b) the switch body list. Maybe a hashtable keyed by the address of the switch body object. Which is a literal, not moving around ... This is the same idea as used for the byte codes, see the 'lineBCPtr' hash table.

Re-assigning to dgp for review of my comments and ideas. Don, feel free to assign to Donal as well for an opinion, given that switch is his. Well, at least the BC compiler part.

andreas_kupries added on 2007-04-03 00:45:38:

File Added - 223419: sr.tcl

Logged In: YES 
user_id=75003
Originator: NO

File Added: sr.tcl

dgp added on 2007-04-02 23:08:06:
Logged In: YES 
user_id=80530
Originator: NO


Thanks for reporting this, and for
the demo script.

I can't confirm your claims about
Tcl 8.4.14; I think you've mixed
something up on that claim.

The slowdown on HEAD Tcl is real, and
came in with the 2006-11-28 commit of
the TIP 280 implementation.

martinlemburg added on 2007-04-02 21:06:17:
Logged In: YES 
user_id=802134
Originator: NO

Sorry for a missing information - my machine running the timing test on:

MS Windows XP Prof (Windows NT 5.1)
Intel P4 2.80GHz, 1GB RAM

martinlemburg added on 2007-04-02 21:04:16:
Logged In: YES 
user_id=802134
Originator: NO

Using the tclkit 8.5a4 (version taken from [info patchlevel]) I have no performance regression at all:

    % puts [time {p "a"} 100000] ;# matching case
    1.62821 microseconds per iteration
    % puts [time {p "x"} 100000] ;# non-matching case
    1.72318 microseconds per iteration

I mostly use switch with its -exact option - mostly, because I seldom need switch with other matching mechanisms.
Here the timing results using switch with -exact:

    % puts [time {p "a"} 100000] ;# matching case
    1.4099 microseconds per iteration
    % puts [time {p "x"} 100000] ;# non-matching case
    2.25362 microseconds per iteration

Where as I don't understand the increased runtime consumption for a non matching case.

dkf added on 2007-04-02 16:45:48:
Logged In: YES 
user_id=79902
Originator: NO

Absolutely nothing in tclCmdMZ.c changed from 8.4.13 to 8.4.14; those releases used the same version of file. Whatever is going on is unrelated.

In 8.5, you'll want to add '-exact --' options to enable maximally efficient processing.

Attachments: