Tcl Source Code: View Ticket

Ticket UUID:	1692378
Title:	TIP 280: performance impact
Type:	Bug	Version:	obsolete: 8.6a1
Submitter:	nemethi	Created on:	2007-04-01 15:47:16
Subsystem:	47. Bytecode Compiler	Assigned To:	andreas_kupries
Priority:	8	Severity:
Status:	Closed	Last Modified:	2011-09-24 05:13:15
Resolution:	Fixed	Closed By:	ferrieux
		Closed on:	2011-09-23 22:13:15
Description:	Consider the following simple script: proc p {arg} { switch $arg { a {return A} b {return B} c {return C} d {return D} e {return E} f {return F} g {return G} h {return H} i {return I} j {return J} k {return K} l {return L} m {return M} n {return N} o {return O} p {return P} q {return Q} r {return R} s {return S} t {return T} } } puts [time {p "a"} 100000] ;# matching case puts [time {p "x"} 100000] ;# non-matching case Here is the output when running Tcl 8.4.13 on my rather slow Linux box (SuSE Linux 10.0): 4.89964 microseconds per iteration 5.37998 microseconds per iteration And here is the output produced by Tcl 8.4.14: 15.80254 microseconds per iteration 5.47919 microseconds per iteration The last Tcl 8.5.a6 from the CVS performs even more poorly: 18.48226 microseconds per iteration 5.43836 microseconds per iteration The results above expose a drastical performance regression in the matching case!
User Comments:	ferrieux added on 2011-09-24 05:13:15: Fixed by TIP#378. msofer added on 2008-07-23 21:33:17: Logged In: YES user_id=148712 Originator: NO Two things that we can do to reduce the speed impact of Tip #280. Ticket now marked as HEAD, but these may be backportable too. ---- (1) Compute things lazily! Looking specifically at the canon-list path in Tcl_EvalObjv, we are now getting and searching the string rep of every element in the list. This work is generally wasted, it is only needed when and if a stack trace is generated. It may even be very costly, if one of the elements turns out to be a big list. So, at least for TCL_LOCATION_EVAL_LIST, we could just store a ref to objPtr (without a refCount, it is guaranteed to be alive until the command returns, ie until after a stack trace entry was either generated or not). If and when a stack trace has to be generated, the search of the elements' string rep can be performed. Note that (a) this may just show how I did not get TIP #280; if true, sorry (b) similar stuff can possibly be done for the other paths, but the payoff may be smaller: all other eval paths (compile, interp) have a string rep available or generate one before running anyway ---- (2) Only attempt to manage/index/whatever Tcl_Objs that can come from source, avoid dealing with dynamically generated Tcl_Objs if possible. Currently a Tcl_Obj that either a) have no string rep b) have a List intrep and the canonicalFlag is set are guaranteed to be dynamically generated; TIP 280 code should be as gentle as possible with them dgp added on 2007-04-11 12:41:04: Logged In: YES user_id=80530 Originator: NO The dgp-refactor branch is now up to date with HEAD changes. There's still much to do to try to blend the two approaches; the current merge just makes the existing TIP 280 implementation work on that branch. It will be at least several days before I can look at this again, but I wanted you to know at least the branch merge was complete. andreas_kupries added on 2007-04-03 04:00:58: Logged In: YES user_id=75003 Originator: NO > Can you comment on what compelling feature TIP 280 provides that causes > it to be included in the ActiveTcl fork of Tcl 8.4.* ? A customer (Cisco) needing this feature, basing their libraries, etc. on 8.4, and unable to wait for 8.5. > Regarding solutions, my preference, is still to build on things in the > dgp-refactor-branch, and store line, file, etc. data in a Tcl_Obj for > a string parsed into Tcl_Tokens. This would be an example of > option (2), IIUC. Interesting. I am aware of your dgp-refactor-branch, but do not truly know what has been going on in there. As such I was not aware that we could have based #280 on work done in that branch. Yes, from the quick description this seems to be a (2). I.e. my understanding is that the modified eval parses a string into an extended token rep and executes the tokens. This token-rep is kept around for the next eval, allowing us to skip the expensive parsing, and with respect to #280 it definitely seems to be a natural location where to cache the line, etc. information. I speculate that the token rep could also be used as the input for modified byte code compiler functions. Full speedup where we parse & byte compile & cache byte codes, partial speed up were we parse and cache tokens, no speedup where we have to reparse every time. I will try to find time to look into this branch. How much has it diverged from head by now ? I seem to remember that I haven't seen a merge from head to this branch in a long time now, several months I think. Also, are these changes for which we need a TIP ? And even if we do not need it, should we have a TIP, to document all the changes (new structures I guess, functions, etc.) ? >I've never made time to pursue that, and quite rightly an implementation >of TIP 280 got done without waiting around for my lack of progress. Rightly maybe, but still/also a pity. >I've got much less to offer regarding ways >to tweak the existing TIP 280 implementation >to address this issue, since my thought >patterns tend to fall back into my own >solutions, and since I've never made time >to fully digest the workings of the existing >implementation. Very understandable. I wonder how difficult a merge has become by now, given that both #280 and refactor both touch the eval implementation, and maybe also the compile functions. nemethi added on 2007-04-03 03:38:03: Logged In: YES user_id=317958 Originator: YES Thanks to Andreas for his deep analysis. Just to make things clear: I have not mixed up anything regarding Tcl 8.4.14. I am using ActiveTcl8.4.14.0.272572-linux-ix86 (does this make a difference?). I have repeated the test, this time with the script sr.tcl. Here are the results on my old Linux box: 8.4.13 p/match 4.91813 microseconds per iteration p/!match 5.41663 microseconds per iteration q/match 5.2057 microseconds per iteration q/!match 5.71473 microseconds per iteration 8.4.14 p/match 15.7745 microseconds per iteration p/!match 5.35078 microseconds per iteration q/match 15.97124 microseconds per iteration q/!match 5.52261 microseconds per iteration 8.5a6 p/match 19.0956 microseconds per iteration p/!match 5.49823 microseconds per iteration q/match 2.94506 microseconds per iteration q/!match 2.99353 microseconds per iteration It is good to know that the -exact option improves the performance when using Tcl 8.5. It is, however, not very promising to see that w/o this option the switch command performs so much slower. Just think of the many existing applications and libraries written w/o using the -exact option. After all, that is the default! dgp added on 2007-04-03 03:32:48: Logged In: YES user_id=80530 Originator: NO Thanks for the correction that this is [switch]-specific. Can you comment on what compelling feature TIP 280 provides that causes it to be included in the ActiveTcl fork of Tcl 8.4.* ? Regarding solutions, my preference, is still to build on things in the dgp-refactor-branch, and store line, file, etc. data in a Tcl_Obj for a string parsed into Tcl_Tokens. This would be an example of option (2), IIUC. I've never made time to pursue that, and quite rightly an implementation of TIP 280 got done without waiting around for my lack of progress. I've got much less to offer regarding ways to tweak the existing TIP 280 implementation to address this issue, since my thought patterns tend to fall back into my own solutions, and since I've never made time to fully digest the workings of the existing implementation. andreas_kupries added on 2007-04-03 02:30:19: Logged In: YES user_id=75003 Originator: NO ActiveTcl ... Yes, that does make a difference. Because ActiveState compiles its tclsh's using a backport of TIPs #280. We put an indicator about this into 'tcl_platform'. So you are likely running into the same #280 issue, except that for 8.4 use of '-exact --' is not a viable workaround, because switch has only the simple byte code compilation, and not the deep compilation. That came with 8.5. Which explains that q/match is like p/match for that variant. nemethi added on 2007-04-03 02:20:00: Logged In: YES user_id=317958 Originator: YES Thanks to Andreas for his deep analysis. Just to make things clear: I have not mixed up anything regarding Tcl 8.4.14. I am using ActiveTcl8.4.14.0.272572-linux-ix86 (does this make a difference?). I have repeated the test, this time with the script sr.tcl. Here are the results on my old Linux box: 8.4.13 p/match 4.91813 microseconds per iteration p/!match 5.41663 microseconds per iteration q/match 5.2057 microseconds per iteration q/!match 5.71473 microseconds per iteration 8.4.14 p/match 15.7745 microseconds per iteration p/!match 5.35078 microseconds per iteration q/match 15.97124 microseconds per iteration q/!match 5.52261 microseconds per iteration 8.5a6 p/match 19.0956 microseconds per iteration p/!match 5.49823 microseconds per iteration q/match 2.94506 microseconds per iteration q/!match 2.99353 microseconds per iteration It is good to know that the -exact option improves the performance when using Tcl 8.5. It is, however, not very promising to see that w/o this option the switch command performs so much slower. Just think of the many existing applications and libraries written w/o using the -exact option. After all, that is the default! andreas_kupries added on 2007-04-03 01:57:19: Logged In: YES user_id=75003 Originator: NO In this specific instance the issue is not general overhead in eval, but switch-specific overhead, due to its ability to take the patterns and branches as a single list. Which also has become the standard form. dgp added on 2007-04-03 01:30:18: Logged In: YES user_id=80530 Originator: NO Dropping priority since a workaround has been identified and confirmed. Remaining issue appears to be general additional cost of an eval when TIP 280 is active, and a good answer to that will require some reflection. andreas_kupries added on 2007-04-03 01:24:01: Logged In: YES user_id=75003 Originator: NO Attached a test script containing the original p proc, and a q proc using '-exact --'. My results are: 8.5a6 p/match 7.96166 microseconds per iteration p/!match 2.14355 microseconds per iteration q/match 1.34635 microseconds per iteration q/!match 1.35886 microseconds per iteration 8.4.15 p/match 1.94189 microseconds per iteration p/!match 1.98629 microseconds per iteration q/match 2.01993 microseconds per iteration q/!match 2.07895 microseconds per iteration A slow down in the matching case for p, but not for q (-exact --). Current theory regarding this slowdown: In 'q', due to the use of '-exact --' the switch command is deeply byte-compiled, including the branches. This keeps it fast. However in 'p' the generated byte code falls back to the regular command implementation, which 'eval's the found branch. Due #280 this eval is now slower. The non-matching case is still as fast because no branch is found, i.e. there is no slow eval, it is skipped. Does that make sense ? To test the theory I disabled the code in lines 2933ff and 3002ff in tclCmdMZ.c ... The time for p/match went down to 8.4.15 levels. Refining this further I am coming to the conclusion that the main cause is the call to 'TclListLines' in line 2962. In other words, it is not, as theorized, the eval which is the problem, but the fact that we have to reparse the switch body argument to determine the line the found branch is starting on/at. And we do this every time the switch executed. Our options to handle this are: (1) Repeal all of #280. (2) Find a way to cache the result of the 'TclListLines' so that the parsing is done only once, and further executions of the same switch use the cached information. (3) Find a way to defer the 'TclListLines', i.e. instead of doing it immediately do it only when needed, i.e. when the information is asked for. No #1 is not very appealing, as we would loose the whole functionality. No #3 is conceptually the most elegant IMVHO, but implementation wise I already see problems. Because the EvalObjEx of the switch branch will count line numbers and for that it will need the line the branch is at. I.e. the deferal is not very far. It would work only if the EvalEx counts relative and creates another structure X = <base,relative-line>, where <base> is the defered computation, and X can be defered too. That can become very complex to manage. No #2 is conceptually not very elegant, i.e. it feels more hackish. compared to #3 its implementation should however be much easier. We 'only' need a way to associate the 'TclListLines' result with (a) the switch, or (b) the switch body list. Maybe a hashtable keyed by the address of the switch body object. Which is a literal, not moving around ... This is the same idea as used for the byte codes, see the 'lineBCPtr' hash table. Re-assigning to dgp for review of my comments and ideas. Don, feel free to assign to Donal as well for an opinion, given that switch is his. Well, at least the BC compiler part. andreas_kupries added on 2007-04-03 00:45:38: File Added - 223419: sr.tcl Logged In: YES user_id=75003 Originator: NO File Added: sr.tcl dgp added on 2007-04-02 23:08:06: Logged In: YES user_id=80530 Originator: NO Thanks for reporting this, and for the demo script. I can't confirm your claims about Tcl 8.4.14; I think you've mixed something up on that claim. The slowdown on HEAD Tcl is real, and came in with the 2006-11-28 commit of the TIP 280 implementation. martinlemburg added on 2007-04-02 21:06:17: Logged In: YES user_id=802134 Originator: NO Sorry for a missing information - my machine running the timing test on: MS Windows XP Prof (Windows NT 5.1) Intel P4 2.80GHz, 1GB RAM martinlemburg added on 2007-04-02 21:04:16: Logged In: YES user_id=802134 Originator: NO Using the tclkit 8.5a4 (version taken from [info patchlevel]) I have no performance regression at all: % puts [time {p "a"} 100000] ;# matching case 1.62821 microseconds per iteration % puts [time {p "x"} 100000] ;# non-matching case 1.72318 microseconds per iteration I mostly use switch with its -exact option - mostly, because I seldom need switch with other matching mechanisms. Here the timing results using switch with -exact: % puts [time {p "a"} 100000] ;# matching case 1.4099 microseconds per iteration % puts [time {p "x"} 100000] ;# non-matching case 2.25362 microseconds per iteration Where as I don't understand the increased runtime consumption for a non matching case. dkf added on 2007-04-02 16:45:48: Logged In: YES user_id=79902 Originator: NO Absolutely nothing in tclCmdMZ.c changed from 8.4.13 to 8.4.14; those releases used the same version of file. Whatever is going on is unrelated. In 8.5, you'll want to add '-exact --' options to enable maximally efficient processing.

Attachments:

sr.tcl [download] added by andreas_kupries on 2007-04-03 00:45:38. [details]