Tcl Source Code

View Ticket
Login
Ticket UUID: 3485833
Title: trace - inexplicable large overhead by ensemble commands
Type: Bug Version: obsolete: 8.5.11
Submitter: sebres Created on: 2012-02-08 18:43:04
Subsystem: 46. Traces Assigned To: dgp
Priority: 3 Low Severity: Minor
Status: Open Last Modified: 2017-01-04 14:25:01
Resolution: Fixed Closed By: nobody
    Closed on:
Description:
Hello, 

I've found that by "execution" trace some ensemble commands produce a lot of overhead... 
without trace it is comparable with not ensemble commands
i have own profiler (to check which code works slowly), that calls somthing like this :

Tcl_CreateObjTrace( interp, level, TCL_ALLOW_INLINE_COMPILATION | ((TCL_TRACE_ANY_EXEC) >> 2),
         &tclexProfilerTraceProc, inDbgObj, NULL );
The C procedure "tclexProfilerTraceProc" i have rewritten to test this, that it performs "return TCL_OK;" only.

Strange is the execution times of "string length" and "::tcl::string::length" are not different, 
but "string range" or "string first" are 100 times slower as its not ensemble commands "::tcl::string::range" or "::tcl::string::first".

That overhead make me not possible, to profile anything code, using ensembles.

Tcl code to be "profiled":
%%%%
set s [string repeat A 50000]; puts ok
puts "--- string range ----"
puts [time {::tcl::string::range $s 2000 2001} 100000]
puts [time {::string range $s 2000 2001} 10000]
puts "--- string first ----"
puts [time {::tcl::string::first A $s 2000} 100000]
puts [time {::string first A $s 2000} 10000]
puts "--- string length ---"
puts [time {::tcl::string::length $s} 100000]
puts [time {::string length $s} 10000]
puts "--- test ------------"
proc ::test1_string_range {s from to} {set a 5}
proc ::test2_string_range {ens s from to} {set a 5}
puts [time {::test1_string_range $s 2000 2001} 10000]
puts [time {::test2_string_range range $s 2000 2001} 10000]
%%%%

Result is :
ok
--- string range ----
1.4 microseconds per iteration
168.8 microseconds per iteration !!!
--- string first ----
1.09 microseconds per iteration
168.8 microseconds per iteration !!!
--- string length ---
0.31 microseconds per iteration
0.0 microseconds per iteration
--- test ------------
1.6 microseconds per iteration
1.5 microseconds per iteration
User Comments: sebres added on 2017-01-04 14:25:01:
Interim status of subject:

Even without any tracing, current trunk have still very large overhead, that concerns to ensemble handling (by the way, greater than in 8.5). 
It could be better explained using my new command "timerate" (fairly more precise by measurement as the "time"), currently available for trunk-based branches in https://github.com/sebres/tcl/pull/3

Following example will provide you insight about ensemble-overhead (compiled as well as non-compiled execution takes by ensemble 70% more time):

```
% set s [string repeat A 50000]; puts ok
ok
% # calibrate to minimize influence of measurement overhead (and byte code execution overhead):

% timerate -calibrate {}
0.04537374883248807 µs/#-overhead, 0.045506 µs/#, 42895062 #, 21974929 #/sec

% # test compiled execution:
% timerate {string first A $s 2000}
0.321560 µs/#, 2725287 #, 3109837 #/sec
% timerate {::tcl::string::first A $s 2000}
0.191541 µs/#, 4220926 #, 5220810 #/sec

% # calibrate again (slower, non-compiled direct execution via TclEvalObjEx):
% timerate -direct -calibrate {}
0.07950384713110215 µs/#-overhead, 0.080109 µs/#, 24366669 #, 12482924 #/sec

% timerate -direct {string first A $s 2000}
0.313794 µs/#, 2542604 #, 3186803 #/sec
% timerate -direct {::tcl::string::first A $s 2000}
0.190985 µs/#, 3697004 #, 5236000 #/sec
```

sebres added on 2012-03-07 22:43:32:
Experimental trying to use TclGetSrcInfoForCmd without special sentinel ENSEMBLE_PSEUDO_COMMAND (as suggested from Don) 
commited in bug-3485833.

sebres added on 2012-03-07 16:00:51:
I have tested it with (-1) insted of ENSEMBLE_PSEUDO_COMMAND - the output is the same. 
This happens because TraceExecutionProc gets the command from objv[0] and that is "::tcl::string::range". 

I am not sure, perhaps we can use the "simple" variant that you have suggested - without ENSEMBLE_PSEUDO_COMMAND. 
I will test it with my profiler - I have therein a strange piece of code, that build calling command name from current namespace and "char * command", when objv[0] is NULL or objc = 0. 
It may be obsolete (since 8.4 or 8.5).

sebres added on 2012-03-07 06:41:36:
I have it unfortunately in C as part of my tcl profiler (very large + custom tcl),
but in pure TCL it is something like this (I do not know whether all interp trace flags are the same):

% proc test {} {string range {} 0 0}
% trace add execution test enterstep {put_stack}
% proc put_stack {args} {puts [join $args " - "]}
% test
string range {} 0 0 - enterstep
::tcl::string::range {} 0 0 - enterstep

dgp added on 2012-03-07 01:31:29:
I don't follow.  Do you have a test case we
can add to the test suite that would demo the
difference?

sebres added on 2012-03-07 01:19:19:
By following code :
proc ::A::test {} {
  string range Test 0 4
}

we should have a following "call stack" by tracing:

::tcl::string::RANGE
string
::A::test

without shift from "string range" to "range" we have following "call stack" :

::tcl::string::STRING
string
::A::test

dgp added on 2012-03-06 22:31:57:
I don't understand the purpose in the patched
TclGetSrcInfoFroCmd() of moving the "command"
pointer forward off of "string" and onto "range" in
the comment example "string range".

If we've already located a pointer to "string range"
seems to me that's the traced command and we can
just use it without modification.  If there's no need for
modification, then there's no need for a special sentinel
value to trigger the modification, and things can get much
simpler.

If there's really a need for this, I need it explained to me, and
we should probably add a test demonstrating it.  We'll also
have to fix the code so it works better.  It's not correct to just
assume that each word in a Tcl command is separated from
the next word by a single space.

sebres added on 2012-03-06 16:38:39:
3) commited in "dgp-scan-element"
http://core.tcl.tk/tcl/info/a16a2b5417

dgp added on 2012-03-06 02:12:25:
Branch bug-3485833 opened to track item 2)
independently.

dgp added on 2012-03-06 01:44:30:
See ticket 3497099 for continued work on item 3).

dgp added on 2012-03-06 01:15:21:
Sadly, the dgp-scan-element branch now holds
at least three different, separable changes, that
really ought to receive independent review. IMHO.

1) Speedup of TclScanElement() that strangely doesn't
    appear to speed things up very much except in this
    particular demo.

2) The alternative kludge to solve/avoid the problem
     originally reported.

3) Some apparent portability issues in the "registry"
    and "dde" packages.

dgp added on 2012-02-14 20:30:22:
Thanks, I see it.  Hope to have time on Thursday to
take a good look.

sebres added on 2012-02-14 20:09:40:
4 Don:

i have checked into your leaf "dgp-scan-element" :
http://core.tcl.tk/tcl/info/d6c4824b30

dgp added on 2012-02-14 00:58:44:
Some bug in this Tracker is preventing my download
of the attachments.  If you could please commit the
patches to some branch of the fossil repos, that would
be a good way to get it to reviewers.  Thanks!

sebres added on 2012-02-13 23:18:08:

File Added - 435716: traceEnsembleOverhead3.patch

sebres added on 2012-02-13 23:17:41:
The last mentioned exception in itcl3.4 is recently occured within execution trace with 8.5.11 (original) also. Therewith it is another bug (possible anywhere in itcl) - not yet found. It's very hardly to reproduce, cause occured sporadical only by initializing (within source in new thread).

For this reason, the solution is fine for me.
Patch with tiny changes attached.

sebres added on 2012-02-10 23:36:54:
not yet really safe - within trace using itcl3.4 tests I have access violation (at TclGetSrcInfoForPc+0x14C). without trace ok.
I will search and fix it later.

sebres added on 2012-02-10 23:02:44:

File Added - 435425: traceEnsembleOverhead2.patch

sebres added on 2012-02-10 22:45:52:
dgp:
> When the the string version is needed, a list is built and 
> the string rep of that list is generated, costing time 
> proportional to the total length of all the argument values

Thanks for your reference Don, you got right.

I have found another (bether) solution as my workaround before.
I use TclGetSrcInfoForCmd, but with "-2", 
to shift string pointer from "string range ..." to "range ..."

I've fixed it for 8.5.11 (org) and my fork. It work correct, but some tests still running.

Patch is attached now.
Your reviews are welcome.

sebres added on 2012-02-10 22:38:41:

File Added - 435424: traceEnsembleOverhead.patch

sebres added on 2012-02-10 18:36:20:
In my tcl fork I have midified a some routines such Tcl_CreateEnsemble, Tcl_SetEnsembleSubcommandList, Tcl_SetEnsembleMappingDict etc.

So after ensemble namespace and subcommands are created, the main ensemle command will be a "dummy" command without compileProc (TclCompileEnsemble), that simple forwards a first arg to corresponding command in ensemble namespace.

I try later to create a patch for core-8-5-11...

My performance now w/o tracing:
--- string range ----
0.62 microseconds per iteration
0.78 microseconds per iteration
--- string first ----
0.63 microseconds per iteration
0.78 microseconds per iteration
--- string length ---
0.31 microseconds per iteration
0.16 microseconds per iteration

Performance with tracing:
--- string range ----
2.16 microseconds per iteration
2.8 microseconds per iteration
--- string first ----
1.78 microseconds per iteration
1.8 microseconds per iteration
--- string length ---
0.32 microseconds per iteration
0.32 microseconds per iteration

By tclbench it seems to be equal.
But strange is, that some my own test scripts running up to one-fifth faster as earlier.

dkf added on 2012-02-10 05:12:48:
I'm not sure that this is a cost that tclbench is set up to spot; it's largely focussed on bytecode/command costs as they bite most people much harder. Still, this seems to make quite a bit of difference to the speed of ensemble command dispatch with 8.5 (around 40% faster with an --enable-symbols=all build) which is very good, if naively surprising. In a fully optimized build, it makes 17.5% difference when traces are turned on, and seems to be irrelevant without traces.

Make it so.

sebres added on 2012-02-10 04:40:06:
the patch is greate...
> they are not clearly demonstrated by tclbench.
There are probably large effect on some things such initial compile and loading (source etc).

dgp added on 2012-02-10 04:29:33:
If the dgp-scan-element branch has significant
performance pros or cons, they are not clearly 
demonstrated by tclbench.

dgp added on 2012-02-10 01:43:22:
The cost of TclScanElement() in dkf's profile
caught my eye, notably as compared to
TclConvertElement().  We can't do anything
about the inherent O(N) nature of the scan,
but the constant seemed higher than needed.

On the dgp-scan-element branch I have some
revisions aimed at brining down the constant.
On the test at issue in this ticket, it knocks about
half off the tracing cost.  No new test failures.
Still running tclbench to see if there's some
countervailing cost somewhere.

dgp added on 2012-02-09 23:33:16:
The usual opinion on this has been that
when this level of tracing is active, yes
performance will be harmed.  But if you
have this level of tracing turned on, we
assume performance is not the concern,
but rather some sort of debugging or profiling.  Optimizing
under these trace conditions hasn't been a priority.

dgp added on 2012-02-09 23:25:03:
Ok, that's it.  This oldest style of execution tracing
wants to have available the command it is tracing
in string form.  Most of the time this isn't a big demand,
because the command started in string form, and we just
have to get to it.

But for ensemble subcommand dispatch, the redirected
subcommand execution is done by building an objv
array and directly calling to Tcl_EvalObjv().  When the
the string version is needed, a list is built and the string
rep of that list is generated, costing time proportional to
the total length of all the argument values.

dkf added on 2012-02-09 23:02:54:
I've attached a profiler trace captured while doing a run of the slowest version from my previous comment. ($s is 50k chars)

dkf added on 2012-02-09 23:01:23:

File Added - 435337: sample.txt

dkf added on 2012-02-09 22:59:41:
Most of cost is due to TEOV_RunEnterTraces, which spends a lot of time converting something to a string.

Timings:

% testcmdtrace resulttest {time {string range $s 2000 2001} 100000}
590.6396843900001 microseconds per iteration
% time {string range $s 2000 2001} 100000
24.18864962 microseconds per iteration
% time {::tcl::string::range $s 2000 2001} 100000
12.64128215 microseconds per iteration

dgp added on 2012-02-09 22:55:35:
Further probing demonstrates the 
{string range ....} script is getting compiled
only one time.  It compiles to direct invocation
of the string command procedure, which is
NsEnsembleImplementationCmd, because "string"
is implemented as an ensemble.  This points to
performance problems there.

A possible alternative would be to have TclCompileEnsemble()
compile to direct invocation of ::tcl::string::range instead.  To
go that way would required an examination of the epoch implications,
since changes in any ensemble dispatch map would then have the
potential to invalidate bytecode.  (That may already be true).

If we go that way, there may be refactoring opportunities in breaking
out from TclCompileScript() the central engine that compiles a single
command so that it could be called by TCE as well.

sebres added on 2012-02-09 17:25:25:
I noticed something else: why recompile or parse is dependent on length of string "s"?
It should still parse the token "$s" and not the value of "s". Or I'm way off here?

sebres added on 2012-02-09 17:12:43:
Yep, 
compared to "length", by the "range" it returns in TclCompileEnsemble with TCL_ERROR after checking (... cmdPtr->compileProc == NULL), and go further to recompile in TclCompileScript.

So, either EACH compiled ensemble should implements ALL compiled subcommands also, or in TclCompileEnsemble (or hereafter in TclCompileScript) should be supported a direct call "cmdPtr->objProc" of the subcommand. 
At least by DONT_COMPILE_CMDS_INLINE or TCL_ALLOW_INLINE_COMPILATION.

I will try to build it later, but each quick solution or workaround are welcome... 
PS. cause I should as soon as possible complete my profiling :)

sebres added on 2012-02-09 15:55:39:
No hard evidence, but in debugger by brreakpoint in NsEnsembleImplementationCmd
it breaks 5 times by "testcmdtrace resulttest {time {string range $s 2000 2001} 5}" 
and zero times by "testcmdtrace resulttest {time {string length $s} 5}"

Don, you got right: ensemble "string range" will be recompiled each time in "time" body.
But why "range" only?

sebres added on 2012-02-09 14:56:20:
by the way, I could not make runtest with windows/MVS 6.0.

..\generic\tclTest.c(5347) : error C2065: 'intptr_t' : undeclared identifier
..\generic\tclTest.c(5347) : error C2146: syntax error : missing ')' before identifier 'Tcl_GetCurrentThread'

my quick workaround :

#if (defined( _MSC_VER ))
typedef void * intptr_t;
#endif

sebres added on 2012-02-09 05:21:52:
> The subcommands that don't compile [string range,first] show
> the problem.
But how you explain, that pure tcl proc is faster as "ensemble" by non compiled subcommands? Anything is wrong in ensemble call.
%
if {[info command ::org_string] == {}} {
  rename ::string ::org_string
  proc ::string {args} {{*}[lreplace $args 0 0 "::tcl::string::[lindex $args 0]"]}
}
.... the same test code ....
%

dgp added on 2012-02-09 04:44:02:
Reproduction using just what's in the 
source distro:

$ make runtest
...
% set s [string repeat A 50000]; puts ok
ok
% testcmdtrace resulttest {time {string range $s 2000 2001} 1000}
171.787 microseconds per iteration
% testcmdtrace resulttest {time {tcl::string::range $s 2000 2001} 1000}
2.357 microseconds per iteration

dgp added on 2012-02-09 04:25:03:
The bytecompiled subcommand [string length] is good.

The subcommands that don't compile [string range,first] show
the problem.

I suspect that every iteration of [time] is re-attempting the
compile.  No hard evidence yet.

dgp added on 2012-02-09 04:00:53:
Another releated Q, is it all ensembles
that have this issue, or only those that
are byte compiled?

dgp added on 2012-02-09 03:59:44:
ok, reverse the question then.  Is the problem
already corrected on the tip of the core-8-5-branch?

sebres added on 2012-02-09 02:28:01:
> Don: original release 8.5.11

What is funny - but even tcl proc is "in trace" faster than "ensemble":
%
if {[info command ::org_string] == {}} {
  rename ::string ::org_string
  proc ::string {args} {{*}[lreplace $args 0 0 "::tcl::string::[lindex $args 0]"]}
}
.... the same test code ....
%

The result is :
ok
--- string range ----
2.04 microseconds per iteration
7.8 microseconds per iteration
--- string first ----
1.87 microseconds per iteration
7.8 microseconds per iteration
--- string length ---
0.32 microseconds per iteration
6.2 microseconds per iteration
--- test ------------
3.1 microseconds per iteration
1.6 microseconds per iteration

dgp added on 2012-02-09 02:10:03:
Does the 8.5.11 release have this problem,
or only the tip of the core-8-5-branch ?

sebres added on 2012-02-09 01:51:16:
the difference between execution times of "ensemble" and "not ensemble" is then greater if string "s" is longer, and vice versa.

Attachments: