Tcl Source Code

View Ticket
Login
Ticket UUID: 2992970
Title: Slow performance of [append]
Type: Bug Version: obsolete: 8.6b1.1
Submitter: jgodfrey Created on: 2010-04-27 13:53:18
Subsystem: 12. ByteArray Object Assigned To: dgp
Priority: 7 High Severity: Minor
Status: Closed Last Modified: 2014-01-16 22:39:46
Resolution: Fixed Closed By: dkf
    Closed on: 2014-01-16 22:39:46
Description:
The attached binary reader script shows a huge performance difference between 8.5 and 8.6b1.1 (both ActiveState builds, under Win7 x64).  Under 8.5 the script completes in about 0.5 seconds.  Under 8.6b1.1, it takes nearly 14.5 seconds.  Removing the [append buffer] from the code, the timings are more similar, though 8.6b1.1 is still about 20-25% slower than 8.5.
User Comments: dkf added on 2014-01-16 22:39:46:

That patch seems OK to me (except that code in a block comment should have a * somewhere in front of it IMO; that's part of the comment indication — this is the most trivial of quibbles, as it is in the meta-information related to a comment!).


dgp added on 2014-01-15 19:06:56:
See branch bug-2992970.

Turns out that there was no need to close off
support for the Tcl_AppendObjToObj(x,x) degeneracy
after all, so I've restored it.

dgp added on 2014-01-15 18:11:14:
Postscript on this bug fix.

The creation of the new routine TclAppendBytesToByteArray()
that includes a smart(er) bytearray growth algorithm was
effective in restoring efficient appends to pure bytearrays; Yay!

As a minor point though, there is a comment in the part of
Tcl_AppendObjToObj() that handles this case to the effect
that you can reliably depend on

    Tcl_AppendObjToObj(x, x);

to work.  The new implementation does not in fact preserve
that guarantee.  Specifically on the current trunk, the
value of bytesSrc passed to TABTBA() is at risk of becoming
invalid by the time it is used, due to reallocs that TABTBA
might perform.

Now in a Copy On Write system like Tcl has, I can't imagine
when anyone would want to call

    Tcl_AppendObjToObj(x, x);

I cannot find any way to induce any of the existing T_AOTO()
calls in the Tcl sources to attempt it, and that likely explains
why the test suite doesn't cover the possibility.

My preference is to remove the (now false) comment, and possibly
even add a panic detecting any attempt to append a value to itself
declaring it to be a CoW violation.  Any reason not to do that?

dkf added on 2010-04-30 16:26:42:
Yes. I copied the algorithm from tclStringObj.c :-)

dgp added on 2010-04-30 06:18:22:
Since the routine for appending to bytearrays is new,
I'll ask, does it handle failure on length overflow gracefully?
What happens when (used + len) > INT_MAX?

Updating the other string growth routines to handle such matters
was one of the tasks in 8.6 development.

dgp added on 2010-04-30 05:40:48:
Just curious whether anyone tried the

#define COMPAT 1

suggestion at the top of tclStringObj.c to see
whether that had any effect on this test script?

I'm guessing it would not help, since the problem
turned out to be that the "optimizations for bytearrays"
added to the code for 8.6 were simply not complete.

ferrieux added on 2010-04-30 02:34:29:
At a guess, I'd say that a realloc is costly only when another allocated block sitting after the current one prevents its growth. So things get very dependent on other allocations in the loop *and* on the system malloc()'s strategy for picking a new address (eg small blocks at one end of the heap, big blocks at the other end).

Donal, do your observations support this guesswork ?

ferrieux added on 2010-04-29 23:13:03:

allow_comments - 0

Yup, told you from the beginning ;-)

I'm still not understanding why the quadratic-realloc was not hitting me on my plenty-of-RMA machine.

dkf added on 2010-04-29 22:20:58:
NB: [tcl::unsupported::representation] was really useful for this one; let me confirm that it was *not* a shimmering problem (narrowing the problem search enormously).

dkf added on 2010-04-29 22:11:58:

allow_comments - 1

dkf added on 2010-04-29 22:11:32:
OK, problem was that the code in Tcl_AppendObjToObj did not apply any kind of growth management strategy to bytearrays, despite the fact that the internal rep contained support for such a thing. Updating the code to have such a thing makes *my* Tcl build able to tackle building a 250MB bytearray a few bytes at a time in only a few seconds (about 7.5s!)

This is now quite a bit faster than 8.5 (on smaller tests; could not be bothered with larger ones. :-)

dkf added on 2010-04-29 22:00:04:
That was purely down to the amount of memory he had. :-)

jgodfrey added on 2010-04-29 21:37:34:
Donal,

AFAIK, Alex was never able to repro this using Linux.  Are you seeing it under Windows, or something else?  I wonder why it apparently works in 8.6 under Linux?

dkf added on 2010-04-29 21:22:18:
As far as I can see, it's working 8.5 because that converts everything to unicode. Oh well.

dkf added on 2010-04-29 21:18:50:
And the problem is now exactly that Tcl_AppendObjToObj doesn't use a sensible growth strategy, so the amount of work done is quadratic. (If only I could see/remember why it works in 8.5...)

dkf added on 2010-04-29 21:09:37:
Looks like [append] no longer handles pure bytearrays efficiently.

dkf added on 2010-04-29 21:00:09:
Scaled down as suggested, I see the problem comparing 8.5.2 and the HEAD (as of now). Swapping is not an issue; this machine had about 450MB free throughout the whole run.

dkf added on 2010-04-29 20:52:20:

IP - Comment Removed: 130.88.1.31

dkf added on 2010-04-29 20:52:17:

IP - Comment Removed: 130.88.1.31

jgodfrey added on 2010-04-29 20:33:26:
Just tested the attached append.tcl script on a WinXP SP3 box, using 8.5.7 and 8.6b1.1, though I had to drop a zero from the main loop counter (10000).

8.5.7 - 1.8 seconds
8.6b1.1 - 3.9 minutes

So, the problem doesn't appear to be Win7 or specific hardware related...

jgodfrey added on 2010-04-29 20:29:13:
Sorry, my previous comment was attached to the wrong ticket.  Can it be removed?

jgodfrey added on 2010-04-29 20:24:08:
Just tested using 8.6b1.1 under WinXP SP3.  It also exhibits the originally described issue.  However, testing with additional fonts seems to indicate that the problem is somehow related to mono spaced fonts.  

At least, some cursory testing shows the problem exists with the Consolas and Courier fonts (mono spaced), but not with the Tahoma and Verdana fonts (proportional).

ferrieux added on 2010-04-29 14:22:11:
D'oh of course, 64 bits machine typically have big memory. Should have known that, sorry.

To proceed, then, we should select a problem size that both fits (with wide margin) into RAM *and* allows for a side-by-side comparison (absolute timings are bad).

So, can you please reduce the outer loop length to a value that allows for 8.5.8 to finish (and then halve it again, to have a margin) ? Then we'll look at the speedup (or down) between the two versions.

jgodfrey added on 2010-04-29 10:14:56:
OK, I finally have some timing info from the latest (attached) script under 8.6b1.1.  Again, I can't run it under 8.5.8 due to a fatal memory allocation error...

The script took 68.2 minutes to complete. (Win 7 x64, 6GB, Quad Core).  

At its worst (at least that I saw), the process was consuming about 450 MB of memory, though when it finished, it settled in at just under 7 MB.

jgodfrey added on 2010-04-29 06:33:04:
Alex,

Just to be clear, as I mentioned previously, I cannot even run the latest test script (using the 100000 value) under 8.5.8.  I get a fairly immediate fatal memory allocation error.

I'm currently running it under 8.6b1.1, but it hasn't yet completed. I'm guessing it's been running for 30-40 minutes.  Checking memory consumption, it's currently running at about 250 MB, but spiking over 300 MB (just via Win Task Manager)....

jgodfrey added on 2010-04-29 05:56:31:
> The only way I can explain your observation, is
> "varying external memory conditions" (other processes).

> Can you recheck with this new perspective ?

Alex,

This doesn't seem likely to explain my case for a few reasons...

- I'm running on a 6GB machine, with not much else going on at the time of testing.
- I'm running both 8.5.8 and 8.6b1.1 back to back, one after the other (in both orders), and my results seem fairly consistent in all tests.  That is, 8.5.8 outruns 8.6b1.1 by a long shot, regardless of whether it's run first or second.

Am I missing the point of your comments?  Apparently, no one else has been able to recreate this I guess?  Has anyone been able to test on similar OS / hardware combo?

Anyway, I'll do some more testing with memory in mind and see if I notice anything worth mentioning.

Thanks.

ferrieux added on 2010-04-29 05:51:11:
More precise memory figures, using [exec pmap [pid]] before exit, and making 'buffer' global so that the pmap sees it.

8.6 HEAD:                        252984K
8.6 20090923 ~AT86:  242404K
8.5 HEAD:                        1392544K

ferrieux added on 2010-04-29 05:38:04:
Homing in.

Thanks to kbk's tests on a small-memory machine, we noticed that the script was readily swapping. So in all this, instead of monitoring times, which may wildly vary depending on the memory consumption of other processes, we should be monitoring memory usage (or running on 2G+ machines).

Note that the comparison also shows that 8.6's memory usage on this script is _much_ smaller than 8.5's. So, even when swapping/paging occurs, 8.6 should keep an advantage. The only way I can explain your observation, is "varying external memory conditions" (other processes). 

Can you recheck with this new perspective ?

ferrieux added on 2010-04-29 04:07:59:

File Added - 372154: append.tcl

ferrieux added on 2010-04-29 04:07:23:

File Deleted - 372040:

ferrieux added on 2010-04-29 03:59:49:
Andreas, thanks for the precision. Did anything happen to the numeric parameters of the allocator (like minimum realloc) between the two versions ?

Jeff, thanks for the new figures. Uploading the updated script, since it seems to be reproducing for you, needs no big file, and is portable.

However, I don't see the same timings at all. Instead, I see a 1.4x speedup from 8.5 HEAD to 8.6 HEAD:

Version:8.5.8
Total size: 242500000
15038942 microseconds per iteration

Version:8.6b1.1
Total size: 242500000
10289783 microseconds per iteration

This is on Linux 32-bit. An interesting counter-test would be AT on Linux 32, and same on Linux 64.

andreas_kupries added on 2010-04-28 22:45:26:
"Could an Activist chime in and tell us the precise allocator choices behind those two precise versions of AT, Win7 x64 ? "

Activator, thats the label ;)

Both 8.5 and 8.6 are compiled with --enable-threads. We use whatever allocator is chosen by the win/configure for that setting (-DUSE_THREAD_ALLOC=1 -DTCL_THREADS=1 are defines coming from that). No tweaks.

jgodfrey added on 2010-04-28 20:10:57:
> Use [binary format a130 \0] to get that.

Donal - thanks.  As you may have guessed, I don't do much work with binary data... ;^)

> Let's see what you get in terms of timings with [binary format] instead of
[string repeat]

Alex,

Here's my modified proc using [binary format]...

proc ap {} {
    global sample
    set z [binary format a130 \0]
    for {set i 0} {$i<10000} {incr i} {
        foreach l $sample {
            append buffer [string range $z 0 $l]
        }
    }
    puts "Total size: [string length $buffer]"
}

Note that I changed the original 100000 to 10000.  Using the original value, 8.5 (fairly quickly) generates a fatal "unable to realloc 485000018 bytes" message. While 8.6b1.1 didn't crash, it also never completed in the 20 mins I let it run.

Reducing the value to 10000, both complete and I see speed differences in the ballpark of what I originally reported (actually, worse).

Total size: 24250000
1071639 microseconds per iteration
(Temp) 1 % set tk_patchLevel
8.5.8

Total size: 24250000
42250394 microseconds per iteration
(Temp) 1 % set tk_patchLevel
8.6b1.1

So, nearly a 40x difference.

ferrieux added on 2010-04-28 17:13:07:
Jeff, as Donal hints at ;-), my intention with /dev/zero was to obtain byte arrays, because the concatenation code paths are wildly different from the string case, *and* it is the byte array case that applies to you (fconfigure -transl binary).

So yes, by "emulate /dev/zero" I meant something like [binary format ...].

Let's see what you get in terms of timings with [binary format] instead of [string repeat]. I'll also put my hands on the sample you sent me by e-mail (thanks !).

As a side note, I'm rather suspecting a quadratic-realloc issue, considering the small size of the chunks. Possibly the two versions that you are testing don't have exactly the same configured #defines regarding the allocator underneath (zippy or not, etc). Could an Activist chime in and tell us the precise allocator choices behind those two precise versions of AT, Win7 x64 ?

dkf added on 2010-04-28 16:58:39:
That's not even close to the same thing. [string repeat 0 130] produces a *string* of zero digits (\u0030) and not a *bytearray* of zero bytes. Use [binary format a130 \0] to get that.

jgodfrey added on 2010-04-28 04:05:14:
Alex,

I'm unfamiliar with /dev/zero, but some Googling indicates that it just returns zeros when read.  With that in mind, I changed 2 lines of your "ap" proc and commented out a third - so it looks like this:

proc ap {} {
    global sample
    #set z [open /dev/zero r]
    set z [string repeat 0 130]
    #fconfigure $z -translation binary
    for {set i 0} {$i<100000} {incr i} {
foreach l $sample {
    #append buffer [read $z $l]
        append buffer [string range $z 0 $l]
}
    }
    puts "Total size: [string length $buffer]"
}

The 3 commented lines are your originals.  As I'm not sure what you're trying to test here, I can't say whether the above is still an equivalent test or not.  If you're just testing a bunch of chunked [append] calls, I guess it's OK?  Though I'm thinking if the above were valid, you probably wouldn't have used /dev/zero in the first place...  Let me know.

Anyway, I don't see any huge differences between 8.5.8 and 8.6b1.1 in my Win7 64 environment, though 8.6 is slower as shown below.

Total size: 242500000
7259428 microseconds per iteration
(Desktop) 1 % set tcl_patchLevel
8.6b1.1

Total size: 242500000
6162242 microseconds per iteration
(Desktop) 1 % set tcl_patchLevel
8.5.8

ferrieux added on 2010-04-28 03:33:53:

File Added - 372040: append.tcl

ferrieux added on 2010-04-28 03:33:23:
FWIW, I've been trying to reproduce with 8.[56] CVS HEADs on Linux (Fed 12, x86), to no avail. The attached script is even twice faster with 8.6 than 8.5, which is itself twice faster than the 8.5.7 installed on Fedora 12.

Could you do the same test (with the append.tcl script) in environments closer to yours (like ActiveTcl, Windows, 64bit...) ? You'll have to emulate /dev/zero, possibly with a [string range] on a prepared byte array.

andreas_kupries added on 2010-04-28 01:36:53:
I just did a disassembly for 8.5/HEAD and 8.6/HEAD.

They are the same, modulo whitespace and location in memory.

Both revisions compile the "append" command into a single bytecode instruction, i.e.

    (221) appendScalar1 %v4     # var "buffer"

I used tcl::unsupported::disassemble to get the bytecode.

hobbs added on 2010-04-28 01:09:39:
Someone should do a disassemble to ensure the append in 8.6 for some reason isn't getting byte-compiled.  That would be account for about that order of slowdown.

andreas_kupries added on 2010-04-28 00:28:30:
Yeah, their look is an SF GUI problem. In each iteration of the interface they seem to introduce at least one broken feature, or break something which worked fine.

jgodfrey added on 2010-04-27 23:50:21:
Andreas,

Thanks - I see it now.  It's not clear (at least in FireFox) that the section headers below (such as "Attached File") actually *do* something.  They don't look like links nor does the cursor change when I hover over them.

Bottom line, I needed to click on the "Attached File" text to open the section.  There, I can add additional files, though I don't think it's necessary in this case.

andreas_kupries added on 2010-04-27 23:40:03:
"I can only add 1 attachment to a ticket? "

No, it should be possible to attach as many as you like.

There is however a size limitation on the individual files IIRC (and maybe on the total set of files, but that is speculative, i.e. while I have never seen such, I can imagine its existence).
I do not remember right now where the limit is.
Compression may help.

In the 'Attached File' section of the report is a button/link 'Add a file'.

jgodfrey added on 2010-04-27 23:20:06:
Alex,

The chunks are quite small, due to the file format.  They never exceed 128 bytes each, though there are *lots* of chunks.  In fact, the file in question contains over 110,000 chunks.  I recorded them in a file, but I don't see how to attach another file via the SF web interface - I must be missing it...

Below, I've included an excerpt from the file.  If you need anything else, just ask.

8
128
128
128
128
128
32
8
128
128
24
8
24
8
64
8
16
8
8
128
128
104
8
60
8
12
8
128
112
8
90
8
128
128
128
68
8
4
8

ferrieux added on 2010-04-27 23:03:47:
Since everything happens in binary (hence no UTF8 conversion is involved), I suspect we can reproduce with /dev/zero. The only varying factor I guess is the pattern of sizes of the chunks you are appending. So maybe you can just give realistic figures of the sizes, or even the top of the histogram (or attach a text file of the chunk sizes if you manage to ;-).

jgodfrey added on 2010-04-27 20:56:37:
Hmmm... I expected to attach both the Tcl script and the sample binary file that it reads, however it seems I can only add 1 attachment to a ticket?  In that case, I can make the binary file available as needed - just ask.

jgodfrey added on 2010-04-27 20:53:19:

File Added - 371999: extractPayload.tcl

Attachments: