Tcl Source Code

View Ticket
Login
Ticket UUID: 32c5740a4d3e86b1973915d92f07d5aee6104a14
Title: Undesirable [lappend] performance as LIST_MAX is reached
Type: Bug Version: trunk
Submitter: dgp Created on: 2015-11-17 18:40:08
Subsystem: 14. List Object Assigned To: nobody
Priority: 5 Medium Severity: Important
Status: Open Last Modified: 2015-11-30 18:25:02
Resolution: None Closed By: nobody
    Closed on:
Description:
Phil Brooks presented at the 2015 Tcl Conference, and later
posted to comp.lang.tcl a demonstration of [lappend] growth
of large lists, and some undesirable performance.  Here's his
demo script as it appeared in clt:

set i 0
set my_list {}
set count 0
set t0 [ clock seconds ]
# Create a list with a billion elements
while {$i < 1000000000} {
       lappend my_list "List element $i"
       incr count
       if { [expr {fmod($count, 100000000)} ] == 0.0} {
          # Show progress every 100 million elements
          set tnow [ clock seconds ]
          puts "elapsed time: [ expr { $tnow - $t0 } ] seconds"
          puts [llength $my_list]
       } elseif { $count > 403000000 } {
          # Show progress more incrementally
          if { [expr {fmod($count, 1000)} ] == 0.0} {
              set tnow [ clock seconds ]
              puts "elapsed time: [ expr { $tnow - $t0 } ] seconds"
              puts [llength $my_list]
          }
       }
       incr i
}
User Comments: dgp added on 2015-11-30 18:25:02:
Committed to trunk item 1) from the roadmap.  This takes care
of the demo script submitted.  Leaving this open for the
other items, but they're mostly matters of cosmetics.

dgp added on 2015-11-23 20:56:49:
So, the path forwards appears to be:

1) Add to Tcl_ListObjReplace() the code needed so that
attempts to realloc List structs happen where they can
be of potential benefit.

2) Then, there may or may not be reason to reconsider
the actual growth algorithm.  Experience with Tcl_ListObjAppendElement()
suggests it won't really matter.

3) Finally, it sure would be nice not to have all this
list growth code semi-duplicated multiple times. It
seems there ought to be a refactoring that puts these
matters into one place.

dgp added on 2015-11-23 20:52:27:
Looking some more, we find that Tcl_ListObjReplace() 
is getting called because the instruction 
INST_LAPPEND_LIST_STK is doing the work. This is
a recent invention.

Tcl_ListObjReplace() is not the only routine that grows
lists. Tcl_ListObjAppendElement() also does so, and is
a more direct implementation to use here.  The 
instructions we need to get that routine working for us are
INST_LAPPEND_SCALAR*  The easy path to that is to wrap
the whole test script in an [apply {{} {...}}]

Indeed this performs much much better, growing up all
the way to the LIST_MAX limit in the time it takes the
original demo to fill less than the first 200M elements.

It is not hard to make the compiler use INST_LAPPEND_SCALAR
instead of INST_LAPPEND_LIST_STK in the demonstrated case.
The problem is that the two instructions exhibit different
behaviors regarding read traces, and the test suite is not
happy with the change.  This is a long, sick, sad tale. Explore
if you like; it's not pretty. We've apparently decided the least
bad option we have is to leave things inconsistent and broken.

The next thing to examine, though, is just why Tcl_ListObjAppendElement
does so much better than Tcl_ListObjReplace. The answer appears to be
that the former takes an extra step in attempting to use the realloc()
call of the memory allocation subsystem before moving on to a full
copy and extend.  The copies are what kills us in the time, and
apparently in precisely this demo case, realloc() is able to do what
it exists to do -- grow a (virutal) memory buffer while avoiding copies
as much as it can.

dgp added on 2015-11-19 20:17:52:
Some probing finds that the slow growth takes
place in the routine Tcl_ListObjReplace(), and
examination of the code in that routine starts
uncovering other bugs such as:

http://core.tcl.tk/tcl/tktview/40f628e8e343f5e3cbef0cd41dff719ef2a11e34

dgp added on 2015-11-19 18:13:03:
Next thing I notice is the anti-idiom

if {[expr {...}]}...

The condition argument of an [if] command
is already an expression.  There's no need
to use [expr] to make it one.

It is also strange to make use of fmod() to
do integer modulus in light of the % operator.

Better code looks like:

if {$count % 100000000 == 0} {...

Somewhat surprisingly the use of [expr] makes
little difference. Bytecode compilation gets
to essentially the same instructions. Avoiding
the floating point conversion tax helps some.

A few more tweaks leads to:

set my_list {}
set count 0
set t0 [ clock seconds ]
# Create a list with a billion elements
while {$count < 1000000000} {
       lappend my_list A
       set count [llength $my_list]
       if { $count % 100000000 == 0.0 } {
          # Show progress every 100 million elements
          set tnow [ clock seconds ]
          puts "elapsed time: [ expr { $tnow - $t0 } ] seconds"
          puts $count
       } elseif { $count > 402600000 } {
          # Show progress more incrementally
          if { $count % 1000 == 0.0 } {
              set tnow [ clock seconds ]
              puts "elapsed time: [ expr { $tnow - $t0 } ] seconds"
              puts $count
          }
       }
}

dgp added on 2015-11-19 17:38:44:
With that change, the next thing to note is that
the "slow growth" bug cannot be demonstrated using
Tcl 8.5, because it still suffers from the predecessor
bug 3293874 where at the point where Tcl 8.6 starts
slow growth, Tcl 8.5 just aborts and claims the list
is already too big.

I don't recall why I chose not to backport the fix,
but I don't see a strong reason to revise that choice.
Folks needing Tcl to scale to it's max list lengths
just need to use Tcl 8.6.

dgp added on 2015-11-19 16:59:26:
The first problem with this demo is the demands
it makes.  In order to see the demonstration, a
list of ~400M elements is needed.  As written each
element has a unique value which requires 8 bytes
for the pointer, 32 bytes for the Tcl_Obj struct,
and about 20 bytes for the string value. This means
a minimum of ~24G of active RAM is needed just to 
run the demo.  Such systems are not uncommon, but
they're not yet the norm, I'd say, and in fact my
everyday development system lacks enough memory.
Since it's a linux system, all I get is death
by OOMKiller.

I believe the key demonstration doesn't need unique
element values, just a long list, so I edit the
lappend command to be:

lappend my_list A

This makes a list of 400M elements, but they are all
the same element, so we now only require ~4G to demo
the problem.  Many more folks can look at that.