Tcl Source Code

View Ticket
Login
Ticket UUID: 496456
Title: regexp engine takes too much RAM
Type: Bug Version: obsolete: 8.3
Submitter: kofeinik Created on: 2001-12-24 11:18:10
Subsystem: 43. Regexp Assigned To: dkf
Priority: 3 Low Severity:
Status: Closed Last Modified: 2002-01-08 23:43:36
Resolution: Wont Fix Closed By: dkf
    Closed on: 2002-01-08 16:43:36
Description:
TCL version: 8.3
Platform: windows
regexp engine takes too much RAM

after executing the code from an example (see attached 
file) the total size of allocated RAM stays on the 
level of 24 megabytes that too much (IMHO) for this 
small array.
User Comments: dkf added on 2002-01-08 23:43:36:
Logged In: YES 
user_id=79902

Further examination indicates that the memory taken is
approximately proportional to the maximum length of the
regexps, and it is not a memory leak since deleting the
BlacklistRules array recovers the memory for general use
(returning it to the OS depends on the system memory
allocator; UNIX ones mostly don't.)

Closing; it's almost certainly beyond the ability of all
current Tcl maintainers to fix this without breaking
something and there are simple workarounds (keeping the RE
size down and not using capturing parens seem to have the
best effect, though the former increases the time to search
all the REs.)

dkf added on 2002-01-02 19:00:57:
Logged In: YES 
user_id=79902

It seems that the RE compiler allocates some fairly large objects during its processing (during the coloring of the graph) 
and in a complex RE these may occupy quite a lot of memory; maybe we need to write a custom allocator for these 
things?  The alternative is to figure out a way to use smaller color arrays, but I don't understand the code anything like well 
enough to be able to do that at this point.  :^(

The details can be seen in regcustom.h and regguts.h if you're interested.  (My guess is that the problems really stem 
from the way that support for characters wider than a byte was added, but that's just a guess at this point.)

kofeinik added on 2001-12-27 23:01:41:
Logged In: YES 
user_id=286083

With concatenating of RE's i want to reduce time, needed 
for matching, because TCL compile RE only first time, and 
next matching works much faster.
And i agree, this situation is typical tradeoff, but i 
can't agree with proportion 23KB->10-17MB. I'm not right?

dkf added on 2001-12-27 22:36:04:
Logged In: YES 
user_id=79902

Experiment indicates that replacing ( with (?: shaves around 5MB off the size of the image.
Don't know how much memory the RE engine needs to allocate for all that stuff, though I
suspect that you don't want to use a single RE for this; the initial compile-time gets a bit
humungous.

Matching each RE line separately keeps the space required *much* smaller, but makes
the length of time to check an unlisted site longer after the initial compilation (which is also
much shorter.)  Looks like a time-space tradeoff...

dkf added on 2001-12-27 19:47:47:
Logged In: YES 
user_id=79902

The problem is probably related to the quantity of capturing parentheses involved.  There
may be some linkage to the use of arrays (which are *truly* not appropriate for this use;
lists are much more like what is needed) and the retention of all the regexps used in
building the main one too.  Inded, I could rewrite the code to build the regexp quite
quickly (though I've no idea how large the single resulting RE would actually be, or
whether there would be any performance gain through doing this after the initial
loading phase.)

There is possibly a need for a flag to say that no capturing at all should be done by a
regexp, (making "RE(RE)RE" equivalent to "RE(?:RE)RE") which might be useful in
situations such as this.  If that's really the case, then this is a FRQ, not a bug per se.

I'll have a look at this in detail after the holidays.

msofer added on 2001-12-24 21:41:13:
Logged In: YES 
user_id=148712

Bug replicated on linux with tcl8.3.3: VSZ is 17MB in the
original version, only 3MB when taking the [string length]
version.

kofeinik added on 2001-12-24 18:18:16:

File Added - 15088: test.zip

Attachments: