Tcl Source Code

View Ticket
Login
Ticket UUID: 446565
Title: Capturing parens are very slow
Type: Bug Version: obsolete: 8.4.4
Submitter: nobody Created on: 2001-07-31 21:29:19
Subsystem: 43. Regexp Assigned To: pvgoran
Priority: 3 Low Severity:
Status: Open Last Modified: 2003-11-13 08:15:19
Resolution: None Closed By:
    Closed on:
Description:
See attached archive (sample script with text data).
If you can't see attached file, look at 
http://ifast.narod.ru/files/regbug.zip
User Comments: msofer added on 2001-08-13 20:04:43:
Logged In: YES 
user_id=148712

The originally submitted script does perform as intended,
only very slowly. The use of non-capturing parens solves the
issue.

It now seems clear that the bug is really a performance and
not a functionality issue. 

I am therefore changing the summary description and reducing
the priority.

msofer added on 2001-08-07 01:12:51:
Logged In: YES 
user_id=148712

Indeed, I am both wrong and sorry.

The "capturing parens" are very costly in this case;
replacing them with the "non-capturing" variant produces the
desired result. The regex becomes then

{<script(?:.(?!</script>))*?(?:doubleclick|flycast|burstnet|spylog)+?.*?</script>}

kofeinik added on 2001-08-04 04:28:19:
Logged In: YES 
user_id=286083

>The regular expression 
>{<script(.(?!</script>))*?
>(doubleclick|flycast|burstnet|spylog)+?.*?</script>}
>is invalid: it contains a constraint (the lookahead)
>followed by a quantifier, which is expressly forbidden in
>the man page for re_syntax.

You aren't right. If you follow lookahead constraint 
directly (NB!) with quantifier, interpreter reports an 
error. And my construction works fine (interpreter doesn't 
report any error), because lookahead atom doesn't follow 
any quantifier. But works quickly only on small buffer.
Also, if you wait a long-long time (you can decrease text 
size between <script> and </script> in example), 
interpreter un-hungs, and return correct result.


>I think the author actually wanted something like
>{<script(?=.*</script>).*?
>(doubleclick|flycast|burstnet|spylog).*?&lt;/script>}
>which runs OK (albeit slowly; there is probably a 
>better way to express this).

You aren't right again. :)
Try:
# my expression
set my {<script(.(?!</script>))*?(123)+?.*?</script>}
# your expression
set mo {<script(?=.*</script>).*?(123).*?</script>}
# sample text
set text {<script> 123 </script> <script> 345 </script> 
<script> 123 </script>}

# Correct result
regsub -all $my $text --- res
2
set res
--- <script> 345 </script> ---

# Not correct result: 
regsub -all $mo $text --- res
2
set res
--- ---

Btw, if you know, how to get first result without using of 
constraint lookahead with non-greedy quantifier, let me 
know. :)

>OTOH, the regexp engine should complain about the 
>bad syntax
>and return an error, instead of getting into an infinite
>loop. 

IMHO, itsn't infinite loop, but only very slow work with 
lookahead constraints.

msofer added on 2001-08-01 05:42:46:
Logged In: YES 
user_id=148712

The regular expression 
 
{<script(.(?!</script>))*?(doubleclick|flycast|burstnet|spylog)+?.*?</script>}
is invalid: it contains a constraint (the lookahead)
followed by a quantifier, which is expressly forbidden in
the man page for re_syntax.

I think the author actually wanted something like
 {<script
(?=.*</script>).*?(doubleclick|flycast|burstnet|spylog).*?</script>}
which runs OK (albeit slowly; there is probably a better way
to express this).

OTOH, the regexp engine should complain about the bad syntax
and return an error, instead of getting into an infinite
loop.

msofer added on 2001-08-01 04:39:35:

File Added - 9093: regbug.zip

msofer added on 2001-08-01 04:39:34:
Logged In: YES 
user_id=148712

Bug confirmed: tclsh seems to get in an infinite loop.
Attaching the file now

Attachments: