Ticket UUID: | 446565 | |||
Title: | Capturing parens are very slow | |||
Type: | Bug | Version: | obsolete: 8.4.4 | |
Submitter: | nobody | Created on: | 2001-07-31 21:29:19 | |
Subsystem: | 43. Regexp | Assigned To: | pvgoran | |
Priority: | 3 Low | Severity: | ||
Status: | Open | Last Modified: | 2003-11-13 08:15:19 | |
Resolution: | None | Closed By: | ||
Closed on: | ||||
Description: |
See attached archive (sample script with text data). If you can't see attached file, look at http://ifast.narod.ru/files/regbug.zip | |||
User Comments: |
msofer added on 2001-08-13 20:04:43:
Logged In: YES user_id=148712 The originally submitted script does perform as intended, only very slowly. The use of non-capturing parens solves the issue. It now seems clear that the bug is really a performance and not a functionality issue. I am therefore changing the summary description and reducing the priority. msofer added on 2001-08-07 01:12:51: Logged In: YES user_id=148712 Indeed, I am both wrong and sorry. The "capturing parens" are very costly in this case; replacing them with the "non-capturing" variant produces the desired result. The regex becomes then {<script(?:.(?!</script>))*?(?:doubleclick|flycast|burstnet|spylog)+?.*?</script>} kofeinik added on 2001-08-04 04:28:19: Logged In: YES user_id=286083 >The regular expression >{<script(.(?!</script>))*? >(doubleclick|flycast|burstnet|spylog)+?.*?</script>} >is invalid: it contains a constraint (the lookahead) >followed by a quantifier, which is expressly forbidden in >the man page for re_syntax. You aren't right. If you follow lookahead constraint directly (NB!) with quantifier, interpreter reports an error. And my construction works fine (interpreter doesn't report any error), because lookahead atom doesn't follow any quantifier. But works quickly only on small buffer. Also, if you wait a long-long time (you can decrease text size between <script> and </script> in example), interpreter un-hungs, and return correct result. >I think the author actually wanted something like >{<script(?=.*</script>).*? >(doubleclick|flycast|burstnet|spylog).*?</script>} >which runs OK (albeit slowly; there is probably a >better way to express this). You aren't right again. :) Try: # my expression set my {<script(.(?!</script>))*?(123)+?.*?</script>} # your expression set mo {<script(?=.*</script>).*?(123).*?</script>} # sample text set text {<script> 123 </script> <script> 345 </script> <script> 123 </script>} # Correct result regsub -all $my $text --- res 2 set res --- <script> 345 </script> --- # Not correct result: regsub -all $mo $text --- res 2 set res --- --- Btw, if you know, how to get first result without using of constraint lookahead with non-greedy quantifier, let me know. :) >OTOH, the regexp engine should complain about the >bad syntax >and return an error, instead of getting into an infinite >loop. IMHO, itsn't infinite loop, but only very slow work with lookahead constraints. msofer added on 2001-08-01 05:42:46: Logged In: YES user_id=148712 The regular expression {<script(.(?!</script>))*?(doubleclick|flycast|burstnet|spylog)+?.*?</script>} is invalid: it contains a constraint (the lookahead) followed by a quantifier, which is expressly forbidden in the man page for re_syntax. I think the author actually wanted something like {<script (?=.*</script>).*?(doubleclick|flycast|burstnet|spylog).*?</script>} which runs OK (albeit slowly; there is probably a better way to express this). OTOH, the regexp engine should complain about the bad syntax and return an error, instead of getting into an infinite loop. msofer added on 2001-08-01 04:39:35: File Added - 9093: regbug.zip msofer added on 2001-08-01 04:39:34: Logged In: YES user_id=148712 Bug confirmed: tclsh seems to get in an infinite loop. Attaching the file now |
Attachments:
- regbug.zip [download] added by msofer on 2001-08-01 04:39:35. [details]