Ticket UUID: | 505048 | |||
Title: | impossible occurs! | |||
Type: | Bug | Version: | obsolete: 8.4.5 | |
Submitter: | dgp | Created on: | 2002-01-17 19:56:27 | |
Subsystem: | 43. Regexp | Assigned To: | dgp | |
Priority: | 7 High | Severity: | ||
Status: | Closed | Last Modified: | 2003-11-18 00:53:24 | |
Resolution: | Fixed | Closed By: | dgp | |
Closed on: | 2003-11-17 17:53:24 | |||
Description: |
% regexp {\A\s*[^<]*\s*<([^>]+)>} "Newsletter Subscriber <[email protected]>" all mail error while matching regular expression: "can't happen" -- you found a bug | |||
User Comments: |
dgp added on 2003-11-18 00:53:23:
Logged In: YES user_id=80530 Back ported fixand tests for Tcl 8.4.5. dgp added on 2003-11-18 00:44:55: Logged In: YES user_id=80530 Added tests reg-33.* to test all the bugs fixed by this patch. dgp added on 2003-11-16 01:05:37: Logged In: YES user_id=80530 Thanks! Assigning back to myself for the addition of tests to the test suite, and to consider a backport for Tcl 8.4.5. pvgoran added on 2003-11-15 18:41:28: File Added - 67582: regcomp.c.diff pvgoran added on 2003-11-15 18:37:23: Logged In: YES user_id=383758 This bug results from an error in code that splits states into "progress" and "no-progress" ones. This error causes an interesting situation with the pre-collected single-linked list of states to be splitted: many items were added to the list, but only several of them are accessible from the list beginning, since the "tmp" member of struct state (which is used here to hold a pointer to the next list item) gets overwritten, which results in a "looped" chain. As a result, not all of states are splitted, and one state is splitted two times, causing incorrect "no-progress" flag values. The bug fix is checked into CVS (regcomp.c, revision 1.8). Also, a patch with this fix is attached here. mistachkin added on 2003-05-10 20:10:15: Logged In: YES user_id=113501 After further research, I have uncovered the following information: In file regexec.c on line #281, there is the following code: close = shortest(v, s, v->start, v->start, v->stop, &cold, (int *)NULL); This call is not producing the correct result for this "buggy" regexp. I came to this conclusion after I was able to obtain the correct result by manually setting the loop start variable begin to the value of v->start. As we can see here, "begin" gets it's initial value from the variable "open", which gets it's initial value from the variable "cold", which is set by function "shortest" prior to the loop. The following attempts to illustrate the flow of this very important value: cold = NULL; close = shortest(v, s, v->start, v->start, v->stop, &cold, (int *)NULL); <more code here> open = cold; cold = NULL; <more code here> for (begin = open; begin <= close; begin++) { <more code here> <this loop is extremely sensitive to the starting value of "begin"> } Further analysis reveals the reason why the variable "cold" is getting an invalid value (off by one "character", which is actually 2 bytes). This is due to function "shortest" calling function "lastcold", which iterates through the "d- >ssets" array looking for a NOPROGRESS flag. I believe this is the root cause of the problem, the NOPROGRESS flag is being set on d->sset[2] incorrectly, specifically in the "miss" function called (quite a few times) by: find --> shortest --> miss The interesting thing about this is that it appears to be a very fancy off-by-one error. It may be worthwhile to further analyze the differences between "{^blah}" and "{\Ablah}" to understand why the NOPROGRESS flag is being set incorrectly. mistachkin added on 2003-05-10 17:24:57: Logged In: YES user_id=113501 I don't know if this helps, but I traced the problem to this line (with assert enabled, that is): regexec.c, near line #318 assert(end != NULL); /* search RE succeeded so loop should */ By the way, this still applies in CVS HEAD (8.5a0). hartweg added on 2002-04-18 17:00:35: Logged In: YES user_id=338658 Just an observation that may (or may not) help track this down. If the \s* elements are removed (which still gives desired output sinces they are redundent as whitespace is also "eaten" by the [^>]* or [^b]* elements) then the error does not occur. dkf added on 2002-01-18 16:24:24: Logged In: YES user_id=79902 But substituting ^ for \A makes the problem go away; maybe the bug lies there? dkf added on 2002-01-18 16:20:31: Logged In: YES user_id=79902 Eep! Nasty, nasty, nasty... % regexp -inline {\A\s*[^b]*b} ab `h % regexp -indices -inline {\A\s*[^b]*b} ab {3 -139006} The first made me go "Eh?" and the second makes me want to hide, because it indicates that something is seriously wrong. Upping the priority. dgp added on 2002-01-18 04:11:16: Logged In: YES user_id=80530 A comment from Helmut Giese in c.l.t leads to: % regexp {\A\s*[^b]*b} ab all 1 % set all % Although no "can't happen" error, one would expect $all to be ab, but it's not. dgp added on 2002-01-18 03:16:41: Logged In: YES user_id=80530 yet simpler variations: % regexp {\A\s*([^b]*)b} ab all mail error while matching regular expression: "can't happen" -- you found a bug % regexp {\A\s*[^b]*(b)} ab all mail error while matching regular expression: "can't happen" -- you found a bug % regexp {\A(\s*)[^b]*(b)} ab all mail error while matching regular expression: "can't happen" -- you found a bug Location of the capturing parens doesn't seem to matter as long as they are there. dgp added on 2002-01-18 03:03:35: Logged In: YES user_id=80530 simpler still: % regexp {\A\s*[^<]*\s*<([^>]+)>} a<a> all mail error while matching regular expression: "can't happen" -- you found a bug dgp added on 2002-01-18 03:01:40: Logged In: YES user_id=80530 Simpler case: % regexp {\A\s*[^<]*\s*<([^>]+)>} "a <a>" all mail error while matching regular expression: "can't happen" -- you found a bug |
Attachments:
- regcomp.c.diff [download] added by pvgoran on 2003-11-15 18:41:28. [details]