Tcl Source Code

View Ticket
Login
Ticket UUID: 505048
Title: impossible occurs!
Type: Bug Version: obsolete: 8.4.5
Submitter: dgp Created on: 2002-01-17 19:56:27
Subsystem: 43. Regexp Assigned To: dgp
Priority: 7 High Severity:
Status: Closed Last Modified: 2003-11-18 00:53:24
Resolution: Fixed Closed By: dgp
    Closed on: 2003-11-17 17:53:24
Description:
% regexp {\A\s*[^<]*\s*<([^>]+)>} "Newsletter
Subscriber <[email protected]>" all mail
error while matching regular expression: "can't happen"
-- you found a bug
User Comments: dgp added on 2003-11-18 00:53:23:
Logged In: YES 
user_id=80530


Back ported fixand tests  for Tcl 8.4.5.

dgp added on 2003-11-18 00:44:55:
Logged In: YES 
user_id=80530


Added tests reg-33.* to test
all the bugs fixed by this patch.

dgp added on 2003-11-16 01:05:37:
Logged In: YES 
user_id=80530


Thanks!  Assigning back
to myself for the addition
of tests to the test suite,
and to consider a backport
for Tcl 8.4.5.

pvgoran added on 2003-11-15 18:41:28:

File Added - 67582: regcomp.c.diff

pvgoran added on 2003-11-15 18:37:23:
Logged In: YES 
user_id=383758

This bug results from an error in code that splits states into 
"progress" and "no-progress" ones. This error causes an 
interesting situation with the pre-collected single-linked list of 
states to be splitted: many items were added to the list, but 
only several of them are accessible from the list beginning, 
since the "tmp" member of struct state (which is used here to 
hold a pointer to the next list item) gets overwritten, which 
results in a "looped" chain. As a result, not all of states are 
splitted, and one state is splitted two times, causing incorrect 
"no-progress" flag values.

The bug fix is checked into CVS (regcomp.c, revision 1.8). 
Also, a patch with this fix is attached here.

mistachkin added on 2003-05-10 20:10:15:
Logged In: YES 
user_id=113501

After further research, I have uncovered the following 
information:

In file regexec.c on line #281, there is the following 
code:

close = shortest(v, s, v->start, v->start, v->stop, 
&cold, (int *)NULL);

This call is not producing the correct result for 
this "buggy" regexp.  I came to this conclusion after I 
was able to obtain the correct result by manually setting 
the loop start variable begin to the value of v->start.  
As we can see here, "begin" gets it's initial value from 
the variable "open", which gets it's initial value from the 
variable "cold", which is set by function "shortest" prior 
to the loop.  The following attempts to illustrate the 
flow of this very important value:

cold = NULL;
close = shortest(v, s, v->start, v->start, v->stop, 
&cold, (int *)NULL);

<more code here>

open = cold;
cold = NULL;

<more code here>

for (begin = open; begin <= close; begin++) {
  <more code here>
  <this loop is extremely sensitive to the starting value 
of "begin">
}

Further analysis reveals the reason why the 
variable "cold" is getting an invalid value (off by 
one "character", which is actually 2 bytes).  

This is due to function "shortest" calling 
function "lastcold", which iterates through the "d-
>ssets" array looking for a NOPROGRESS flag.  

I believe this is the root cause of the problem, the 
NOPROGRESS flag is being set on d->sset[2] incorrectly, 
specifically in the "miss" function called (quite a few 
times) by:

find --> shortest --> miss

The interesting thing about this is that it appears to be 
a very fancy off-by-one error.

It may be worthwhile to further analyze the differences 
between "{^blah}" and "{\Ablah}" to understand why the 
NOPROGRESS flag is being set incorrectly.

mistachkin added on 2003-05-10 17:24:57:
Logged In: YES 
user_id=113501

I don't know if this helps, but I traced the problem to 
this line (with assert enabled, that is):

regexec.c, near line #318

assert(end != NULL); /* search RE succeeded so loop 
should */

By the way, this still applies in CVS HEAD (8.5a0).

hartweg added on 2002-04-18 17:00:35:
Logged In: YES 
user_id=338658

Just an observation that may (or may not) help track this 
down. If the \s* elements are removed (which still gives 
desired output sinces they are redundent as whitespace is 
also "eaten" by the [^>]* or [^b]* elements) then the error 
does not occur.

dkf added on 2002-01-18 16:24:24:
Logged In: YES 
user_id=79902

But substituting ^ for \A makes the problem go away; maybe
the bug lies there?

dkf added on 2002-01-18 16:20:31:
Logged In: YES 
user_id=79902

Eep!  Nasty, nasty, nasty...

  % regexp -inline {\A\s*[^b]*b} ab
  `h
  % regexp -indices -inline {\A\s*[^b]*b} ab
  {3 -139006}

The first made me go "Eh?" and the second makes me want to
hide, because it indicates that something is seriously
wrong.  Upping the priority.

dgp added on 2002-01-18 04:11:16:
Logged In: YES 
user_id=80530

A comment from Helmut Giese in c.l.t leads to:

% regexp {\A\s*[^b]*b} ab all           
1
% set all
 
%

Although no "can't happen" error, one would
expect $all to be ab, but it's not.

dgp added on 2002-01-18 03:16:41:
Logged In: YES 
user_id=80530

yet simpler variations:


% regexp {\A\s*([^b]*)b} ab all mail
error while matching regular expression: "can't happen" --
you found a bug

% regexp {\A\s*[^b]*(b)} ab all mail
error while matching regular expression: "can't happen" --
you found a bug

% regexp {\A(\s*)[^b]*(b)} ab all mail
error while matching regular expression: "can't happen" --
you found a bug

Location of the capturing parens doesn't seem to matter
as long as they are there.

dgp added on 2002-01-18 03:03:35:
Logged In: YES 
user_id=80530

simpler still:

% regexp {\A\s*[^<]*\s*<([^>]+)>} a<a> all mail
error while matching regular expression: "can't happen" --
you found a bug

dgp added on 2002-01-18 03:01:40:
Logged In: YES 
user_id=80530

Simpler case:

% regexp {\A\s*[^<]*\s*<([^>]+)>} "a <a>" all mail
error while matching regular expression: "can't happen" --
you found a bug

Attachments: