Tcl Source Code

View Ticket
Bounty program for improvements to Tcl and certain Tcl packages.
Ticket UUID: 703709
Title: 'TCL_REG_CANMATCH' flag returns incorrect matches
Type: Bug Version: obsolete: 8.4.2
Submitter: vincentdarley Created on: 2003-03-14 16:23:11
Subsystem: 43. Regexp Assigned To: aku
Priority: 8 Severity: Minor
Status: Open Last Modified: 2015-11-06 21:04:46
Resolution: None Closed By: nobody
    Closed on:
The attached patch to tclTest.c and reg.test shows 
how Tcl_RegExpExecObj returns incorrect matches 
for many, many cases using the 'TCL_CAN_MATCH' 

This flag is supposed to make regexp return (if it 
can't find a complete match) the first possible 
position at which a match could occur.  However, it 
returns positions at which it is quite obvious no 
match could ever occur!

For example:

    set pat {.x}
    set line "asd asd"
    # can match the last char, if followed by x
    set res [testregexp -xflags -- c $pat $line resvar]
    lappend res $resvar

Should return '0 6' to say there is no match (0) but 
a match could possible occur at position 6 (with '.' 
matching the final 'd', if an 'x' was to be appended 
to the string).

Unfortunately it returns '0 0' which is quite wrong.
User Comments: tgl added on 2015-11-06 21:04:46:
I am not sure that there is any workable solution to this.  The problem is that you're relying on the "cold start" position returned by shortest(), which is an implementation detail that should probably never have been exposed outside the regexp engine.  It doesn't have any meaningful value for cases like this, because it reports on the last place where there was no active NFA state representing partial traversal of the regexp.  Given a regexp starting with '.', every character can match that, and so every string position beyond start-of-string will have an active NFA state indicating that we successfully traversed the '.'.  Now, unless the next character is 'x' we will fail to find any transition out of that NFA state so it gives rise to no active state after consuming the next character --- but meanwhile we matched '.' again, so there's a new instance of the same active NFA state instead.  There's no way to get to a cold-start condition as long as incomplete matches overlap, or even just are adjacent.

I don't presently see any way around that except to try to track the match-start position(s) associated with each active NFA state, which seems completely impractical, and certainly would be disastrous for string matching speed even if it could be done in a sane amount of memory.

So AFAICS the cold-start logic is just a heuristic meant to reduce the amount of brute-force searching done in simpleFind() or complicatedFindLoop(), and it was a mistake to imagine that it had semantics useful to outside callers.

vincentdarley added on 2003-10-16 20:08:51:
Logged In: YES 

One more subtlety here.  A full implementation of this flag
ought to provide two pieces of information:

(i) if there was no match, where a match could occur if
there was more text

(ii) even if there was a match, whether the match could be
longer (and possibly start earlier in the next) if there was
more text

The currently implementation handles (i) in a buggy way, and
doesn't handle (ii) at all.

vincentdarley added on 2003-04-11 23:34:55:
Logged In: YES 

Tcl now contains some 'knownBug' tests for this 
problem (reg.test).  Since tip113 has now passed, and 
can easily exhibit this bug without the need to write C 
code, I'm upping the bug priority.

vincentdarley added on 2003-03-14 23:52:29:
Logged In: YES 

No: the whole point of the TCL_REG_CAN_MATCH flag is 
to tell you where a match *might* be possible, if the 
string given was a bit longer.

dkf added on 2003-03-14 23:46:49:
Logged In: YES 

Surely if you know where the end of the string is, you know
that the match area can't go past it!  Yes?

vincentdarley added on 2003-03-14 23:24:51:

File Added - 45078: tclTest.diff

Logged In: YES 

Attaching patch to tclTest.c

vincentdarley added on 2003-03-14 23:23:36:

File Added - 45077: reg.test.diff

Logged In: YES 

Attaching patch to reg.test