Tcl Source Code

View Ticket
Login
Ticket UUID: 3606683
Title: fix for 3604074 does not work
Type: Bug Version: current: 8.6.0
Submitter: tgl Created on: 2013-03-03 17:34:40
Subsystem: 43. Regexp Assigned To: dgp
Priority: 5 Medium Severity:
Status: Closed Last Modified: 2013-03-07 03:51:52
Resolution: Fixed Closed By: dgp
    Closed on: 2013-03-06 20:51:52
Description:
I couldn't find any way to reopen or comment on the already-closed bug
3604074, so here is a new bug report to say that the fix for that one
doesn't work: if you build with the asserts in Spencer's code enabled,
you'll find that the Tcl regression tests draw an assertion failure,
because fixempties() fails to eliminate all the EMPTY arcs.

The proximate cause of that is a small oversight in the patch: the first
sub-loop in fixempties supposes that it doesn't need to do the "nexta =
a->outchain" dance, but that's wrong because unempty is still capable of
deleting the arc immediately, if the arc is a "vacuous loop".

However, even with that repaired I don't find the fix implemented for
3604074 to be very trustworthy.  The patch will cope if expansion of a
later out-arc for the current state tries to re-add an empty out-arc *to
that state* that was already processed.  However, it doesn't detect the
case where we re-add an empty out-arc to a previously processed state.
Nor is it obvious that a removed empty out-arc can't be regenerated in a
later pass (which after all was the failure mode exhibited in 3604074).
I couldn't find a test case exhibiting an infinite loop, but I don't have
a good feeling about that.  I am also suspicious that the new code is
outright wrong, because it seems to leave out some arguably-necessary arcs
when I compare its final NFA state to the rewritten version attached.  I
don't have a test case to prove that either, but I'm suspicious that the
timing of elimination of empty arcs might not be quite right now.

After some thought and experimentation I devised a new algorithm that
clearly terminates, because it doesn't have any "repeat until done"
flavor to it.  As a bonus, it seems to be faster than the old one,
and it's clearer that it creates all the required replacement arcs.

A few miscellaneous notes:

* It was alleged in the comments to 3604074 that Postgres had fixed this
bug, but actually the reason it doesn't show up for us is only that in
http://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=173e29aa5deefd9e71c183583ba37805c8102a72
we had gotten rid of the expansion of "a*" to "a+|", and it's the empty
alternative that's tickling this bug.  If you try
select 'a' ~ '((((((a+|)+|)+|)+|)+|)+|)';
it still fails in released Postgres versions.

* The reason I found out that your patch is broken is that as Postgres
has things set up, the assert()s in Spencer's code are enabled in
standard development builds.  This evidently isn't the case in Tcl,
but I suggest you might want to rethink that.

* While fooling with the fixempties algorithm, I found that it was very
easy to make CVE-2007-4772 (the infinite loop in pullback() for "($|^)*")
come back, if the output NFA was even slightly nonoptimal.  So I fear that
that bug is not really fixed either: it seems to be fairly easy to make
pullback() create new states as fast as it can process them.  pullback
looks vaguely similar to what fixempties is doing, so I wonder whether it
could be rewritten in a similar style to this patch.  I've not tried it
though.
User Comments: dgp added on 2013-03-07 03:51:52:

allow_comments - 1

fixed for 8.4.20, 8.5.14, 8.6.1.

dgp added on 2013-03-07 02:21:16:
Attached patch is being applied to Tcl 8.4.  It adds the
hasnonemptyarc() routine, folds the nonempty arc copy
routines into the all arc copy routines with a flag argument,
does some reworking to reduce indent levels and recasts
things back to the coding style used in Tcl 8.4.* releases
(which I think is closer to the original Spencer style).
Uploaded as a patch here in case it's useful to anyone.

dgp added on 2013-03-07 02:18:23:

File Added - 461272: 3606683.patch

tgl added on 2013-03-06 03:00:41:
> it's not clear to me what worst case performance possibilities might be lurking in the nested loops containing the emptyreachable() call

FWIW, this version is faster than the old code in my tests (the special cases with "doomed" states were required to make it so, else I'd not have bothered with those).  I was mostly testing with regexes like "(((((a+|)+|)+|)+|)+|)", and it's possible that I somehow optimized for these cases at the expense of other ones, but there's nothing in the code that seems like it ought to be very specific to any particular regexp.

dgp added on 2013-03-05 23:36:21:
I'm satisfied that if the new routines have
a correctness problem, I'm not able to find it.

It's also clear that the new routines do not suffer
from the infinite loop hazard found in the former
version.

There might well be opportunities for improvement
remaining, and it's not clear to me what worst case
performance possibilities might be lurking in the
nested loops containing the emptyreachable() call,
but I think such concerns can be set aside until
demonstrations of trouble can be found.

So, I think accepting this patch is the right move.
(possible addition of hasanemptyout() pending).

I also found, as tgl reported, that Tcl Bug 1810038
was quite easy to resurrect when making mods to
the details of how these graphs are optimized.  I think
revising the pushfwd() and pullback() routines so that
their infinite loop hazards are also removed is well
worth doing.

dgp added on 2013-03-05 23:25:46:
The recursion in emptyreachable() gives me
brief concern, but it's no worse than the recursion
already present in other traversal routines.

tgl added on 2013-03-05 23:02:43:
Yeah, that's a bit handwavy, and I don't fully understand the effects either.  But one of the things I tried was removing that logic altogether and just always pulling arcs in the same direction, and that definitely produced inferior results (for one thing, it broke pullback()).  So there's some method in Henry's madness there, even if I didn't explain it quite right.

dgp added on 2013-03-05 22:59:06:
It may not matter much, but I do not understand
the selection argument made in the comment in
the replaceempty() routine regarding larger fans
creating greater likelihood of making states
deleteable.

tgl added on 2013-03-05 22:43:55:
The flag field is only nonzero for the special "pre" and "post" states that represent the NFA's starting and ending points.  It's not valid to remove either one (at least not without creating a replacement), so the tests are there to ensure we don't try to do so.

The previous coding had an assert() that seemed to imply that the pre and post states could never have empty arcs attached at all, but I think that's wrong --- at least, if I remember the testing I was doing over the weekend, this code can migrate empties to be attached directly to the pre state, and I don't see a reason why that's not ok.  They'll be gone anyway when fixempties is done.

dgp added on 2013-03-05 22:38:19:
I see that the "flag" field of the state struct plays
a role in fixempties() that it did not before.  Was 
this just another way that fixempties() was broken?
Or is the new preservation of states with nonzero flag
values a misstep?  Are there test cases to demonstrate
the correctness either way?

tgl added on 2013-03-05 22:37:37:
> A hasanemptyout(s) routine that didn't uselessly count them all to report existence would be an improvement.

Yeah, I considered that but didn't feel it was worth the trouble.  But if you want to add one, no objection here.

dgp added on 2013-03-05 22:31:34:
The (nonemptyouts(s2) > 0) test in fixempties seems
mildly wasteful.  A hasanemptyout(s) routine that didn't
uselessly count them all to report existence would be an
improvement.

dgp added on 2013-03-04 23:12:27:
I wrote a long comment, then via a network glitch,
lost it.

Summarized, it was "Yup."

Will review the patch.

tgl added on 2013-03-04 00:34:41:

File Added - 461167: tcl-regex.patch

Attachments: