Tcl Source Code: View Ticket

Ticket UUID:	3606683
Title:	fix for 3604074 does not work
Type:	Bug	Version:	current: 8.6.0
Submitter:	tgl	Created on:	2013-03-03 17:34:40
Subsystem:	43. Regexp	Assigned To:	dgp
Priority:	5 Medium	Severity:
Status:	Closed	Last Modified:	2013-03-07 03:51:52
Resolution:	Fixed	Closed By:	dgp
		Closed on:	2013-03-06 20:51:52
Description:	I couldn't find any way to reopen or comment on the already-closed bug 3604074, so here is a new bug report to say that the fix for that one doesn't work: if you build with the asserts in Spencer's code enabled, you'll find that the Tcl regression tests draw an assertion failure, because fixempties() fails to eliminate all the EMPTY arcs. The proximate cause of that is a small oversight in the patch: the first sub-loop in fixempties supposes that it doesn't need to do the "nexta = a->outchain" dance, but that's wrong because unempty is still capable of deleting the arc immediately, if the arc is a "vacuous loop". However, even with that repaired I don't find the fix implemented for 3604074 to be very trustworthy. The patch will cope if expansion of a later out-arc for the current state tries to re-add an empty out-arc to that state that was already processed. However, it doesn't detect the case where we re-add an empty out-arc to a previously processed state. Nor is it obvious that a removed empty out-arc can't be regenerated in a later pass (which after all was the failure mode exhibited in 3604074). I couldn't find a test case exhibiting an infinite loop, but I don't have a good feeling about that. I am also suspicious that the new code is outright wrong, because it seems to leave out some arguably-necessary arcs when I compare its final NFA state to the rewritten version attached. I don't have a test case to prove that either, but I'm suspicious that the timing of elimination of empty arcs might not be quite right now. After some thought and experimentation I devised a new algorithm that clearly terminates, because it doesn't have any "repeat until done" flavor to it. As a bonus, it seems to be faster than the old one, and it's clearer that it creates all the required replacement arcs. A few miscellaneous notes: * It was alleged in the comments to 3604074 that Postgres had fixed this bug, but actually the reason it doesn't show up for us is only that in http://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=173e29aa5deefd9e71c183583ba37805c8102a72 we had gotten rid of the expansion of "a" to "a+\|", and it's the empty alternative that's tickling this bug. If you try select 'a' ~ '((((((a+\|)+\|)+\|)+\|)+\|)+\|)'; it still fails in released Postgres versions. The reason I found out that your patch is broken is that as Postgres has things set up, the assert()s in Spencer's code are enabled in standard development builds. This evidently isn't the case in Tcl, but I suggest you might want to rethink that. * While fooling with the fixempties algorithm, I found that it was very easy to make CVE-2007-4772 (the infinite loop in pullback() for "($\|^)*") come back, if the output NFA was even slightly nonoptimal. So I fear that that bug is not really fixed either: it seems to be fairly easy to make pullback() create new states as fast as it can process them. pullback looks vaguely similar to what fixempties is doing, so I wonder whether it could be rewritten in a similar style to this patch. I've not tried it though.
User Comments:	dgp added on 2013-03-07 03:51:52: allow_comments - 1 fixed for 8.4.20, 8.5.14, 8.6.1. dgp added on 2013-03-07 02:21:16: Attached patch is being applied to Tcl 8.4. It adds the hasnonemptyarc() routine, folds the nonempty arc copy routines into the all arc copy routines with a flag argument, does some reworking to reduce indent levels and recasts things back to the coding style used in Tcl 8.4.* releases (which I think is closer to the original Spencer style). Uploaded as a patch here in case it's useful to anyone. dgp added on 2013-03-07 02:18:23: File Added - 461272: 3606683.patch tgl added on 2013-03-06 03:00:41: > it's not clear to me what worst case performance possibilities might be lurking in the nested loops containing the emptyreachable() call FWIW, this version is faster than the old code in my tests (the special cases with "doomed" states were required to make it so, else I'd not have bothered with those). I was mostly testing with regexes like "(((((a+\|)+\|)+\|)+\|)+\|)", and it's possible that I somehow optimized for these cases at the expense of other ones, but there's nothing in the code that seems like it ought to be very specific to any particular regexp. dgp added on 2013-03-05 23:36:21: I'm satisfied that if the new routines have a correctness problem, I'm not able to find it. It's also clear that the new routines do not suffer from the infinite loop hazard found in the former version. There might well be opportunities for improvement remaining, and it's not clear to me what worst case performance possibilities might be lurking in the nested loops containing the emptyreachable() call, but I think such concerns can be set aside until demonstrations of trouble can be found. So, I think accepting this patch is the right move. (possible addition of hasanemptyout() pending). I also found, as tgl reported, that Tcl Bug 1810038 was quite easy to resurrect when making mods to the details of how these graphs are optimized. I think revising the pushfwd() and pullback() routines so that their infinite loop hazards are also removed is well worth doing. dgp added on 2013-03-05 23:25:46: The recursion in emptyreachable() gives me brief concern, but it's no worse than the recursion already present in other traversal routines. tgl added on 2013-03-05 23:02:43: Yeah, that's a bit handwavy, and I don't fully understand the effects either. But one of the things I tried was removing that logic altogether and just always pulling arcs in the same direction, and that definitely produced inferior results (for one thing, it broke pullback()). So there's some method in Henry's madness there, even if I didn't explain it quite right. dgp added on 2013-03-05 22:59:06: It may not matter much, but I do not understand the selection argument made in the comment in the replaceempty() routine regarding larger fans creating greater likelihood of making states deleteable. tgl added on 2013-03-05 22:43:55: The flag field is only nonzero for the special "pre" and "post" states that represent the NFA's starting and ending points. It's not valid to remove either one (at least not without creating a replacement), so the tests are there to ensure we don't try to do so. The previous coding had an assert() that seemed to imply that the pre and post states could never have empty arcs attached at all, but I think that's wrong --- at least, if I remember the testing I was doing over the weekend, this code can migrate empties to be attached directly to the pre state, and I don't see a reason why that's not ok. They'll be gone anyway when fixempties is done. dgp added on 2013-03-05 22:38:19: I see that the "flag" field of the state struct plays a role in fixempties() that it did not before. Was this just another way that fixempties() was broken? Or is the new preservation of states with nonzero flag values a misstep? Are there test cases to demonstrate the correctness either way? tgl added on 2013-03-05 22:37:37: > A hasanemptyout(s) routine that didn't uselessly count them all to report existence would be an improvement. Yeah, I considered that but didn't feel it was worth the trouble. But if you want to add one, no objection here. dgp added on 2013-03-05 22:31:34: The (nonemptyouts(s2) > 0) test in fixempties seems mildly wasteful. A hasanemptyout(s) routine that didn't uselessly count them all to report existence would be an improvement. dgp added on 2013-03-04 23:12:27: I wrote a long comment, then via a network glitch, lost it. Summarized, it was "Yup." Will review the patch. tgl added on 2013-03-04 00:34:41: File Added - 461167: tcl-regex.patch

Attachments:

3606683.patch [download] added by dgp on 2013-03-07 02:18:23. [details]
tcl-regex.patch [download] added by tgl on 2013-03-04 00:34:41. [details]