Tcl Source Code

View Ticket
Login
Ticket UUID: 219219
Title: greedy vs. non-greedy confusion
Type: Bug Version: final: 8.2.3
Submitter: nobody Created on: 2000-10-26 05:03:45
Subsystem: 43. Regexp Assigned To: aku
Priority: 5 Medium Severity: Minor
Status: Open Last Modified: 2017-10-26 01:40:45
Resolution: None Closed By: nobody
    Closed on:
Description:
OriginalBugID: 4001 Bug
Version: 8.2.3
SubmitDate: '2000-01-10'
LastModified: '2000-01-27'
Severity: MED
Status: Assigned
Submitter: techsupp
ChangedBy: hobbs
RelatedBugIDs: 2866
OS: BSD
OSVersion: NetBSD-1.4.1/i386
FixedDate: '2000-10-25'
ClosedDate: '2000-10-25'


Name: hume smith

ReproducibleScript:
> tclsh
% regexp {x.*?([a-z]+)} {1234x56789word101112} a b
1
% set a
x56789w
% set b
w
% set tcl_patchLevel
8.2.3
%

ObservedBehavior:
the + was matched in a nongreedy fashion; shouldn't it be greedy?

DesiredBehavior:
% set a
x56789word
% set b
word
%

Patch:


PatchFiles:




Henry Spencer has noted some problems with mixing greedy and
non-greedy quantifiers in the new regexp code.  He's cc'ed on
this, but in the meantime, the work-around is:
    regexp {x[^a-z]*([a-z]+)} {1234x56789word101112} a b 
-- 01/10/2000 hobbs
From HS:

This is the same old problem:  people accustomed to Perl are not grasping
the idea that the whole RE is greedy or non-greedy, but *not* some mixture
of the two.  In this case, it is non-greedy since the first thing in it
which cares is non-greedy.  The + is being as greedy as it can, within the
constraints set by the behavior of the whole RE. 

In short, the behavior, while surprising, is as documented.  It's not an
outright bug; it may, however, be a misfeature.
 
-- 01/10/2000 hobbs
User Comments: dram added on 2017-10-26 01:40:45:
Also encoutered this problem recently, with a simpler case:

  % regexp -inline {(.+?), (.+)} "foo, bar"
  {foo, b} foo b

But this is expected behaviour as stated by re_syntax(n), and PostgreSQL have a more throughly description[1].


[1] https://www.postgresql.org/docs/10/static/functions-matching.html#functions-posix-regexp  (in 9.7.3.5. Regular Expression Matching Rules)

segeth added on 2006-10-05 21:29:16:
Logged In: YES 
user_id=1613941

The problem greedy mixing non-greedy still exists in tcl
v8.4. . In my opinion mixing those two one's should be
possible, actually i have to split the RE ... one for
non-greedy and one for greedy. It isn't explained in the
manual re_syntax, that you can't create a mixture of greedy
and non-greedy quantifiers.

I like to know is it a documentation bug or a bug in the tcl
interpreter?

nobody added on 2001-11-20 05:45:44:
Logged In: NO 

I couldnt find anything to confirm this behavior in 'man 
re_syntax(n)'.  Where has it been documented into 
semi-legitimacy?

Any correlation between familiarity with perl and noticing 
this behavior is purely coincidental.  I mean, come on... 
If greedyness is meant to be a 'whole expression' 
behavior, why isnt it implemented as a switch, like with
the '-nocase' option?   Calling it a misfeature is being 
too kind, especially considering the amount of grief it 
causes the people who are affected by it. ;')

dkf added on 2000-11-10 17:48:16:
Perhaps this requires a documentation change?  Well, either that or a behaviour change.  Would it be possible to have a flag to force greediness or non-greediness instead of guessing it from the first quantifier in the regular expression?  (With the default being greedy unless all the "top-level" quantifiers were non-greedy, maybe?)