Tcl Source Code

View Ticket
Login
Ticket UUID: d351de9d7b1a4c57a44a10d92e0ace9bac86786
Title: Tcl lacks lookbehind in regex expressions
Type: RFE Version: 8.6
Submitter: anonymous Created on: 2015-06-20 18:08:22
Subsystem: 43. Regexp Assigned To: dkf
Priority: 6 Severity: Severe
Status: Open Last Modified: 2015-11-12 14:41:33
Resolution: None Closed By: nobody
    Closed on:
Description:
There is no notion of lookbehind in Tcl regex expressions. Only lookahead.
User Comments: dgp added on 2015-11-12 14:41:33:
I'm not so much opposed as just wanting it understood that
a new feature will go in under different terms than bug
fixes.

If current visions play out, the next opportunity for new
regexp features would be Tcl 9.0, and there's at least some
interest in a complete overhaul to PCRE, or to a pluggable
system, at that point, so remains to be seen what will happen
with this work.

Nevertheless, very grateful to have it, and will attempt to
get it into a branch of our dev code sometime soon-ish.

tgl added on 2015-11-11 14:51:52:
Well, as far as that goes, this code won't see much public use before Postgres 9.6 is released, which is probably a year away.  We don't put new features into minor releases either.  So I follow dgp's concern completely: you need to absorb such things according to stated project policy.

But, as a Tcl outsider, I don't want to spend much of my own time satisfying the minutiae of such policy.  My feeling is "here's some code, merge when and as you see fit".

(And on the third hand, if we allow the Postgres and Tcl versions of Spencer's code to diverge too far, it'll become impractical to trade even minor bug fixes.  So I do have some investment in seeing you merge it eventually.)

dkf added on 2015-11-11 10:13:46:
I'm less opposed than dgp, especially given that you're doing some burn-in testing.

dgp added on 2015-11-09 20:40:11:
It would be a new feature, and a pretty substantial one.

Tcl does such things only in .0 releases after an alpha/beta period.
Would not be welcome in 8.6.5.

tgl added on 2015-11-09 20:12:05:
FWIW, it's not an incompatibility any more than lookaheads were, because the syntax is something that will fail with an "invalid quantifier" error in a regex engine that doesn't know the feature.  All of the "(?...)" features have that property, since ordinarily "?" can't follow a left paren.

I don't know anything about the TIP process, so will leave that to you.

dgp added on 2015-11-09 19:42:50:
If we can do it with reasonable effort, this might go on
a Tcl development branch for interested people to try out.

Additions to RE syntax, though, would be an incompatibility
in need of some TIP review before adding to Tcl at some
well-defined release point.

tgl added on 2015-11-06 19:44:09:
I found a solution to the performance issues I was worried about before, so attached is a proposed patch implementing lookbehind.  This is equivalent to code already committed to Postgres at
http://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=12c9a04008870c283931d6b3b648ee21bbc2cfda

tgl added on 2015-10-17 04:06:48:
I've put up a WIP patch for lookbehind constraints in Spencer's engine at
http://www.postgresql.org/message-id/[email protected]

It works, but has some performance issues.  If anyone's interested in helping fix those, step right up ...

dkf added on 2015-09-03 12:42:26:

But then you get all the complexity of a required dependency while keeping the costs associated with maintaining the old engine, and you also have to maintain the new code (which doesn't just interact with regexp and regsub, but also switch and possibly other places too).


sebres added on 2015-09-02 14:08:09:
> Some scripts that currently work would cease to function as expected...
Not if PCRE will be not used by default, but rather optionally, like option "-pc", see my comment below.

dkf added on 2015-09-02 13:20:17:

What would be exact the impacts to this change?
The main impact is that it's a different RE language. Some scripts that currently work would cease to function as expected, despite not being necessarily broken by things (with the main portion of that impact being in the different set of backslash-escapes supported).

A necessary breakage would be something like the syntax for lookbehind becoming used for that purpose. We would not get upset over that.


sebres added on 2015-09-02 11:53:12:
Thx Don, is a good tip

How can I download the diff.gz? It says: (file is 21230 bytes of binary data)

dgp added on 2015-09-02 11:16:31:
See also: http://core.tcl.tk/tcl/tktview/1860727

anonymous (claiming to be FireTcl) added on 2015-08-31 21:57:47:
I researched and I found that PCRE is used in the PHP programming language, the Apache server and R. 
Source: https://en.wikipedia.org/wiki/Perl_Compatible_Regular_Expressions

I don't know whether it is a good idea to make a change . Anyway it's interesting to know what would be the increase in size, because I have understood that Tcl is normally used in embedded systems. For me the size is not so important but I imagine that for some users it is.

I suppose also that with this change then the Tcl developer's team can concentrate in other important parts of the language and be more focused.

What would be exact the impacts to this change?

I suppose that one of them is performance, for example in alternation expressions. Source:
http://lh3lh3.users.sourceforge.net/reb.shtml

ferrieux added on 2015-08-31 21:15:15:
Twice the maintenance effort maybe ?

sebres added on 2015-08-31 08:48:42:
What's about something like additionally option to use PCRE instead of "tcl" regular expressions?

Something like:

regexp -pc <PCRE> ...
regsub -pc <PCRE> ...

dkf added on 2015-08-30 20:29:33:

Known issue. Henry Spencer never added them to his library. (Don't know why.)

Easiest fix would be to switch to using the PCRE library, which has some advantages (e.g., it's properly maintained!) but would have quite an impact on existing users of advanced RE features. And even some not-so-advanced ones. The impact's pretty severe, in some ways.


Attachments: