Tcl Source Code

View Ticket
Login
Ticket UUID: 1860727
Title: PCRE optional regexp
Type: Patch Version: TIP Implementation
Submitter: hobbs Created on: 2007-12-30 01:12:44
Subsystem: 43. Regexp Assigned To: aku
Priority: 6 Severity: Minor
Status: Open Last Modified: 2017-11-17 14:52:47
Resolution: None Closed By: nobody
    Closed on:
Description:
Attached is a diff that adds a configure --with-pcre option, as well as -type classic|pcre -binary options to [regexp] (available in either build, only functional with --with-pcre).

--with-pcre=/path/to/pcre (or have it installed in a "default" location).

Initial testing shows that PCRE is significantly faster in all cases the the classic Spencer engine.
User Comments: sebres added on 2017-11-17 14:52:47:

I've rebased it to newer 8.5 (many conflicts resolved) relative my branch sebres-8-5-timerate (in order to test performance also).

Currently available on my github (see artificial PR sebres/tcl#5).

Compared to original variant provided von Jeffrey with patch pcre-20080121, it is complete:-
many bugs fixed, regsub working as expected now, more robust and faster
and has better backwards compatibility to the classic NFA-regexp of tcl, e. g. UCP (Unicode properties for \d, \w, etc.) and many others.

Additionally:

  • better UTF8 and UCP support;
  • types are compiled now;
  • I've fixed all known bugs (of the classic regexp) for this new engine;
  • added DFA type (still draft, but it works well), thus for example
    `regexp -type dfa -inline {a|ab|abc} -abc-` returns dfa-alternatives `{abc ab a}` ;)
  • new test-cases added in order to explain differences (advantages and disadvantages) of all 3 variants.
  • I've also a build for windows resp. auto-scripts and makefile for windows, thus if somebody needs, just ask :)

Todo's:

  • I should review and rewrite a few of code pieces for better understanding and to avoid code duplication;
  • `regexp -binary` for real binary capability through PCRE;
  • `regexp -dict` for capturing named groups of PCRE;
  • provide real DFA-workspace (ATM fixed in stack) with reallocate on demand if to small;
  • docu.

As regards the performance, the PCRE as well as DFA are very faster as classic NFA of tcl (up to 10 times and on large regexp still faster). Here an excerpt as a foretaste:

% timerate -calibrate {}

% foreach t {c p d} { proc test_$t {} \ [string map [list _REENG_ $t] \ {puts _REENG_:[timerate {regsub -type _REENG_ -all -line {^((\d{2})-(\d{2})-(\d{2,4})|NULL)$} "10-10-2017\nNULL\n20-10-2017" {**\1**\2**}}]} ]; puts "% [info body test_$t]"; test_$t } % puts c:[timerate {regsub -type c -all -line {^((\d{2})-(\d{2})-(\d{2,4})|NULL)$} "10-10-2017\nNULL\n20-10-2017" {**\1**\2**}}] c:38.3610 µs/# 26049 # 26068.1 #/sec 999.266 nett-ms % puts p:[timerate {regsub -type p -all -line {^((\d{2})-(\d{2})-(\d{2,4})|NULL)$} "10-10-2017\nNULL\n20-10-2017" {**\1**\2**}}] p:1.413072 µs/# 693365 # 707677 #/sec 979.775 nett-ms % puts d:[timerate {regsub -type d -all -line {^((\d{2})-(\d{2})-(\d{2,4})|NULL)$} "10-10-2017\nNULL\n20-10-2017" {**\1**\2**}}] d:1.205143 µs/# 810168 # 829777 #/sec 976.368 nett-ms

Note DFA has not realy a back-references here (but match alternatives), just the regsub used in order to minimize overhead of some tcl internal by measure (round about setting of variables or creating the lists by -indices or -inline).

Tested with PCRE 8.40.

If interested by TCT I'll rebase it to fossil as soon as possible and provide my 8.6th and 8.7th branches for this. I'll just spare this work (rebase) if nobody needs it.

Ah, yes, don't forget: Thanks to Jeffrey for the original work!


hobbs added on 2008-01-22 11:47:36:

File Added - 263244: pcre-20080121.diff.gz

Logged In: YES 
user_id=72656
Originator: YES

Updated to have --enable-pcre=yes|no|default.  If default is used, then PCRE will be the default engine.  --with-pcre still exists to point to a non-standard location.

Fixed a -indices issue, and updated the test suite.  The remaining test issues mostly represent differences in line anchor styles.
File Added: pcre-20080121.diff.gz

hobbs added on 2008-01-03 05:16:04:

File Deleted - 260247: 



File Added - 260524: pcre.diff4.gz

Logged In: YES 
user_id=72656
Originator: YES

Updated version that doesn't leak the study'd pcre info, corrects more tests and is generally better, so just use it.
File Added: pcre.diff4.gz

hobbs added on 2008-01-01 03:16:18:

File Added - 260318: pcre.diff3.gz

Logged In: YES 
user_id=72656
Originator: YES

New patch that calms some tests, fixes [lsearch -regexp] crash condition.  Note that any calls that use Tcl_GetRegexpFromObj with NULL interp can't check the [interp regexp {} pcre] state (as lsearch -regexp does).

In this version, you can set environment TCL_REGEXP_PCRE to have PCRE enabled by default in Tcl interps.
File Added: pcre.diff3.gz

hobbs added on 2007-12-31 09:15:23:

File Added - 260247: pcre.diff2.gz

Logged In: YES 
user_id=72656
Originator: YES

Updated version that has cleaner integration.  The conversion of RE compile flags is done at the caching of the object.

This version includes fully correct handling in [regsub] (you'll find the it is mostly transparent to Tcl_RegsubObjCmd), with support for the whole Tcl_GetRegExpFromObj/Tcl_RegExpExecObj/Tcl_RegExpGetInfo path of execution being handled 100% transparently for classic or PCRE REs.

The translation of flags needs to be better reconciled between Spencer's flag meanings and PCREs (like TCL_REG_NLSTOP TCL_REG_NLMATCH == ??? in PCRE).
File Added: pcre.diff2.gz

hobbs added on 2007-12-30 09:04:26:

File Deleted - 260122:

hobbs added on 2007-12-30 09:04:25:

File Added - 260151: pcre.diff.gz

Logged In: YES 
user_id=72656
Originator: YES

Updated version that adds:
    interp regexp {} ?classic|pcre?

So set the default engine with [interp regexp {} pcre].  I've also added support in Tcl_RegExpExecObj to recognize compiled PCREs so that the compile case works.  It currently assumes -binary operation by default.

In the lmbench grep.tcl code, you need to add:

if {![catch {interp regexp {}}]} {
    puts stderr "PCRE regexp"
    interp regexp {} pcre
} else {
    puts stderr "TCL regexp"
}

and then it will work as before, just faster.
File Added: pcre.diff.gz

hobbs added on 2007-12-30 08:12:44:

File Added - 260122: pcre.diff.gz

Attachments: