Tcl Source Code

View Ticket
Login
Ticket UUID: 578363
Title: [:xdigit:] makes RE to behave strange
Type: Bug Version: None
Submitter: pvgoran Created on: 2002-07-07 14:31:56
Subsystem: 43. Regexp Assigned To: dkf
Priority: 6 Severity:
Status: Closed Last Modified: 2002-07-29 17:56:50
Resolution: Fixed Closed By: dkf
    Closed on: 2002-07-29 10:56:50
Description:
Tcl Version: 8.4a3

Platform: Windows

Code 
sample:
set str {2:::DebugWin32}
set re 
{([[:xdigit:]])([[:space:]]*)}
puts "[regexp $re $str match xdigit 
spaces]"
puts "match=$match"
puts 
"xdigit=$xdigit"
puts "spaces=$spaces"

This 
gives:
1
match=2:::DebugWin32
xdigit=2
spaces=:::DebugWin32

Observed 
behaviour: "spaces=:::DebugWin32"

Desired behaviour: 
"spaces="

Comment:
It looks like the [[:xdigit:]] 
bracket expression causes the [[:space:]] bracket expression to 
match any symbol.

If [[:xdigit:]] is replaced, for example, 
by [[:digit:]], or [[:space:]] is replaced by [[:alpha:]], all is going 
right. (Initially, I noticed this problem with \s instead of 
[[:space:]].)
User Comments: dkf added on 2002-07-29 17:44:56:
Logged In: YES 
user_id=79902

Reviewing your second pair of patches, I've decided to go
instead with specifying the number of ranges as 3 because
hex-digits are understood to only be done using the standard
western digit characters (plus the six alphas in both cases,
of course.)  Unless there's a good reason for matching the
number characters used in other alphabet systems, but then
there'll also be a need for a locale-specific version of
'A-F', yes?  :^)

dkf added on 2002-07-29 17:06:12:
Logged In: YES 
user_id=79902

You've found the fault in the RE engine?  I'm impressed;
that code is non-trivial.  Do you want to become a
maintainer of this section?

(For future reference, single patches rooted at the top of
the CVS tree are easiest to work with by far.)

I'll now be able to have a look at fixing this problem (with
my general wherever-its-needed maintainer hat on.)

pvgoran added on 2002-07-28 23:45:01:

File Added - 27909: regc_locale.c.diff-2

pvgoran added on 2002-07-28 23:43:39:

File Added - 27908: regc_locale.c.diff-1

pvgoran added on 2002-07-28 23:42:09:

File Added - 27907: regc_cvec.c.diff

Logged In: YES 
user_id=383758

Yes, I definitely had to attach the files, since inserting them into the 
comment text give very strange formatting. Is this a bug in 
SourceForge.net software, or it is caused by my Opera browser? :)

pvgoran added on 2002-07-28 23:31:28:
Logged In: YES 
user_id=383758

This bug is caused by the error in generic/regc_cvec.c.

Patch for: 
File "generic/regc_cvec.c", Branch "MAIN", Revision 1.4

--- 
regc_cvec.cSun Jul 28 22:34:17 2002
+++ 
regc_cvec.c.newSun Jul 28 23:15:34 2002
@@ -50,7 +50,7 
@@
 cv = (struct cvec *)MALLOC(n);
 if (cv == NULL)
 
return NULL;
-cv->chrspace = nc;
+cv->chrspace = 
nchrs;
 cv->chrs = (chr *)&cv->mcces[nmcces];/* chrs just after 
MCCE ptrs */
 cv->mccespace = nmcces;
 cv->ranges = cv-
>chrs + nchrs + nmcces*(MAXMCCE+1);

It's strange that such 
a serious error was not noticed yet.

I also found the inconsistency 
in generic/regc_locale.c. The existing code should work without 
problems, but it is not correct. It can be fixed in two ways. The first 
one:

Patch for: File "generic/regc_locale.c", Branch "MAIN", 
Revision 1.8

--- regc_locale.cSun Jul 28 22:33:28 2002
+++ 
regc_locale.c.new-1Sun Jul 28 22:38:02 2002
@@ -842,7 
+842,10 @@
 case CC_XDIGIT:
     cv = getcvec(v, 0, 
NUM_DIGIT_RANGE+2, 0);
     if (cv) {
-addrange(cv, '0', 
'9');
+for (i = 0; i < NUM_DIGIT_RANGE; i++) {
+    
addrange(cv, digitRangeTable[i].start,
+    
digitRangeTable[i].end);
+}
 addrange(cv, 'a', 'f');
 
addrange(cv, 'A', 'F');
     }

The second one:

Patch 
for: File "generic/regc_locale.c", Branch "MAIN", Revision 1.8

--- 
regc_locale.cSun Jul 28 22:33:28 2002
+++ regc_locale.c.new-
2Sun Jul 28 23:15:03 2002
@@ -840,7 +840,7 @@
     }
     
break;
 case CC_XDIGIT:
-    cv = getcvec(v, 0, 
NUM_DIGIT_RANGE+2, 0);
+    cv = getcvec(v, 0, 3, 0);
     if (cv) 
{
 addrange(cv, '0', '9');
 addrange(cv, 'a', 'f');

The 
first way is IMO preferrable.

P.S. Maybe it was better to attach 
three diff files, instead of inserting them in the text?

dkf added on 2002-07-08 16:49:59:
Logged In: YES 
user_id=79902

Strange indeed!  If only the RE engine was less cryptic...

Suggested workaround: replace [[:xdigit:]] with [0-9a-fA-F]
which works.

Attachments: