Tcl Source Code

View Ticket
Login
Ticket UUID: 1376892
Title: [:print:] wrong behaviour
Type: Bug Version: obsolete: 8.4.11
Submitter: petterik Created on: 2005-12-09 05:23:41
Subsystem: 43. Regexp Assigned To: dkf
Priority: 7 High Severity:
Status: Closed Last Modified: 2006-08-24 09:20:21
Resolution: Fixed Closed By: sf-robot
    Closed on: 2006-08-24 02:20:21
Description:
% set str {moi+moi+moi}
% regsub -all {[^[:print:]]} $str {} str2; puts $str2
moimoimoi

Expected result is the original string.

This, however, is subject to definition and Tcl's
specification. If one keeps `perlre' as the refence,
this is a bug. Perlre (v5.6.1) says: 'print -- Any
alphanumeric or punctuation (special) character or space.'
User Comments: sf-robot added on 2006-08-24 09:20:20:
Logged In: YES 
user_id=1312539

This Tracker item was closed automatically by the system. It was
previously set to a Pending status, and the original submitter
did not respond within 14 days (the time period specified by
the administrator of this Tracker).

dkf added on 2006-04-12 21:13:00:

File Deleted - 167658: 



File Added - 174347: re_print.diff

dkf added on 2006-04-12 21:12:58:
Logged In: YES 
user_id=79902

Reading around the web, I find that there's not much
agreement on what isprint() means at all outside the ASCII
domain. That really sucks.

So I'm defining it now. The [:print:] category shall now
contain all characters that are in any of the following
UNICODE categories:
  Letter (L*)
  Number (N*)
  Punctuation (P*)
  Symbol (S*)
  Space (Zs) but not other kinds of whitespace

http://www.unicode.org/Public/UNIDATA/UCD.html#General_Category_Values

Fixed in the HEAD with the attached patch. Backport candidate?

billposer added on 2006-03-04 05:12:54:
Logged In: YES 
user_id=939324

I forgot to account for U+2028 and U+2029. These are the
abstract line and paragraph separators. I guess it makes
ense for them to be excluded from [:print:] even though the
other non-ASCII [:space:] characters are included since as I
understand it they have no corresponding glyphs but are
purely abstract.

billposer added on 2006-03-04 05:01:24:
Logged In: YES 
user_id=939324

Using the same techniques as in my previous messsage, I get
a uniform list of characters that are in [:print:] but not
in [:alnum:] or [:punct:].

U+0020: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
U+1680: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
U+2000: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
U+2001: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
U+2002: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
U+2003: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
U+2004: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
U+2005: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
U+2006: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
U+2008: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
U+2009: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
U+200A: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
U+200B: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
U+205F: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
U+3000: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
U+FE45: alpha F alnum F digit F cntrl F punct F upper F
lower F blank F space F graph T print T xdigit F
U+FE46: alpha F alnum F digit F cntrl F punct F upper F
lower F blank F space F graph T print T xdigit F

Here is the diff against [:space:]:

> U+0009:       alpha F alnum F digit F cntrl T punct F
upper F lower F blank T space T graph F print F xdigit F>
U+000A:       alpha F alnum F digit F cntrl T punct F upper
F lower F blank F space T graph F print F xdigit F> U+000B:
      alpha F alnum F digit F cntrl T punct F upper F lower
F blank F space T graph F print F xdigit F> U+000C:      
alpha F alnum F digit F cntrl T punct F upper F lower F
blank F space T graph F print F xdigit F> U+000D:      
alpha F alnum F digit F cntrl T punct F upper F lower F
blank F space T graph F print F xdigit F13a20,21
> U+2028:       alpha F alnum F digit F cntrl T punct F
upper F lower F blank F space T graph F print F xdigit F>
U+2029:       alpha F alnum F digit F cntrl T punct F upper
F lower F blank F space T graph F print F xdigit F16,17d23
< U+FE45:       alpha F alnum F digit F cntrl F punct F
upper F lower F blank F space F graph T print T xdigit F<
U+FE46:       alpha F alnum F digit F cntrl F punct F upper
F lower F blank F space F graph T print T xdigit F

It looks like [:print:] consists of [:graph:] plus [:space:]
minus (ASCII [:space:] - SPACE]) plus U+FE45 and U+FE46,
which are the sesame points. This seems sensible.

billposer added on 2006-03-04 04:34:15:
Logged In: YES 
user_id=939324

Regarding [:space:], I checked out the classification
provided by the glibc wide character class functions using
the following program in a variety of locales followed by:
 
egrep "space T|Locale" ClassResults > SpaceResults
 
#include <stdlib.h>
#include <stdio.h>
#include <wctype.h>
#include <wchar.h>
#include <locale.h>
 
int main(int ac, char *av[]) {
  wchar_t i;
  setlocale(LC_ALL,"");
  printf("Locale: %s\n",setlocale(LC_ALL,NULL));
  for(i=0;i<0xFFFF;i++) {
    printf("U+%04X:\t",i);
    printf("alpha %s\t",(iswalpha(i)? "T":"F"));
    printf("alnum %s\t",(iswalnum(i)? "T":"F"));
    printf("digit %s\t",(iswdigit(i)? "T":"F"));
    printf("cntrl %s\t",(iswcntrl(i)? "T":"F"));
    printf("punct %s\t",(iswpunct(i)? "T":"F"));
    printf("upper %s\t",(iswupper(i)? "T":"F"));
    printf("lower %s\t",(iswlower(i)? "T":"F"));
    printf("blank %s\t",(iswblank(i)? "T":"F"));
    printf("space %s\t",(iswspace(i)? "T":"F"));
    printf("graph %s\t",(iswgraph(i)? "T":"F"));
    printf("print %s\t",(iswprint(i)? "T":"F"));
    printf("xdigit %s\n",(iswxdigit(i)? "T":"F"));
  }
  exit(0);
}
 
In the C locale I got the expected:
Locale: C
U+0009: alpha F alnum F digit F cntrl T punct F upper F
lower F blank T space T graph F print F xdigit F
U+000A: alpha F alnum F digit F cntrl T punct F upper F
lower F blank F space T graph F print F xdigit F
U+000B: alpha F alnum F digit F cntrl T punct F upper F
lower F blank F space T graph F print F xdigit F
U+000C: alpha F alnum F digit F cntrl T punct F upper F
lower F blank F space T graph F print F xdigit F
U+000D: alpha F alnum F digit F cntrl T punct F upper F
lower F blank F space T graph F print F xdigit F
U+0020: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
 
In all of the other locales that I tried
(ca_ES,de_DE,en_US,hi_IN,ja_JP,kk_KZ,th_TH,zh_TW)
I got the same result:
 
Locale: hi_IN
U+0009: alpha F alnum F digit F cntrl T punct F upper F
lower F blank T space T graph F print F xdigit F
U+000A: alpha F alnum F digit F cntrl T punct F upper F
lower F blank F space T graph F print F xdigit F
U+000B: alpha F alnum F digit F cntrl T punct F upper F
lower F blank F space T graph F print F xdigit F
U+000C: alpha F alnum F digit F cntrl T punct F upper F
lower F blank F space T graph F print F xdigit F
U+000D: alpha F alnum F digit F cntrl T punct F upper F
lower F blank F space T graph F print F xdigit F
U+0020: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
U+1680: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
U+2000: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
U+2001: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
U+2002: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
U+2003: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
U+2004: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
U+2005: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
U+2006: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
U+2008: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
U+2009: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
U+200A: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
U+200B: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
U+2028: alpha F alnum F digit F cntrl T punct F upper F
lower F blank F space T graph F print F xdigit F
U+2029: alpha F alnum F digit F cntrl T punct F upper F
lower F blank F space T graph F print F xdigit F
U+205F: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
U+3000: alpha F alnum F digit F cntrl F punct F upper F
lower F blank T space T graph F print T xdigit F
 
So, at least as far as glibc and the locale definitions
distributed with it are concerned there is a standard set of
space characters. The list is not the same as the characters
with Unicode General Property Zs or Z, nor with Bidi
property WS.  Somebody has evidently worked through the
plausible candidates with their usage in mind.
 
Bill

billposer added on 2006-03-02 12:26:40:
Logged In: YES 
user_id=939324

>Interestingly, there are also many characters that are
>isalnum||ispunct but not isprint. That seems very strange to
>me; perhaps we need to find a real spec and use that instead
>of guessing... :-)

My experience suggests that a lot of software has been
rather sloppily extended to handle Unicode with the result
that for many features the behavior is not only non-standard
not "common sense" but downright bizarre. For instance, try
a range like [a-ALPHA] in your favorite regexp engine (other
than Tcl). The common sense correct result is that this
should match the characters U+0061 through U+03B1. Another
plausible result would be an error because it crosses
Unicode blocks (this is the gawk behaviour). But in addition
to these I have found several other things, including
matches that include not only alpha but the entire Greek range!

Anyhow, the other problem is that I'm pretty sure that there
isn't any standard governing the extension of the POSIX
classes to Unicode. POSIX states some principles but they
are very general, basically just that you have to preserve
the ASCII classes. Unicode has classes of its own in the
form of the General Character Properties, but they aren't
the same and don't map to the POSIX classes in an obvious way.

dkf added on 2006-02-24 16:47:05:
Logged In: YES 
user_id=79902

The following C program indicates that there are large
numbers of characters that satisfy isprint() but neither
isalnum() nor ispunct()

#include <ctype.h>
#include <stdio.h>
#include <locale.h>
int main() {
   unsigned int i,j=1000000000;
   setlocale(LC_ALL, "en_GB.UTF-8");
   for (i=0 ; i<65536 ; i++) {
      if (isprint(i) && !isalnum(i) && !ispunct(i)) {
         if (i!=j+1) {
            printf("%04x-", i);
         }
         j = i;
      } else if (i == j+1) {
         printf("%04x\n", j);
      }
   }
   return 0;
}

Interestingly, there are also many characters that are
isalnum||ispunct but not isprint. That seems very strange to
me; perhaps we need to find a real spec and use that instead
of guessing... :-)

dkf added on 2006-02-16 16:22:19:
Logged In: YES 
user_id=79902

What about [:space:] characters outside the classic ASCII
range? That's a total of 20 characters, and I'm not willing
to automatically just go with non-UNICODE-aware tools on
this. I ask this because it seems unreasonable to me to just
assume that old stuff is holy (an approach that has happened
in this area in the past; as a point to help understanding,
[:digit:] isn't the same as [0-9], and this is good.)

My characters of concern are:
  \u00a0, \u1680, \u2000-\u200b, \u2028, \u2029, \u202f, \u3000

billposer added on 2006-02-16 16:07:42:
Logged In: YES 
user_id=939324

GNU egrep agrees with gawk, java, ruby, etc. as opposed to Perl.

billposer added on 2006-02-16 15:15:40:
Logged In: YES 
user_id=939324

Donal,

I think that your patch does what you intended, but what you
intended and what I intended aren't the same. What you've
got treats [:print:] as [:alnum:] U [:punct:] U [:space:],
which is what Perl does. What I intended was [:alnum:] U
[:punct:] U SPACE, where SPACE = 0x20. (The simple test is
to see whether [[:print:]] matches tab.) That is my
understanding of the POSIX standard. I just checked a few
other regexp engines. TRE, which makes a point of strict
POSIX conformance, follows my interpretation, as do GNU awk,
java.util.regex,
ruby, vim, zsh, and, interestingly, pcre. So I would say
that Perl has got it wrong. The necessary fix is just to
delete the two bits involving NUM_SPACE_RANGE from your patch.

billposer added on 2006-02-16 15:15:12:
Logged In: YES 
user_id=939324

Donal,

I think that your patch does what you intended, but what you
intended and what I intended aren't the same. What you've
got treats [:print:] as [:alnum:] U [:punct:] U [:space:],
which is what Perl does. What I intended was [:alnum:] U
[:punct:] U SPACE, where SPACE = 0x20. (The simple test is
to see whether [[:print:]] matches tab.) That is my
understanding of the POSIX standard. I just checked a few
other regexp engines. TRE, which makes a point of strict
POSIX conformance, follows my interpretation, as do GNU awk,
java.util.regex,
ruby, vim, zsh, and, interestingly, pcre. So I would say
that Perl has got it wrong. The necessary fix is just to
delete the two bits involving NUM_SPACE_RANGE from your patch.

dkf added on 2006-02-16 13:54:27:

File Added - 167658: reprintchars.diff

dkf added on 2006-02-16 13:53:36:
Logged In: YES 
user_id=79902

Bug is located at line 817 of the HEAD regc_locale.c, and
consists of a missing arm for the CC_PRINT case. I believe
that the fix required (based on the POSIX definition, thanks
Bill!) is the attached patch, which I'd appreciate people
testing... :-)

billposer added on 2006-02-16 02:08:07:
Logged In: YES 
user_id=939324

Perhaps more important than the Perl definition is the POSIX
definition, according to which [:print:] = [:alnum:] U
[:punct:] U SPACE. Curiously, the current behaviour with
[:print:] = [:alnum:] is documented in Welch, Jones, and
Hobbs with no mention of it being a bug.

dkf added on 2005-12-09 23:42:28:
Logged In: YES 
user_id=79902

Good point. Yes. And [string is print] follows C's isprint()
IIRC, so it is (almost certainly) the RE engine that is wrong.

OK, this should be fixed.

dgp added on 2005-12-09 23:34:08:
Logged In: YES 
user_id=80530


should we also consider
consistency with
[string is print] ?

dkf added on 2005-12-09 16:19:55:
Logged In: YES 
user_id=79902

It seems that Perl defines [[:print:]] as
[[:space:][:graph:]] and we define it as [[:alnum:]] (and
yes, it is documented that way.) Or perhaps [:blank:]
instead of [:space:], the documentation being a bit hazy in
that respect.

The question is, what *should* we do?

The following procedure helps with checking these sorts of
things out:
   proc matches args {
      set RE [format {[[:%s:]]} [join $args ":\]\[:"]]
      for {set i 32} {$i<127} {incr i} {
         set c [format %c $i]
         puts -nonewline "$c-[regexp $RE $c]\t"
      }
      puts ""
   }

Attachments: