Ticket UUID: | 1376892 | |||
Title: | [:print:] wrong behaviour | |||
Type: | Bug | Version: | obsolete: 8.4.11 | |
Submitter: | petterik | Created on: | 2005-12-09 05:23:41 | |
Subsystem: | 43. Regexp | Assigned To: | dkf | |
Priority: | 7 High | Severity: | ||
Status: | Closed | Last Modified: | 2006-08-24 09:20:21 | |
Resolution: | Fixed | Closed By: | sf-robot | |
Closed on: | 2006-08-24 02:20:21 | |||
Description: |
% set str {moi+moi+moi} % regsub -all {[^[:print:]]} $str {} str2; puts $str2 moimoimoi Expected result is the original string. This, however, is subject to definition and Tcl's specification. If one keeps `perlre' as the refence, this is a bug. Perlre (v5.6.1) says: 'print -- Any alphanumeric or punctuation (special) character or space.' | |||
User Comments: |
sf-robot added on 2006-08-24 09:20:20:
Logged In: YES user_id=1312539 This Tracker item was closed automatically by the system. It was previously set to a Pending status, and the original submitter did not respond within 14 days (the time period specified by the administrator of this Tracker). dkf added on 2006-04-12 21:13:00: File Deleted - 167658: File Added - 174347: re_print.diff dkf added on 2006-04-12 21:12:58: Logged In: YES user_id=79902 Reading around the web, I find that there's not much agreement on what isprint() means at all outside the ASCII domain. That really sucks. So I'm defining it now. The [:print:] category shall now contain all characters that are in any of the following UNICODE categories: Letter (L*) Number (N*) Punctuation (P*) Symbol (S*) Space (Zs) but not other kinds of whitespace http://www.unicode.org/Public/UNIDATA/UCD.html#General_Category_Values Fixed in the HEAD with the attached patch. Backport candidate? billposer added on 2006-03-04 05:12:54: Logged In: YES user_id=939324 I forgot to account for U+2028 and U+2029. These are the abstract line and paragraph separators. I guess it makes ense for them to be excluded from [:print:] even though the other non-ASCII [:space:] characters are included since as I understand it they have no corresponding glyphs but are purely abstract. billposer added on 2006-03-04 05:01:24: Logged In: YES user_id=939324 Using the same techniques as in my previous messsage, I get a uniform list of characters that are in [:print:] but not in [:alnum:] or [:punct:]. U+0020: alpha F alnum F digit F cntrl F punct F upper F lower F blank T space T graph F print T xdigit F U+1680: alpha F alnum F digit F cntrl F punct F upper F lower F blank T space T graph F print T xdigit F U+2000: alpha F alnum F digit F cntrl F punct F upper F lower F blank T space T graph F print T xdigit F U+2001: alpha F alnum F digit F cntrl F punct F upper F lower F blank T space T graph F print T xdigit F U+2002: alpha F alnum F digit F cntrl F punct F upper F lower F blank T space T graph F print T xdigit F U+2003: alpha F alnum F digit F cntrl F punct F upper F lower F blank T space T graph F print T xdigit F U+2004: alpha F alnum F digit F cntrl F punct F upper F lower F blank T space T graph F print T xdigit F U+2005: alpha F alnum F digit F cntrl F punct F upper F lower F blank T space T graph F print T xdigit F U+2006: alpha F alnum F digit F cntrl F punct F upper F lower F blank T space T graph F print T xdigit F U+2008: alpha F alnum F digit F cntrl F punct F upper F lower F blank T space T graph F print T xdigit F U+2009: alpha F alnum F digit F cntrl F punct F upper F lower F blank T space T graph F print T xdigit F U+200A: alpha F alnum F digit F cntrl F punct F upper F lower F blank T space T graph F print T xdigit F U+200B: alpha F alnum F digit F cntrl F punct F upper F lower F blank T space T graph F print T xdigit F U+205F: alpha F alnum F digit F cntrl F punct F upper F lower F blank T space T graph F print T xdigit F U+3000: alpha F alnum F digit F cntrl F punct F upper F lower F blank T space T graph F print T xdigit F U+FE45: alpha F alnum F digit F cntrl F punct F upper F lower F blank F space F graph T print T xdigit F U+FE46: alpha F alnum F digit F cntrl F punct F upper F lower F blank F space F graph T print T xdigit F Here is the diff against [:space:]: > U+0009: alpha F alnum F digit F cntrl T punct F upper F lower F blank T space T graph F print F xdigit F> U+000A: alpha F alnum F digit F cntrl T punct F upper F lower F blank F space T graph F print F xdigit F> U+000B: alpha F alnum F digit F cntrl T punct F upper F lower F blank F space T graph F print F xdigit F> U+000C: alpha F alnum F digit F cntrl T punct F upper F lower F blank F space T graph F print F xdigit F> U+000D: alpha F alnum F digit F cntrl T punct F upper F lower F blank F space T graph F print F xdigit F13a20,21 > U+2028: alpha F alnum F digit F cntrl T punct F upper F lower F blank F space T graph F print F xdigit F> U+2029: alpha F alnum F digit F cntrl T punct F upper F lower F blank F space T graph F print F xdigit F16,17d23 < U+FE45: alpha F alnum F digit F cntrl F punct F upper F lower F blank F space F graph T print T xdigit F< U+FE46: alpha F alnum F digit F cntrl F punct F upper F lower F blank F space F graph T print T xdigit F It looks like [:print:] consists of [:graph:] plus [:space:] minus (ASCII [:space:] - SPACE]) plus U+FE45 and U+FE46, which are the sesame points. This seems sensible. billposer added on 2006-03-04 04:34:15: Logged In: YES user_id=939324 Regarding [:space:], I checked out the classification provided by the glibc wide character class functions using the following program in a variety of locales followed by: egrep "space T|Locale" ClassResults > SpaceResults #include <stdlib.h> #include <stdio.h> #include <wctype.h> #include <wchar.h> #include <locale.h> int main(int ac, char *av[]) { wchar_t i; setlocale(LC_ALL,""); printf("Locale: %s\n",setlocale(LC_ALL,NULL)); for(i=0;i<0xFFFF;i++) { printf("U+%04X:\t",i); printf("alpha %s\t",(iswalpha(i)? "T":"F")); printf("alnum %s\t",(iswalnum(i)? "T":"F")); printf("digit %s\t",(iswdigit(i)? "T":"F")); printf("cntrl %s\t",(iswcntrl(i)? "T":"F")); printf("punct %s\t",(iswpunct(i)? "T":"F")); printf("upper %s\t",(iswupper(i)? "T":"F")); printf("lower %s\t",(iswlower(i)? "T":"F")); printf("blank %s\t",(iswblank(i)? "T":"F")); printf("space %s\t",(iswspace(i)? "T":"F")); printf("graph %s\t",(iswgraph(i)? "T":"F")); printf("print %s\t",(iswprint(i)? "T":"F")); printf("xdigit %s\n",(iswxdigit(i)? "T":"F")); } exit(0); } In the C locale I got the expected: Locale: C U+0009: alpha F alnum F digit F cntrl T punct F upper F lower F blank T space T graph F print F xdigit F U+000A: alpha F alnum F digit F cntrl T punct F upper F lower F blank F space T graph F print F xdigit F U+000B: alpha F alnum F digit F cntrl T punct F upper F lower F blank F space T graph F print F xdigit F U+000C: alpha F alnum F digit F cntrl T punct F upper F lower F blank F space T graph F print F xdigit F U+000D: alpha F alnum F digit F cntrl T punct F upper F lower F blank F space T graph F print F xdigit F U+0020: alpha F alnum F digit F cntrl F punct F upper F lower F blank T space T graph F print T xdigit F In all of the other locales that I tried (ca_ES,de_DE,en_US,hi_IN,ja_JP,kk_KZ,th_TH,zh_TW) I got the same result: Locale: hi_IN U+0009: alpha F alnum F digit F cntrl T punct F upper F lower F blank T space T graph F print F xdigit F U+000A: alpha F alnum F digit F cntrl T punct F upper F lower F blank F space T graph F print F xdigit F U+000B: alpha F alnum F digit F cntrl T punct F upper F lower F blank F space T graph F print F xdigit F U+000C: alpha F alnum F digit F cntrl T punct F upper F lower F blank F space T graph F print F xdigit F U+000D: alpha F alnum F digit F cntrl T punct F upper F lower F blank F space T graph F print F xdigit F U+0020: alpha F alnum F digit F cntrl F punct F upper F lower F blank T space T graph F print T xdigit F U+1680: alpha F alnum F digit F cntrl F punct F upper F lower F blank T space T graph F print T xdigit F U+2000: alpha F alnum F digit F cntrl F punct F upper F lower F blank T space T graph F print T xdigit F U+2001: alpha F alnum F digit F cntrl F punct F upper F lower F blank T space T graph F print T xdigit F U+2002: alpha F alnum F digit F cntrl F punct F upper F lower F blank T space T graph F print T xdigit F U+2003: alpha F alnum F digit F cntrl F punct F upper F lower F blank T space T graph F print T xdigit F U+2004: alpha F alnum F digit F cntrl F punct F upper F lower F blank T space T graph F print T xdigit F U+2005: alpha F alnum F digit F cntrl F punct F upper F lower F blank T space T graph F print T xdigit F U+2006: alpha F alnum F digit F cntrl F punct F upper F lower F blank T space T graph F print T xdigit F U+2008: alpha F alnum F digit F cntrl F punct F upper F lower F blank T space T graph F print T xdigit F U+2009: alpha F alnum F digit F cntrl F punct F upper F lower F blank T space T graph F print T xdigit F U+200A: alpha F alnum F digit F cntrl F punct F upper F lower F blank T space T graph F print T xdigit F U+200B: alpha F alnum F digit F cntrl F punct F upper F lower F blank T space T graph F print T xdigit F U+2028: alpha F alnum F digit F cntrl T punct F upper F lower F blank F space T graph F print F xdigit F U+2029: alpha F alnum F digit F cntrl T punct F upper F lower F blank F space T graph F print F xdigit F U+205F: alpha F alnum F digit F cntrl F punct F upper F lower F blank T space T graph F print T xdigit F U+3000: alpha F alnum F digit F cntrl F punct F upper F lower F blank T space T graph F print T xdigit F So, at least as far as glibc and the locale definitions distributed with it are concerned there is a standard set of space characters. The list is not the same as the characters with Unicode General Property Zs or Z, nor with Bidi property WS. Somebody has evidently worked through the plausible candidates with their usage in mind. Bill billposer added on 2006-03-02 12:26:40: Logged In: YES user_id=939324 >Interestingly, there are also many characters that are >isalnum||ispunct but not isprint. That seems very strange to >me; perhaps we need to find a real spec and use that instead >of guessing... :-) My experience suggests that a lot of software has been rather sloppily extended to handle Unicode with the result that for many features the behavior is not only non-standard not "common sense" but downright bizarre. For instance, try a range like [a-ALPHA] in your favorite regexp engine (other than Tcl). The common sense correct result is that this should match the characters U+0061 through U+03B1. Another plausible result would be an error because it crosses Unicode blocks (this is the gawk behaviour). But in addition to these I have found several other things, including matches that include not only alpha but the entire Greek range! Anyhow, the other problem is that I'm pretty sure that there isn't any standard governing the extension of the POSIX classes to Unicode. POSIX states some principles but they are very general, basically just that you have to preserve the ASCII classes. Unicode has classes of its own in the form of the General Character Properties, but they aren't the same and don't map to the POSIX classes in an obvious way. dkf added on 2006-02-24 16:47:05: Logged In: YES user_id=79902 The following C program indicates that there are large numbers of characters that satisfy isprint() but neither isalnum() nor ispunct() #include <ctype.h> #include <stdio.h> #include <locale.h> int main() { unsigned int i,j=1000000000; setlocale(LC_ALL, "en_GB.UTF-8"); for (i=0 ; i<65536 ; i++) { if (isprint(i) && !isalnum(i) && !ispunct(i)) { if (i!=j+1) { printf("%04x-", i); } j = i; } else if (i == j+1) { printf("%04x\n", j); } } return 0; } Interestingly, there are also many characters that are isalnum||ispunct but not isprint. That seems very strange to me; perhaps we need to find a real spec and use that instead of guessing... :-) dkf added on 2006-02-16 16:22:19: Logged In: YES user_id=79902 What about [:space:] characters outside the classic ASCII range? That's a total of 20 characters, and I'm not willing to automatically just go with non-UNICODE-aware tools on this. I ask this because it seems unreasonable to me to just assume that old stuff is holy (an approach that has happened in this area in the past; as a point to help understanding, [:digit:] isn't the same as [0-9], and this is good.) My characters of concern are: \u00a0, \u1680, \u2000-\u200b, \u2028, \u2029, \u202f, \u3000 billposer added on 2006-02-16 16:07:42: Logged In: YES user_id=939324 GNU egrep agrees with gawk, java, ruby, etc. as opposed to Perl. billposer added on 2006-02-16 15:15:40: Logged In: YES user_id=939324 Donal, I think that your patch does what you intended, but what you intended and what I intended aren't the same. What you've got treats [:print:] as [:alnum:] U [:punct:] U [:space:], which is what Perl does. What I intended was [:alnum:] U [:punct:] U SPACE, where SPACE = 0x20. (The simple test is to see whether [[:print:]] matches tab.) That is my understanding of the POSIX standard. I just checked a few other regexp engines. TRE, which makes a point of strict POSIX conformance, follows my interpretation, as do GNU awk, java.util.regex, ruby, vim, zsh, and, interestingly, pcre. So I would say that Perl has got it wrong. The necessary fix is just to delete the two bits involving NUM_SPACE_RANGE from your patch. billposer added on 2006-02-16 15:15:12: Logged In: YES user_id=939324 Donal, I think that your patch does what you intended, but what you intended and what I intended aren't the same. What you've got treats [:print:] as [:alnum:] U [:punct:] U [:space:], which is what Perl does. What I intended was [:alnum:] U [:punct:] U SPACE, where SPACE = 0x20. (The simple test is to see whether [[:print:]] matches tab.) That is my understanding of the POSIX standard. I just checked a few other regexp engines. TRE, which makes a point of strict POSIX conformance, follows my interpretation, as do GNU awk, java.util.regex, ruby, vim, zsh, and, interestingly, pcre. So I would say that Perl has got it wrong. The necessary fix is just to delete the two bits involving NUM_SPACE_RANGE from your patch. dkf added on 2006-02-16 13:54:27: File Added - 167658: reprintchars.diff dkf added on 2006-02-16 13:53:36: Logged In: YES user_id=79902 Bug is located at line 817 of the HEAD regc_locale.c, and consists of a missing arm for the CC_PRINT case. I believe that the fix required (based on the POSIX definition, thanks Bill!) is the attached patch, which I'd appreciate people testing... :-) billposer added on 2006-02-16 02:08:07: Logged In: YES user_id=939324 Perhaps more important than the Perl definition is the POSIX definition, according to which [:print:] = [:alnum:] U [:punct:] U SPACE. Curiously, the current behaviour with [:print:] = [:alnum:] is documented in Welch, Jones, and Hobbs with no mention of it being a bug. dkf added on 2005-12-09 23:42:28: Logged In: YES user_id=79902 Good point. Yes. And [string is print] follows C's isprint() IIRC, so it is (almost certainly) the RE engine that is wrong. OK, this should be fixed. dgp added on 2005-12-09 23:34:08: Logged In: YES user_id=80530 should we also consider consistency with [string is print] ? dkf added on 2005-12-09 16:19:55: Logged In: YES user_id=79902 It seems that Perl defines [[:print:]] as [[:space:][:graph:]] and we define it as [[:alnum:]] (and yes, it is documented that way.) Or perhaps [:blank:] instead of [:space:], the documentation being a bit hazy in that respect. The question is, what *should* we do? The following procedure helps with checking these sorts of things out: proc matches args { set RE [format {[[:%s:]]} [join $args ":\]\[:"]] for {set i 32} {$i<127} {incr i} { set c [format %c $i] puts -nonewline "$c-[regexp $RE $c]\t" } puts "" } |
Attachments:
- re_print.diff [download] added by dkf on 2006-04-12 21:12:59. [details]