Tcl Source Code: Check-in [4d6af4f7a4]

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview

Comment:	IMPLEMENTATION OF TIP #388
Downloads:	Tarball \| ZIP archive \| SQL archive
Timelines:	family \| ancestors \| descendants \| both \| trunk \| potential incompatibility
Files:	files \| file ages \| folders
SHA1:	4d6af4f7a468b71a36b9c74e7921195225f74679
User & Date:	jan.nijtmans 2011-09-16 13:23:19

Context

2016-01-29
22:32		Backout TIP #388 implementation, this part doesn't belong in 8.5 check-in: 8660791a56 user: jan.nijtmans tags: 8.5-with-8.6-regexp
2011-09-16
13:35		Synchronization marker. check-in: 112a12c3ec user: dkf tags: trunk
13:23		IMPLEMENTATION OF TIP #388 check-in: 4d6af4f7a4 user: jan.nijtmans tags: trunk, potential incompatibility
13:19		Noticed that a test now works. check-in: d281c444c3 user: dkf tags: trunk
13:09		Don't change Tcl_UniChar type when TCL_UTF_MAX == 4 (not supported anyway) check-in: 98f64c277b user: jan.nijtmans tags: core-8-5-branch
08:12		merge trunk to feature branch Closed-Leaf check-in: 8b3fef2633 user: jan.nijtmans tags: tip-388-impl

Changes

Hide Diffs Unified Diffs Ignore Whitespace Patch

Changes to ChangeLog.

Changes to doc/Tcl.n.

Changes to doc/re_syntax.n.

Changes to generic/regc_lex.c.

Changes to generic/regcomp.c.

Changes to generic/regcustom.h.

Changes to generic/tcl.h.

Changes to generic/tclParse.c.

Changes to tests/reg.test.

Changes to tests/utf.test.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16	'\" '\" Copyright (c) 1993 The Regents of the University of California. '\" Copyright (c) 1994-1996 Sun Microsystems, Inc. '\" '\" See the file "license.terms" for information on usage and redistribution '\" of this file, and for a DISCLAIMER OF ALL WARRANTIES. '\" .so man.macros ~~.TH Tcl n "8.5" Tcl "Tcl Built-In Commands"~~ .BS .SH NAME Tcl \- Tool Command Language .SH SYNOPSIS Summary of Tcl language syntax. .BE .SH DESCRIPTION	\|	1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16	'\" '\" Copyright (c) 1993 The Regents of the University of California. '\" Copyright (c) 1994-1996 Sun Microsystems, Inc. '\" '\" See the file "license.terms" for information on usage and redistribution '\" of this file, and for a DISCLAIMER OF ALL WARRANTIES. '\" .so man.macros .TH Tcl n "8.6" Tcl "Tcl Built-In Commands" .BS .SH NAME Tcl \- Tool Command Language .SH SYNOPSIS Summary of Tcl language syntax. .BE .SH DESCRIPTION
︙			︙
189 190 191 192 193 194 195 ~~196 197~~ ~~198~~ 199 200 201 ~~202 203 204 205 206~~ 207 208 209 210 211 ~~212~~ 213 214 215 216 217 218 219	.TP 7 \e\e Backslash .PQ \e "" . .TP 7 \e\fIooo\fR . ~~The digits \fIooo\fR (one, two, or three of them) give an eight-bit octal value for the Unicode character that will be inserted~~. The upper bits of~~ the~~ ~~~~Unicode~~ character will be 0.~~ .TP 7 \e\fBx\fIhh\fR . The hexadecimal digits \fIhh\fR give an eight-bit ~~hexadecimal value for the~~ Unicode character that will be inserted. ~~Any~~ ~~numb~~er ~~of hexadecimal digits~~ ~~may be present; however, all but the last two are ignored (the result is~~ ~~always a one-byte quantity). The upper~~ bits of the Unicode character will ~~be 0.~~ .TP 7 \e\fBu\fIhhhh\fR . The hexadecimal digits \fIhhhh\fR (one, two, three, or four of them) give a sixteen-bit hexadecimal value for the Unicode character that will be ~~inserted.~~ .PP Backslash substitution is not performed on words enclosed in braces, except for backslash-newline as described above. .RE .IP "[10] \fBComments.\fR" If a hash character .PQ #	\| \| > > \| \| \| < \| < \| > > > > > > > > > >	189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229	.TP 7 \e\e Backslash .PQ \e "" . .TP 7 \e\fIooo\fR . The digits \fIooo\fR (one, two, or three of them) give a eight-bit octal value for the Unicode character that will be inserted, in the range \fI000\fR - \fI377\fR. The parser will stop just before this range overflows, or when the maximum of three digits is reached. The upper bits of the Unicode character will be 0. .TP 7 \e\fBx\fIhh\fR . The hexadecimal digits \fIhh\fR (one or two of them) give an eight-bit hexadecimal value for the Unicode character that will be inserted. The upper bits of the Unicode character will be 0. .TP 7 \e\fBu\fIhhhh\fR . The hexadecimal digits \fIhhhh\fR (one, two, three, or four of them) give a sixteen-bit hexadecimal value for the Unicode character that will be inserted. The upper bits of the Unicode character will be 0. .TP 7 \e\fBU\fIhhhhhhhh\fR . The hexadecimal digits \fIhhhhhhhh\fR (one up to eight of them) give a twentiy-one-bit hexadecimal value for the Unicode character that will be inserted, in the range U+0000..U+10FFFF. The parser will stop just before this range overflows, or when the maximum of eight digits is reached. The upper bits of the Unicode character will be 0. .PP The range U+010000..U+10FFFD is reserved for the future. .PP Backslash substitution is not performed on words enclosed in braces, except for backslash-newline as described above. .RE .IP "[10] \fBComments.\fR" If a hash character .PQ #
︙			︙

︙			︙
738 739 740 741 742 743 744 745 746 747 748 749 750 751	^ static int lexescape(struct vars ); / static int /* not actually used, but convenient for RETV / lexescape( struct vars v) { chr c; static const chr alert[] = { CHR('a'), CHR('l'), CHR('e'), CHR('r'), CHR('t') }; static const chr esc[] = { CHR('E'), CHR('S'), CHR('C') }; const chr *save;	>	738 739 740 741 742 743 744 745 746 747 748 749 750 751 752	^ static int lexescape(struct vars ); / static int /* not actually used, but convenient for RETV / lexescape( struct vars v) { chr c; int i; static const chr alert[] = { CHR('a'), CHR('l'), CHR('e'), CHR('r'), CHR('t') }; static const chr esc[] = { CHR('E'), CHR('S'), CHR('C') }; const chr *save;
︙			︙
814 815 816 817 818 819 820 ~~821~~ 822 823 824 825 826 827 ~~828~~ 829 830 831 ~~832~~ 833 834 835 836 837 838 839 840 841 842 843 844 845 846 ~~847~~ 848 849 850 851 852 853 854	NOTE(REG_ULOCALE); RETV(CCLASS, 'S'); break; case CHR('t'): RETV(PLAIN, CHR('\t')); break; case CHR('u'): ~~c = lexdigits(v, 16, 4, 4);~~ if (ISERR()) { FAILW(REG_EESCAPE); } RETV(PLAIN, c); break; case CHR('U'): ~~c = lexdigits(v, 16, 8, 8);~~ if (ISERR()) { FAILW(REG_EESCAPE); } ~~RETV(PLAIN, c);~~ break; case CHR('v'): RETV(PLAIN, CHR('\v')); break; case CHR('w'): NOTE(REG_ULOCALE); RETV(CCLASS, 'w'); break; case CHR('W'): NOTE(REG_ULOCALE); RETV(CCLASS, 'W'); break; case CHR('x'): NOTE(REG_UUNPORT); ~~c = lexdigits(v, 16, 1, 2~~55); /* REs >255 long outside spec */~~~~ if (ISERR()) { FAILW(REG_EESCAPE); } RETV(PLAIN, c); break; case CHR('y'): NOTE(REG_ULOCALE);	\| \| > > > > > \| \|	815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860	NOTE(REG_ULOCALE); RETV(CCLASS, 'S'); break; case CHR('t'): RETV(PLAIN, CHR('\t')); break; case CHR('u'): c = (uchr) lexdigits(v, 16, 1, 4); if (ISERR()) { FAILW(REG_EESCAPE); } RETV(PLAIN, c); break; case CHR('U'): i = lexdigits(v, 16, 1, 8); if (ISERR()) { FAILW(REG_EESCAPE); } if (i > 0xFFFF) { /* TODO: output a Surrogate pair */ i = 0xFFFD; } RETV(PLAIN, (uchr) i); break; case CHR('v'): RETV(PLAIN, CHR('\v')); break; case CHR('w'): NOTE(REG_ULOCALE); RETV(CCLASS, 'w'); break; case CHR('W'): NOTE(REG_ULOCALE); RETV(CCLASS, 'W'); break; case CHR('x'): NOTE(REG_UUNPORT); c = (uchr) lexdigits(v, 16, 1, 2); if (ISERR()) { FAILW(REG_EESCAPE); } RETV(PLAIN, c); break; case CHR('y'): NOTE(REG_ULOCALE);
︙			︙
862 863 864 865 866 867 868 ~~869~~ 870 871 872 873 874 875 876	RETV(SEND, 0); break; case CHR('1'): case CHR('2'): case CHR('3'): case CHR('4'): case CHR('5'): case CHR('6'): case CHR('7'): case CHR('8'): case CHR('9'): save = v->now; v->now--; /* put first digit back / ~~c = lexdigits(v, 10, 1, 255); / REs >255 long outside spec /~~ if (ISERR()) { FAILW(REG_EESCAPE); } / * Ugly heuristic (first test is "exactly 1 digit?") */	\|	868 869 870 871 872 873 874 875 876 877 878 879 880 881 882	RETV(SEND, 0); break; case CHR('1'): case CHR('2'): case CHR('3'): case CHR('4'): case CHR('5'): case CHR('6'): case CHR('7'): case CHR('8'): case CHR('9'): save = v->now; v->now--; /* put first digit back / c = (uchr) lexdigits(v, 10, 1, 255); / REs >255 long outside spec / if (ISERR()) { FAILW(REG_EESCAPE); } / * Ugly heuristic (first test is "exactly 1 digit?") */
︙			︙
889 890 891 892 893 894 895 ~~896~~ 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 ~~912~~ 913 ~~914~~ 915 916 917 918 919 920 ~~921~~ 922 923 924 925 926 927 928 929 930 931 932 933 934 935	/* * And fall through into octal number. / case CHR('0'): NOTE(REG_UUNPORT); v->now--; / put first digit back / ~~c = lexdigits(v, 8, 1, 3);~~ if (ISERR()) { FAILW(REG_EESCAPE); } RETV(PLAIN, c); break; default: assert(iscalpha(c)); FAILW(REG_EESCAPE); / unknown alphabetic escape / break; } assert(NOTREACHED); } / - lexdigits - slurp up digits and return chr value ~~^ static ~~chr~~ lexdigits(struct vars , int, int, int);~~ / ~~static ~~chr~~ /* chr value; errors signalled via ERR /~~ lexdigits( struct vars v, int base, int minlen, int maxlen) { ~~uchr~~ n; ~~/* unsigned to avoid overflow misbehavior /~~ int len; chr c; int d; const uchr ub = (uchr) base; n = 0; for (len = 0; len < maxlen && !ATEOS(); len++) { c = v->now++; switch (c) { case CHR('0'): case CHR('1'): case CHR('2'): case CHR('3'): case CHR('4'): case CHR('5'): case CHR('6'): case CHR('7'): case CHR('8'): case CHR('9'): d = DIGITVAL(c); break;	\| > > > > > \| \| \| > > > >	895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950	/* * And fall through into octal number. / case CHR('0'): NOTE(REG_UUNPORT); v->now--; / put first digit back / c = (uchr) lexdigits(v, 8, 1, 3); if (ISERR()) { FAILW(REG_EESCAPE); } if (c > 0xff) { / out of range, so we handled one digit too much / v->now--; c >>= 3; } RETV(PLAIN, c); break; default: assert(iscalpha(c)); FAILW(REG_EESCAPE); / unknown alphabetic escape / break; } assert(NOTREACHED); } / - lexdigits - slurp up digits and return chr value ^ static int lexdigits(struct vars , int, int, int); / static int /* chr value; errors signalled via ERR / lexdigits( struct vars v, int base, int minlen, int maxlen) { int n; int len; chr c; int d; const uchr ub = (uchr) base; n = 0; for (len = 0; len < maxlen && !ATEOS(); len++) { if (n > 0x10fff) { /* Stop when continuing would otherwise overflow / break; } c = v->now++; switch (c) { case CHR('0'): case CHR('1'): case CHR('2'): case CHR('3'): case CHR('4'): case CHR('5'): case CHR('6'): case CHR('7'): case CHR('8'): case CHR('9'): d = DIGITVAL(c); break;
︙			︙
954 955 956 957 958 959 960 ~~961~~ 962 963 964 965 966 967 968	} n = nub + (uchr)d; } if (len < minlen) { ERR(REG_EESCAPE); } ~~return ~~(chr)~~n;~~ } / - brenext - get next BRE token * This is much like EREs except for all the stupid backslashes and the * context-dependency of some things. ^ static int brenext(struct vars *, pchr);	\|	969 970 971 972 973 974 975 976 977 978 979 980 981 982 983	} n = nub + (uchr)d; } if (len < minlen) { ERR(REG_EESCAPE); } return n; } / - brenext - get next BRE token * This is much like EREs except for all the stupid backslashes and the * context-dependency of some things. ^ static int brenext(struct vars *, pchr);
︙			︙

︙			︙
750 751 752 753 754 755 756 ~~757~~ 758 759 760 761 762 763 764	{ int result = 0; register const char p = src; while (numBytes--) { unsigned char digit = UCHAR(p); ~~if (!isxdigit(digit)) {~~ break; } p++; result <<= 4; if (digit >= 'a') {	\|	750 751 752 753 754 755 756 757 758 759 760 761 762 763 764	{ int result = 0; register const char p = src; while (numBytes--) { unsigned char digit = UCHAR(p); if (!isxdigit(digit) \|\| (result > 0x10fff)) { break; } p++; result <<= 4; if (digit >= 'a') {
︙			︙
862 863 864 865 866 867 868 ~~869~~ 870 871 872 873 874 875 876	case 't': result = 0x9; break; case 'v': result = 0xb; break; case 'x': ~~count += TclParseHex(p+1, numBytes-2, &result);~~ if (count == 2) { /* * No hexadigits -> This is just "x". */ result = 'x'; } else {	\|	862 863 864 865 866 867 868 869 870 871 872 873 874 875 876	case 't': result = 0x9; break; case 'v': result = 0xb; break; case 'x': count += TclParseHex(p+1, (numBytes > 3) ? 2 : numBytes-2, &result); if (count == 2) { /* * No hexadigits -> This is just "x". */ result = 'x'; } else {
︙			︙
884 885 886 887 888 889 890 891 892 893 894 895 896 897	count += TclParseHex(p+1, (numBytes > 5) ? 4 : numBytes-2, &result); if (count == 2) { /* * No hexadigits -> This is just "u". / result = 'u'; } break; case '\n': count--; do { p++; count++; } while ((count < numBytes) && ((p == ' ') \|\| (*p == '\t')));	> > > > > > > > >	884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906	count += TclParseHex(p+1, (numBytes > 5) ? 4 : numBytes-2, &result); if (count == 2) { /* * No hexadigits -> This is just "u". / result = 'u'; } break; case 'U': count += TclParseHex(p+1, (numBytes > 9) ? 8 : numBytes-2, &result); if (count == 2) { / * No hexadigits -> This is just "U". / result = 'U'; } break; case '\n': count--; do { p++; count++; } while ((count < numBytes) && ((p == ' ') \|\| (*p == '\t')));
︙			︙
913 914 915 916 917 918 919 ~~920~~ 921 922 923 924 925 926 927	\|\| (UCHAR(p) >= '8')) { break; } count = 3; result = (result << 3) + (p - '0'); p++; if ((numBytes == 3) \|\| !isdigit(UCHAR(p)) / INTL: digit / ~~\|\| (UCHAR(p) >= '8')) {~~ break; } count = 4; result = UCHAR((result << 3) + (*p - '0')); break; }	\|	922 923 924 925 926 927 928 929 930 931 932 933 934 935 936	\|\| (UCHAR(p) >= '8')) { break; } count = 3; result = (result << 3) + (p - '0'); p++; if ((numBytes == 3) \|\| !isdigit(UCHAR(p)) / INTL: digit / \|\| (UCHAR(p) >= '8') \|\| (result >= 0x20)) { break; } count = 4; result = UCHAR((result << 3) + (*p - '0')); break; }
︙			︙

︙			︙
622 623 624 625 626 627 628 ~~629~~ 630 631 ~~632~~ 633 634 635 ~~636~~ 637 638 639 640 641 642 643 644 645	expectMatch 13.10 MP "a\\cHb" "a\bb" "a\bb" expectMatch 13.11 LMP "a\\e" "a\033" "a\033" expectMatch 13.12 P "a\\fb" "a\fb" "a\fb" expectMatch 13.13 P "a\\nb" "a\nb" "a\nb" expectMatch 13.14 P "a\\rb" "a\rb" "a\rb" expectMatch 13.15 P "a\\tb" "a\tb" "a\tb" expectMatch 13.16 P "a\\u0008x" "a\bx" "a\bx" ~~expect~~Error~~ 13.17 - {a\u008x} ~~EESCAPE~~~~ expectMatch 13.18 P "a\\u00088x" "a\b8x" "a\b8x" expectMatch 13.19 P "a\\U00000008x" "a\bx" "a\bx" ~~expect~~Error~~ 13.20 - {a\U0000008x} ~~EESCAPE~~~~ expectMatch 13.21 P "a\\vb" "a\vb" "a\vb" expectMatch 13.22 MP "a\\x08x" "a\bx" "a\bx" expectError 13.23 - {a\xq} EESCAPE ~~expectMatch 13.24 MP "a\\x0008x" "a\bx" "a\bx"~~ expectError 13.25 - {a\z} EESCAPE expectMatch 13.26 MP "a\\010b" "a\bb" "a\bb" doing 14 "back references" # ugh expectMatch 14.1 RP {a(b)c\1} abbcbb abbcbb bb expectMatch 14.2 RP {a(b)c\1} ac ac "" expectNomatch 14.3 RP {a(b*)c\1} abbcb	\| \| \| > > > > > > > >	622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653	expectMatch 13.10 MP "a\\cHb" "a\bb" "a\bb" expectMatch 13.11 LMP "a\\e" "a\033" "a\033" expectMatch 13.12 P "a\\fb" "a\fb" "a\fb" expectMatch 13.13 P "a\\nb" "a\nb" "a\nb" expectMatch 13.14 P "a\\rb" "a\rb" "a\rb" expectMatch 13.15 P "a\\tb" "a\tb" "a\tb" expectMatch 13.16 P "a\\u0008x" "a\bx" "a\bx" expectMatch 13.17 P {a\u008x} "a\bx" "a\bx" expectMatch 13.18 P "a\\u00088x" "a\b8x" "a\b8x" expectMatch 13.19 P "a\\U00000008x" "a\bx" "a\bx" expectMatch 13.20 P {a\U0000008x} "a\bx" "a\bx" expectMatch 13.21 P "a\\vb" "a\vb" "a\vb" expectMatch 13.22 MP "a\\x08x" "a\bx" "a\bx" expectError 13.23 - {a\xq} EESCAPE expectMatch 13.24 MP "a\\x08x" "a\bx" "a\bx" expectError 13.25 - {a\z} EESCAPE expectMatch 13.26 MP "a\\010b" "a\bb" "a\bb" expectMatch 13.27 P "a\\U00001234x" "a\u1234x" "a\u1234x" expectMatch 13.28 P {a\U00001234x} "a\u1234x" "a\u1234x" expectMatch 13.29 P "a\\U0001234x" "a\u1234x" "a\u1234x" expectMatch 13.30 P {a\U0001234x} "a\u1234x" "a\u1234x" expectMatch 13.31 P "a\\U000012345x" "a\u12345x" "a\u12345x" expectMatch 13.32 P {a\U000012345x} "a\u12345x" "a\u12345x" expectMatch 13.33 P "a\\U1000000x" "a\ufffd0x" "a\ufffd0x" expectMatch 13.34 P {a\U1000000x} "a\ufffd0x" "a\ufffd0x" doing 14 "back references" # ugh expectMatch 14.1 RP {a(b)c\1} abbcbb abbcbb bb expectMatch 14.2 RP {a(b)c\1} ac ac "" expectNomatch 14.3 RP {a(b*)c\1} abbcb
︙			︙
678 679 680 681 682 683 684 685 686 687 688 689 690 691	"abbbbbbbbbbbc" abbbbbbbbbbbc b b b b b b b b b b # but we're fussy about border cases -- guys who want octal should use the zero expectError 15.9 - {a((((((((((b\10))))))))))c} ESUBREG # BREs don't have octal, EREs don't have backrefs expectMatch 15.10 MP "a\\12b" "a\nb" "a\nb" expectError 15.11 b {a\12b} ESUBREG expectMatch 15.12 eAS {a\12b} a12b a12b doing 16 "expanded syntax" expectMatch 16.1 xP "a b c" "abc" "abc" expectMatch 16.2 xP "a b #oops\nc\td" "abcd" "abcd" expectMatch 16.3 x "a\\ b\\\tc" "a b\tc" "a b\tc" expectMatch 16.4 xP "a b\\#c" "ab#c" "ab#c"	>	686 687 688 689 690 691 692 693 694 695 696 697 698 699 700	"abbbbbbbbbbbc" abbbbbbbbbbbc b b b b b b b b b b # but we're fussy about border cases -- guys who want octal should use the zero expectError 15.9 - {a((((((((((b\10))))))))))c} ESUBREG # BREs don't have octal, EREs don't have backrefs expectMatch 15.10 MP "a\\12b" "a\nb" "a\nb" expectError 15.11 b {a\12b} ESUBREG expectMatch 15.12 eAS {a\12b} a12b a12b expectMatch 15.13 MP {a\701b} a\u00381b a\u00381b doing 16 "expanded syntax" expectMatch 16.1 xP "a b c" "abc" "abc" expectMatch 16.2 xP "a b #oops\nc\td" "abcd" "abcd" expectMatch 16.3 x "a\\ b\\\tc" "a b\tc" "a b\tc" expectMatch 16.4 xP "a b\\#c" "ab#c" "ab#c"
︙			︙