Check-in [4d6af4f7a4]
Not logged in
Tcl 2014 Conference, Portland/OR, US, Nov 10-14
Browse the schedule online.

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview
SHA1 Hash:4d6af4f7a468b71a36b9c74e7921195225f74679
Date: 2011-09-16 13:23:19
User: jan.nijtmans
Comment:IMPLEMENTATION OF TIP #388
Tags And Properties
Changes

Changes to ChangeLog




















1
2
3
4
5
6
7



















2011-09-16  Donal K. Fellows  <dkf@users.sf.net>

	* generic/tclProc.c (ProcWrongNumArgs): [Bugs 3400658,3408830]:
	Ensemble-like rewriting of error messages is complex, and TclOO (in
	combination with iTcl) hits the most tricky cases.

	* library/http/http.tcl (http::geturl): [Bug 3391977]: Ensure that the
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>







1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
2011-09-16  Jan Nijtmans  <nijtmans@users.sf.net>

	IMPLEMENTATION OF TIP #388

	* doc/Tcl.n
	* doc/re_syntax.n
	* generic/regc_lex.c
	* generic/regcomp.c
	* generic/regcustom.h
	* generic/tcl.h
	* generic/tclParse.c
	* tests/reg.test
	* tests/utf.test

2011-09-16  Jan Nijtmans  <nijtmans@users.sf.net>

	* generic/tcl.h:        Don't change Tcl_UniChar type when
	* generic/regcustom.h:  TCL_UTF_MAX == 4 (not supported anyway)

2011-09-16  Donal K. Fellows  <dkf@users.sf.net>

	* generic/tclProc.c (ProcWrongNumArgs): [Bugs 3400658,3408830]:
	Ensemble-like rewriting of error messages is complex, and TclOO (in
	combination with iTcl) hits the most tricky cases.

	* library/http/http.tcl (http::geturl): [Bug 3391977]: Ensure that the

Changes to doc/Tcl.n

2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
...
189
190
191
192
193
194
195
196
197


198
199
200
201
202
203
204
205
206
207
208
209
210
211
212










213
214
215
216
217
218
219
'\" Copyright (c) 1993 The Regents of the University of California.
'\" Copyright (c) 1994-1996 Sun Microsystems, Inc.
'\"
'\" See the file "license.terms" for information on usage and redistribution
'\" of this file, and for a DISCLAIMER OF ALL WARRANTIES.
'\"
.so man.macros
.TH Tcl n "8.5" Tcl "Tcl Built-In Commands"
.BS
.SH NAME
Tcl \- Tool Command Language
.SH SYNOPSIS
Summary of Tcl language syntax.
.BE
.SH DESCRIPTION
................................................................................
.TP 7
\e\e
Backslash
.PQ \e "" .
.TP 7
\e\fIooo\fR 
.
The digits \fIooo\fR (one, two, or three of them) give an eight-bit octal 
value for the Unicode character that will be inserted.  The upper bits of the


Unicode character will be 0.
.TP 7
\e\fBx\fIhh\fR 
.
The hexadecimal digits \fIhh\fR give an eight-bit hexadecimal value for the
Unicode character that will be inserted.  Any number of hexadecimal digits
may be present; however, all but the last two are ignored (the result is
always a one-byte quantity).  The upper bits of the Unicode character will
be 0.
.TP 7
\e\fBu\fIhhhh\fR 
.
The hexadecimal digits \fIhhhh\fR (one, two, three, or four of them) give a
sixteen-bit hexadecimal value for the Unicode character that will be
inserted.










.PP
Backslash substitution is not performed on words enclosed in braces,
except for backslash-newline as described above.
.RE
.IP "[10] \fBComments.\fR"
If a hash character
.PQ #







|







 







|
|
>
>
|



|
|
<
|
<





|
>
>
>
>
>
>
>
>
>
>







2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
...
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205

206

207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
'\" Copyright (c) 1993 The Regents of the University of California.
'\" Copyright (c) 1994-1996 Sun Microsystems, Inc.
'\"
'\" See the file "license.terms" for information on usage and redistribution
'\" of this file, and for a DISCLAIMER OF ALL WARRANTIES.
'\"
.so man.macros
.TH Tcl n "8.6" Tcl "Tcl Built-In Commands"
.BS
.SH NAME
Tcl \- Tool Command Language
.SH SYNOPSIS
Summary of Tcl language syntax.
.BE
.SH DESCRIPTION
................................................................................
.TP 7
\e\e
Backslash
.PQ \e "" .
.TP 7
\e\fIooo\fR 
.
The digits \fIooo\fR (one, two, or three of them) give a eight-bit octal 
value for the Unicode character that will be inserted, in the range \fI000\fR
- \fI377\fR.  The parser will stop just before this range overflows, or when
the maximum of three digits is reached.  The upper bits of the Unicode
character will be 0.
.TP 7
\e\fBx\fIhh\fR 
.
The hexadecimal digits \fIhh\fR (one or two of them) give an eight-bit
hexadecimal value for the Unicode character that will be inserted.  The upper

bits of the Unicode character will be 0.

.TP 7
\e\fBu\fIhhhh\fR 
.
The hexadecimal digits \fIhhhh\fR (one, two, three, or four of them) give a
sixteen-bit hexadecimal value for the Unicode character that will be
inserted.  The upper bits of the Unicode character will be 0.
.TP 7
\e\fBU\fIhhhhhhhh\fR 
.
The hexadecimal digits \fIhhhhhhhh\fR (one up to eight of them) give a
twentiy-one-bit hexadecimal value for the Unicode character that will be
inserted, in the range U+0000..U+10FFFF.  The parser will stop just
before this range overflows, or when the maximum of eight digits
is reached.  The upper bits of the Unicode character will be 0.
.PP
The range U+010000..U+10FFFD is reserved for the future.
.PP
Backslash substitution is not performed on words enclosed in braces,
except for backslash-newline as described above.
.RE
.IP "[10] \fBComments.\fR"
If a hash character
.PQ #

Changes to doc/re_syntax.n

355
356
357
358
359
360
361
362
363
364
365
366
367
368



369
370
371
372
373
374
375
376
377
378
379
380
381
382
383







384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
.TP
\fB\et\fR
.
horizontal tab, as in C
.TP
\fB\eu\fIwxyz\fR
.
(where \fIwxyz\fR is exactly four hexadecimal digits) the Unicode
character \fBU+\fIwxyz\fR in the local byte ordering
.TP
\fB\eU\fIstuvwxyz\fR
.
(where \fIstuvwxyz\fR is exactly eight hexadecimal digits) reserved
for a somewhat-hypothetical Unicode extension to 32 bits



.TP
\fB\ev\fR
.
vertical tab, as in C are all available.
.TP
\fB\ex\fIhhh\fR
.
(where \fIhhh\fR is any sequence of hexadecimal digits) the character
whose hexadecimal value is \fB0x\fIhhh\fR (a single character no
matter how many hexadecimal digits are used).
.TP
\fB\e0\fR
.
the character whose value is \fB0\fR
.TP







\fB\e\fIxy\fR
.
(where \fIxy\fR is exactly two octal digits, and is not a \fIback
reference\fR (see below)) the character whose octal value is
\fB0\fIxy\fR
.TP
\fB\e\fIxyz\fR
.
(where \fIxyz\fR is exactly three octal digits, and is not a back
reference (see below)) the character whose octal value is
\fB0\fIxyz\fR
.RE
.PP
Hexadecimal digits are
.QR \fB0\fR \fB9\fR ,
.QR \fBa\fR \fBf\fR ,
and
.QR \fBA\fR \fBF\fR .







|




|
|
>
>
>





|

|
|
<





>
>
>
>
>
>
>





<
<
<
<
<
<







355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380

381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397






398
399
400
401
402
403
404
.TP
\fB\et\fR
.
horizontal tab, as in C
.TP
\fB\eu\fIwxyz\fR
.
(where \fIwxyz\fR is one up to four hexadecimal digits) the Unicode
character \fBU+\fIwxyz\fR in the local byte ordering
.TP
\fB\eU\fIstuvwxyz\fR
.
(where \fIstuvwxyz\fR is one up to eight hexadecimal digits) reserved
for a Unicode extension up to 21 bits. The digits are parsed until the
first non-hexadecimal character is encountered, the maximun of eight
hexadecimal digits are reached, or an overflow would occur in the maximum
value of \fBU+\fI10ffff\fR.
.TP
\fB\ev\fR
.
vertical tab, as in C are all available.
.TP
\fB\ex\fIhh\fR
.
(where \fIhh\fR is one or two hexadecimal digits) the character
whose hexadecimal value is \fB0x\fIhh\fR.

.TP
\fB\e0\fR
.
the character whose value is \fB0\fR
.TP
\fB\e\fIxyz\fR
.
(where \fIxyz\fR is exactly three octal digits, and is not a \fIback
reference\fR (see below)) the character whose octal value is
\fB0\fIxyz\fR. The first digit must be in the range 0-3, otherwise
the two-digit form is assumed.
.TP
\fB\e\fIxy\fR
.
(where \fIxy\fR is exactly two octal digits, and is not a \fIback
reference\fR (see below)) the character whose octal value is
\fB0\fIxy\fR






.RE
.PP
Hexadecimal digits are
.QR \fB0\fR \fB9\fR ,
.QR \fBa\fR \fBf\fR ,
and
.QR \fBA\fR \fBF\fR .

Changes to generic/regc_lex.c

738
739
740
741
742
743
744

745
746
747
748
749
750
751
...
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831





832
833
834
835
836
837
838
839
...
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
...
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
...
889
890
891
892
893
894
895
896
897
898





899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928




929
930
931
932
933
934
935
...
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
 ^ static int lexescape(struct vars *);
 */
static int			/* not actually used, but convenient for RETV */
lexescape(
    struct vars *v)
{
    chr c;

    static const chr alert[] = {
	CHR('a'), CHR('l'), CHR('e'), CHR('r'), CHR('t')
    };
    static const chr esc[] = {
	CHR('E'), CHR('S'), CHR('C')
    };
    const chr *save;
................................................................................
	NOTE(REG_ULOCALE);
	RETV(CCLASS, 'S');
	break;
    case CHR('t'):
	RETV(PLAIN, CHR('\t'));
	break;
    case CHR('u'):
	c = lexdigits(v, 16, 4, 4);
	if (ISERR()) {
	    FAILW(REG_EESCAPE);
	}
	RETV(PLAIN, c);
	break;
    case CHR('U'):
	c = lexdigits(v, 16, 8, 8);
	if (ISERR()) {
	    FAILW(REG_EESCAPE);
	}





	RETV(PLAIN, c);
	break;
    case CHR('v'):
	RETV(PLAIN, CHR('\v'));
	break;
    case CHR('w'):
	NOTE(REG_ULOCALE);
	RETV(CCLASS, 'w');
................................................................................
	break;
    case CHR('W'):
	NOTE(REG_ULOCALE);
	RETV(CCLASS, 'W');
	break;
    case CHR('x'):
	NOTE(REG_UUNPORT);
	c = lexdigits(v, 16, 1, 255);	/* REs >255 long outside spec */
	if (ISERR()) {
	    FAILW(REG_EESCAPE);
	}
	RETV(PLAIN, c);
	break;
    case CHR('y'):
	NOTE(REG_ULOCALE);
................................................................................
	RETV(SEND, 0);
	break;
    case CHR('1'): case CHR('2'): case CHR('3'): case CHR('4'):
    case CHR('5'): case CHR('6'): case CHR('7'): case CHR('8'):
    case CHR('9'):
	save = v->now;
	v->now--;		/* put first digit back */
	c = lexdigits(v, 10, 1, 255);	/* REs >255 long outside spec */
	if (ISERR()) {
	    FAILW(REG_EESCAPE);
	}

	/*
	 * Ugly heuristic (first test is "exactly 1 digit?")
	 */
................................................................................
	/*
	 * And fall through into octal number.
	 */

    case CHR('0'):
	NOTE(REG_UUNPORT);
	v->now--;		/* put first digit back */
	c = lexdigits(v, 8, 1, 3);
	if (ISERR()) {
	    FAILW(REG_EESCAPE);





	}
	RETV(PLAIN, c);
	break;
    default:
	assert(iscalpha(c));
	FAILW(REG_EESCAPE);	/* unknown alphabetic escape */
	break;
    }
    assert(NOTREACHED);
}
 
/*
 - lexdigits - slurp up digits and return chr value
 ^ static chr lexdigits(struct vars *, int, int, int);
 */
static chr			/* chr value; errors signalled via ERR */
lexdigits(
    struct vars *v,
    int base,
    int minlen,
    int maxlen)
{
    uchr n;			/* unsigned to avoid overflow misbehavior */
    int len;
    chr c;
    int d;
    const uchr ub = (uchr) base;

    n = 0;
    for (len = 0; len < maxlen && !ATEOS(); len++) {




	c = *v->now++;
	switch (c) {
	case CHR('0'): case CHR('1'): case CHR('2'): case CHR('3'):
	case CHR('4'): case CHR('5'): case CHR('6'): case CHR('7'):
	case CHR('8'): case CHR('9'):
	    d = DIGITVAL(c);
	    break;
................................................................................
	}
	n = n*ub + (uchr)d;
    }
    if (len < minlen) {
	ERR(REG_EESCAPE);
    }

    return (chr)n;
}
 
/*
 - brenext - get next BRE token
 * This is much like EREs except for all the stupid backslashes and the
 * context-dependency of some things.
 ^ static int brenext(struct vars *, pchr);







>







 







|






|



>
>
>
>
>
|







 







|







 







|







 







|


>
>
>
>
>













|

|






|







>
>
>
>







 







|







738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
...
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
...
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
...
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
...
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
...
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
 ^ static int lexescape(struct vars *);
 */
static int			/* not actually used, but convenient for RETV */
lexescape(
    struct vars *v)
{
    chr c;
    int i;
    static const chr alert[] = {
	CHR('a'), CHR('l'), CHR('e'), CHR('r'), CHR('t')
    };
    static const chr esc[] = {
	CHR('E'), CHR('S'), CHR('C')
    };
    const chr *save;
................................................................................
	NOTE(REG_ULOCALE);
	RETV(CCLASS, 'S');
	break;
    case CHR('t'):
	RETV(PLAIN, CHR('\t'));
	break;
    case CHR('u'):
	c = (uchr) lexdigits(v, 16, 1, 4);
	if (ISERR()) {
	    FAILW(REG_EESCAPE);
	}
	RETV(PLAIN, c);
	break;
    case CHR('U'):
	i = lexdigits(v, 16, 1, 8);
	if (ISERR()) {
	    FAILW(REG_EESCAPE);
	}
	if (i > 0xFFFF) {
	    /* TODO: output a Surrogate pair
	     */
	    i = 0xFFFD;
	}
	RETV(PLAIN, (uchr) i);
	break;
    case CHR('v'):
	RETV(PLAIN, CHR('\v'));
	break;
    case CHR('w'):
	NOTE(REG_ULOCALE);
	RETV(CCLASS, 'w');
................................................................................
	break;
    case CHR('W'):
	NOTE(REG_ULOCALE);
	RETV(CCLASS, 'W');
	break;
    case CHR('x'):
	NOTE(REG_UUNPORT);
	c = (uchr) lexdigits(v, 16, 1, 2);
	if (ISERR()) {
	    FAILW(REG_EESCAPE);
	}
	RETV(PLAIN, c);
	break;
    case CHR('y'):
	NOTE(REG_ULOCALE);
................................................................................
	RETV(SEND, 0);
	break;
    case CHR('1'): case CHR('2'): case CHR('3'): case CHR('4'):
    case CHR('5'): case CHR('6'): case CHR('7'): case CHR('8'):
    case CHR('9'):
	save = v->now;
	v->now--;		/* put first digit back */
	c = (uchr) lexdigits(v, 10, 1, 255);	/* REs >255 long outside spec */
	if (ISERR()) {
	    FAILW(REG_EESCAPE);
	}

	/*
	 * Ugly heuristic (first test is "exactly 1 digit?")
	 */
................................................................................
	/*
	 * And fall through into octal number.
	 */

    case CHR('0'):
	NOTE(REG_UUNPORT);
	v->now--;		/* put first digit back */
	c = (uchr) lexdigits(v, 8, 1, 3);
	if (ISERR()) {
	    FAILW(REG_EESCAPE);
	}
	if (c > 0xff) {
	    /* out of range, so we handled one digit too much */
	    v->now--;
	    c >>= 3;
	}
	RETV(PLAIN, c);
	break;
    default:
	assert(iscalpha(c));
	FAILW(REG_EESCAPE);	/* unknown alphabetic escape */
	break;
    }
    assert(NOTREACHED);
}
 
/*
 - lexdigits - slurp up digits and return chr value
 ^ static int lexdigits(struct vars *, int, int, int);
 */
static int			/* chr value; errors signalled via ERR */
lexdigits(
    struct vars *v,
    int base,
    int minlen,
    int maxlen)
{
    int n;
    int len;
    chr c;
    int d;
    const uchr ub = (uchr) base;

    n = 0;
    for (len = 0; len < maxlen && !ATEOS(); len++) {
	if (n > 0x10fff) {
	    /* Stop when continuing would otherwise overflow */
	    break;
	}
	c = *v->now++;
	switch (c) {
	case CHR('0'): case CHR('1'): case CHR('2'): case CHR('3'):
	case CHR('4'): case CHR('5'): case CHR('6'): case CHR('7'):
	case CHR('8'): case CHR('9'):
	    d = DIGITVAL(c);
	    break;
................................................................................
	}
	n = n*ub + (uchr)d;
    }
    if (len < minlen) {
	ERR(REG_EESCAPE);
    }

    return n;
}
 
/*
 - brenext - get next BRE token
 * This is much like EREs except for all the stupid backslashes and the
 * context-dependency of some things.
 ^ static int brenext(struct vars *, pchr);

Changes to generic/regcomp.c

75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
/* === regc_lex.c === */
static void lexstart(struct vars *);
static void prefixes(struct vars *);
static void lexnest(struct vars *, const chr *, const chr *);
static void lexword(struct vars *);
static int next(struct vars *);
static int lexescape(struct vars *);
static chr lexdigits(struct vars *, int, int, int);
static int brenext(struct vars *, pchr);
static void skip(struct vars *);
static chr newline(NOPARMS);
#ifdef REG_DEBUG
static const chr *ch(NOPARMS);
#endif
static chr chrnamed(struct vars *, const chr *, const chr *, pchr);







|







75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
/* === regc_lex.c === */
static void lexstart(struct vars *);
static void prefixes(struct vars *);
static void lexnest(struct vars *, const chr *, const chr *);
static void lexword(struct vars *);
static int next(struct vars *);
static int lexescape(struct vars *);
static int lexdigits(struct vars *, int, int, int);
static int brenext(struct vars *, pchr);
static void skip(struct vars *);
static chr newline(NOPARMS);
#ifdef REG_DEBUG
static const chr *ch(NOPARMS);
#endif
static chr chrnamed(struct vars *, const chr *, const chr *, pchr);

Changes to generic/regcustom.h

93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
typedef Tcl_UniChar chr;	/* The type itself. */
typedef int pchr;		/* What it promotes to. */
typedef unsigned uchr;		/* Unsigned type that will hold a chr. */
typedef int celt;		/* Type to hold chr, or NOCELT */
#define	NOCELT (-1)		/* Celt value which is not valid chr */
#define	CHR(c) (UCHAR(c))	/* Turn char literal into chr literal */
#define	DIGITVAL(c) ((c)-'0')	/* Turn chr digit into its value */
#if TCL_UTF_MAX > 3
#define	CHRBITS	32		/* Bits in a chr; must not use sizeof */
#define	CHR_MIN	0x00000000	/* Smallest and largest chr; the value */
#define	CHR_MAX	0xffffffff	/* CHR_MAX-CHR_MIN+1 should fit in uchr */
#else
#define	CHRBITS	16		/* Bits in a chr; must not use sizeof */
#define	CHR_MIN	0x0000		/* Smallest and largest chr; the value */
#define	CHR_MAX	0xffff		/* CHR_MAX-CHR_MIN+1 should fit in uchr */







|







93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
typedef Tcl_UniChar chr;	/* The type itself. */
typedef int pchr;		/* What it promotes to. */
typedef unsigned uchr;		/* Unsigned type that will hold a chr. */
typedef int celt;		/* Type to hold chr, or NOCELT */
#define	NOCELT (-1)		/* Celt value which is not valid chr */
#define	CHR(c) (UCHAR(c))	/* Turn char literal into chr literal */
#define	DIGITVAL(c) ((c)-'0')	/* Turn chr digit into its value */
#if TCL_UTF_MAX > 4
#define	CHRBITS	32		/* Bits in a chr; must not use sizeof */
#define	CHR_MIN	0x00000000	/* Smallest and largest chr; the value */
#define	CHR_MAX	0xffffffff	/* CHR_MAX-CHR_MIN+1 should fit in uchr */
#else
#define	CHRBITS	16		/* Bits in a chr; must not use sizeof */
#define	CHR_MIN	0x0000		/* Smallest and largest chr; the value */
#define	CHR_MAX	0xffff		/* CHR_MAX-CHR_MIN+1 should fit in uchr */

Changes to generic/tcl.h

2149
2150
2151
2152
2153
2154
2155
2156
2157
2158
2159
2160
2161
2162
2163
2164
2165
2166
2167
2168
2169
2170
2171
2172
2173
2174
2175
2176
2177
2178
2179
2180
#define TCL_CONVERT_MULTIBYTE	(-1)
#define TCL_CONVERT_SYNTAX	(-2)
#define TCL_CONVERT_UNKNOWN	(-3)
#define TCL_CONVERT_NOSPACE	(-4)

/*
 * The maximum number of bytes that are necessary to represent a single
 * Unicode character in UTF-8. The valid values should be 3 or 6 (or perhaps 1
 * if we want to support a non-unicode enabled core). If 3, then Tcl_UniChar
 * must be 2-bytes in size (UCS-2) (the default). If 6, then Tcl_UniChar must
 * be 4-bytes in size (UCS-4). At this time UCS-2 mode is the default and
 * recommended mode. UCS-4 is experimental and not recommended. It works for
 * the core, but most extensions expect UCS-2.
 */

#ifndef TCL_UTF_MAX
#define TCL_UTF_MAX		3
#endif

/*
 * This represents a Unicode character. Any changes to this should also be
 * reflected in regcustom.h.
 */

#if TCL_UTF_MAX > 3
    /*
     * unsigned int isn't 100% accurate as it should be a strict 4-byte value
     * (perhaps wchar_t). 64-bit systems may have troubles. The size of this
     * value must be reflected correctly in regcustom.h and
     * in tclEncoding.c.
     * XXX: Tcl is currently UCS-2 and planning UTF-16 for the Unicode
     * XXX: string rep that Tcl_UniChar represents.  Changing the size







|
|
|
|
|
|











|







2149
2150
2151
2152
2153
2154
2155
2156
2157
2158
2159
2160
2161
2162
2163
2164
2165
2166
2167
2168
2169
2170
2171
2172
2173
2174
2175
2176
2177
2178
2179
2180
#define TCL_CONVERT_MULTIBYTE	(-1)
#define TCL_CONVERT_SYNTAX	(-2)
#define TCL_CONVERT_UNKNOWN	(-3)
#define TCL_CONVERT_NOSPACE	(-4)

/*
 * The maximum number of bytes that are necessary to represent a single
 * Unicode character in UTF-8. The valid values should be 3, 4 or 6
 * (or perhaps 1 if we want to support a non-unicode enabled core). If 3 or
 * 4, then Tcl_UniChar must be 2-bytes in size (UCS-2) (the default). If 6,
 * then Tcl_UniChar must be 4-bytes in size (UCS-4). At this time UCS-2 mode
 * is the default and recommended mode. UCS-4 is experimental and not
 * recommended. It works for the core, but most extensions expect UCS-2.
 */

#ifndef TCL_UTF_MAX
#define TCL_UTF_MAX		3
#endif

/*
 * This represents a Unicode character. Any changes to this should also be
 * reflected in regcustom.h.
 */

#if TCL_UTF_MAX > 4
    /*
     * unsigned int isn't 100% accurate as it should be a strict 4-byte value
     * (perhaps wchar_t). 64-bit systems may have troubles. The size of this
     * value must be reflected correctly in regcustom.h and
     * in tclEncoding.c.
     * XXX: Tcl is currently UCS-2 and planning UTF-16 for the Unicode
     * XXX: string rep that Tcl_UniChar represents.  Changing the size

Changes to generic/tclParse.c

750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
...
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
...
884
885
886
887
888
889
890









891
892
893
894
895
896
897
...
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
{
    int result = 0;
    register const char *p = src;

    while (numBytes--) {
	unsigned char digit = UCHAR(*p);

	if (!isxdigit(digit)) {
	    break;
	}

	p++;
	result <<= 4;

	if (digit >= 'a') {
................................................................................
    case 't':
	result = 0x9;
	break;
    case 'v':
	result = 0xb;
	break;
    case 'x':
	count += TclParseHex(p+1, numBytes-2, &result);
	if (count == 2) {
	    /*
	     * No hexadigits -> This is just "x".
	     */

	    result = 'x';
	} else {
................................................................................
	count += TclParseHex(p+1, (numBytes > 5) ? 4 : numBytes-2, &result);
	if (count == 2) {
	    /*
	     * No hexadigits -> This is just "u".
	     */
	    result = 'u';
	}









	break;
    case '\n':
	count--;
	do {
	    p++;
	    count++;
	} while ((count < numBytes) && ((*p == ' ') || (*p == '\t')));
................................................................................
		    || (UCHAR(*p) >= '8')) {
		break;
	    }
	    count = 3;
	    result = (result << 3) + (*p - '0');
	    p++;
	    if ((numBytes == 3) || !isdigit(UCHAR(*p))	/* INTL: digit */
		    || (UCHAR(*p) >= '8')) {
		break;
	    }
	    count = 4;
	    result = UCHAR((result << 3) + (*p - '0'));
	    break;
	}








|







 







|







 







>
>
>
>
>
>
>
>
>







 







|







750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
...
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
...
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
...
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
{
    int result = 0;
    register const char *p = src;

    while (numBytes--) {
	unsigned char digit = UCHAR(*p);

	if (!isxdigit(digit) || (result > 0x10fff)) {
	    break;
	}

	p++;
	result <<= 4;

	if (digit >= 'a') {
................................................................................
    case 't':
	result = 0x9;
	break;
    case 'v':
	result = 0xb;
	break;
    case 'x':
	count += TclParseHex(p+1, (numBytes > 3) ? 2 : numBytes-2, &result);
	if (count == 2) {
	    /*
	     * No hexadigits -> This is just "x".
	     */

	    result = 'x';
	} else {
................................................................................
	count += TclParseHex(p+1, (numBytes > 5) ? 4 : numBytes-2, &result);
	if (count == 2) {
	    /*
	     * No hexadigits -> This is just "u".
	     */
	    result = 'u';
	}
	break;
    case 'U':
	count += TclParseHex(p+1, (numBytes > 9) ? 8 : numBytes-2, &result);
	if (count == 2) {
	    /*
	     * No hexadigits -> This is just "U".
	     */
	    result = 'U';
	}
	break;
    case '\n':
	count--;
	do {
	    p++;
	    count++;
	} while ((count < numBytes) && ((*p == ' ') || (*p == '\t')));
................................................................................
		    || (UCHAR(*p) >= '8')) {
		break;
	    }
	    count = 3;
	    result = (result << 3) + (*p - '0');
	    p++;
	    if ((numBytes == 3) || !isdigit(UCHAR(*p))	/* INTL: digit */
		    || (UCHAR(*p) >= '8') || (result >= 0x20)) {
		break;
	    }
	    count = 4;
	    result = UCHAR((result << 3) + (*p - '0'));
	    break;
	}

Changes to tests/reg.test

622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638








639
640
641
642
643
644
645
...
678
679
680
681
682
683
684

685
686
687
688
689
690
691
expectMatch	13.10 MP	"a\\cHb"	"a\bb"	"a\bb"
expectMatch	13.11 LMP	"a\\e"		"a\033"	"a\033"
expectMatch	13.12 P		"a\\fb"		"a\fb"	"a\fb"
expectMatch	13.13 P		"a\\nb"		"a\nb"	"a\nb"
expectMatch	13.14 P		"a\\rb"		"a\rb"	"a\rb"
expectMatch	13.15 P		"a\\tb"		"a\tb"	"a\tb"
expectMatch	13.16 P		"a\\u0008x"	"a\bx"	"a\bx"
expectError	13.17 -		{a\u008x}	EESCAPE
expectMatch	13.18 P		"a\\u00088x"	"a\b8x"	"a\b8x"
expectMatch	13.19 P		"a\\U00000008x"	"a\bx"	"a\bx"
expectError	13.20 -		{a\U0000008x}	EESCAPE
expectMatch	13.21 P		"a\\vb"		"a\vb"	"a\vb"
expectMatch	13.22 MP	"a\\x08x"	"a\bx"	"a\bx"
expectError	13.23 -		{a\xq}		EESCAPE
expectMatch	13.24 MP	"a\\x0008x"	"a\bx"	"a\bx"
expectError	13.25 -		{a\z}		EESCAPE
expectMatch	13.26 MP	"a\\010b"	"a\bb"	"a\bb"










doing 14 "back references"
# ugh
expectMatch	14.1  RP	{a(b*)c\1}	abbcbb	abbcbb	bb
expectMatch	14.2  RP	{a(b*)c\1}	ac	ac	""
expectNomatch	14.3  RP	{a(b*)c\1}	abbcb
................................................................................
	"abbbbbbbbbbbc" abbbbbbbbbbbc b b b b b b b b b b
# but we're fussy about border cases -- guys who want octal should use the zero
expectError	15.9  -	{a((((((((((b\10))))))))))c}	ESUBREG
# BREs don't have octal, EREs don't have backrefs
expectMatch	15.10 MP	"a\\12b"	"a\nb"	"a\nb"
expectError	15.11 b		{a\12b}		ESUBREG
expectMatch	15.12 eAS	{a\12b}		a12b	a12b



doing 16 "expanded syntax"
expectMatch	16.1 xP		"a b c"		"abc"	"abc"
expectMatch	16.2 xP		"a b #oops\nc\td"	"abcd"	"abcd"
expectMatch	16.3 x		"a\\ b\\\tc"	"a b\tc"	"a b\tc"
expectMatch	16.4 xP		"a b\\#c"	"ab#c"	"ab#c"







|


|



|


>
>
>
>
>
>
>
>







 







>







622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
...
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
expectMatch	13.10 MP	"a\\cHb"	"a\bb"	"a\bb"
expectMatch	13.11 LMP	"a\\e"		"a\033"	"a\033"
expectMatch	13.12 P		"a\\fb"		"a\fb"	"a\fb"
expectMatch	13.13 P		"a\\nb"		"a\nb"	"a\nb"
expectMatch	13.14 P		"a\\rb"		"a\rb"	"a\rb"
expectMatch	13.15 P		"a\\tb"		"a\tb"	"a\tb"
expectMatch	13.16 P		"a\\u0008x"	"a\bx"	"a\bx"
expectMatch	13.17 P		{a\u008x}	"a\bx"	"a\bx"
expectMatch	13.18 P		"a\\u00088x"	"a\b8x"	"a\b8x"
expectMatch	13.19 P		"a\\U00000008x"	"a\bx"	"a\bx"
expectMatch	13.20 P		{a\U0000008x}	"a\bx"	"a\bx"
expectMatch	13.21 P		"a\\vb"		"a\vb"	"a\vb"
expectMatch	13.22 MP	"a\\x08x"	"a\bx"	"a\bx"
expectError	13.23 -		{a\xq}		EESCAPE
expectMatch	13.24 MP	"a\\x08x"	"a\bx"	"a\bx"
expectError	13.25 -		{a\z}		EESCAPE
expectMatch	13.26 MP	"a\\010b"	"a\bb"	"a\bb"
expectMatch	13.27 P		"a\\U00001234x"	"a\u1234x"	"a\u1234x"
expectMatch	13.28 P		{a\U00001234x}	"a\u1234x"	"a\u1234x"
expectMatch	13.29 P		"a\\U0001234x"	"a\u1234x"	"a\u1234x"
expectMatch	13.30 P		{a\U0001234x}	"a\u1234x"	"a\u1234x"
expectMatch	13.31 P		"a\\U000012345x"	"a\u12345x"	"a\u12345x"
expectMatch	13.32 P		{a\U000012345x}	"a\u12345x"	"a\u12345x"
expectMatch	13.33 P		"a\\U1000000x"	"a\ufffd0x"	"a\ufffd0x"
expectMatch	13.34 P		{a\U1000000x}	"a\ufffd0x"	"a\ufffd0x"


doing 14 "back references"
# ugh
expectMatch	14.1  RP	{a(b*)c\1}	abbcbb	abbcbb	bb
expectMatch	14.2  RP	{a(b*)c\1}	ac	ac	""
expectNomatch	14.3  RP	{a(b*)c\1}	abbcb
................................................................................
	"abbbbbbbbbbbc" abbbbbbbbbbbc b b b b b b b b b b
# but we're fussy about border cases -- guys who want octal should use the zero
expectError	15.9  -	{a((((((((((b\10))))))))))c}	ESUBREG
# BREs don't have octal, EREs don't have backrefs
expectMatch	15.10 MP	"a\\12b"	"a\nb"	"a\nb"
expectError	15.11 b		{a\12b}		ESUBREG
expectMatch	15.12 eAS	{a\12b}		a12b	a12b
expectMatch	15.13 MP	{a\701b}	a\u00381b	a\u00381b


doing 16 "expanded syntax"
expectMatch	16.1 xP		"a b c"		"abc"	"abc"
expectMatch	16.2 xP		"a b #oops\nc\td"	"abcd"	"abcd"
expectMatch	16.3 x		"a\\ b\\\tc"	"a b\tc"	"a b\tc"
expectMatch	16.4 xP		"a b\\#c"	"ab#c"	"ab#c"

Changes to tests/utf.test

167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182












183
184
185
186
187
188
189
bsCheck \14	12
bsCheck \141	97
bsCheck b\0	98
bsCheck \x	120
bsCheck \xa	10
bsCheck \xA	10
bsCheck \x41	65
bsCheck \x541	65
bsCheck \u	117
bsCheck \uk	117
bsCheck \u41	65
bsCheck \ua	10
bsCheck \uA	10
bsCheck \340	224
bsCheck \ua1	161
bsCheck \u4e21	20001













test utf-11.1 {Tcl_UtfToUpper} {
    string toupper {}
} {}
test utf-11.2 {Tcl_UtfToUpper} {
    string toupper abc
} ABC







|








>
>
>
>
>
>
>
>
>
>
>
>







167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
bsCheck \14	12
bsCheck \141	97
bsCheck b\0	98
bsCheck \x	120
bsCheck \xa	10
bsCheck \xA	10
bsCheck \x41	65
bsCheck \x541	84
bsCheck \u	117
bsCheck \uk	117
bsCheck \u41	65
bsCheck \ua	10
bsCheck \uA	10
bsCheck \340	224
bsCheck \ua1	161
bsCheck \u4e21	20001
bsCheck \741	60
bsCheck \U	85
bsCheck \Uk	85
bsCheck \U41	65
bsCheck \Ua	10
bsCheck \UA	10
bsCheck \Ua1	161
bsCheck \U4e21	20001
bsCheck \U004e21	20001
bsCheck \U00004e21	20001
bsCheck \U00110000	65533
bsCheck \Uffffffff	65533

test utf-11.1 {Tcl_UtfToUpper} {
    string toupper {}
} {}
test utf-11.2 {Tcl_UtfToUpper} {
    string toupper abc
} ABC