Tcl Source Code

Check-in [7cefe002e7]
Login

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview
Comment:Add two new (undocumented) flags to the Tcl_ExternalToUtf() interface. TCL_ENCODING_NO_TERMINATE rejects the default behavior of appending a terminating NUL byte to the produced Utf output. This permits use of all of the dstLen bytes provided, and simplifies the buffer size calculations demanded from callers. Perhaps some callers need or appreciate this default behavior, but for Tcl's own main use of encodings - conversions within I/O - this just gets in the way. TCL_ENCODING_CHAR_LIMIT lets the caller set a limit on the number of chars to be output to be enforced by the encoding routines themselves. Without this, callers have to check after the fact for going beyond limits and make multiple encoding calls in a trial and error approach. Full compatibility is supported. No defaults are changed, and the flags have their effect even if an encoding driver has not been written to support these flags (but greater efficiency is enjoyed if they do!). All of Tcl's own encoding drivers are updated to support this. Other encoding drivers may exist somewhere, but I cannot point to any. A TIP to document this and make it officially supported may come in time.
Downloads: Tarball | ZIP archive | SQL archive
Timelines: family | ancestors | descendants | both | trunk
Files: files | file ages | folders
SHA1: 7cefe002e742bc53de10591f3c3b5d363889ecf4
User & Date: dgp 2014-12-23 18:39:35
Context
2015-01-02
22:42
Now that we're using TCL_ENCODING_NO_TERMINATE - be careful about acting on the contents of dst -- t... check-in: bd7e81c0fb user: dgp tags: trunk
2014-12-23
18:44
merge trunk check-in: 526ce54517 user: dgp tags: novem
18:39
Add two new (undocumented) flags to the Tcl_ExternalToUtf() interface. TCL_ENCODING_NO_TERMINATE rej... check-in: 7cefe002e7 user: dgp tags: trunk
17:53
Support TCL_ENCODING_CHAR_LIMIT in TableToUtfProc and EscapeToUtfProc drivers. Closed-Leaf check-in: 14b0ab81bb user: dgp tags: dgp-encoding-flags
02:41
Use more suitable variable name pushers. check-in: d9a4be6a30 user: dgp tags: trunk
Changes
Hide Diffs Unified Diffs Ignore Whitespace Patch

Changes to generic/tcl.h.

2140
2141
2142
2143
2144
2145
2146















2147
2148
2149
2150
2151


2152
2153
2154
2155
2156
2157
2158
 *				immediately upon encountering an invalid byte
 *				sequence or a source character that has no
 *				mapping in the target encoding. If clear, then
 *				the converter will skip the problem,
 *				substituting one or more "close" characters in
 *				the destination buffer and then continue to
 *				convert the source.















 */

#define TCL_ENCODING_START		0x01
#define TCL_ENCODING_END		0x02
#define TCL_ENCODING_STOPONERROR	0x04



/*
 * The following definitions are the error codes returned by the conversion
 * routines:
 *
 * TCL_OK -			All characters were converted.
 * TCL_CONVERT_NOSPACE -	The output buffer would not have been large







>
>
>
>
>
>
>
>
>
>
>
>
>
>
>





>
>







2140
2141
2142
2143
2144
2145
2146
2147
2148
2149
2150
2151
2152
2153
2154
2155
2156
2157
2158
2159
2160
2161
2162
2163
2164
2165
2166
2167
2168
2169
2170
2171
2172
2173
2174
2175
 *				immediately upon encountering an invalid byte
 *				sequence or a source character that has no
 *				mapping in the target encoding. If clear, then
 *				the converter will skip the problem,
 *				substituting one or more "close" characters in
 *				the destination buffer and then continue to
 *				convert the source.
 * TCL_ENCODING_NO_TERMINATE - 	If set, Tcl_ExternalToUtf will not append a
 *				terminating NUL byte.  Knowing that it will
 *				not need space to do so, it will fill all
 *				dstLen bytes with encoded UTF-8 content, as
 *				other circumstances permit.  If clear, the
 *				default behavior is to reserve a byte in
 *				the dst space for NUL termination, and to
 *				append the NUL byte.
 * TCL_ENCODING_CHAR_LIMIT -	If set and dstCharsPtr is not NULL, then
 *				Tcl_ExternalToUtf takes the initial value
 *				of *dstCharsPtr is taken as a limit of the
 *				maximum number of chars to produce in the
 *				encoded UTF-8 content.  Otherwise, the 
 *				number of chars produced is controlled only
 *				by other limiting factors.
 */

#define TCL_ENCODING_START		0x01
#define TCL_ENCODING_END		0x02
#define TCL_ENCODING_STOPONERROR	0x04
#define TCL_ENCODING_NO_TERMINATE	0x08
#define TCL_ENCODING_CHAR_LIMIT		0x10

/*
 * The following definitions are the error codes returned by the conversion
 * routines:
 *
 * TCL_OK -			All characters were converted.
 * TCL_CONVERT_NOSPACE -	The output buffer would not have been large

Changes to generic/tclEncoding.c.

1202
1203
1204
1205
1206
1207
1208
1209



1210
1211
1212
1213
1214
1215
1216
				 * stored in the output buffer as a result of
				 * the conversion. */
    int *dstCharsPtr)		/* Filled with the number of characters that
				 * correspond to the bytes stored in the
				 * output buffer. */
{
    const Encoding *encodingPtr;
    int result, srcRead, dstWrote, dstChars;



    Tcl_EncodingState state;

    if (encoding == NULL) {
	encoding = systemEncoding;
    }
    encodingPtr = (Encoding *) encoding;








|
>
>
>







1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
				 * stored in the output buffer as a result of
				 * the conversion. */
    int *dstCharsPtr)		/* Filled with the number of characters that
				 * correspond to the bytes stored in the
				 * output buffer. */
{
    const Encoding *encodingPtr;
    int result, srcRead, dstWrote, dstChars = 0;
    int noTerminate = flags & TCL_ENCODING_NO_TERMINATE;
    int charLimited = (flags & TCL_ENCODING_CHAR_LIMIT) && dstCharsPtr;
    int maxChars = INT_MAX;
    Tcl_EncodingState state;

    if (encoding == NULL) {
	encoding = systemEncoding;
    }
    encodingPtr = (Encoding *) encoding;

1227
1228
1229
1230
1231
1232
1233



1234
1235

1236
1237
1238
1239

1240
1241
1242





1243
1244
1245










1246

1247
1248
1249
1250
1251
1252
1253
	srcReadPtr = &srcRead;
    }
    if (dstWrotePtr == NULL) {
	dstWrotePtr = &dstWrote;
    }
    if (dstCharsPtr == NULL) {
	dstCharsPtr = &dstChars;



    }


    /*
     * If there are any null characters in the middle of the buffer, they will
     * converted to the UTF-8 null character (\xC080). To get the actual \0 at
     * the end of the destination buffer, we need to append it manually.

     */

    dstLen--;





    result = encodingPtr->toUtfProc(encodingPtr->clientData, src, srcLen,
	    flags, statePtr, dst, dstLen, srcReadPtr, dstWrotePtr,
	    dstCharsPtr);










    dst[*dstWrotePtr] = '\0';


    return result;
}

/*
 *-------------------------------------------------------------------------
 *







>
>
>


>
|
|
|
|
>
|

|
>
>
>
>
>
|
|
|
>
>
>
>
>
>
>
>
>
>
|
>







1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
	srcReadPtr = &srcRead;
    }
    if (dstWrotePtr == NULL) {
	dstWrotePtr = &dstWrote;
    }
    if (dstCharsPtr == NULL) {
	dstCharsPtr = &dstChars;
	flags &= ~TCL_ENCODING_CHAR_LIMIT;
    } else if (charLimited) {
	maxChars = *dstCharsPtr;
    }

    if (!noTerminate) {
	/*
	 * If there are any null characters in the middle of the buffer,
	 * they will converted to the UTF-8 null character (\xC080). To get
	 * the actual \0 at the end of the destination buffer, we need to
	 * append it manually.  First make room for it...
	 */

	dstLen--;
    }
    do {
	int savedFlags = flags;
	Tcl_EncodingState savedState = *statePtr;

	result = encodingPtr->toUtfProc(encodingPtr->clientData, src, srcLen,
		flags, statePtr, dst, dstLen, srcReadPtr, dstWrotePtr,
		dstCharsPtr);
	if (*dstCharsPtr <= maxChars) {
	    break;
	}
	dstLen = Tcl_UtfAtIndex(dst, maxChars) - 1 - dst + TCL_UTF_MAX;
	flags = savedFlags;
	*statePtr = savedState;
    } while (1);
    if (!noTerminate) {
	/* ...and then append it */

	dst[*dstWrotePtr] = '\0';
    }

    return result;
}

/*
 *-------------------------------------------------------------------------
 *
2102
2103
2104
2105
2106
2107
2108



2109
2110
2111
2112
2113
2114
2115
{
    int result;

    result = TCL_OK;
    dstLen -= TCL_UTF_MAX - 1;
    if (dstLen < 0) {
	dstLen = 0;



    }
    if (srcLen > dstLen) {
	srcLen = dstLen;
	result = TCL_CONVERT_NOSPACE;
    }

    *srcReadPtr = srcLen;







>
>
>







2126
2127
2128
2129
2130
2131
2132
2133
2134
2135
2136
2137
2138
2139
2140
2141
2142
{
    int result;

    result = TCL_OK;
    dstLen -= TCL_UTF_MAX - 1;
    if (dstLen < 0) {
	dstLen = 0;
    }
    if ((flags & TCL_ENCODING_CHAR_LIMIT) && srcLen > *dstCharsPtr) {
	srcLen = *dstCharsPtr;
    }
    if (srcLen > dstLen) {
	srcLen = dstLen;
	result = TCL_CONVERT_NOSPACE;
    }

    *srcReadPtr = srcLen;
2263
2264
2265
2266
2267
2268
2269
2270
2271
2272
2273
2274
2275
2276
2277
2278
2279
2280



2281
2282
2283
2284
2285
2286
2287
2288
2289
2290
2291
2292
				 * output buffer. */
    int pureNullMode)		/* Convert embedded nulls from internal
				 * representation to real null-bytes or vice
				 * versa. */
{
    const char *srcStart, *srcEnd, *srcClose;
    const char *dstStart, *dstEnd;
    int result, numChars;
    Tcl_UniChar ch;

    result = TCL_OK;

    srcStart = src;
    srcEnd = src + srcLen;
    srcClose = srcEnd;
    if ((flags & TCL_ENCODING_END) == 0) {
	srcClose -= TCL_UTF_MAX;
    }




    dstStart = dst;
    dstEnd = dst + dstLen - TCL_UTF_MAX;

    for (numChars = 0; src < srcEnd; numChars++) {
	if ((src > srcClose) && (!Tcl_UtfCharComplete(src, srcEnd - src))) {
	    /*
	     * If there is more string to follow, this will ensure that the
	     * last UTF-8 character in the source buffer hasn't been cut off.
	     */

	    result = TCL_CONVERT_MULTIBYTE;







|










>
>
>




|







2290
2291
2292
2293
2294
2295
2296
2297
2298
2299
2300
2301
2302
2303
2304
2305
2306
2307
2308
2309
2310
2311
2312
2313
2314
2315
2316
2317
2318
2319
2320
2321
2322
				 * output buffer. */
    int pureNullMode)		/* Convert embedded nulls from internal
				 * representation to real null-bytes or vice
				 * versa. */
{
    const char *srcStart, *srcEnd, *srcClose;
    const char *dstStart, *dstEnd;
    int result, numChars, charLimit = INT_MAX;
    Tcl_UniChar ch;

    result = TCL_OK;

    srcStart = src;
    srcEnd = src + srcLen;
    srcClose = srcEnd;
    if ((flags & TCL_ENCODING_END) == 0) {
	srcClose -= TCL_UTF_MAX;
    }
    if (flags & TCL_ENCODING_CHAR_LIMIT) {
	charLimit = *dstCharsPtr;
    }

    dstStart = dst;
    dstEnd = dst + dstLen - TCL_UTF_MAX;

    for (numChars = 0; src < srcEnd && numChars <= charLimit; numChars++) {
	if ((src > srcClose) && (!Tcl_UtfCharComplete(src, srcEnd - src))) {
	    /*
	     * If there is more string to follow, this will ensure that the
	     * last UTF-8 character in the source buffer hasn't been cut off.
	     */

	    result = TCL_CONVERT_MULTIBYTE;
2374
2375
2376
2377
2378
2379
2380
2381
2382
2383



2384
2385
2386
2387
2388
2389
2390
2391
2392
2393
2394
2395
2396
2397
2398
2399
2400
2401
2402
2403
2404
				 * the conversion. */
    int *dstCharsPtr)		/* Filled with the number of characters that
				 * correspond to the bytes stored in the
				 * output buffer. */
{
    const char *srcStart, *srcEnd;
    const char *dstEnd, *dstStart;
    int result, numChars;
    Tcl_UniChar ch;




    result = TCL_OK;
    if ((srcLen % sizeof(Tcl_UniChar)) != 0) {
	result = TCL_CONVERT_MULTIBYTE;
	srcLen /= sizeof(Tcl_UniChar);
	srcLen *= sizeof(Tcl_UniChar);
    }

    srcStart = src;
    srcEnd = src + srcLen;

    dstStart = dst;
    dstEnd = dst + dstLen - TCL_UTF_MAX;

    for (numChars = 0; src < srcEnd; numChars++) {
	if (dst > dstEnd) {
	    result = TCL_CONVERT_NOSPACE;
	    break;
	}

	/*
	 * Special case for 1-byte utf chars for speed. Make sure we work with







|


>
>
>













|







2404
2405
2406
2407
2408
2409
2410
2411
2412
2413
2414
2415
2416
2417
2418
2419
2420
2421
2422
2423
2424
2425
2426
2427
2428
2429
2430
2431
2432
2433
2434
2435
2436
2437
				 * the conversion. */
    int *dstCharsPtr)		/* Filled with the number of characters that
				 * correspond to the bytes stored in the
				 * output buffer. */
{
    const char *srcStart, *srcEnd;
    const char *dstEnd, *dstStart;
    int result, numChars, charLimit = INT_MAX;
    Tcl_UniChar ch;

    if (flags & TCL_ENCODING_CHAR_LIMIT) {
	charLimit = *dstCharsPtr;
    }
    result = TCL_OK;
    if ((srcLen % sizeof(Tcl_UniChar)) != 0) {
	result = TCL_CONVERT_MULTIBYTE;
	srcLen /= sizeof(Tcl_UniChar);
	srcLen *= sizeof(Tcl_UniChar);
    }

    srcStart = src;
    srcEnd = src + srcLen;

    dstStart = dst;
    dstEnd = dst + dstLen - TCL_UTF_MAX;

    for (numChars = 0; src < srcEnd && numChars <= charLimit; numChars++) {
	if (dst > dstEnd) {
	    result = TCL_CONVERT_NOSPACE;
	    break;
	}

	/*
	 * Special case for 1-byte utf chars for speed. Make sure we work with
2558
2559
2560
2561
2562
2563
2564
2565
2566
2567
2568
2569
2570



2571
2572
2573
2574
2575
2576
2577
2578
2579
2580
2581
2582
2583
2584
2585
2586
2587
2588
2589
				 * the conversion. */
    int *dstCharsPtr)		/* Filled with the number of characters that
				 * correspond to the bytes stored in the
				 * output buffer. */
{
    const char *srcStart, *srcEnd;
    const char *dstEnd, *dstStart, *prefixBytes;
    int result, byte, numChars;
    Tcl_UniChar ch;
    const unsigned short *const *toUnicode;
    const unsigned short *pageZero;
    TableEncodingData *dataPtr = clientData;




    srcStart = src;
    srcEnd = src + srcLen;

    dstStart = dst;
    dstEnd = dst + dstLen - TCL_UTF_MAX;

    toUnicode = (const unsigned short *const *) dataPtr->toUnicode;
    prefixBytes = dataPtr->prefixBytes;
    pageZero = toUnicode[0];

    result = TCL_OK;
    for (numChars = 0; src < srcEnd; numChars++) {
	if (dst > dstEnd) {
	    result = TCL_CONVERT_NOSPACE;
	    break;
	}
	byte = *((unsigned char *) src);
	if (prefixBytes[byte]) {
	    src++;







|





>
>
>











|







2591
2592
2593
2594
2595
2596
2597
2598
2599
2600
2601
2602
2603
2604
2605
2606
2607
2608
2609
2610
2611
2612
2613
2614
2615
2616
2617
2618
2619
2620
2621
2622
2623
2624
2625
				 * the conversion. */
    int *dstCharsPtr)		/* Filled with the number of characters that
				 * correspond to the bytes stored in the
				 * output buffer. */
{
    const char *srcStart, *srcEnd;
    const char *dstEnd, *dstStart, *prefixBytes;
    int result, byte, numChars, charLimit = INT_MAX;
    Tcl_UniChar ch;
    const unsigned short *const *toUnicode;
    const unsigned short *pageZero;
    TableEncodingData *dataPtr = clientData;

    if (flags & TCL_ENCODING_CHAR_LIMIT) {
	charLimit = *dstCharsPtr;
    }
    srcStart = src;
    srcEnd = src + srcLen;

    dstStart = dst;
    dstEnd = dst + dstLen - TCL_UTF_MAX;

    toUnicode = (const unsigned short *const *) dataPtr->toUnicode;
    prefixBytes = dataPtr->prefixBytes;
    pageZero = toUnicode[0];

    result = TCL_OK;
    for (numChars = 0; src < srcEnd && numChars <= charLimit; numChars++) {
	if (dst > dstEnd) {
	    result = TCL_CONVERT_NOSPACE;
	    break;
	}
	byte = *((unsigned char *) src);
	if (prefixBytes[byte]) {
	    src++;
2789
2790
2791
2792
2793
2794
2795
2796
2797



2798
2799
2800
2801
2802
2803
2804
2805
2806
2807
2808
2809
2810
2811
2812
				 * the conversion. */
    int *dstCharsPtr)		/* Filled with the number of characters that
				 * correspond to the bytes stored in the
				 * output buffer. */
{
    const char *srcStart, *srcEnd;
    const char *dstEnd, *dstStart;
    int result, numChars;




    srcStart = src;
    srcEnd = src + srcLen;

    dstStart = dst;
    dstEnd = dst + dstLen - TCL_UTF_MAX;

    result = TCL_OK;
    for (numChars = 0; src < srcEnd; numChars++) {
	Tcl_UniChar ch;

	if (dst > dstEnd) {
	    result = TCL_CONVERT_NOSPACE;
	    break;
	}
	ch = (Tcl_UniChar) *((unsigned char *) src);







|

>
>
>







|







2825
2826
2827
2828
2829
2830
2831
2832
2833
2834
2835
2836
2837
2838
2839
2840
2841
2842
2843
2844
2845
2846
2847
2848
2849
2850
2851
				 * the conversion. */
    int *dstCharsPtr)		/* Filled with the number of characters that
				 * correspond to the bytes stored in the
				 * output buffer. */
{
    const char *srcStart, *srcEnd;
    const char *dstEnd, *dstStart;
    int result, numChars, charLimit = INT_MAX;

    if (flags & TCL_ENCODING_CHAR_LIMIT) {
	charLimit = *dstCharsPtr;
    }
    srcStart = src;
    srcEnd = src + srcLen;

    dstStart = dst;
    dstEnd = dst + dstLen - TCL_UTF_MAX;

    result = TCL_OK;
    for (numChars = 0; src < srcEnd && numChars <= charLimit; numChars++) {
	Tcl_UniChar ch;

	if (dst > dstEnd) {
	    result = TCL_CONVERT_NOSPACE;
	    break;
	}
	ch = (Tcl_UniChar) *((unsigned char *) src);
3014
3015
3016
3017
3018
3019
3020
3021
3022
3023



3024
3025
3026
3027
3028
3029
3030
3031
3032
3033
3034
3035
3036
3037
3038
3039
3040
3041
3042
3043
3044
3045
3046
3047
3048
				 * correspond to the bytes stored in the
				 * output buffer. */
{
    EscapeEncodingData *dataPtr = clientData;
    const char *prefixBytes, *tablePrefixBytes, *srcStart, *srcEnd;
    const unsigned short *const *tableToUnicode;
    const Encoding *encodingPtr;
    int state, result, numChars;
    const char *dstStart, *dstEnd;




    result = TCL_OK;
    tablePrefixBytes = NULL;	/* lint. */
    tableToUnicode = NULL;	/* lint. */
    prefixBytes = dataPtr->prefixBytes;
    encodingPtr = NULL;

    srcStart = src;
    srcEnd = src + srcLen;

    dstStart = dst;
    dstEnd = dst + dstLen - TCL_UTF_MAX;

    state = PTR2INT(*statePtr);
    if (flags & TCL_ENCODING_START) {
	state = 0;
    }

    for (numChars = 0; src < srcEnd; ) {
	int byte, hi, lo, ch;

	if (dst > dstEnd) {
	    result = TCL_CONVERT_NOSPACE;
	    break;
	}
	byte = *((unsigned char *) src);







|


>
>
>

















|







3053
3054
3055
3056
3057
3058
3059
3060
3061
3062
3063
3064
3065
3066
3067
3068
3069
3070
3071
3072
3073
3074
3075
3076
3077
3078
3079
3080
3081
3082
3083
3084
3085
3086
3087
3088
3089
3090
				 * correspond to the bytes stored in the
				 * output buffer. */
{
    EscapeEncodingData *dataPtr = clientData;
    const char *prefixBytes, *tablePrefixBytes, *srcStart, *srcEnd;
    const unsigned short *const *tableToUnicode;
    const Encoding *encodingPtr;
    int state, result, numChars, charLimit = INT_MAX;
    const char *dstStart, *dstEnd;

    if (flags & TCL_ENCODING_CHAR_LIMIT) {
	charLimit = *dstCharsPtr;
    }
    result = TCL_OK;
    tablePrefixBytes = NULL;	/* lint. */
    tableToUnicode = NULL;	/* lint. */
    prefixBytes = dataPtr->prefixBytes;
    encodingPtr = NULL;

    srcStart = src;
    srcEnd = src + srcLen;

    dstStart = dst;
    dstEnd = dst + dstLen - TCL_UTF_MAX;

    state = PTR2INT(*statePtr);
    if (flags & TCL_ENCODING_START) {
	state = 0;
    }

    for (numChars = 0; src < srcEnd && numChars <= charLimit; ) {
	int byte, hi, lo, ch;

	if (dst > dstEnd) {
	    result = TCL_CONVERT_NOSPACE;
	    break;
	}
	byte = *((unsigned char *) src);

Changes to generic/tclIO.c.

4574
4575
4576
4577
4578
4579
4580
4581
4582
4583
4584
4585
4586

4587
4588
4589
4590
4591
4592
4593
4594
4595
	    if (GotFlag(statePtr, INPUT_SAW_CR)) {
		ResetFlag(statePtr, INPUT_SAW_CR);
		if ((eol < dstEnd) && (*eol == '\n')) {
		    /*
		     * Skip the raw bytes that make up the '\n'.
		     */

		    char tmp[1 + TCL_UTF_MAX];
		    int rawRead;

		    bufPtr = gs.bufPtr;
		    Tcl_ExternalToUtf(NULL, gs.encoding, RemovePoint(bufPtr),
			    gs.rawRead, statePtr->inputEncodingFlags,

			    &gs.state, tmp, 1 + TCL_UTF_MAX, &rawRead, NULL,
			    NULL);
		    bufPtr->nextRemoved += rawRead;
		    gs.rawRead -= rawRead;
		    gs.bytesWrote--;
		    gs.charsWrote--;
		    memmove(dst, dst + 1, (size_t) (dstEnd - dst));
		    dstEnd--;
		}







|




|
>
|
<







4574
4575
4576
4577
4578
4579
4580
4581
4582
4583
4584
4585
4586
4587
4588

4589
4590
4591
4592
4593
4594
4595
	    if (GotFlag(statePtr, INPUT_SAW_CR)) {
		ResetFlag(statePtr, INPUT_SAW_CR);
		if ((eol < dstEnd) && (*eol == '\n')) {
		    /*
		     * Skip the raw bytes that make up the '\n'.
		     */

		    char tmp[TCL_UTF_MAX];
		    int rawRead;

		    bufPtr = gs.bufPtr;
		    Tcl_ExternalToUtf(NULL, gs.encoding, RemovePoint(bufPtr),
			    gs.rawRead, statePtr->inputEncodingFlags
				| TCL_ENCODING_NO_TERMINATE, &gs.state, tmp,
			    TCL_UTF_MAX, &rawRead, NULL, NULL);

		    bufPtr->nextRemoved += rawRead;
		    gs.rawRead -= rawRead;
		    gs.bytesWrote--;
		    gs.charsWrote--;
		    memmove(dst, dst + 1, (size_t) (dstEnd - dst));
		    dstEnd--;
		}
4682
4683
4684
4685
4686
4687
4688

4689
4690
4691
4692
4693
4694
4695
4696
4697

    bufPtr = gs.bufPtr;
    if (bufPtr == NULL) {
	Tcl_Panic("Tcl_GetsObj: gotEOL reached with bufPtr==NULL");
    }
    statePtr->inputEncodingState = gs.state;
    Tcl_ExternalToUtf(NULL, gs.encoding, RemovePoint(bufPtr), gs.rawRead,

	    statePtr->inputEncodingFlags, &statePtr->inputEncodingState, dst,
	    eol - dst + skip + TCL_UTF_MAX, &gs.rawRead, NULL,
	    &gs.charsWrote);
    bufPtr->nextRemoved += gs.rawRead;

    /*
     * Recycle all the emptied buffers.
     */








>
|
|







4682
4683
4684
4685
4686
4687
4688
4689
4690
4691
4692
4693
4694
4695
4696
4697
4698

    bufPtr = gs.bufPtr;
    if (bufPtr == NULL) {
	Tcl_Panic("Tcl_GetsObj: gotEOL reached with bufPtr==NULL");
    }
    statePtr->inputEncodingState = gs.state;
    Tcl_ExternalToUtf(NULL, gs.encoding, RemovePoint(bufPtr), gs.rawRead,
	    statePtr->inputEncodingFlags | TCL_ENCODING_NO_TERMINATE,
	    &statePtr->inputEncodingState, dst,
	    eol - dst + skip + TCL_UTF_MAX - 1, &gs.rawRead, NULL,
	    &gs.charsWrote);
    bufPtr->nextRemoved += gs.rawRead;

    /*
     * Recycle all the emptied buffers.
     */

5215
5216
5217
5218
5219
5220
5221
5222
5223
5224
5225
5226
5227
5228
5229
5230
5231
	}
	spaceLeft = length - offset;
	dst = objPtr->bytes + offset;
	*gsPtr->dstPtr = dst;
    }
    gsPtr->state = statePtr->inputEncodingState;
    result = Tcl_ExternalToUtf(NULL, gsPtr->encoding, raw, rawLen,
	    statePtr->inputEncodingFlags, &statePtr->inputEncodingState,
	    dst, spaceLeft+1, &gsPtr->rawRead, &gsPtr->bytesWrote,
	    &gsPtr->charsWrote);

    /*
     * Make sure that if we go through 'gets', that we reset the
     * TCL_ENCODING_START flag still. [Bug #523988]
     */

    statePtr->inputEncodingFlags &= ~TCL_ENCODING_START;







|
|
|







5216
5217
5218
5219
5220
5221
5222
5223
5224
5225
5226
5227
5228
5229
5230
5231
5232
	}
	spaceLeft = length - offset;
	dst = objPtr->bytes + offset;
	*gsPtr->dstPtr = dst;
    }
    gsPtr->state = statePtr->inputEncodingState;
    result = Tcl_ExternalToUtf(NULL, gsPtr->encoding, raw, rawLen,
	    statePtr->inputEncodingFlags | TCL_ENCODING_NO_TERMINATE,
	    &statePtr->inputEncodingState, dst, spaceLeft, &gsPtr->rawRead,
	    &gsPtr->bytesWrote, &gsPtr->charsWrote);

    /*
     * Make sure that if we go through 'gets', that we reset the
     * TCL_ENCODING_START flag still. [Bug #523988]
     */

    statePtr->inputEncodingFlags &= ~TCL_ENCODING_START;
5924
5925
5926
5927
5928
5929
5930
5931
5932
5933
5934
5935
5936
5937
5938
5939
5940
5941
5942
5943
5944
5945
5946
5947
5948
5949
5950
5951
5952
5953
5954
5955
5956
5957
5958
5959
5960
5961
5962
5963
5964
5965
5966
5967
5968
5969
5970
5971
5972
5973
5974
5975
5976
5977
5978
5979
5980






5981
5982
5983
5984
5985
5986
5987
5988
5989
5990
5991
5992
5993
5994
5995
5996
5997
    Tcl_Encoding encoding = statePtr->encoding? statePtr->encoding
	    : GetBinaryEncoding();
    Tcl_EncodingState savedState = statePtr->inputEncodingState;
    ChannelBuffer *bufPtr = statePtr->inQueueHead;
    int savedIEFlags = statePtr->inputEncodingFlags;
    int savedFlags = statePtr->flags;
    char *dst, *src = RemovePoint(bufPtr);
    int dstLimit, numBytes, srcLen = BytesLeft(bufPtr);

    /*
     * One src byte can yield at most one character.  So when the
     * number of src bytes we plan to read is less than the limit on
     * character count to be read, clearly we will remain within that
     * limit, and we can use the value of "srcLen" as a tighter limit
     * for sizing receiving buffers.
     */

    int toRead = ((charsToRead<0)||(charsToRead > srcLen)) ? srcLen : charsToRead;

    /*
     * 'factor' is how much we guess that the bytes in the source buffer will
     * expand when converted to UTF-8 chars. This guess comes from analyzing
     * how many characters were produced by the previous pass.
     */
    
    int factor = *factorPtr;
    int dstNeeded = TCL_UTF_MAX - 1 + toRead * factor / UTF_EXPANSION_FACTOR;

    (void) TclGetStringFromObj(objPtr, &numBytes);
    Tcl_AppendToObj(objPtr, NULL, dstNeeded);
    if (toRead == srcLen) {
	unsigned int size;
	dst = TclGetStringStorage(objPtr, &size) + numBytes;
	dstNeeded = size - numBytes;
    } else {
	dst = TclGetString(objPtr) + numBytes;
    }

    /*
     * This routine is burdened with satisfying several constraints.
     * It cannot append more than 'charsToRead` chars onto objPtr.
     * This is measured after encoding and translation transformations
     * are completed.  There is no precise number of src bytes that can
     * be associated with the limit.  Yet, when we are done, we must know
     * precisely the number of src bytes that were consumed to produce
     * the appended chars, so that all subsequent bytes are left in
     * the buffers for future read operations.
     *
     * The consequence is that we have no choice but to implement a
     * "trial and error" approach, where in general we may need to
     * perform transformations and copies multiple times to achieve
     * a consistent set of results.  This takes the shape of a loop.
     */

    dstLimit = dstNeeded + 1;
    while (1) {
	int dstDecoded, dstRead, dstWrote, srcRead, numChars;







	/*
	 * Perform the encoding transformation.  Read no more than
	 * srcLen bytes, write no more than dstLimit bytes.
	 */

	int code = Tcl_ExternalToUtf(NULL, encoding, src, srcLen,
		statePtr->inputEncodingFlags & (bufPtr->nextPtr
		? ~0 : ~TCL_ENCODING_END), &statePtr->inputEncodingState,
		dst, dstLimit, &srcRead, &dstDecoded, &numChars);

	/*
	 * Perform the translation transformation in place.  Read no more
	 * than the dstDecoded bytes the encoding transformation actually
	 * produced.  Capture the number of bytes written in dstWrote.
	 * Capture the number of bytes actually consumed in dstRead.
	 */







|


















|


|



|




















<

|
>
>
>
>
>
>






|
|
|
|







5925
5926
5927
5928
5929
5930
5931
5932
5933
5934
5935
5936
5937
5938
5939
5940
5941
5942
5943
5944
5945
5946
5947
5948
5949
5950
5951
5952
5953
5954
5955
5956
5957
5958
5959
5960
5961
5962
5963
5964
5965
5966
5967
5968
5969
5970
5971
5972
5973
5974
5975
5976
5977
5978

5979
5980
5981
5982
5983
5984
5985
5986
5987
5988
5989
5990
5991
5992
5993
5994
5995
5996
5997
5998
5999
6000
6001
6002
6003
    Tcl_Encoding encoding = statePtr->encoding? statePtr->encoding
	    : GetBinaryEncoding();
    Tcl_EncodingState savedState = statePtr->inputEncodingState;
    ChannelBuffer *bufPtr = statePtr->inQueueHead;
    int savedIEFlags = statePtr->inputEncodingFlags;
    int savedFlags = statePtr->flags;
    char *dst, *src = RemovePoint(bufPtr);
    int numBytes, srcLen = BytesLeft(bufPtr);

    /*
     * One src byte can yield at most one character.  So when the
     * number of src bytes we plan to read is less than the limit on
     * character count to be read, clearly we will remain within that
     * limit, and we can use the value of "srcLen" as a tighter limit
     * for sizing receiving buffers.
     */

    int toRead = ((charsToRead<0)||(charsToRead > srcLen)) ? srcLen : charsToRead;

    /*
     * 'factor' is how much we guess that the bytes in the source buffer will
     * expand when converted to UTF-8 chars. This guess comes from analyzing
     * how many characters were produced by the previous pass.
     */
    
    int factor = *factorPtr;
    int dstLimit = TCL_UTF_MAX - 1 + toRead * factor / UTF_EXPANSION_FACTOR;

    (void) TclGetStringFromObj(objPtr, &numBytes);
    Tcl_AppendToObj(objPtr, NULL, dstLimit);
    if (toRead == srcLen) {
	unsigned int size;
	dst = TclGetStringStorage(objPtr, &size) + numBytes;
	dstLimit = size - numBytes;
    } else {
	dst = TclGetString(objPtr) + numBytes;
    }

    /*
     * This routine is burdened with satisfying several constraints.
     * It cannot append more than 'charsToRead` chars onto objPtr.
     * This is measured after encoding and translation transformations
     * are completed.  There is no precise number of src bytes that can
     * be associated with the limit.  Yet, when we are done, we must know
     * precisely the number of src bytes that were consumed to produce
     * the appended chars, so that all subsequent bytes are left in
     * the buffers for future read operations.
     *
     * The consequence is that we have no choice but to implement a
     * "trial and error" approach, where in general we may need to
     * perform transformations and copies multiple times to achieve
     * a consistent set of results.  This takes the shape of a loop.
     */


    while (1) {
	int dstDecoded, dstRead, dstWrote, srcRead, numChars, code;
	int flags = statePtr->inputEncodingFlags | TCL_ENCODING_NO_TERMINATE;
	
	if (charsToRead > 0) {
	    flags |= TCL_ENCODING_CHAR_LIMIT;
	    numChars = charsToRead;
	}

	/*
	 * Perform the encoding transformation.  Read no more than
	 * srcLen bytes, write no more than dstLimit bytes.
	 */

	code = Tcl_ExternalToUtf(NULL, encoding, src, srcLen,
		flags & (bufPtr->nextPtr ? ~0 : ~TCL_ENCODING_END),
		&statePtr->inputEncodingState, dst, dstLimit, &srcRead,
		&dstDecoded, &numChars);

	/*
	 * Perform the translation transformation in place.  Read no more
	 * than the dstDecoded bytes the encoding transformation actually
	 * produced.  Capture the number of bytes written in dstWrote.
	 * Capture the number of bytes actually consumed in dstRead.
	 */
6046
6047
6048
6049
6050
6051
6052
6053
6054
6055
6056
6057
6058
6059
6060
		    /*
		     * There are chars leading the buffer before the eof
		     * char.  Adjust the dstLimit so we go back and read
		     * only those and do not encounter the eof char this
		     * time.
		     */

		    dstLimit = dstRead + TCL_UTF_MAX;
		    statePtr->flags = savedFlags;
		    statePtr->inputEncodingFlags = savedIEFlags;
		    statePtr->inputEncodingState = savedState;
		    continue;
		}
	    }








|







6052
6053
6054
6055
6056
6057
6058
6059
6060
6061
6062
6063
6064
6065
6066
		    /*
		     * There are chars leading the buffer before the eof
		     * char.  Adjust the dstLimit so we go back and read
		     * only those and do not encounter the eof char this
		     * time.
		     */

		    dstLimit = dstRead - 1 + TCL_UTF_MAX;
		    statePtr->flags = savedFlags;
		    statePtr->inputEncodingFlags = savedIEFlags;
		    statePtr->inputEncodingState = savedState;
		    continue;
		}
	    }

6072
6073
6074
6075
6076
6077
6078
6079
6080
6081
6082
6083
6084
6085
6086
6087
6088
6089
6090
6091
6092
6093
6094
6095
6096
6097
6098
6099
6100
6101
6102
6103
6104
6105
6106
6107
6108

6109
6110
6111
6112
6113
6114
6115
6116
6117
6118
6119
6120
6121
6122
6123
6124
6125
6126
6127
6128
6129
		/*
		 * There are chars we can read before we hit the bare cr.
		 * Go back with a smaller dstLimit so we get them in the
		 * next pass, compute a matching srcRead, and don't end
		 * up back here in this call.
		 */

		dstLimit = dstRead + TCL_UTF_MAX;
		statePtr->flags = savedFlags;
		statePtr->inputEncodingFlags = savedIEFlags;
		statePtr->inputEncodingState = savedState;
		continue;
	    }

	    assert(dstWrote == 0);
	    assert(dstRead == 0);

	    /*
	     * We decoded only the bare cr, and we cannot read a
	     * translated char from that alone.  We have to know what's
	     * next.  So why do we only have the one decoded char?
	     */

	    if (code != TCL_OK) {
		char buffer[TCL_UTF_MAX + 2];
		int read, decoded, count;

		/* 
		 * Didn't get everything the buffer could offer
		 */

		statePtr->flags = savedFlags;
		statePtr->inputEncodingFlags = savedIEFlags;
		statePtr->inputEncodingState = savedState;

		Tcl_ExternalToUtf(NULL, encoding, src, srcLen,
		statePtr->inputEncodingFlags & (bufPtr->nextPtr

		? ~0 : ~TCL_ENCODING_END), &statePtr->inputEncodingState,
		buffer, TCL_UTF_MAX + 2, &read, &decoded, &count);

		if (count == 2) {
		    if (buffer[1] == '\n') {
			/* \r\n translate to \n */
			dst[0] = '\n';
			bufPtr->nextRemoved += read;
		    } else {
			dst[0] = '\r';
			bufPtr->nextRemoved += srcRead;
		    }

		    dst[1] = '\0';
		    statePtr->inputEncodingFlags &= ~TCL_ENCODING_START;

		    Tcl_SetObjLength(objPtr, numBytes + 1);
		    return 1;
		}

	    } else if (statePtr->flags & CHANNEL_EOF) {







|
















|











|
>
|
|











<







6078
6079
6080
6081
6082
6083
6084
6085
6086
6087
6088
6089
6090
6091
6092
6093
6094
6095
6096
6097
6098
6099
6100
6101
6102
6103
6104
6105
6106
6107
6108
6109
6110
6111
6112
6113
6114
6115
6116
6117
6118
6119
6120
6121
6122
6123
6124
6125
6126
6127
6128

6129
6130
6131
6132
6133
6134
6135
		/*
		 * There are chars we can read before we hit the bare cr.
		 * Go back with a smaller dstLimit so we get them in the
		 * next pass, compute a matching srcRead, and don't end
		 * up back here in this call.
		 */

		dstLimit = dstRead - 1 + TCL_UTF_MAX;
		statePtr->flags = savedFlags;
		statePtr->inputEncodingFlags = savedIEFlags;
		statePtr->inputEncodingState = savedState;
		continue;
	    }

	    assert(dstWrote == 0);
	    assert(dstRead == 0);

	    /*
	     * We decoded only the bare cr, and we cannot read a
	     * translated char from that alone.  We have to know what's
	     * next.  So why do we only have the one decoded char?
	     */

	    if (code != TCL_OK) {
		char buffer[TCL_UTF_MAX + 1];
		int read, decoded, count;

		/* 
		 * Didn't get everything the buffer could offer
		 */

		statePtr->flags = savedFlags;
		statePtr->inputEncodingFlags = savedIEFlags;
		statePtr->inputEncodingState = savedState;

		Tcl_ExternalToUtf(NULL, encoding, src, srcLen,
		(statePtr->inputEncodingFlags | TCL_ENCODING_NO_TERMINATE)
		& (bufPtr->nextPtr ? ~0 : ~TCL_ENCODING_END),
		&statePtr->inputEncodingState, buffer, TCL_UTF_MAX + 1,
		&read, &decoded, &count);

		if (count == 2) {
		    if (buffer[1] == '\n') {
			/* \r\n translate to \n */
			dst[0] = '\n';
			bufPtr->nextRemoved += read;
		    } else {
			dst[0] = '\r';
			bufPtr->nextRemoved += srcRead;
		    }


		    statePtr->inputEncodingFlags &= ~TCL_ENCODING_START;

		    Tcl_SetObjLength(objPtr, numBytes + 1);
		    return 1;
		}

	    } else if (statePtr->flags & CHANNEL_EOF) {
6156
6157
6158
6159
6160
6161
6162


6163
6164
6165
6166
6167
6168
6169
6170
6171
6172
6173
6174
6175
6176
	 */

	numChars -= (dstRead - dstWrote);

	if (charsToRead > 0 && numChars > charsToRead) {

	    /* 


	     * We read more chars than allowed.  Reset limits to
	     * prevent that and try again.  Don't forget the extra
	     * padding of TCL_UTF_MAX bytes demanded by the
	     * Tcl_ExternalToUtf() call!
	     */

	    dstLimit = Tcl_UtfAtIndex(dst, charsToRead) + TCL_UTF_MAX - dst;
	    statePtr->flags = savedFlags;
	    statePtr->inputEncodingFlags = savedIEFlags;
	    statePtr->inputEncodingState = savedState;
	    continue;
	}

	if (dstWrote == 0) {







>
>






|







6162
6163
6164
6165
6166
6167
6168
6169
6170
6171
6172
6173
6174
6175
6176
6177
6178
6179
6180
6181
6182
6183
6184
	 */

	numChars -= (dstRead - dstWrote);

	if (charsToRead > 0 && numChars > charsToRead) {

	    /* 
	     * TODO: This cannot happen anymore.
	     *
	     * We read more chars than allowed.  Reset limits to
	     * prevent that and try again.  Don't forget the extra
	     * padding of TCL_UTF_MAX bytes demanded by the
	     * Tcl_ExternalToUtf() call!
	     */

	    dstLimit = Tcl_UtfAtIndex(dst, charsToRead) - 1 + TCL_UTF_MAX - dst;
	    statePtr->flags = savedFlags;
	    statePtr->inputEncodingFlags = savedIEFlags;
	    statePtr->inputEncodingState = savedState;
	    continue;
	}

	if (dstWrote == 0) {