Tcl Source Code

Check-in [d42a114238]
Login

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview
Comment:Added long comment explaining history and work in progress making bytearray interfaces usable.
Downloads: Tarball | ZIP archive | SQL archive
Timelines: family | ancestors | descendants | both | trunk
Files: files | file ages | folders
SHA1: d42a114238e3e3c7ee99986099b6805adcf6180d
User & Date: dgp 2016-12-02 18:18:59
Context
2016-12-02
21:08
Avoid "warning: format '%llu' expects argument of type 'long long unsigned int', but argument 2 has ... check-in: 871fcf85c4 user: andy tags: trunk
21:00
Merge trunk check-in: 4ab23eded7 user: andy tags: amg-array-enum-c-api
19:02
merge trunk check-in: 237dd7902f user: dgp tags: novem
19:01
merge trunk check-in: 33041ae916 user: dgp tags: tip-445
18:18
Added long comment explaining history and work in progress making bytearray interfaces usable. check-in: d42a114238 user: dgp tags: trunk
15:39
Relocate the cmdEpoch bumps during command deletion so that command resolution caching works properl... check-in: 71fa1f9c91 user: dgp tags: trunk
Changes
Hide Diffs Unified Diffs Ignore Whitespace Patch

Changes to generic/tclBinary.c.

151
152
153
154
155
156
157
158


159





160



161
162
163
164
165

166
167
168

























169
170
171

172







173







174
175
176
177







178



179









180
181
182
183
184
185
186
    { "hex",      BinaryDecodeHex, TclCompileBasic1Or2ArgCmd, NULL, NULL, 0 },
    { "uuencode", BinaryDecodeUu,  TclCompileBasic1Or2ArgCmd, NULL, NULL, 0 },
    { "base64",   BinaryDecode64,  TclCompileBasic1Or2ArgCmd, NULL, NULL, 0 },
    { NULL, NULL, NULL, NULL, NULL, 0 }
};

/*
 * The following object type represents an array of bytes. An array of bytes


 * is not equivalent to an internationalized string. Conceptually, a string is





 * an array of 16-bit quantities organized as a sequence of properly formed



 * UTF-8 characters, while a ByteArray is an array of 8-bit quantities.
 * Accessor functions are provided to convert a ByteArray to a String or a
 * String to a ByteArray. Two or more consecutive bytes in an array of bytes
 * may look like a single UTF-8 character if the array is casually treated as
 * a string. But obtaining the String from a ByteArray is guaranteed to

 * produced properly formed UTF-8 sequences so that there is a one-to-one map
 * between bytes and characters.
 *

























 * Converting a ByteArray to a String proceeds by casting each byte in the
 * array to a 16-bit quantity, treating that number as a Unicode character,
 * and storing the UTF-8 version of that Unicode character in the String. For

 * ByteArrays consisting entirely of values 1..127, the corresponding String







 * representation is the same as the ByteArray representation.







 *
 * Converting a String to a ByteArray proceeds by getting the Unicode
 * representation of each character in the String, casting it to a byte by
 * truncating the upper 8 bits, and then storing the byte in the ByteArray.







 * Converting from ByteArray to String and back to ByteArray is not lossy, but



 * converting an arbitrary String to a ByteArray may be.









 */

static const Tcl_ObjType properByteArrayType = {
    "bytearray",
    FreeByteArrayInternalRep,
    DupByteArrayInternalRep,
    UpdateStringOfByteArray,







|
>
>
|
>
>
>
>
>
|
>
>
>
|
<
<
<
<
>
|
|
|
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
|
<
|
>
|
>
>
>
>
>
>
>
|
>
>
>
>
>
>
>

|
|
|
>
>
>
>
>
>
>
|
>
>
>
|
>
>
>
>
>
>
>
>
>







151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171




172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201

202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
    { "hex",      BinaryDecodeHex, TclCompileBasic1Or2ArgCmd, NULL, NULL, 0 },
    { "uuencode", BinaryDecodeUu,  TclCompileBasic1Or2ArgCmd, NULL, NULL, 0 },
    { "base64",   BinaryDecode64,  TclCompileBasic1Or2ArgCmd, NULL, NULL, 0 },
    { NULL, NULL, NULL, NULL, NULL, 0 }
};

/*
 * The following object types represent an array of bytes. The intent is
 * to allow arbitrary binary data to pass through Tcl as a Tcl value
 * without loss or damage. Such values are useful for things like
 * encoded strings or Tk images to name just two.
 *
 * It's strange to have two Tcl_ObjTypes in place for this task when
 * one would do, so a bit of detail and history how we got to this point
 * and where we might go from here.
 *
 * A bytearray is an ordered sequence of bytes. Each byte is an integer
 * value in the range [0-255].  To be a Tcl value type, we need a way to
 * encode each value in the value set as a Tcl string.  The simplest
 * encoding is to represent each byte value as the same codepoint value.
 * A bytearray of N bytes is encoded into a Tcl string of N characters




 * where the codepoint of each character is the value of corresponding byte.
 * This approach creates a one-to-one map between all bytearray values
 * and a subset of Tcl string values.
 * 
 * When converting a Tcl string value to the bytearray internal rep, the
 * question arises what to do with strings outside that subset?  That is,
 * those Tcl strings containing at least one codepoint greater than 255?
 * The obviously correct answer is to raise an error!  That string value
 * does not represent any valid bytearray value. Full Stop.  The
 * setFromAnyProc signature has a completion code return value for just
 * this reason, to reject invalid inputs.
 * 
 * Unfortunately this was not the path taken by the authors of the
 * original tclByteArrayType.  They chose to accept all Tcl string values
 * as acceptable string encodings of the bytearray values that result
 * from masking away the high bits of any codepoint value at all. This
 * meant that every bytearray value had multiple accepted string
 * representations.
 *
 * The implications of this choice are truly ugly.  When a Tcl value has
 * a string representation, we are required to accept that as the true
 * value.  Bytearray values that possess a string representation cannot
 * be processed as bytearrays because we cannot know which true value
 * that bytearray represents.  The consequence is that we drag around
 * an internal rep that we cannot make any use of.  This painful price
 * is extracted at any point after a string rep happens to be generated
 * for the value.  This happens even when the troublesome codepoints
 * outside the byte range never show up.  This happens rather routinely
 * in normal Tcl operations unless we burden the script writer with the
 * cognitive burden of avoiding it.  The price is also paid by callers

 * of the C interface.  The routine
 *
 *	unsigned char *Tcl_GetByteArrayFromObj(objPtr, lenPtr)
 *
 * has a guarantee to always return a non-NULL value, but that value
 * points to a byte sequence that cannot be used by the caller to  
 * process the Tcl value absent some sideband testing that objPtr
 * is "pure".  Tcl offers no public interface to perform this test,
 * so callers either break encapsulation or are unavoidably buggy.  Tcl
 * has defined a public interface that cannot be used correctly. The
 * Tcl source code itself suffers the same problem, and has been buggy,
 * but progressively less so as more and more portions of the code have
 * been retrofitted with the required "purity testing".  The set of values
 * able to pass the purity test can be increased via the introduction of
 * a "canonical" flag marker, but the only way the broken interface itself
 * can be discarded is to start over and define the Tcl_ObjType properly.
 * Bytearrays should simply be usable as bytearrays without a kabuki
 * dance of testing.
 *
 * The Tcl_ObjType "properByteArrayType" is (nearly) a correct 
 * implementation of bytearrays.  Any Tcl value with the type
 * properByteArrayType can have its bytearray value fetched and
 * used with confidence that acting on that value is equivalent to
 * acting on the true Tcl string value.  This still implies a side
 * testing burden -- past mistakes will not let us avoid that
 * immediately, but it is at least a conventional test of type, and
 * can be implemented entirely by examining the objPtr fields, with
 * no need to query the intrep, as a canonical flag would require.
 *
 * Until Tcl_GetByteArrayFromObj() and Tcl_SetByteArrayLength() can
 * be revised to admit the possibility of returning NULL when the true
 * value is not a valid bytearray, we need a mechanism to retain
 * compatibility with the deployed callers of the broken interface.
 * That's what the retained "tclByteArrayType" provides.  In those
 * unusual circumstances where we convert an invalid bytearray value
 * to a bytearray type, it is to this legacy type.  Essentially any
 * time this legacy type gets used, it's a signal of a bug being ignored.
 * A TIP should be drafted to remove this connection to the broken past
 * so that Tcl 9 will no longer have any trace of it.  Prescribing a
 * migration path will be the key element of that work.  The internal
 * changes now in place are the limit of what can be done short of
 * interface repair.  They provide a great expansion of the histories
 * over which bytearray values can be useful in the meanwhile.
 */

static const Tcl_ObjType properByteArrayType = {
    "bytearray",
    FreeByteArrayInternalRep,
    DupByteArrayInternalRep,
    UpdateStringOfByteArray,