TIP 346: Error on Failed String Encodings

Login
Bounty program for improvements to Tcl and certain Tcl packages.
Tcl 2017 Conference, Houston/TX, US, Oct 16-20
Send your abstracts to tclconference@googlegroups.com
by Aug 21.
Author:         Alexandre Ferrieux <alexandre.ferrieux@gmail.com>
State:          Draft
Type:           Project
Vote:           Pending
Created:        02-Feb-2009
Post-History:   
Keywords:       Tcl,encoding,convertto,strict,Unicode,String,ByteArray
Tcl-Version:    8.7

Abstract

This TIP proposes to raise an error when an encoding-based conversion loses information.

Background

Encoding-based conversions occur e.g. when writing a string to a channel. In doing so, Unicode characters are converted to sequences of bytes according to the channel's encoding. Similarly, a conversion can occur on request of the ByteArray internal representation of an object, the target encoding being ISO8859-1. In both cases, for some combinations of Unicode char and target encoding, the mapping is lossy (non-injective). For example, the "e acute" character, and many of its cousins, is mapped to a "?" in the 'ascii' target encoding. Also, Unicode chars above \u00FF get 'projected' onto their low byte in the ISO8859-1 ByteArray conversion.

This loss of information, in the first case, introduces unnoticed i18n mishandlings. In the second case, it makes it unreliable to do pure-ByteArray operations on objects unless they have no string representation. This induces unwanted and hard-to-debug performance hits on bytearray manipulations when people add debugging puts.

Proposed Change

This TIP proposes to make this loss conspicuous.

For the first use case, the idea is to introduce a -strict option to encoding convertto, that would raise an explicit error when non-mappable characters are met. Lossy conversions during channel I/O would also fail if a -strictencoding true [fconfigure option] is set. For the second case, we simply want the conversion to fail, like does the Listification of an ill-formed list. In both cases, the change consists of letting the proper internal conversion routine like SetByteArrayFromAny return TCL_ERROR.

Rationale

The second case does imply a Potential Incompatibility, since currently SBFA is documented to always return TCL_OK. However, it is felt that virtually all cases that are sensitive to this, are actually half-working in a completely hidden manner. Hence the global effect is a healthy one.

Reference Example

See Bug 1665628 https://sourceforge.net/support/tracker.php?aid=1665628 .

Copyright

This document has been placed in the public domain.

History