TIP 137: Specifying Script Encodings for [source] and tclsh

Bounty program for improvements to Tcl and certain Tcl packages.
Tcl 2017 Conference, Houston/TX, US, Oct 16-20
Send your abstracts to tclconference@googlegroups.com
by Aug 21.
Author:         Anton Kovalenko <a_kovalenko@fromru.com>
State:          Final
Type:           Project
Vote:           Done
Created:        29-May-2003
Tcl-Version:    8.5


This TIP proposes a way to specify encoding used to read a script with source command and (tclsh, wish) shell.


Now all Tcl scripts are expected to be in the system encoding. An author of a script or a package can't specify concrete encoding to source the script or the files of package. Common practise is to assume that encoding to be superset of ASCII, and not to use non-ASCII characters in the scripts that are targeting the wide audience of potential users.

There's a way to specify UNICODE literals with \uXXXX sequences. But it's not an universal and convenient way - this sequences aren't substituted in {}-quoted strings, and they can't be edited in WYSIWYG editors without special (and hence uncommon) support.

This TIP proposes to add the -encoding option support to source command and to tclsh and wish. Thus, package authors will be able to specify encoding of the package files in pkgIndex.tcl; and script authors will be able to specify the script encoding when calling tclsh (either in the first line beginning with #! or in the line where the exec tclsh "$0" "$@" is located).

This TIP also proposes to use utf-8 for all the standard system scripts, for pkgIndex.tcl and tclIndex files. Now they all are supposed to be in the system encoding (message catalogs do not need this support, as they are already always loaded using the UTF-8 encoding scheme), and it could event prevent Tcl itself from running when system encoding is not a compatible superset of ASCII.


Tclsh will allow the encoding to be specified on the command-line in two forms: the first form is -encoding name as two separate arguments, and the second is -e:name (a single argument.) The second form is intended for when the script begins with #! and is because Unix kernels pass extra parameters from the #! line as a single argument. This very short notation (-e:) is chosen because some Unices limit the #! line to the length of 32.

To implement all these options, this TIP proposes a new C-level public API function Tcl_FSEvalFileEx, which is similar to Tcl_FSEvalFile, but takes one extra argument that must be an encoding name or NULL (to use system encoding).

 int Tcl_FSEvalFileEx(Tcl_Interp *interp, Tcl_Obj *pathPtr,
                      CONST char *encodingName);

Common use of this new options will be like this:

  1. In a script:

     !#/usr/bin/tclsh -e:utf-8
     do something...

    or it could be:

     # I wish to use tclsh \
     exec tclsh -encoding utf-8 "$0" "$@"
  2. In a pkgIndex.tcl:

     package ifneeded pkg 1.02 [list source -encoding utf-8 \
          [file join $dir pkg.tcl]]


The partial implementation of this TIP can be found at http://sourceforge.net/projects/tcl as Patch #742683. Some fixes to source system scripts and pkgIndex'es in utf-8 are not yet there. But with the existing implementation of "-encoding", it would be really easy to implement.


This document is placed in the public domain.