Tcl Source Code

View Ticket
Login
Ticket UUID: 418645
Title: Initial encoding selected incorrectly
Type: Bug Version: obsolete: 8.3.3
Submitter: mkuhn Created on: 2001-04-24 20:28:54
Subsystem: 38. Init - Library - Autoload Assigned To: hobbs
Priority: 5 Medium Severity:
Status: Closed Last Modified: 2001-11-21 00:36:51
Resolution: Fixed Closed By: mkuhn
    Closed on: 2001-11-20 14:51:02
Description:
The function unix/tclUnixInit.c:TclpSetInitialEncodings
contains an ugly hack to guess from the locale name the
multibyte encoding currently used on a Unix system.
This might work in some of the few special cases listed
in the provided table, but it fails badly in general.
For example under Linux (glibc 2.2), the locale de_DE
uses ISO 8859-1, the locale de_DE@euro uses ISO
8859-15, and the locale vi_VN uses UTF-8. None of these
is covered by your table.

Just extending localeTable[] is not the solution here,
because manufacturers change the encodings of locales
sometimes. Unix has an X/Open standardized API function
to determine the character set of the current locale! I
suggest that you drop the entire environment variable
parsing and table mechanics in TclpSetInitialEncodings.
Instead simply first call

  setlocale(LC_NUMERIC, "C");

such that the C library sets the locale, then call

  nl_langinfo(CODESET)

(on all platforms where langinfo.h is available) which
will return the name of the now used encoding. This
will be a string such as

  ISO-8859-1
  ISO-8859-15
  UTF-8
  EUC-JP
  KOI8-R
  SJIS

The command "locale -m" will print you on a system a
list of all available encodings. These strings are
unfortunately not strictly standardized and you will
still need a table to map these encoding names into
those used by TCL, but the return value of
nl_langinfo(CODESET) is a far better starting point to
find the currently used encoding than the locale name.

On some systems (including all with glibc 2.2 for
instance) you do not even have to determine the
encoding from the output of nl_langinfo(CODESET). The
iconv function will provide you a comprehensive
conversion service to convert whatever encoding
nl_langinfo(CODESET) identified into "UTF-8".

The matter is of some urgency, because SuSE Linux is
going to switch the default locales of most European
Union countries to ISO 8859-15 (for support of the Euro
symbol) soon, and then you assumption that ISO 8859-1
is a good default will fail for millions of Linux
users.

X/Open spec for nl_langinfo:

http://www.opengroup.org/onlinepubs/7908799/xsh/langinfo.h.html
User Comments: haible added on 2001-11-21 00:36:51:
Logged In: YES 
user_id=5923

Thanks Jeffrey Hobbs for the patch. That's the standards
compliant way
to do things.

I disagree with Markus' suggestion for an additional
environment
variable. It may have made sense a few years ago, but
nowadays the
important systems all have good locale support. Setting
LC_CTYPE to
something is a much more universal way to set the encoding
than
specialized environment variables.

Markus, you wrote yourself a few days ago:
  "LESSCHARSET is long obsolete and should not be used.
Less-358
   tests LANG/LC_CTYPE/LC_ALL for the UTF-8 substring and
activates
   its UTF-8 mode accordingly."

Your TCL_DEFAULT_ENCODING is just another variant of
LESSCHARSET.

mkuhn added on 2001-11-20 21:51:02:
Logged In: YES 
user_id=53799

If you want to have an environment variable to override the
default encoding specified in the standard POSIX locale
variables, then feel free to introduce and document a proper
one for exactly that purpose,e.g.

  export TCL_DEFAULT_ENCODING=EUC-JP

Implementation suggestion:

If $TCL_DEFAULT_ENCODING is defined, then nl_langinfo will
not be called and TCL behaves as if nl_langinfo had returned
TCL_DEFAULT_ENCODING instead. That fixes all problems for
users of systems that don't have a locale with the required
encoding installed or that don't support nl_langinfo.

But please don't overload well-established standardized
variables such as LANG, LC_* for any strange hacks. That is
messy and dangerous and affects other applications that
interpret LANG the standard way. LANG, LC_* have only one
single purpose: to name the configuration files that
setlocale will load from the disk. If LANG, LC_* have values
that are not listed in the output of 'locale -a', then this
should cause a warning to be printed and the 'C' locale to
be used.

Example for a more detailed warning message:

"Couldn't set locale directly! I tried to instruct the C
run-time system to load localization information (locale
definitions) according to the value of the environment
variable LC_ALL, LC_CTYPE or LANG (the first of these with a
value was used). However this failed. Apparently you
specified a locale name there that is not supported by this
operating system and therefore I have no idea what default
character encoding you would like me to use externally.
Please make sure that you specify one of the locale names
listed by the command 'locale -a'. See the operating system
documentation for details."

hobbs added on 2001-11-20 16:14:58:

File Added - 13570: langinfo3.diff

hobbs added on 2001-11-20 16:14:57:
Logged In: YES 
user_id=72656

attached is revision 3 of my patch which uses the old style 
fallback mechanism when setlocale returns null or other 
failure occurs.  This means that this should be more of a 
strict improvement over the original.  I have commited this 
for 8.4a4cvs.

hobbs added on 2001-11-17 07:20:40:
Logged In: YES 
user_id=72656

I added some fallback checks, because HPUX 11 uses locale 
names without -'s (and weird ones at that), and one for 
gb2312-1980 to gb2312.  I have a feeling we should maybe 
localize these mappings better, but I'm not finding a great 
resource that helps with ideas ...

hobbs added on 2001-11-17 03:57:10:
Logged In: YES 
user_id=72656

commited to 8.4a4cvs, but leaving open for further 
reference.

hobbs added on 2001-11-16 09:41:53:

File Added - 13403: langinfo2.diff

hobbs added on 2001-11-16 09:41:52:
Logged In: YES 
user_id=72656

I've attached a modification of Victor's patch which allows 
for "" return from nl_langinfo to be interp'ed as iso8859-1 
(because Solaris gave that to me when I didn't have LANG 
set).  Also fixed one error (I think it was an error), 
where the ibm** encoding check checked against <= 9 instead 
of <= '9' (int v. char).  Also moved the autoconf check to 
tcl.m4 as a macro.  Finally, I added in as a debug 
statement an fprintf which id's the returned encoding, and 
the set encoding (also retrievable by [encoding system] in 
Tcl).

mkuhn added on 2001-11-07 19:26:10:
Logged In: YES 
user_id=53799

I haven't tested Victor's nl_langinfo() patch yet myself,
but it looks good to me. Can't see anything wrong with it
and I recommend its inclusion.

hobbs added on 2001-11-07 09:42:06:

File Added - 12945: langinfo.patch

Logged In: YES 
user_id=72656

Victor Wagner provides the attached patch to start making 
use of nl_langinfo.  Note that whatever we use, it has to 
be BSD-license friendly (IOW, no (L)GPL).

mkuhn added on 2001-09-11 17:48:38:
Logged In: YES 
user_id=53799

The odd setlocale(LC_NUMERIC, "C") in my original message
should of course be setlocale(LC_CTYPE, ""). [No idea why I
typed that ...]

I made a little survey of the strings that
nl_langinfo(CODESET) returns on different operating systems
or that have been proposed as locale name draft standards:

Good old 7-bit USASCII comes in the largest number of names:

ASCII = US-ASCII = ANSI_X3.4-1968 = 646 = ISO646 =
ISO_646.IRV

For ISO 8859-1 (and correspondingly the other parts) there
are only two notations in use:

ISO8859-1 = ISO-8859-1

And then there are also

UTF-8
TIS-620 = TIS620.2533 = ISO-8859-12 = ISO8859-12
EUC-JP
EUC-KR
EUC-TW
EUC-CN = GB2312
VSCII
GB18030
GBK
BIG5 = Big5
KOI8-R
KOI8-U
WINDOWS-1251
WINDOWS-1256

I think that list covers even more than what you can
nl_langinfo(CODESET) expect to return today on any
mainstream system.

I suggest that the TCL character set conversion functions
should be able to accept all of the above character encoding
names, and then the output of nl_langinfo(CODESET) can be
forwarded directly to the character set conversion routines,
such that standard text I/O under TCL happens automatically
in the correct locale-dependent character set by default.

haible added on 2001-09-10 20:16:13:
Logged In: YES 
user_id=5923

Right, it is LGPL, and is contained in GNU libiconv.
A separate download is
 
ftp://ftp.ilog.fr/pub/Users/haible/gnu/libcharset-1.1.tar.gz

Note that the LGPL is not a big restriction, because
libcharset
compiles to a shared library and a text file, thus the
package
that uses it is not "infected" by the LGPL.

rmax added on 2001-09-10 16:54:17:
Logged In: YES 
user_id=124643

It is LGPL and distributed as part of libiconv.
The current version can be found here:
ftp://ftp.gnu.org/pub/gnu/libiconv/libiconv-1.7.tar.gz

andreas_kupries added on 2001-09-08 06:01:46:
Logged In: YES 
user_id=75003

The url in the last comment is invalid.
The server "clisp.cons.org" is not found.

Under what license is libcharset ?

mkuhn added on 2001-06-25 18:43:58:
Logged In: YES 
user_id=53799

Using nl_langinfo(CODESET) directly has some portability
problems, because the returned character set names are not
well standardized and some systems (FreeBSD in particular)
still do not implement it. Fortunately, Bruno Haible has
developed with "libcharset" a highly portable wrapper around
nl_langinfo(CODESET) that can easily be integrated, which I
think is really what
unix/tclUnixInit.c:TclpSetInitialEncodings should be using.
Several GNU packages determine the encoding this way
already. Have a look at:

http://clisp.cons.org/~haible/packages-libcharset.html

Attachments: