Tcl Source Code: View Ticket

Ticket UUID:	418645
Title:	Initial encoding selected incorrectly
Type:	Bug	Version:	obsolete: 8.3.3
Submitter:	mkuhn	Created on:	2001-04-24 20:28:54
Subsystem:	38. Init - Library - Autoload	Assigned To:	hobbs
Priority:	5 Medium	Severity:
Status:	Closed	Last Modified:	2001-11-21 00:36:51
Resolution:	Fixed	Closed By:	mkuhn
		Closed on:	2001-11-20 14:51:02
Description:	The function unix/tclUnixInit.c:TclpSetInitialEncodings contains an ugly hack to guess from the locale name the multibyte encoding currently used on a Unix system. This might work in some of the few special cases listed in the provided table, but it fails badly in general. For example under Linux (glibc 2.2), the locale de_DE uses ISO 8859-1, the locale de_DE@euro uses ISO 8859-15, and the locale vi_VN uses UTF-8. None of these is covered by your table. Just extending localeTable[] is not the solution here, because manufacturers change the encodings of locales sometimes. Unix has an X/Open standardized API function to determine the character set of the current locale! I suggest that you drop the entire environment variable parsing and table mechanics in TclpSetInitialEncodings. Instead simply first call setlocale(LC_NUMERIC, "C"); such that the C library sets the locale, then call nl_langinfo(CODESET) (on all platforms where langinfo.h is available) which will return the name of the now used encoding. This will be a string such as ISO-8859-1 ISO-8859-15 UTF-8 EUC-JP KOI8-R SJIS The command "locale -m" will print you on a system a list of all available encodings. These strings are unfortunately not strictly standardized and you will still need a table to map these encoding names into those used by TCL, but the return value of nl_langinfo(CODESET) is a far better starting point to find the currently used encoding than the locale name. On some systems (including all with glibc 2.2 for instance) you do not even have to determine the encoding from the output of nl_langinfo(CODESET). The iconv function will provide you a comprehensive conversion service to convert whatever encoding nl_langinfo(CODESET) identified into "UTF-8". The matter is of some urgency, because SuSE Linux is going to switch the default locales of most European Union countries to ISO 8859-15 (for support of the Euro symbol) soon, and then you assumption that ISO 8859-1 is a good default will fail for millions of Linux users. X/Open spec for nl_langinfo: http://www.opengroup.org/onlinepubs/7908799/xsh/langinfo.h.html
User Comments:	haible added on 2001-11-21 00:36:51: Logged In: YES user_id=5923 Thanks Jeffrey Hobbs for the patch. That's the standards compliant way to do things. I disagree with Markus' suggestion for an additional environment variable. It may have made sense a few years ago, but nowadays the important systems all have good locale support. Setting LC_CTYPE to something is a much more universal way to set the encoding than specialized environment variables. Markus, you wrote yourself a few days ago: "LESSCHARSET is long obsolete and should not be used. Less-358 tests LANG/LC_CTYPE/LC_ALL for the UTF-8 substring and activates its UTF-8 mode accordingly." Your TCL_DEFAULT_ENCODING is just another variant of LESSCHARSET. mkuhn added on 2001-11-20 21:51:02: Logged In: YES user_id=53799 If you want to have an environment variable to override the default encoding specified in the standard POSIX locale variables, then feel free to introduce and document a proper one for exactly that purpose,e.g. export TCL_DEFAULT_ENCODING=EUC-JP Implementation suggestion: If $TCL_DEFAULT_ENCODING is defined, then nl_langinfo will not be called and TCL behaves as if nl_langinfo had returned TCL_DEFAULT_ENCODING instead. That fixes all problems for users of systems that don't have a locale with the required encoding installed or that don't support nl_langinfo. But please don't overload well-established standardized variables such as LANG, LC_* for any strange hacks. That is messy and dangerous and affects other applications that interpret LANG the standard way. LANG, LC_* have only one single purpose: to name the configuration files that setlocale will load from the disk. If LANG, LC_* have values that are not listed in the output of 'locale -a', then this should cause a warning to be printed and the 'C' locale to be used. Example for a more detailed warning message: "Couldn't set locale directly! I tried to instruct the C run-time system to load localization information (locale definitions) according to the value of the environment variable LC_ALL, LC_CTYPE or LANG (the first of these with a value was used). However this failed. Apparently you specified a locale name there that is not supported by this operating system and therefore I have no idea what default character encoding you would like me to use externally. Please make sure that you specify one of the locale names listed by the command 'locale -a'. See the operating system documentation for details." hobbs added on 2001-11-20 16:14:58: File Added - 13570: langinfo3.diff hobbs added on 2001-11-20 16:14:57: Logged In: YES user_id=72656 attached is revision 3 of my patch which uses the old style fallback mechanism when setlocale returns null or other failure occurs. This means that this should be more of a strict improvement over the original. I have commited this for 8.4a4cvs. hobbs added on 2001-11-17 07:20:40: Logged In: YES user_id=72656 I added some fallback checks, because HPUX 11 uses locale names without -'s (and weird ones at that), and one for gb2312-1980 to gb2312. I have a feeling we should maybe localize these mappings better, but I'm not finding a great resource that helps with ideas ... hobbs added on 2001-11-17 03:57:10: Logged In: YES user_id=72656 commited to 8.4a4cvs, but leaving open for further reference. hobbs added on 2001-11-16 09:41:53: File Added - 13403: langinfo2.diff hobbs added on 2001-11-16 09:41:52: Logged In: YES user_id=72656 I've attached a modification of Victor's patch which allows for "" return from nl_langinfo to be interp'ed as iso8859-1 (because Solaris gave that to me when I didn't have LANG set). Also fixed one error (I think it was an error), where the ibm** encoding check checked against <= 9 instead of <= '9' (int v. char). Also moved the autoconf check to tcl.m4 as a macro. Finally, I added in as a debug statement an fprintf which id's the returned encoding, and the set encoding (also retrievable by [encoding system] in Tcl). mkuhn added on 2001-11-07 19:26:10: Logged In: YES user_id=53799 I haven't tested Victor's nl_langinfo() patch yet myself, but it looks good to me. Can't see anything wrong with it and I recommend its inclusion. hobbs added on 2001-11-07 09:42:06: File Added - 12945: langinfo.patch Logged In: YES user_id=72656 Victor Wagner provides the attached patch to start making use of nl_langinfo. Note that whatever we use, it has to be BSD-license friendly (IOW, no (L)GPL). mkuhn added on 2001-09-11 17:48:38: Logged In: YES user_id=53799 The odd setlocale(LC_NUMERIC, "C") in my original message should of course be setlocale(LC_CTYPE, ""). [No idea why I typed that ...] I made a little survey of the strings that nl_langinfo(CODESET) returns on different operating systems or that have been proposed as locale name draft standards: Good old 7-bit USASCII comes in the largest number of names: ASCII = US-ASCII = ANSI_X3.4-1968 = 646 = ISO646 = ISO_646.IRV For ISO 8859-1 (and correspondingly the other parts) there are only two notations in use: ISO8859-1 = ISO-8859-1 And then there are also UTF-8 TIS-620 = TIS620.2533 = ISO-8859-12 = ISO8859-12 EUC-JP EUC-KR EUC-TW EUC-CN = GB2312 VSCII GB18030 GBK BIG5 = Big5 KOI8-R KOI8-U WINDOWS-1251 WINDOWS-1256 I think that list covers even more than what you can nl_langinfo(CODESET) expect to return today on any mainstream system. I suggest that the TCL character set conversion functions should be able to accept all of the above character encoding names, and then the output of nl_langinfo(CODESET) can be forwarded directly to the character set conversion routines, such that standard text I/O under TCL happens automatically in the correct locale-dependent character set by default. haible added on 2001-09-10 20:16:13: Logged In: YES user_id=5923 Right, it is LGPL, and is contained in GNU libiconv. A separate download is ftp://ftp.ilog.fr/pub/Users/haible/gnu/libcharset-1.1.tar.gz Note that the LGPL is not a big restriction, because libcharset compiles to a shared library and a text file, thus the package that uses it is not "infected" by the LGPL. rmax added on 2001-09-10 16:54:17: Logged In: YES user_id=124643 It is LGPL and distributed as part of libiconv. The current version can be found here: ftp://ftp.gnu.org/pub/gnu/libiconv/libiconv-1.7.tar.gz andreas_kupries added on 2001-09-08 06:01:46: Logged In: YES user_id=75003 The url in the last comment is invalid. The server "clisp.cons.org" is not found. Under what license is libcharset ? mkuhn added on 2001-06-25 18:43:58: Logged In: YES user_id=53799 Using nl_langinfo(CODESET) directly has some portability problems, because the returned character set names are not well standardized and some systems (FreeBSD in particular) still do not implement it. Fortunately, Bruno Haible has developed with "libcharset" a highly portable wrapper around nl_langinfo(CODESET) that can easily be integrated, which I think is really what unix/tclUnixInit.c:TclpSetInitialEncodings should be using. Several GNU packages determine the encoding this way already. Have a look at: http://clisp.cons.org/~haible/packages-libcharset.html

Attachments:

langinfo3.diff [download] added by hobbs on 2001-11-20 16:14:58. [details]
langinfo2.diff [download] added by hobbs on 2001-11-16 09:41:53. [details]
langinfo.patch [download] added by hobbs on 2001-11-07 09:42:06. [details]