Tcl Source Code: View Ticket

Ticket UUID:	525525
Title:	wrong file list in msgcat::mclocale
Type:	Bug	Version:	obsolete: 8.4a5
Submitter:	haible	Created on:	2002-03-04 14:35:11
Subsystem:	30. msgcat Package	Assigned To:	dgp
Priority:	5 Medium	Severity:
Status:	Closed	Last Modified:	2002-06-18 04:07:43
Resolution:	Fixed	Closed By:	dgp
		Closed on:	2002-06-17 21:07:43
Description:	When my locale name is "pt_BR.UTF-8", tcl looks for the catalog in "pt_br.utf-8.msg" and "pt.msg". I claim that the first one should better be "pt_br.msg". Because the .msg files are all in UTF-8 (or contain \unnnn escape sequences for non-ASCII characters), they don't depend on the locale's encoding. Brazilian users want Brazilian translations, not Portuguese translations. Regardless whether their locale is named pt_BR.ISO-8859-1 or pt_BR.UTF-8. The encoding that the translator used shouldn't matter.
User Comments:	dgp added on 2002-06-18 04:07:43: Logged In: YES user_id=80530 corrected docs. haible added on 2002-06-17 23:50:24: Logged In: YES user_id=5923 Yes, "UK" is not an ISO 3166 country code. The British English locale is called en_GB, not en_UK. dgp added on 2002-06-17 23:36:41: Logged In: YES user_id=80530 That was the idea. Thanks for the additional data. Committed. Is there an en_UK locale? If not, we should get it out of the examples in the msgcat documentation. haible added on 2002-06-17 22:59:27: File Added - 25257: msgcat.w32a Logged In: YES user_id=5923 Thanks. With the new table lookup that you added, it's now much easier to support country dependent variants of languages (like Brazilian Portuguese). Find attached a table which 1. adds country information where possible (i.e. where the previous code looked up only pt.msg, the new code will look up pt_br.msg and pt.msg). 2. Two fixes: 0809 is en_GB, not en_UK. 13 is nl, not da. dgp added on 2002-06-17 12:26:58: File Deleted - 25140: File Added - 25220: msgcat.patch dgp added on 2002-06-17 12:26:57: Logged In: YES user_id=80530 final version of patch committing to msgcat 1.3, bundled with Tcl 8.4a5/b1 dgp added on 2002-06-15 07:35:05: File Deleted - 24997: File Added - 25140: msgcat.patch dgp added on 2002-06-15 07:35:04: Logged In: YES user_id=80530 updated patch folds in additional data about locale values in the Windows Registry. (Thanks!) Also cleans up some style guide issues. haible added on 2002-06-15 03:40:15: File Added - 25126: msgcat.w32 haible added on 2002-06-15 03:40:14: Logged In: YES user_id=5923 The ConvertLocale function for Unix works as expected; thanks. For the locale determination on Woe32 systems, I've made up a longer list, using tables from unicode.org and microsoft.com. You find it appended. Please use it at the appropiate place of the patch. dgp added on 2002-06-13 06:54:03: File Deleted - 24663: File Added - 24997: msgcat.patch Logged In: YES user_id=80530 Updated patch include documentation changes. Still need tests before acceptance. dgp added on 2002-06-08 08:23:52: File Deleted - 21556: File Added - 24663: msgcat.patch Logged In: YES user_id=80530 Here's an updated patch. Please review that it takes care of the problem reported here. The bump to version 1.3 is because the interface is subtly changed by recognizing more environment variables. With updated docs/tests we can bundle this with Tcl 8.4. dgp added on 2002-06-08 06:12:37: Logged In: YES user_id=80530 Note that I read Bug 525522 as referring to facilities of TclX which are outside of my concern. If that bug really refers to the msgcat package that is bundled with Tcl (not TclX), then please clarify that bug report. haible added on 2002-04-20 09:01:27: Logged In: YES user_id=5923 I can see three problems in the patch. 1) These question marks should be brackets. # Assume $value is of form: $language?_$territory??.$codeset??@modifier? # Convert to form: $language?_$territory??_$modifier? 2) The function ConvertLocale gives an error when the argument string contains an @ sign. You can try these tests: echo [msgcat::ConvertLocale "de"] -> de echo [msgcat::ConvertLocale "de_DE"] -> de_DE echo [msgcat::ConvertLocale "de.UTF-8"] -> de echo [msgcat::ConvertLocale "de_DE.UTF-8"] -> de_DE echo [msgcat::ConvertLocale "de@bayrisch"] -> de_bayrisch echo [msgcat::ConvertLocale "de_DE@bayrisch"] -> de_DE_bayrisch echo [msgcat::ConvertLocale "de.UTF-8@bayrisch"] -> de_bayrisch echo [msgcat::ConvertLocale "de_DE.UTF-8@bayrisch"] -> de_DE_bayrisch 3) [info exists ::env(...)] returns true for an empty environment value, right? But ::env("LANG") must be ignored if it is set to empty (see bug #525522). Similarly for LC_MESSAGES and LC_ALL. dgp added on 2002-04-20 08:15:22: Logged In: YES user_id=80530 Ah... I think that patch is wrong. It will only recognize a codeset or modifier if it already recognized as territory, etc. Feel free to offer a correction. I may not get back to this right away. dgp added on 2002-04-20 08:12:56: File Added - 21556: msgcat.patch Logged In: YES user_id=80530 Yes, it seems so. Here's a patch as you described. Still needs the documentation changes added. Please review. haible added on 2002-04-20 06:55:17: Logged In: YES user_id=5923 In glibc the following are in use: @euro (for some European countries) means a variant of the locale where the currency is the EUR. @cyrillic (for Serbian) means a variant of the locale which uses cyrillic writing, latin writing being the default @nynorsk (for Norwegian) meant a variant of the language (= dialect), but now they use the language code nn. In German locales, you could easily use the modifier @old to denote the old (20th century) orthorgraphy. In OSF/1 or Solaris locales, I've the @UCS suffix to denote locales where the internal wchar_t representation has been changed to Unicode. In summary, you can see that some of these modifiers warrant the creation of separate .msg files (especially the @cyrillic and @old modifiers). dgp added on 2002-04-20 06:22:50: Logged In: YES user_id=80530 Can you provide any examples of the @modifier in use? What sort of things does it indicate? haible added on 2002-04-20 06:19:11: Logged In: YES user_id=5923 If you do this: "A [split] on "." and passing only the first element of the resulting list" the locales including a modifier and a codeset will not be treated correctly. IMO the right transformation would be (assuming that the codeset of the locale doesn't matter because all catalogs are in UTF-8 anyway): Convert language_territory.codeset@modifier to language_territory_modifier I.e. the algorithm is: - Search the first dot or @ sign. Take the substring preceding this character. - If it was a dot, search the next @ sign. - If an @ sign was found, append to the result an underscore and all that follows the @ sign. dgp added on 2002-04-20 06:01:20: Logged In: YES user_id=80530 I don't plan to go that far. The [::msgcat::mclocale] command is documented to accept an argument that is a list of strings joined by "_". I don't plan to extend that. So, the only thing that needs fixing is the way msgcat uses the environment variable values to derive an argument to pass to [mclocale]. A [split] on "." and passing only the first element of the resulting list should take care of that. The message catalogs of interest are those packaged with Tcl packages and applications. All of them are of the form foo_bar_baz.msg . I don't see a compelling reason to extend that further. The message catalog files themselves are always read in the UTF-8 encoding, and then evaluated as Tcl scripts. The documentation of msgcat should probably be improved to make that clear. haible added on 2002-04-20 05:50:36: Logged In: YES user_id=5923 You are right. The most general locale name form is as you mentioned. The .msg files that should be attempted to open are [email protected] [email protected] language_territory.msg language.msg in that order. (Remove the codeset first because it is merely an implementation issue, thus it has lowest "priority". Remove the modifier next, because usually a modifier belongs to a particular language_territory - for example no_NO@nynorsk is a dialect of Norwegian, and no@nynorsk has not much sense.) About LC_ALL, LC_MESSAGES, LANG: you are right here again; this was mentioned in bug report 525522. dgp added on 2002-04-20 05:27:35: Logged In: YES user_id=80530 Likewise, for consistency with C libraries, it would make sense for msgcat to get its default locale from ::env(LC_ALL), then ::env(LC_MESSAGES), then ::env(LANG). dgp added on 2002-04-20 05:22:28: Logged In: YES user_id=80530 Various man pages and online references suggest that the XPG4 canonical form for a locale (for example, a suitable value for $::env(LANG)) is language[_territroy][.codeset][@modifier] Systems can understand locales of other forms as well, but I believe that systems (C libraries) conforming to XPG4 all agree on what locales of the form above mean. The [::msgcat::mclocale] command doesn't deal with all of that, though. It expects an argument that is a list of strings joined by "_", like en, en_US, or en_US_funky. So it seems that when pulling values out of environment variables, the msgcat package should pass the portion of the string before the first "." on to [mclocale]. I think that change would take care of this bug. haible added on 2002-04-18 23:57:39: Logged In: YES user_id=5923 Yes, that would be fine. .msg files that are written in a particular encoding could be stored as pt_br.utf-8.msg or pt_br.iso-8859-1.msg. Whereas .msg files that are encoding independent (because they use \unnnn syntax - such as the ones produced by GNU msgfmt 0.11.1) could be stored as pt_br.msg. dgp added on 2002-04-18 23:37:31: Logged In: YES user_id=80530 Would it be sufficient if msgcat looked for the catalogs: pt_br.utf-8.msg pt_br.msg pt.msg in that order?

Attachments:

msgcat.w32a [download] added by haible on 2002-06-17 22:59:27. [details]
msgcat.patch [download] added by dgp on 2002-06-17 12:26:58. [details]
msgcat.w32 [download] added by haible on 2002-06-15 03:40:15. [details]