Tcl Source Code

View Ticket
Login
Ticket UUID: 525525
Title: wrong file list in msgcat::mclocale
Type: Bug Version: obsolete: 8.4a5
Submitter: haible Created on: 2002-03-04 14:35:11
Subsystem: 30. msgcat Package Assigned To: dgp
Priority: 5 Medium Severity:
Status: Closed Last Modified: 2002-06-18 04:07:43
Resolution: Fixed Closed By: dgp
    Closed on: 2002-06-17 21:07:43
Description:
When my locale name is "pt_BR.UTF-8", tcl looks for the
catalog in
"pt_br.utf-8.msg" and "pt.msg". I claim that the first
one should
better be "pt_br.msg". Because the .msg files are all
in UTF-8 (or
contain \unnnn escape sequences for non-ASCII
characters), they don't
depend on the locale's encoding. Brazilian users want
Brazilian
translations, not Portuguese translations. Regardless
whether their
locale is named pt_BR.ISO-8859-1 or pt_BR.UTF-8. The
encoding that the
translator used shouldn't matter.
User Comments: dgp added on 2002-06-18 04:07:43:
Logged In: YES 
user_id=80530

corrected docs.

haible added on 2002-06-17 23:50:24:
Logged In: YES 
user_id=5923

Yes, "UK" is not an ISO 3166 country code. The British
English locale is called en_GB, not en_UK.

dgp added on 2002-06-17 23:36:41:
Logged In: YES 
user_id=80530


That was the idea.  Thanks for the additional data.  Committed.

Is there an en_UK locale?  If not, we should get it out
of the examples in the msgcat documentation.

haible added on 2002-06-17 22:59:27:

File Added - 25257: msgcat.w32a

Logged In: YES 
user_id=5923

Thanks. With the new table lookup that you added, it's now
much easier to support country dependent variants of
languages (like Brazilian Portuguese). Find attached a table
which

1. adds country information where possible (i.e. where the
previous code looked up only pt.msg, the new code will look
up pt_br.msg and pt.msg).

2. Two fixes: 0809 is en_GB, not en_UK. 13 is nl, not da.

dgp added on 2002-06-17 12:26:58:

File Deleted - 25140: 



File Added - 25220: msgcat.patch

dgp added on 2002-06-17 12:26:57:
Logged In: YES 
user_id=80530

final version of patch
committing to msgcat 1.3, bundled with Tcl 8.4a5/b1

dgp added on 2002-06-15 07:35:05:

File Deleted - 24997: 



File Added - 25140: msgcat.patch

dgp added on 2002-06-15 07:35:04:
Logged In: YES 
user_id=80530

updated patch folds in additional data about locale values
in the Windows Registry.  (Thanks!)  Also cleans up some
style guide issues.

haible added on 2002-06-15 03:40:15:

File Added - 25126: msgcat.w32

haible added on 2002-06-15 03:40:14:
Logged In: YES 
user_id=5923

The ConvertLocale function for Unix works as expected;
thanks.

For the locale determination on Woe32 systems, I've made up
a longer list,
using tables from unicode.org and microsoft.com. You find it
appended.
Please use it at the appropiate place of the patch.

dgp added on 2002-06-13 06:54:03:

File Deleted - 24663: 



File Added - 24997: msgcat.patch

Logged In: YES 
user_id=80530

Updated patch include documentation changes.
Still need tests before acceptance.

dgp added on 2002-06-08 08:23:52:

File Deleted - 21556: 



File Added - 24663: msgcat.patch

Logged In: YES 
user_id=80530

Here's an updated patch.  Please review that
it takes care of the problem reported here.

The bump to version 1.3 is because the interface
is subtly changed by recognizing more environment
variables.

With updated docs/tests we can bundle this
with Tcl 8.4.

dgp added on 2002-06-08 06:12:37:
Logged In: YES 
user_id=80530

Note that I read Bug 525522 as referring to
facilities of TclX which are outside of my
concern.  If that bug really refers to the
msgcat *package* that is bundled with Tcl
(*not* TclX), then please clarify that bug
report.

haible added on 2002-04-20 09:01:27:
Logged In: YES 
user_id=5923

I can see three problems in the patch.

1) These question marks should be brackets.

     # Assume $value is of form:
$language?_$territory??.$codeset??@modifier?
     # Convert to form: $language?_$territory??_$modifier?

2) The function ConvertLocale gives an error when the
argument string contains an @ sign. You can try these tests:

echo [msgcat::ConvertLocale "de"]                   -> de
echo [msgcat::ConvertLocale "de_DE"]                -> de_DE
echo [msgcat::ConvertLocale "de.UTF-8"]             -> de
echo [msgcat::ConvertLocale "de_DE.UTF-8"]          -> de_DE
echo [msgcat::ConvertLocale "de@bayrisch"]          ->
de_bayrisch
echo [msgcat::ConvertLocale "de_DE@bayrisch"]       ->
de_DE_bayrisch
echo [msgcat::ConvertLocale "de.UTF-8@bayrisch"]    ->
de_bayrisch
echo [msgcat::ConvertLocale "de_DE.UTF-8@bayrisch"] ->
de_DE_bayrisch

3) [info exists ::env(...)] returns true for an empty
environment value, right? But ::env("LANG") must be ignored
if it is set to empty (see bug #525522). Similarly for
LC_MESSAGES and LC_ALL.

dgp added on 2002-04-20 08:15:22:
Logged In: YES 
user_id=80530

Ah... I think that patch is wrong.  It will only
recognize a codeset or modifier if it already
recognized as territory, etc.  Feel free to offer
a correction.  I may not get back to this right away.

dgp added on 2002-04-20 08:12:56:

File Added - 21556: msgcat.patch

Logged In: YES 
user_id=80530

Yes, it seems so.  Here's a patch as you described.
Still needs the documentation changes added.  Please
review.

haible added on 2002-04-20 06:55:17:
Logged In: YES 
user_id=5923

In glibc the following are in use:

@euro (for some European countries) means a variant of the
locale where the currency is the EUR.

@cyrillic (for Serbian) means a variant of the locale which
uses cyrillic writing, latin writing being the default

@nynorsk (for Norwegian) meant a variant of the language (=
dialect), but now they use the language code nn.

In German locales, you could easily use the modifier @old to
denote the old (20th century) orthorgraphy.

In OSF/1 or Solaris locales, I've the @UCS suffix to denote
locales where the internal wchar_t representation has been
changed to Unicode.

In summary, you can see that some of these modifiers warrant
the creation of separate .msg files (especially the
@cyrillic and @old modifiers).

dgp added on 2002-04-20 06:22:50:
Logged In: YES 
user_id=80530

Can you provide any examples of the @modifier in use?
What sort of things does it indicate?

haible added on 2002-04-20 06:19:11:
Logged In: YES 
user_id=5923

If you do this: "A [split] on "." and passing only the first
element of the resulting list"
the locales including a modifier and a codeset will not be
treated correctly. IMO the right transformation would be
(assuming that the codeset of the locale doesn't matter
because all catalogs are in UTF-8 anyway):

Convert
      language_territory.codeset@modifier
to
      language_territory_modifier

I.e. the algorithm is:
  - Search the first dot or @ sign. Take the substring
preceding
     this character.
  - If it was a dot, search the next @ sign.
  - If an @ sign was found, append to the result an
underscore
     and all that follows the @ sign.

dgp added on 2002-04-20 06:01:20:
Logged In: YES 
user_id=80530

I don't plan to go that far.  The [::msgcat::mclocale]
command is documented to accept an argument that is
a list of strings joined by "_".  I don't plan to
extend that.

So, the only thing that needs fixing is the way msgcat
uses the environment variable values to derive an argument
to pass to [mclocale].  A [split] on "." and passing only
the first element of the resulting list should take care
of that.

The message catalogs of interest are those packaged with
Tcl packages and applications.  All of them are of the
form foo_bar_baz.msg .  I don't see a compelling reason
to extend that further.  The message catalog files
themselves are always read in the UTF-8 encoding, and then
evaluated as Tcl scripts.  The documentation of msgcat
should probably be improved to make that clear.

haible added on 2002-04-20 05:50:36:
Logged In: YES 
user_id=5923

You are right. The most general locale name form is as you
mentioned. The .msg files that should be attempted to open
are

   [email protected]
   [email protected]
   language_territory.msg
   language.msg

in that order. (Remove the codeset first because it is
merely an
implementation issue, thus it has lowest "priority". Remove
the
modifier next, because usually a modifier belongs to a
particular
language_territory - for example no_NO@nynorsk is a dialect
of
Norwegian, and no@nynorsk has not much sense.)

About LC_ALL, LC_MESSAGES, LANG: you are right here again;
this
was mentioned in bug report 525522.

dgp added on 2002-04-20 05:27:35:
Logged In: YES 
user_id=80530

Likewise, for consistency with C libraries, it
would make sense for msgcat to get its default
locale from ::env(LC_ALL), then ::env(LC_MESSAGES),
then ::env(LANG).

dgp added on 2002-04-20 05:22:28:
Logged In: YES 
user_id=80530


Various man pages and online references suggest that the
XPG4 canonical form for a locale (for example, a suitable
value for $::env(LANG)) is

  language[_territroy][.codeset][@modifier]

Systems can understand locales of other forms as well,
but I believe that systems (C libraries) conforming to
XPG4 all agree on what locales of the form above mean.

The [::msgcat::mclocale] command doesn't deal with all
of that, though.  It expects an argument that is
a list of strings joined by "_", like en, en_US, or
en_US_funky.

So it seems that when pulling values out of environment
variables, the msgcat package should pass the portion
of the string before the first "." on to [mclocale].

I think that change would take care of this bug.

haible added on 2002-04-18 23:57:39:
Logged In: YES 
user_id=5923

Yes, that would be fine.

.msg files that are written in a particular encoding could
be
stored as pt_br.utf-8.msg or pt_br.iso-8859-1.msg.

Whereas .msg files that are encoding independent (because
they use
\unnnn syntax - such as the ones produced by GNU msgfmt
0.11.1) could be stored as pt_br.msg.

dgp added on 2002-04-18 23:37:31:
Logged In: YES 
user_id=80530

Would it be sufficient if msgcat looked for the
catalogs:

pt_br.utf-8.msg
pt_br.msg
pt.msg

in that order?

Attachments: