Tcl Source Code

View Ticket
Login
Ticket UUID: 950acd6df4cffdac6dfecd189864e86262c33ea5
Title: utf-8 to unicode conversion gives broken results
Type: Bug Version: Tcl 8.6.0 (pl10)
Submitter: anonymous Created on: 2014-02-24 09:49:39
Subsystem: 11. Conversions from String Assigned To: nobody
Priority: 5 Medium Severity: Critical
Status: Closed Last Modified: 2016-08-23 13:47:57
Resolution: Invalid Closed By: dkf
    Closed on: 2016-08-23 13:47:57
Description:
So far as I understood "encoding convertfrom $enc $text" is the right command to convert from a string with known encoding to unicode. This works fine, except $enc is utf-8. Please try the following script (I hope this tracker is not corrupting the Greek):

----------------------------------------------------------------------
#!/usr/bin/wish

# 1. ensure you have the right encoding to read the string
# 2. we want conversion utf-8 -> unicode
encoding system utf-8

set src "Η Σάμος είναι"

# the result of this operation is broken!
set uni [encoding convertfrom utf-8 $src]
set utf [encoding convertto utf-8 $uni]

# this works fine, it's a iso8859-7 encoded string
set grk [encoding convertfrom iso8859-7 "Ç ÓÜìïò åßíáé"]

pack [label .l1 -text $src]
pack [label .l2 -text $uni]
pack [label .l3 -text $utf]
pack [label .l4 -text $grk]
----------------------------------------------------------------------

You will see that the conversion from utf-8 is broken. Converting back is still broken. Conversion from iso8859-7 is fine.

My application is only working if I exclude utf-8 explicitly (my system encoding is utf-8):

if {$encoding eq "utf-8"} {
  set dst $src ;# because convertfrom doesn't work
} else {
  set dst [encoding convertfrom $encoding $src]
}

But I don't think that this is intended, the generic command "convertfrom" should work with all encodings due to the manual.

PS: In my tests I've figured out that sometimes the conversion to utf-8 gives results with overlong sequences, such results are invalid due to the utf-8 specification.
User Comments: jan.nijtmans added on 2014-04-04 12:50:46:
Since no-one disputed dkf's excellent explanation (neither do I), this can be closed.

dkf added on 2014-02-24 11:55:49:

You're doing this backwards.

encoding convertto takes a string and produces a byte-sequence that is the encoded form of the string using the specified encoding (or the system encoding). encoding convertfrom takes a byte-sequence and interprets that as an encoded string, returning the string that it thinks it is the encoded form of (where the encoding is the specified one or the system encoding). The process of converting a string to a particular encoding can result in information loss if the target encoding can't handle a character, and the reverse can result in information loss if the source contains a byte sequence that is not valid in the encoding.

If we take all that and try the reverse operations:

% set src "Η Σάμος είναι"
Η Σάμος είναι
% encoding convertto utf-8 $src
Η Σάμος είναι
% encoding convertfrom utf-8 $src
— £¬¼¿Â µ¯½±¹
% encoding convertto utf-8 [encoding convertfrom utf-8 $src]
— £¬¼¿Â µ¯½±¹
% encoding convertfrom utf-8 [encoding convertto utf-8 $src]
Η Σάμος είναι
We can see that you really are doing things backwards. Do not think of Tcl strings as UTF-8 but rather as abstract Unicode character sequences. Let Tcl manage what encoding its strings are (it's not constant!) and ask for them in a particular encoding on when you need that. By the time a sourced script is interpreted at all, Tcl has already done encoding handling for the entire script; everything in it is understood as an (abstract) Unicode character.

Changing encoding system is probably a very bad idea; the main thing that is used for is actually controlling how Tcl and the OS understand bytes used in filenames…