Tcl Source Code

Ticket Change Details
Login
Overview

Artifact ID: 1ef56cb2bf2a8186a10627ef1e66e1428964d052
Ticket: 950acd6df4cffdac6dfecd189864e86262c33ea5
utf-8 to unicode conversion gives broken results
User & Date: dkf 2014-02-24 11:55:49
Changes

  1. icomment:
    You're doing this backwards.<p>
    <tt>encoding convertto</tt> takes a string and produces a byte-sequence that is the encoded form of the string using the specified encoding (or the system encoding). <tt>encoding convertfrom</tt> takes a byte-sequence and interprets that as an encoded string, returning the string that it thinks it is the encoded form of (where the encoding is the specified one or the system encoding). The process of converting a string <i>to</i> a particular encoding can result in information loss if the target encoding can't handle a character, and the reverse can result in information loss if the source contains a byte sequence that is not valid in the encoding.<p>
    If we take all that and try the reverse operations:
    <pre>
    % set src "Η Σάμος είναι"
    Η Σάμος είναι
    % encoding convertto utf-8 $src
    Η Σάμος είναι
    % encoding convertfrom utf-8 $src
    — £¬¼¿Â µ¯½±¹
    % encoding convertto utf-8 [encoding convertfrom utf-8 $src]
    — £¬¼¿Â µ¯½±¹
    % encoding convertfrom utf-8 [encoding convertto utf-8 $src]
    Η Σάμος είναι
    </pre>
    We can see that you really are doing things backwards. <strong>Do not think of Tcl strings as UTF-8</strong> but rather as abstract Unicode character sequences. Let Tcl manage what encoding its strings are (it's not constant!) and ask for them in a particular encoding on when you need that. By the time a <tt>source</tt>d script is interpreted at all, Tcl has already done encoding handling for the entire script; everything in it is understood as an (abstract) Unicode character.<p>
    Changing <tt>encoding system</tt> is probably a very bad idea; the main thing that is used for is actually controlling how Tcl and the OS understand bytes used in filenames…
    
  2. login: "dkf"
  3. mimetype: "text/html"
  4. resolution changed to: "Invalid"
  5. status changed to: "Pending"
  6. username: "dkf"