Tcl Source Code

View Ticket
Login
Ticket UUID: 219341
Title: Unicode line separator \u2028 not recognized
Type: RFE Version: None
Submitter: nobody Created on: 2000-10-26 05:11:00
Subsystem: 45. Parsing and Eval Assigned To: andreas_kupries
Priority: 5 Medium Severity:
Status: Open Last Modified: 2006-09-30 00:40:34
Resolution: None Closed By:
    Closed on:
Description:
OriginalBugID: 5333 Bug
Version: 8.3.1
SubmitDate: '2000-05-03'
LastModified: '2000-05-15'
Severity: LOW
Status: UnAssn
Submitter: techsupp
ChangedBy: hobbs
OS: Windows 98
FixedDate: '2000-10-25'
ClosedDate: '2000-10-25'


Name:
doug edmunds

ReproducibleScript:
create a unicode file using \u2028 (unicode line separator)
instead of LF, CR, or CRLF.  
read file into tcl (with appropriate conversion)
line separator will not be converted.

ObservedBehavior:
if file is a script, it will break due to run on of all 
commands onto one line

DesiredBehavior:
on file read, \u2028 converted into CR, LF, or CRLF, as 
appropriate for the system (unix, mac, dos/windows)
User Comments: msofer added on 2006-03-12 21:22:58:

data_type - 360894

Logged In: YES 
user_id=148712

Changing to RFE. 

Note that the conversions are done properly in 8.4.9 (and
presumably every later Tcl): reading in the utf16 file and
passing it through proc u2a (http://wiki.tcl.tk/515) shows

\uFEFFthis is a file with\u2028multiple lines,
which\u2028will be saved as\u2028utf-16 (little endian)\u2028

as it should.

So the RFE is for Tcl's parser to recognise these unicode
whitespaces. Category changed accordingly.

msofer added on 2006-03-12 02:33:17:
Logged In: YES 
user_id=148712

Chat quotes:

[16:29]dgpThe Tcl parser recognizes only 5 characters as
word- or list-element- breaking white space.
[16:30]migueldon: I know what it does; the question if it
should incorporate the newer ones. And if not, that bug
should be closed as "won't fix"
[16:31]dgpthat's distinct from [string is space] and
possibly other things like regexp
[16:31]dgpclose it "won't fix"
[16:31]dgpor recast as an RFE
[16:31]dgplikely aimed at 9.0

msofer added on 2006-03-12 02:19:28:
Logged In: YES 
user_id=148712

status? Tried the attached example in 8.4.9 and HEAD; the
output is still wrong, but different from what is described
in the comments: the line separators are printed as white
space, but they do not taken as whitespace by tcl:

% set x "a\u2028b"
a b
% llength $x
1
% encoding system
utf-8

andreas_kupries added on 2001-04-12 22:23:50:
Logged In: YES 
user_id=75003

More Urls:

http://www.linuxdoc.org/HOWTO/Unicode-HOWTO.html
http://www.unicode.org/unicode/reports/tr18/#End Of Line

TR18 talks about REGEXes and UNICODE.

andreas_kupries added on 2001-04-11 21:41:30:
Logged In: YES 
user_id=75003

Some relevant urls:

http://www.unicode.org/
http://www.unicode.org/unicode/reports/

http://www.unicode.org/unicode/reports/tr13/
(UAX #13: Unicode Newline Guidelines)

http://www.unicode.org/unicode/reports/tr14/
(UAX #14: Line Breaking Properties)

http://www.unicode.org/unicode/reports/tr18/
(UTR #18: Unicode Regular Expression Guidelines)

http://www.unicode.org/unicode/reports/tr6/
(UTS #6: A Standard Compression Scheme for Unicode)

andreas_kupries added on 2001-04-10 13:58:48:
Logged In: YES 
user_id=75003

I just noted that Doug talked about utf-16 little endian!
I remember a bug report which complained that Tcl was not
picking up on the first marker characters to auto-detect
endianess in the utf-16 encoding. Possibly a reason for our
problem too.

andreas_kupries added on 2001-04-10 13:55:53:

File Added - 5183: utf219341.tar.gz

andreas_kupries added on 2001-04-10 13:55:51:
Logged In: YES 
user_id=75003

Doug Edmunds is <[email protected]>.

Here his answer to my questions ___________________________

Look at the attached utftest.tcl file. The "appropriate
conversion" is the line:

    fconfigure $fileid -encoding unicode

needed to correctly read-in a utf-16 file. A utf-8 file
doesn't need any conversion

Since you asked about this, I have reconstructed the issue, 
using a Windows-based unicode editor (freeware) called SC
Unipad. It is available at  http://www.sharmahd.com/unipad/

With that program (which uses the unicode line-separator
instead of CR, LF, LFCR) files can be saved in a multitude
of unicode formats, include utf-8 and both utf-16 big endian
and utf-16 little endian.

Attached are 2 files, generated with that editor,
appropriately named utf8test.txt and utf16text.txt.

Also enclosed is a short tcl file which is supposed to open
these files and print some info out to a tcl console. I used
the console that opens with wish83.exe.

Here's the issue: Instead of producing

Total contents of the utf-8 file are:
this is a file with
multiple lines, which
will be saved as
utf-8

It produces
Total contents of the utf-8 file are:
this is a file with�multiple lines, which�will be
saved asâ?¨utf-8

Similarly, using the 'appropriate conversion' the
line-separator is lost, for the utf-16 file:

this is a file with
multiple lines, which
will be saved as
utf-16 (little endian)

It looks like this:
this is a file withmultiple lines, whichwill be saved
asutf-16 (little
endian)

(but which copy-paste converts into question-marks, showing
that the characters are really retained by tcl, but just are
not being printed correctly):

What copy-from-tcl-console paste-to-email produces:
?this is a file with?multiple lines, which?will be saved
as?utf-16 (little
endian)?

I hope this helps.  Personally, I think there will always be
problems until the underlying OS is based on unicode.

andreas_kupries added on 2001-04-09 23:52:41:
Logged In: YES 
user_id=75003

What is the mentioned "appropriate conversion" ?
"utf-8" or "unicode" ?

Attachments: