Tcl Source Code: View Ticket

Ticket UUID:	219341
Title:	Unicode line separator \u2028 not recognized
Type:	RFE	Version:	None
Submitter:	nobody	Created on:	2000-10-26 05:11:00
Subsystem:	45. Parsing and Eval	Assigned To:	andreas_kupries
Priority:	5 Medium	Severity:
Status:	Open	Last Modified:	2006-09-30 00:40:34
Resolution:	None	Closed By:
		Closed on:
Description:	OriginalBugID: 5333 Bug Version: 8.3.1 SubmitDate: '2000-05-03' LastModified: '2000-05-15' Severity: LOW Status: UnAssn Submitter: techsupp ChangedBy: hobbs OS: Windows 98 FixedDate: '2000-10-25' ClosedDate: '2000-10-25' Name: doug edmunds ReproducibleScript: create a unicode file using \u2028 (unicode line separator) instead of LF, CR, or CRLF. read file into tcl (with appropriate conversion) line separator will not be converted. ObservedBehavior: if file is a script, it will break due to run on of all commands onto one line DesiredBehavior: on file read, \u2028 converted into CR, LF, or CRLF, as appropriate for the system (unix, mac, dos/windows)
User Comments:	msofer added on 2006-03-12 21:22:58: data_type - 360894 Logged In: YES user_id=148712 Changing to RFE. Note that the conversions are done properly in 8.4.9 (and presumably every later Tcl): reading in the utf16 file and passing it through proc u2a (http://wiki.tcl.tk/515) shows \uFEFFthis is a file with\u2028multiple lines, which\u2028will be saved as\u2028utf-16 (little endian)\u2028 as it should. So the RFE is for Tcl's parser to recognise these unicode whitespaces. Category changed accordingly. msofer added on 2006-03-12 02:33:17: Logged In: YES user_id=148712 Chat quotes: [16:29]dgpThe Tcl parser recognizes only 5 characters as word- or list-element- breaking white space. [16:30]migueldon: I know what it does; the question if it should incorporate the newer ones. And if not, that bug should be closed as "won't fix" [16:31]dgpthat's distinct from [string is space] and possibly other things like regexp [16:31]dgpclose it "won't fix" [16:31]dgpor recast as an RFE [16:31]dgplikely aimed at 9.0 msofer added on 2006-03-12 02:19:28: Logged In: YES user_id=148712 status? Tried the attached example in 8.4.9 and HEAD; the output is still wrong, but different from what is described in the comments: the line separators are printed as white space, but they do not taken as whitespace by tcl: % set x "a\u2028b" a b % llength $x 1 % encoding system utf-8 andreas_kupries added on 2001-04-12 22:23:50: Logged In: YES user_id=75003 More Urls: http://www.linuxdoc.org/HOWTO/Unicode-HOWTO.html http://www.unicode.org/unicode/reports/tr18/#End Of Line TR18 talks about REGEXes and UNICODE. andreas_kupries added on 2001-04-11 21:41:30: Logged In: YES user_id=75003 Some relevant urls: http://www.unicode.org/ http://www.unicode.org/unicode/reports/ http://www.unicode.org/unicode/reports/tr13/ (UAX #13: Unicode Newline Guidelines) http://www.unicode.org/unicode/reports/tr14/ (UAX #14: Line Breaking Properties) http://www.unicode.org/unicode/reports/tr18/ (UTR #18: Unicode Regular Expression Guidelines) http://www.unicode.org/unicode/reports/tr6/ (UTS #6: A Standard Compression Scheme for Unicode) andreas_kupries added on 2001-04-10 13:58:48: Logged In: YES user_id=75003 I just noted that Doug talked about utf-16 little endian! I remember a bug report which complained that Tcl was not picking up on the first marker characters to auto-detect endianess in the utf-16 encoding. Possibly a reason for our problem too. andreas_kupries added on 2001-04-10 13:55:53: File Added - 5183: utf219341.tar.gz andreas_kupries added on 2001-04-10 13:55:51: Logged In: YES user_id=75003 Doug Edmunds is <[email protected]>. Here his answer to my questions ___________________________ Look at the attached utftest.tcl file. The "appropriate conversion" is the line: fconfigure $fileid -encoding unicode needed to correctly read-in a utf-16 file. A utf-8 file doesn't need any conversion Since you asked about this, I have reconstructed the issue, using a Windows-based unicode editor (freeware) called SC Unipad. It is available at http://www.sharmahd.com/unipad/ With that program (which uses the unicode line-separator instead of CR, LF, LFCR) files can be saved in a multitude of unicode formats, include utf-8 and both utf-16 big endian and utf-16 little endian. Attached are 2 files, generated with that editor, appropriately named utf8test.txt and utf16text.txt. Also enclosed is a short tcl file which is supposed to open these files and print some info out to a tcl console. I used the console that opens with wish83.exe. Here's the issue: Instead of producing Total contents of the utf-8 file are: this is a file with multiple lines, which will be saved as utf-8 It produces Total contents of the utf-8 file are: ï»¿this is a file withâ?¨multiple lines, whichâ?¨will be saved asâ?¨utf-8 Similarly, using the 'appropriate conversion' the line-separator is lost, for the utf-16 file: this is a file with multiple lines, which will be saved as utf-16 (little endian) It looks like this: this is a file withmultiple lines, whichwill be saved asutf-16 (little endian) (but which copy-paste converts into question-marks, showing that the characters are really retained by tcl, but just are not being printed correctly): What copy-from-tcl-console paste-to-email produces: ?this is a file with?multiple lines, which?will be saved as?utf-16 (little endian)? I hope this helps. Personally, I think there will always be problems until the underlying OS is based on unicode. andreas_kupries added on 2001-04-09 23:52:41: Logged In: YES user_id=75003 What is the mentioned "appropriate conversion" ? "utf-8" or "unicode" ?

Attachments:

utf219341.tar.gz [download] added by andreas_kupries on 2001-04-10 13:55:52. [details]