Ticket UUID: | 219341 | |||
Title: | Unicode line separator \u2028 not recognized | |||
Type: | RFE | Version: | None | |
Submitter: | nobody | Created on: | 2000-10-26 05:11:00 | |
Subsystem: | 45. Parsing and Eval | Assigned To: | andreas_kupries | |
Priority: | 5 Medium | Severity: | ||
Status: | Open | Last Modified: | 2006-09-30 00:40:34 | |
Resolution: | None | Closed By: | ||
Closed on: | ||||
Description: |
OriginalBugID: 5333 Bug Version: 8.3.1 SubmitDate: '2000-05-03' LastModified: '2000-05-15' Severity: LOW Status: UnAssn Submitter: techsupp ChangedBy: hobbs OS: Windows 98 FixedDate: '2000-10-25' ClosedDate: '2000-10-25' Name: doug edmunds ReproducibleScript: create a unicode file using \u2028 (unicode line separator) instead of LF, CR, or CRLF. read file into tcl (with appropriate conversion) line separator will not be converted. ObservedBehavior: if file is a script, it will break due to run on of all commands onto one line DesiredBehavior: on file read, \u2028 converted into CR, LF, or CRLF, as appropriate for the system (unix, mac, dos/windows) | |||
User Comments: |
msofer added on 2006-03-12 21:22:58:
data_type - 360894 Logged In: YES user_id=148712 Changing to RFE. Note that the conversions are done properly in 8.4.9 (and presumably every later Tcl): reading in the utf16 file and passing it through proc u2a (http://wiki.tcl.tk/515) shows \uFEFFthis is a file with\u2028multiple lines, which\u2028will be saved as\u2028utf-16 (little endian)\u2028 as it should. So the RFE is for Tcl's parser to recognise these unicode whitespaces. Category changed accordingly. msofer added on 2006-03-12 02:33:17: Logged In: YES user_id=148712 Chat quotes: [16:29]dgpThe Tcl parser recognizes only 5 characters as word- or list-element- breaking white space. [16:30]migueldon: I know what it does; the question if it should incorporate the newer ones. And if not, that bug should be closed as "won't fix" [16:31]dgpthat's distinct from [string is space] and possibly other things like regexp [16:31]dgpclose it "won't fix" [16:31]dgpor recast as an RFE [16:31]dgplikely aimed at 9.0 msofer added on 2006-03-12 02:19:28: Logged In: YES user_id=148712 status? Tried the attached example in 8.4.9 and HEAD; the output is still wrong, but different from what is described in the comments: the line separators are printed as white space, but they do not taken as whitespace by tcl: % set x "a\u2028b" a b % llength $x 1 % encoding system utf-8 andreas_kupries added on 2001-04-12 22:23:50: Logged In: YES user_id=75003 More Urls: http://www.linuxdoc.org/HOWTO/Unicode-HOWTO.html http://www.unicode.org/unicode/reports/tr18/#End Of Line TR18 talks about REGEXes and UNICODE. andreas_kupries added on 2001-04-11 21:41:30: Logged In: YES user_id=75003 Some relevant urls: http://www.unicode.org/ http://www.unicode.org/unicode/reports/ http://www.unicode.org/unicode/reports/tr13/ (UAX #13: Unicode Newline Guidelines) http://www.unicode.org/unicode/reports/tr14/ (UAX #14: Line Breaking Properties) http://www.unicode.org/unicode/reports/tr18/ (UTR #18: Unicode Regular Expression Guidelines) http://www.unicode.org/unicode/reports/tr6/ (UTS #6: A Standard Compression Scheme for Unicode) andreas_kupries added on 2001-04-10 13:58:48: Logged In: YES user_id=75003 I just noted that Doug talked about utf-16 little endian! I remember a bug report which complained that Tcl was not picking up on the first marker characters to auto-detect endianess in the utf-16 encoding. Possibly a reason for our problem too. andreas_kupries added on 2001-04-10 13:55:53: File Added - 5183: utf219341.tar.gz andreas_kupries added on 2001-04-10 13:55:51: Logged In: YES user_id=75003 Doug Edmunds is <[email protected]>. Here his answer to my questions ___________________________ Look at the attached utftest.tcl file. The "appropriate conversion" is the line: fconfigure $fileid -encoding unicode needed to correctly read-in a utf-16 file. A utf-8 file doesn't need any conversion Since you asked about this, I have reconstructed the issue, using a Windows-based unicode editor (freeware) called SC Unipad. It is available at http://www.sharmahd.com/unipad/ With that program (which uses the unicode line-separator instead of CR, LF, LFCR) files can be saved in a multitude of unicode formats, include utf-8 and both utf-16 big endian and utf-16 little endian. Attached are 2 files, generated with that editor, appropriately named utf8test.txt and utf16text.txt. Also enclosed is a short tcl file which is supposed to open these files and print some info out to a tcl console. I used the console that opens with wish83.exe. Here's the issue: Instead of producing Total contents of the utf-8 file are: this is a file with multiple lines, which will be saved as utf-8 It produces Total contents of the utf-8 file are: this is a file with�multiple lines, which�will be saved as�utf-8 Similarly, using the 'appropriate conversion' the line-separator is lost, for the utf-16 file: this is a file with multiple lines, which will be saved as utf-16 (little endian) It looks like this: this is a file withmultiple lines, whichwill be saved asutf-16 (little endian) (but which copy-paste converts into question-marks, showing that the characters are really retained by tcl, but just are not being printed correctly): What copy-from-tcl-console paste-to-email produces: ?this is a file with?multiple lines, which?will be saved as?utf-16 (little endian)? I hope this helps. Personally, I think there will always be problems until the underlying OS is based on unicode. andreas_kupries added on 2001-04-09 23:52:41: Logged In: YES user_id=75003 What is the mentioned "appropriate conversion" ? "utf-8" or "unicode" ? |
Attachments:
- utf219341.tar.gz [download] added by andreas_kupries on 2001-04-10 13:55:52. [details]