TIP 497: Full support for Unicode 11.0 and later (part 2)

Login
FlightAware bounty program for improvements to Tcl and certain Tcl packages.
Author:         Jan Nijtmans <jan.nijtmans@users.sf.net>
Author:         Jan Nijtmans <jan.nijtmans@gmail.com>
Author:         Don Porter <donald.porter@nist.gov>
State:          Draft
Type:           Project
Vote:           Pending
Created:        23-Jan-2018
Post-History:   
Discussions-To: Tcl Core list
Keywords:       Tcl
Tcl-Version:    9.0

Abstract

This TIP proposes to add full support for all characters in Unicode 10.0+, inclusive the characters >= U+010000, even the adaptation in the regexp engine. Also, the caveats remaining from TIP #389 will be handled here.

Summary

TODO

Rationale

TODO

Specification

This document proposes:

  • Add a new objType "UTF-32", which is able to store a string in 32-bits per character.

  • Adapt the regexp engine to start using the "UTF-32" objType: Any string handled by regexp will first be converted to "UTF-32". (DONE in tip-497 branch)

  • Modify all API using Tcl_UniChar: If the string contains surrogate pairs, the "UTF-32" objType will used.

  • Modify all functions using or producing an index: "string length " should return 1 for all Unicode characters, even the ones >= U+010000

TODO: everything else that comes up

Compatibility

TODO

Reference Implementation

A reference implementation is ongoing in the tip-497 branch. https://core.tcl.tk/tcl/timeline?r=tip-497

Copyright

This document has been placed in the public domain.