Tcl Source Code

View Ticket
Login
Ticket UUID: 8663689908d3304a74fee525cd04aa4162e86391
Title: regexp \\w missing characters
Type: Bug Version: 8.6.5
Submitter: jan.nijtmans Created on: 2016-04-08 12:09:48
Subsystem: 43. Regexp Assigned To: jan.nijtmans
Priority: 9 Immediate Severity: Minor
Status: Closed Last Modified: 2021-03-18 14:47:22
Resolution: Fixed Closed By: jan.nijtmans
    Closed on: 2021-03-18 14:47:22
Description:
% string is word \u203f
1
% regexp \\w \u203f
0

It appears that the character UNDERTIE (\u203f) is considered a word character by "string is word", but not in the regexp engine. In total, 9 characters are missing, all characters in the Unicode category Pc (Puntuation Connector)
User Comments: jan.nijtmans added on 2021-03-18 14:47:22:

> Something in this collection of facts isn't right.

Indeed, in this case the documentation is not correct. In the pre-unicode world, "word" characters were [a-zA-Z0-9_]. This is indeed the alphanumeric characters and the underscore.

In the Unicode world, "word" characters are the alphanumeric characters and the "CONNECTOR_PUNCTUATION" characters (Pc). That's the underscore, but also 9 more:

UNDERTIE ‿ CHARACTER TIE ⁀ INVERTED UNDERTIE ⁔ PRESENTATION FORM FOR VERTICAL LOW LINE ︳ PRESENTATION FORM FOR VERTICAL WAVY LOW LINE ︴ DASHED LOW LINE ﹍ CENTRELINE LOW LINE ﹎ WAVY LOW LINE ﹏ FULLWIDTH LOW LINE _

So, the documentation should be adapted. The implementation is already doing this correctly. Done now in core-8-6-branch and up.


dgp added on 2016-07-22 21:18:47:
Tcl docs (re_syntax.n) say that the "CLASS-SHORTHAND ESCAPES"
are defined:

 \w        [[:alnum:]_] (note underscore)

By this description, \u203f ought to match
regexp {\w} only if it matches {[[:alnum:]]}.

% info patch
8.6.6
% regexp {\w} \u203f
1
% regexp {[[:alnum:]]} \u203f
0

Something in this collection of facts isn't right.

dgp added on 2016-04-08 12:33:32:
Please notify Tom Lane (user 'tgl') about this change.

jan.nijtmans added on 2016-04-08 12:29:58:
Fixed in all active branches