Ticket UUID: | 8663689908d3304a74fee525cd04aa4162e86391 | |||
Title: | regexp \\w missing characters | |||
Type: | Bug | Version: | 8.6.5 | |
Submitter: | jan.nijtmans | Created on: | 2016-04-08 12:09:48 | |
Subsystem: | 43. Regexp | Assigned To: | jan.nijtmans | |
Priority: | 9 Immediate | Severity: | Minor | |
Status: | Closed | Last Modified: | 2021-03-18 14:47:22 | |
Resolution: | Fixed | Closed By: | jan.nijtmans | |
Closed on: | 2021-03-18 14:47:22 | |||
Description: |
% string is word \u203f 1 % regexp \\w \u203f 0 It appears that the character UNDERTIE (\u203f) is considered a word character by "string is word", but not in the regexp engine. In total, 9 characters are missing, all characters in the Unicode category Pc (Puntuation Connector) | |||
User Comments: |
jan.nijtmans added on 2021-03-18 14:47:22:
> Something in this collection of facts isn't right. Indeed, in this case the documentation is not correct. In the pre-unicode world, "word" characters were [a-zA-Z0-9_]. This is indeed the alphanumeric characters and the underscore. In the Unicode world, "word" characters are the alphanumeric characters and the "CONNECTOR_PUNCTUATION" characters (Pc). That's the underscore, but also 9 more: UNDERTIE ‿ CHARACTER TIE ⁀ INVERTED UNDERTIE ⁔ PRESENTATION FORM FOR VERTICAL LOW LINE ︳ PRESENTATION FORM FOR VERTICAL WAVY LOW LINE ︴ DASHED LOW LINE ﹍ CENTRELINE LOW LINE ﹎ WAVY LOW LINE ﹏ FULLWIDTH LOW LINE _ So, the documentation should be adapted. The implementation is already doing this correctly. Done now in core-8-6-branch and up. dgp added on 2016-07-22 21:18:47: Tcl docs (re_syntax.n) say that the "CLASS-SHORTHAND ESCAPES" are defined: \w [[:alnum:]_] (note underscore) By this description, \u203f ought to match regexp {\w} only if it matches {[[:alnum:]]}. % info patch 8.6.6 % regexp {\w} \u203f 1 % regexp {[[:alnum:]]} \u203f 0 Something in this collection of facts isn't right. dgp added on 2016-04-08 12:33:32: Please notify Tom Lane (user 'tgl') about this change. jan.nijtmans added on 2016-04-08 12:29:58: Fixed in all active branches |