Tcl Source Code: View Ticket

Ticket UUID:	2251175
Title:	Expand ({*}), constants, and list syntax
Type:	Bug	Version:	obsolete: 8.6a3
Submitter:	lars_h	Created on:	2008-11-09 21:14:04
Subsystem:	45. Parsing and Eval	Assigned To:	dgp
Priority:	3 Low	Severity:
Status:	Closed	Last Modified:	2012-05-02 00:28:55
Resolution:	Fixed	Closed By:	dgp
		Closed on:	2012-05-01 17:28:55
Description:	The syntax {} uses when reparsing a literate constant appears to deviate from the usual list syntax in that it does not perform backslash substitution. Short example: % set L1 [list \{] ; # List intrep \{ % set L2 {\{} ; # No intrep \{ % set L3 [list {}$L1 {}$L2 {}{\{} {}"\\\{"] \{ \{ {\{} \{ In all of these cases, the string being reparsed by {} has a backslash as first character and a left brace as second character (i.e., the canonical string representation for the length 1 list whose only element is the left brace), but when that string is brace-delimited then something goes horribly wrong; it appears the {} list parser thinks the backslash is just an ordinary character: % set L4 {{a b} {\\} \\ "c \n d"} {a b} {\\} \\ "c \n d" % list {}$L4 {}{{a b} {\\} \\ "c \n d"} {a b} {\\} \\ {c d} {a b} {\\} {\\} {c \n d} This may have been supported by the following language in the dodekalogue: [5] Argument expansion. If a word starts with the string {} followed by a non-whitespace character, then the leading {} is removed and the rest of the word is parsed and substituted as any other word. After substitution, the word is parsed again without substitutions, and its words are added to the command being substituted. That is not consistent with TIP#157 however, since this clearly specifies the {}-reparsing as "After substitution, the word is then parsed as a _list_ (as if with Tcl_SplitList() or Tcl_GetListFromObj)". Backslash substitution is a necessary part of list syntax, since the string representation for list elements which are not balanced with respect to braces requires backslash substitution. I also note that there doesn't seem to be any tests for {*} which involves applying it to lists with a string reperesentation that contains backslashes.
User Comments:	dgp added on 2012-05-02 00:28:55: allow_comments - 1 The TclFindElement improvements were completed some time ago and are now in use. ferrieux added on 2008-12-02 06:24:41: Yes. In an ideal world we'd have a single list parser controlled by flags telling what to do with # ; \n etc. Not an evening session project though :-} About your idea of limiting the runtime expansion to the tricky elements: I had considered it but dismissed it on the grounds that it would push back the cursor towards the complex end of the spectrum, which you convinced me to avoid ;-) As to the suboptimality of re-scanning the tokens to find naked backslashes, it shouldn't stop us because it is already here... dgp added on 2008-12-02 05:41:29: backport for 8.5.6 committed. leaving open as reminder to seek more internals improvements. There's something striking about the fact that the definition of lists and elements and the definition of commands and words are defined to have so much in common, but are implemented completely separately. dgp added on 2008-12-02 05:35:22: File Added - 303665: 2251175.patch Here's the backport patch. File Added: 2251175.patch dgp added on 2008-12-02 05:31:04: Thanks for the work on this. I see that tclCompile.c is now back to the same code as before patches here; back to Revision 1.160, and the fix for this bug is now completely changes to tclParse.c. Committing a few code standards changes to HEAD. The remaining improvement I would make (low priority) would be to break up literal lists into simple parts and complex parts so the simple parts can be expanded in the parser, leaving TCL_TOKEN_EXPAND_WORD only for the elements that need it. This might involve an improved variant of TclFindElement() that reports back the data we need so we don't have to make a second pass scanning for backslashes. Can you confirm my reading of the logs that so far there's been no attempt to backport a fix to the core-8-5-branch ? ferrieux added on 2008-11-28 22:35:48: Fixing status finger-slip ;-) ferrieux added on 2008-11-28 22:31:46: Committed. dgp added on 2008-11-27 13:41:38: It will be at least several days before I can review. Since you've got tests and I trust you've run them, go ahead and commit. I'll review from there (and others will too). ferrieux added on 2008-11-26 07:18:11: File Added - 302973: expand2.patch Attaching expand2.patch implementing Solution2, i.e. detecting naked backslashes in to-expand literals and disabling parse-time expansion in that case. Test suite updated to acknowledge the fact. Also, your example now yields: % proc \{ {} {puts [info frame 1]} % {}[list \{] type eval line -1 cmd {{}[list \{]} level 1 % {}{\{} type eval line -1 cmd {{}{\{}} level 1 Notice the patch is much smaller than its apparent size due to the large number of '-', since it includes the reversion of expand1c.patch. Please review. File Added: expand2.patch ferrieux added on 2008-11-25 05:44:20: OK so you're voting for Solution 2 after all... Why not just say so ;-) Thanks for the journey in the parser though. No regrets... dgp added on 2008-11-25 05:40:57: What I would like is a solution that keeps TCL_TOKEN_EXPAND_WORD out of the picture for the overwhelmingly common case where breaking into equivalent TCL_TOKEN_SIMPLE_WORDs can be done without error. For the cases like those pointed out in the bug report, I would just punt the whole problem to runtime, just the way the whole "not a valid list" error reporting is handled. I think that combination can be done with no new Tcl_Token types. Schedule is pulling me away again for a week or more, but I will contribute something here besides griping, I promise. ferrieux added on 2008-11-19 07:06:27: Just committed the simplifications induced by further analyzing dead branches. tclCompCmds.c is pristine again. tclParse.c's secondary cases eliminated as well (were not WORD tokens). tclCompile.c pruned as you suggested. In total only 3 instances of collapsing remain: TclCompileTokens, TclWordKnownAtCompileTime, TclSubstTokens. Of these, I am only able to produce a realistic use-case of the first. However, I thing they should be kept in case of future evolutions (like new compiled commands using them). As you can see, the residual size of the patch is getting small. However I'm still interested in your opinion re Solution 2. ferrieux added on 2008-11-19 06:10:58: You are right about 1469-1506: they are under the SIMPLE_WORD case, while we carefully demoted from SIMPLE_WORD to WORD the containers of UNCOLLAPSED tokens (to avoid ending up collapsing in all CompCmds). Removing them. Now about your suggestion: this is what I called Solution2 earlier in the thread. Indeed, what makes {}$x work from the beginning is the fact that there is no parse-time expansion, it is runtime expansion. So Solution2 would consist of removing that optimization that is parse-time literal expansion, and always generate EXPAND tokens even for literals. We'll be losing some performance for large expanded constants. Dunno how important they are in real life though... Are you advocating Solution 2 ? dgp added on 2008-11-19 04:10:06: On a different track, when I expressed reservations about complexity, I was referring to the addition of a new Tcl_Token type. I still need to find time for a more thorough review, but my impression is that such an addition should not be necessary to fix the bug. This example has always worked: % set l {a \n b} a \n b % list {}$l a { } b and there was no need for the additional token type to enable that, so I would think the same capabilities could make list {}{a \n b} work as well without the new type. dgp added on 2008-11-19 04:03:55: It appears that running the current test suite does not reach line 1469 or line 1506 of tclCompile.c. Do you have any cases that would do so? If these branches aren't really necessary, it would be nice to remove them. dgp added on 2008-11-19 02:26:42: Do the new tests cover all the branches added by the patch ? ferrieux added on 2008-11-18 05:30:44: Committed as per Donald's suggestion. Leaving open for review. ferrieux added on 2008-11-16 18:18:18: File Added - 301633: expand1c.patch OK. The reason I did this was to limit the range of changes. Assuming checks for TCL_TOKEN_TEXT were spread all over the code, I had the hope that some of them could work "transparently" for uncollapsed text (assuming they are done by bit-masking). Now in hindsight it appears that all clients of text tokens must, by definition, handle differently normal and to-collapse test. So your argument wins. Updating the patch. File Added: expand1c.patch nijtmans added on 2008-11-16 15:05:33: My only remark is that things like: case TCL_TOKEN_TEXT\|TCL_TOKEN_MUSTCOLLAPSE: look confusing: it tests for the new token type (a variant of TCL_TOKEN_TEXT). I would not introduce the concept of 'modifiers' for tokens, especially not if this will be backported to Tcl 8.5. Why not just use a new token type, like TCL_TOKEN_UNCOLLAPSED_TEXT (bikeshed warning: I don't mind the exact name) Then in tcl.h we get: #define TCL_TOKEN_UNCOLLAPSED_TEXT 512 The above case TCL_TOKEN_TEXT\|TCL_TOKEN_MUSTCOLLAPSE: will simplify to case TCL_TOKEN_UNCOLLAPSED_TEXT: Expressions like originally: if (bodyToken[i]->type == TCL_TOKEN_TEXT) { which you changed to: if (bodyToken[i]->type & TCL_TOKEN_TEXT) { would become: if (bodyToken[i]->type & (TCL_TOKEN_TEXT\|TCL_TOKEN_UNCOLLAPSED_TEXT)) { This would make the intention of the patch more clear. It makes the remark you put in tcl.h unneccessary, and the case statement less confusing. Apart form that, it looks fine to me. My $.02 Regards, Jan Nijtmans ferrieux added on 2008-11-16 04:29:28: File Deleted - 301046: ferrieux added on 2008-11-15 07:28:25: File Added - 301423: expand1b.patch Fixed. Attaching the completed patch + tests (including parser tests). File Added: expand1b.patch ferrieux added on 2008-11-14 01:30:28: Oops just found one extra failing case: % proc h {} {return [list {}{a \n b}]} % h a {\n} b I know what's happening, I remember I forgot to update that part (CompileTokens) because it was a bit late :-) Will fix it shortly. ferrieux added on 2008-11-14 00:55:39: I have to admit I too knew nothing of it 4 days ago... I know Miguel is under awful load, but really I'd appreciate a Dirac of review activity here :-} (if one of you can ^Z him...) In the meantime I'll write a few extra tests. andreas_kupries added on 2008-11-14 00:04:15: Looking over the patch I see no reason to object. I should qualify that this looks to touch areas (parser) I do not know much about. Giving back to dgp. ferrieux added on 2008-11-13 00:19:07: Sorry if I have given the impression of a high complexity; indeed the patch is very small. My long paragraph was there only to explain how I had wedged a bit in Tcl_Tokens without extending the sizeof(), and while staying fast (unlike a bitfield). The short story is: - set this bit on expanded TCL_TOKEN_TEXT's with backslashes - in both clients (compile and eval), detect this bit after TIP280 data extraction, and collapse As for the line number "1" I do not know the details, but I'm not really surprised because the patch tries hard to stay on rock-solid "ROM source", meaning storing real offsets in the original string. So line "1" is at least consistent with that ;-) Andreas, your green light ? -Alex dgp added on 2008-11-12 23:54:07: andreas should have look too dgp added on 2008-11-12 23:50:26: I've not looked at the code of expand1.patch, but the description sounds more complex than I would prefer, at least as my first impulse. Applying and testing it I see: % proc \{ {} {puts [info frame 1]} % {}[list \{] type eval line -1 cmd {{}[list \{]} level 1 % {}{\{} type eval line 1 cmd {{}{\{}} level 1 which has what appears to be the proper behavior. There's no proper literals in the original script to point to, so we fall back to the "eval" reporting. I don't understand why one form reports line 1 and the other reports line -1, but since I don't understand, I don't want to make a claim of incorrectness either. If there's a reasonably obvious simpler way to achieve this, I think I'd prefer it, but if not, or if not with reasonable short term effort, then committing expand1.patch is fine with me so that we have some fix in place and we can debate whether or not it is the best fix at a more leisurely pace. Some additional tests are also in order. Thanks for tackling this Alexandre. dgp added on 2008-11-12 09:53:44: No evening session tonight. Confirmed the bug was introduced by the first TIP 280 commit on 2006-11-28. ferrieux added on 2008-11-12 07:01:28: File Added - 301046: expand1.patch Please find attached a patch implementing Solution (1): delayed collapse. It passes the test suite, including [info frame], since it respects the tying between TIP280-data and the original (ROM) source, while doing the necessary collapse just before use, which is (a) before literal creation in TclCompileScript and (b) before string appendage in TclEvalEx. One word about complexity: I did my best to minimize the impact for the "rest of us" never expanding a literal ;-) For this, I kept the size of the Tcl_Token struct unchanged. Only I exploited the fact (documented in comments in tcl.h) that TCL_TOKEN_* values were actually bits so that class tests can be done easily, that is, by bitwise AND. The funny fact is that nowhere in the core is this actually done, all tests are done with "==". For this patch I had to define a modifier bit in this field, defined only for type TCL_TOKEN_TEXT. So I did change all existing checks "==TCL_TOKEN_TEXT" into "&TCL_TOKEN_TEXT", which allows for graceful ignorance of the modifier everywhere it needs be. I believe the perf impact of this change should be between zero and epsilon ;-) Please review. File Added: expand1.patch dgp added on 2008-11-12 06:17:16: Confirmed bug introduce sometime between 8.5a5 and 8.5a6: % info patch 8.5a5 % {expand}{\{} invalid command name "{" % info patch 8.5a6 % {expand}{\{} invalid command name "\{" ferrieux added on 2008-11-12 05:04:10: I have not checked, but I found this in tclParse.c's history revision 1.53 date: 2007/05/30 18:12:59; author: dgp; state: Exp; lines: +138 -53 * generic/tclBasic.c: Removed code that dealt with * generic/tclCompile.c: TCL_TOKEN_EXPAND_WORD tokens representing * generic/tclCompile.h: expanded literal words. These sections were mostly in place to enable [info frame] to discover line information in expanded literals. Since the parser now generates a token for each post-expansion word referring to the right location in the original script string, [info frame] gets all the data it needs. * generic/tclInt.h: Revised the parser so that it never produces * generic/tclParse.c: TCL_TOKEN_EXPAND_WORD tokens when parsing an * tests/parse.test: expanded literal word; that is, something like {}{x y z}. Instead, generate the series of TCL_TOKEN_SIMPLE_WORD tokens to represent the words that expansion of the literal string produces. [RFE 1725186] So, it looks like prior to this date, things were going as per what I called "Solution (2)". Since that implies runtime expansion, which involves Tcl_GetListFromObj, I'd risk a guess that the bug appeared at that time :-{ So the dilemma becomes: (1) complexify Tcl_Token handling a bit to flag must-collapse-backslash tokens, to be collapsed by the client. (2) go back to situation as of revision 1.53 of tclParse.c and bury [RFE 1725186] What do you think ? dgp added on 2008-11-12 04:41:23: Thanks for reporting this. I acknowledge this is important to resolve one way or another before another release. I've raised priority to signal that. In return please acknowledge that this will require care and attention, and that will have to wait until I return from the conference I am currently attending. The next step I will/would take is to look at the original implementation of {expand} before TIP 280 was an issue. Did it have the same troubles? pspjuth added on 2008-11-10 23:48:34: In patch 1870327 I also encountered the problem of compiling dynamic source. I did some ugly workaround. I'm interested in any solution to this problem. https://sourceforge.net/tracker/?func=detail&aid=1870327&group_id=10894&atid=310894 ferrieux added on 2008-11-10 21:43:23: Progress report: I now realize that this approach has problems. Maintaining proper line info while violating the "source in ROM" principle is tricky. I can think of two alternatives: (1) Delay the TclCopyAndCollapse by adding to tokens a flag bit TCL_TOKEN_COLLAPSE, so that only the clients of the parse tree (eval and compile) call TCAC at the latest possible moment (i.e. just when building the objv[]) and still operate on source-ROM for line info. (2) Consider compile-time expansion of literals as a rather useless optimization, and generate a runtime-expand instead. Don, Miguel, please advise !!! ferrieux added on 2008-11-10 18:51:56: File Added - 300840: expand.patch Attaching a patch fixing the issue, using the latter (lazier) merthod: a linked list of individual ckallocs. However, this patch does* break tests info-33.[13], (info frame). I suspect the reason is that it introduces text pointer values that are outside the source, and that get recorded in the bytecode. This can also be seen by ::tcl::unsupported::disassemble on the generated proc of these tests. I will look into these details now, but I'd appreciate Miguel to take over ;-) File Added: expand.patch ferrieux added on 2008-11-10 16:42:28: Hmm, maybe the reason is that it is not easy to tie any new allocations to the result of the parsing. Indeed the rest of tclParse.c arranges for token cells to be allocated/reallocated in a contiguous array, with their member strings pointing into the original source. So, if we (correctly) decide to call TclCopyAndCollapse for each "found element", we'll need to spit out the resulting backslash-subst'ed strings in some space that will die at Tcl_FreeParse time. Maybe we need to complexify the Tcl_Parse struct a bit for that (eg by keeping a single growing malloc'ed block similar to the token array but for massaged strings). Or maybe this is premature optimization again, and a linked list of individual mallocs will do... Miguel, do you confirm/infirm ? ferrieux added on 2008-11-10 15:30:15: Ouch :-) First, a simpler way of summing up the bug is to notice the violation of the invariant list {}{SOMETHING} == [lrange {SOMETHING} 0 end] Second, we witness this violation only for expanded literals, that is, those that are expanded directly within the token stream of the parser. Third, this indeed is due to the fact that the relevant part of the code in tclParse.c only calls TclFindElement instead of Tcl_GetListFromObj or equivalent. This makes a whole lot of a difference of course, since Tcl_GetListFromObj..SetListFromAny calls both TclFindElement and* TclCopyAndCollapse, which does the necessary backslash subst. Miguel, why this ? Premature optimization ;-) ? -Alex

Attachments:

2251175.patch [download] added by dgp on 2008-12-02 05:35:22. [details]
expand2.patch [download] added by ferrieux on 2008-11-26 07:18:11. [details]
expand1c.patch [download] added by ferrieux on 2008-11-16 18:18:18. [details]
expand1b.patch [download] added by ferrieux on 2008-11-15 07:28:24. [details]