Tcl Source Code

View Ticket
Login
Ticket UUID: 2251175
Title: Expand ({*}), constants, and list syntax
Type: Bug Version: obsolete: 8.6a3
Submitter: lars_h Created on: 2008-11-09 21:14:04
Subsystem: 45. Parsing and Eval Assigned To: dgp
Priority: 3 Low Severity:
Status: Closed Last Modified: 2012-05-02 00:28:55
Resolution: Fixed Closed By: dgp
    Closed on: 2012-05-01 17:28:55
Description:
The syntax {*} uses when reparsing a literate constant appears to deviate from the usual list syntax in that it does not perform backslash substitution.

Short example:
% set L1 [list \{] ; # List intrep
\{
% set L2 {\{} ; # No intrep
\{
% set L3 [list {*}$L1 {*}$L2 {*}{\{} {*}"\\\{"]
\{ \{ {\{} \{

In all of these cases, the string being reparsed by {*} has a backslash as first character and a left brace as second character (i.e., the canonical string representation for the length 1 list whose only element is the left brace), but when that string is brace-delimited then something goes horribly wrong; it appears the {*} list parser thinks the backslash is just an ordinary character:

% set L4 {{a b} {\\} \\ "c \n d"}
{a b} {\\} \\ "c \n d"
% list {*}$L4 {*}{{a b} {\\} \\ "c \n d"}
{a b} {\\} \\ {c 
 d} {a b} {\\} {\\} {c \n d}

This may have been supported by the following language in the dodekalogue:
[5] Argument expansion. If a word starts with the string {*} followed by a non-whitespace character, then the leading {*} is removed and the rest of the word is parsed and substituted as any other word. After substitution, the word is parsed again without substitutions, and its words are added to the command being substituted.

That is not consistent with TIP#157 however, since this clearly specifies the {*}-reparsing as "After substitution, the word is then parsed as a _list_ (as if with Tcl_SplitList() or Tcl_GetListFromObj)". Backslash substitution is a necessary part of list syntax, since the string representation for list elements which are not balanced with respect to braces requires backslash substitution. 

I also note that there doesn't seem to be any tests for {*} which involves applying it to lists with a string reperesentation that contains backslashes.
User Comments: dgp added on 2012-05-02 00:28:55:

allow_comments - 1

The TclFindElement improvements were completed some time ago
and are now in use.

ferrieux added on 2008-12-02 06:24:41:
Yes. In an ideal world we'd have a single list parser controlled by flags telling what to do with # ; \n etc.
Not an evening session project though :-}

About your idea of limiting the runtime expansion to the tricky elements: I had considered it but dismissed it on the grounds that it would push back the cursor towards the complex end of the spectrum, which you convinced me to avoid ;-)
As to the suboptimality of re-scanning the tokens to find naked backslashes, it shouldn't stop us because it is already here...

dgp added on 2008-12-02 05:41:29:
backport for 8.5.6 committed.

leaving open as reminder to
seek more internals improvements.

There's something striking about
the fact that the definition of 
lists and elements and the definition
of commands and words are defined to
have so much in common, but are
implemented completely separately.

dgp added on 2008-12-02 05:35:22:

File Added - 303665: 2251175.patch


Here's the backport patch.
File Added: 2251175.patch

dgp added on 2008-12-02 05:31:04:
Thanks for the work on this.  I see that
tclCompile.c is now back to the same code
as before patches here; back to Revision 1.160,
and the fix for this bug is now completely
changes to tclParse.c.

Committing a few code standards changes to HEAD.

The remaining improvement I would make (low priority)
would be to break up literal lists into simple
parts and complex parts so the simple parts can
be expanded in the parser, leaving TCL_TOKEN_EXPAND_WORD
only for the elements that need it.  This might
involve an improved variant of TclFindElement() that
reports back the data we need so we don't have to
make a second pass scanning for backslashes.

Can you confirm my reading of the logs that so far
there's been no attempt to backport a fix to the
core-8-5-branch ?

ferrieux added on 2008-11-28 22:35:48:
Fixing status finger-slip ;-)

ferrieux added on 2008-11-28 22:31:46:
Committed.

dgp added on 2008-11-27 13:41:38:
It will be at least several days
before I can review.  Since you've
got tests and I trust you've run them,
go ahead and commit.  I'll review
from there (and others will too).

ferrieux added on 2008-11-26 07:18:11:

File Added - 302973: expand2.patch

Attaching expand2.patch implementing Solution2, i.e. detecting naked backslashes in to-expand literals and disabling parse-time expansion in that case. Test suite updated to acknowledge the fact.
Also, your example now yields:
% proc \{ {} {puts [info frame 1]}
% {*}[list \{]
type eval line -1 cmd {{*}[list \{]} level 1
% {*}{\{}
type eval line -1 cmd {{*}{\{}} level 1

Notice the patch is much smaller than its apparent size due to the large number of '-', since it includes the reversion of expand1c.patch.

Please review.
File Added: expand2.patch

ferrieux added on 2008-11-25 05:44:20:
OK so you're voting for Solution 2 after all... Why not just say so ;-)
Thanks for the journey in the parser though. No regrets...

dgp added on 2008-11-25 05:40:57:
What I would like is a solution that
keeps TCL_TOKEN_EXPAND_WORD out of the
picture for the overwhelmingly common
case where breaking into equivalent
TCL_TOKEN_SIMPLE_WORDs can be done without
error.

For the cases like those pointed out in
the bug report, I would just punt the whole
problem to runtime, just the way the whole
"not a valid list" error reporting is handled.

I think that combination can be done with no
new Tcl_Token types.

Schedule is pulling me away again for a week or
more, but I will contribute something here
besides griping, I promise.

ferrieux added on 2008-11-19 07:06:27:
Just committed the simplifications induced by further analyzing dead branches. tclCompCmds.c is pristine again. tclParse.c's secondary cases eliminated as well (were not WORD tokens). tclCompile.c pruned as you suggested. In total only 3 instances of collapsing remain: TclCompileTokens, TclWordKnownAtCompileTime, TclSubstTokens. Of these, I am only able to produce a realistic use-case of the first. However, I thing they should be kept in case of future evolutions (like new compiled commands using them).
As you can see, the residual size of the patch is getting small. However I'm still interested in your opinion re Solution 2.

ferrieux added on 2008-11-19 06:10:58:
You are right about 1469-1506: they are under the SIMPLE_WORD case, while we carefully demoted from SIMPLE_WORD to WORD the containers of UNCOLLAPSED tokens (to avoid ending up collapsing in all CompCmds). Removing them.

Now about your suggestion: this is what I called Solution2 earlier in the thread. Indeed, what makes {*}$x work from the beginning is the fact that there is no parse-time expansion, it is runtime expansion. So Solution2 would consist of removing that optimization that is parse-time literal expansion, and always generate EXPAND tokens even for literals. We'll be losing some performance for large expanded constants. Dunno how important they are in real life though... Are you advocating Solution 2 ?

dgp added on 2008-11-19 04:10:06:
On a different track, when I expressed
reservations about complexity, I was
referring to the addition of a new
Tcl_Token type.  I still need to find
time for a more thorough review, but
my impression is that such an addition
should not be necessary to fix the bug.

This example has always worked:

% set l {a \n b}
a \n b
% list {*}$l
a {
} b

and there was no need for
the additional token type
to enable that, so I would think
the same capabilities could make

list {*}{a \n b}

work as well without the new type.

dgp added on 2008-11-19 04:03:55:
It appears that running the current
test suite does not reach line 1469
or line 1506 of tclCompile.c.  Do you
have any cases that would do so?  If
these branches aren't really necessary,
it would be nice to remove them.

dgp added on 2008-11-19 02:26:42:
Do the new tests cover all the
branches added by the patch ?

ferrieux added on 2008-11-18 05:30:44:
Committed as per Donald's suggestion.
Leaving open for review.

ferrieux added on 2008-11-16 18:18:18:

File Added - 301633: expand1c.patch

OK. The reason I did this was to limit the range of changes. Assuming
checks for TCL_TOKEN_TEXT were spread all over the code, I had the
hope that some of them could work "transparently" for uncollapsed text
(assuming they are done by bit-masking).

Now in hindsight it appears that all clients of text tokens must, by
definition, handle differently normal and to-collapse test.
So your argument wins. Updating the patch.
File Added: expand1c.patch

nijtmans added on 2008-11-16 15:05:33:
My only remark is that things like:
    case TCL_TOKEN_TEXT|TCL_TOKEN_MUSTCOLLAPSE:
look confusing: it tests for the new token type
(a variant of TCL_TOKEN_TEXT). I would not introduce
the concept of 'modifiers' for tokens, especially not if
this will be backported to Tcl 8.5. Why not just use
a new token type, like TCL_TOKEN_UNCOLLAPSED_TEXT
(bikeshed warning: I don't mind the exact name)

Then in tcl.h we get:
     #define TCL_TOKEN_UNCOLLAPSED_TEXT 512

The above
    case TCL_TOKEN_TEXT|TCL_TOKEN_MUSTCOLLAPSE:
will simplify to
    case TCL_TOKEN_UNCOLLAPSED_TEXT:

Expressions like originally:
    if (bodyToken[i]->type == TCL_TOKEN_TEXT) {
which you changed to:
    if (bodyToken[i]->type & TCL_TOKEN_TEXT) {
would become:
    if (bodyToken[i]->type & (TCL_TOKEN_TEXT|TCL_TOKEN_UNCOLLAPSED_TEXT)) {

This would make the intention of the patch more clear. It makes the
remark you put in tcl.h unneccessary, and the case statement
less confusing.

Apart form that, it looks fine to me.

My $.02

Regards,
          Jan Nijtmans

ferrieux added on 2008-11-16 04:29:28:

File Deleted - 301046:

ferrieux added on 2008-11-15 07:28:25:

File Added - 301423: expand1b.patch

Fixed. Attaching the completed patch + tests (including parser tests).


File Added: expand1b.patch

ferrieux added on 2008-11-14 01:30:28:
Oops just found one extra failing case:

 % proc h {} {return [list {*}{a \n b}]}
 % h
 a {\n} b

I know what's happening, I remember I forgot to update that part (CompileTokens) because it was a bit late :-)
Will fix it shortly.

ferrieux added on 2008-11-14 00:55:39:
I have to admit I too knew nothing of it 4 days ago...
I know Miguel is under awful load, but really I'd appreciate a Dirac of review activity here :-}
(if one of you can ^Z him...)
In the meantime I'll write a few extra tests.

andreas_kupries added on 2008-11-14 00:04:15:
Looking over the patch I see no reason to object.
I should qualify that this looks to touch areas (parser) I do not know much about.
Giving back to dgp.

ferrieux added on 2008-11-13 00:19:07:
Sorry if I have given the impression of a high complexity; indeed the patch is very small.
My long paragraph was there only to explain how I had wedged a bit in Tcl_Tokens without extending the sizeof(), and while staying fast (unlike a bitfield).
The short story is:

   - set this bit on expanded TCL_TOKEN_TEXT's with backslashes

   - in both clients (compile and eval), detect this bit *after* TIP280 data extraction, and collapse

As for the line number "1" I do not know the details, but I'm not really surprised because the patch tries hard to stay on rock-solid "ROM source", meaning storing real offsets in the original string. So line "1" is at least consistent with that ;-)

Andreas, your green light ?

-Alex

dgp added on 2008-11-12 23:54:07:
andreas should have look too

dgp added on 2008-11-12 23:50:26:
I've not looked at the code of
expand1.patch, but the description
sounds more complex than I would
prefer, at least as my first impulse.

Applying and testing it I see:

% proc \{ {} {puts [info frame 1]}
% {*}[list \{]
type eval line -1 cmd {{*}[list \{]} level 1
% {*}{\{}
type eval line 1 cmd {{*}{\{}} level 1

which has what appears to be the proper
behavior.  There's no proper literals in the
original script to point to, so we fall
back to the "eval" reporting.  I don't
understand why one form reports line 1
and the other reports line -1, but since
I don't understand, I don't want to make
a claim of incorrectness either.

If there's a reasonably obvious simpler
way to achieve this, I think I'd prefer it,
but if not, or if not with reasonable
short term effort, then committing 
expand1.patch is fine with me so that
we have *some* fix in place and we
can debate whether or not it is the
best fix at a more leisurely pace.

Some additional tests are also in order.

Thanks for tackling this Alexandre.

dgp added on 2008-11-12 09:53:44:
No evening session tonight.

Confirmed the bug was introduced
by the first TIP 280 commit on
2006-11-28.

ferrieux added on 2008-11-12 07:01:28:

File Added - 301046: expand1.patch

Please find attached a patch implementing Solution (1): delayed collapse.

It passes the test suite, including [info frame], since it respects the tying between TIP280-data and the original (ROM) source, while doing the necessary collapse just before use, which is (a) before literal creation in TclCompileScript and (b) before string appendage in TclEvalEx.

One word about complexity: I did my best to minimize the impact for the "rest of us" never expanding a literal ;-)
For this, I kept the size of the Tcl_Token struct unchanged. Only I exploited the fact (documented in comments in tcl.h) that TCL_TOKEN_* values were actually bits so that class tests can be done easily, that is, by bitwise AND. The funny fact is that nowhere in the core is this actually done, all tests are done with "==". For this patch I had to define a modifier bit in this field, defined only for type TCL_TOKEN_TEXT. So I did change all existing checks "==TCL_TOKEN_TEXT" into "&TCL_TOKEN_TEXT", which allows for graceful ignorance of the modifier everywhere it needs be. I believe the perf impact of this change should be between zero and epsilon ;-)

Please review.
File Added: expand1.patch

dgp added on 2008-11-12 06:17:16:
Confirmed bug introduce sometime
between 8.5a5 and 8.5a6:

% info patch
8.5a5
% {expand}{\{}
invalid command name "{"

% info patch
8.5a6
% {expand}{\{}
invalid command name "\{"

ferrieux added on 2008-11-12 05:04:10:
I have not checked, but I found this in tclParse.c's history

revision 1.53
date: 2007/05/30 18:12:59;  author: dgp;  state: Exp;  lines: +138 -53
        * generic/tclBasic.c:   Removed code that dealt with
        * generic/tclCompile.c: TCL_TOKEN_EXPAND_WORD tokens representing
        * generic/tclCompile.h: expanded literal words.  These sections were
        mostly in place to enable [info frame] to discover line information
        in expanded literals.  Since the parser now generates a token for
        each post-expansion word referring to the right location in the
        original script string, [info frame] gets all the data it needs.

        * generic/tclInt.h:     Revised the parser so that it never produces
        * generic/tclParse.c:   TCL_TOKEN_EXPAND_WORD tokens when parsing an
        * tests/parse.test:     expanded literal word; that is, something like
        {*}{x y z}.  Instead, generate the series of TCL_TOKEN_SIMPLE_WORD
        tokens to represent the words that expansion of the literal string
        produces.  [RFE 1725186]

So, it looks like prior to this date, things were going as per what I called "Solution (2)". Since that implies runtime expansion, which involves Tcl_GetListFromObj, I'd risk a guess that the bug appeared at that time :-{

So the dilemma becomes:

   (1) complexify Tcl_Token handling a bit to flag must-collapse-backslash tokens, to be collapsed by the client.

   (2) go back to situation as of revision 1.53 of tclParse.c and bury [RFE 1725186]

What do you think ?

dgp added on 2008-11-12 04:41:23:
Thanks for reporting this.

I acknowledge this is important
to resolve one way or another
before another release.
I've raised priority to signal that.

In return please acknowledge that
this will require care and attention,
and that will have to wait until I
return from the conference I am
currently attending.

The next step I will/would take is
to look at the original implementation
of {expand} before TIP 280 was an
issue.  Did it have the same troubles?

pspjuth added on 2008-11-10 23:48:34:
In patch 1870327 I also encountered the problem of
compiling dynamic source.  I did some ugly workaround.
I'm interested in any solution to this problem.  

https://sourceforge.net/tracker/?func=detail&aid=1870327&group_id=10894&atid=310894

ferrieux added on 2008-11-10 21:43:23:
Progress report: I now realize that this approach has problems. Maintaining proper line info while violating the "source in ROM" principle is tricky.
I can think of two alternatives:
(1) Delay the TclCopyAndCollapse by adding to tokens a flag bit TCL_TOKEN_COLLAPSE, so that only the clients of the parse tree (eval and compile) call TCAC at the latest possible moment (i.e. just when building the objv[]) and still operate on source-ROM for line info.
(2) Consider compile-time expansion of literals as a rather useless optimization, and generate a runtime-expand instead.

Don, Miguel, please advise !!!

ferrieux added on 2008-11-10 18:51:56:

File Added - 300840: expand.patch

Attaching a patch fixing the issue, using the latter (lazier) merthod: a linked list of individual ckallocs.

However, this patch *does* break tests info-33.[13], (info frame).
I suspect the reason is that it introduces text pointer values that are outside the source, and that get recorded in the bytecode.
This can also be seen by ::tcl::unsupported::disassemble on the generated proc of these tests.
I will look into these details now, but I'd appreciate Miguel to take over ;-)


File Added: expand.patch

ferrieux added on 2008-11-10 16:42:28:
Hmm, maybe the reason is that it is not easy to tie any new allocations to the result of the parsing. Indeed the rest of tclParse.c arranges for token cells to be allocated/reallocated in a contiguous array, with their member strings pointing into the original source. So, if we (correctly) decide to call TclCopyAndCollapse for each "found element", we'll need to spit out the resulting backslash-subst'ed strings in some space that will die at Tcl_FreeParse time. Maybe we need to complexify the Tcl_Parse struct a bit for that (eg by keeping a single growing malloc'ed block similar to the token array but for massaged strings). Or maybe this is premature optimization again, and a linked list of individual mallocs will do...

Miguel, do you confirm/infirm ?

ferrieux added on 2008-11-10 15:30:15:
Ouch :-)

First, a simpler way of summing up the bug is to notice the violation of the invariant
 
   list {*}{SOMETHING}  ==  [lrange {SOMETHING} 0 end]

Second, we witness this violation only for expanded literals, that is, those that are expanded directly within the token stream of the parser.

Third, this indeed is due to the fact that the relevant part of the code in tclParse.c only calls TclFindElement instead of Tcl_GetListFromObj or equivalent.
This makes a whole lot of a difference of course, since Tcl_GetListFromObj..SetListFromAny calls both TclFindElement *and* TclCopyAndCollapse, which does the necessary backslash subst.

Miguel, why this ? Premature optimization ;-) ?

-Alex

Attachments: