TIP 407: The String Representation of Tcl Lists: the Gory Details

Bounty program for improvements to Tcl and certain Tcl packages.
Tcl 2017 Conference, Houston/TX, US, Oct 16-20
Send your abstracts to tclconference@googlegroups.com
by Aug 21.
Author:         Donal K. Fellows <dkf@users.sf.net>
Author:         Kevin Kenny <kennykb@acm.org>
Author:		Don Porter <dgp@users.sf.net>
State:          Draft
Type:           Informative
Vote:           No voting
Created:        06-Aug-2012


This document explains some of the details behind Tcl lists; it is there to expose documentation that was previously only present as comments in Tcl's source code.

String Representation of Lists

The routines [in tclUtil.c] implement the conversions of strings to and from Tcl lists. To understand their operation, the rules of parsing and generating the string representation of lists must be known. Here we describe them in one place.

A list is made up of zero or more elements. Any string is a list if it is made up of alternating substrings of element-separating ASCII whitespace and properly formatted elements.

The ASCII characters which can make up the whitespace between list elements are:

    \u0009    \t    TAB
    \u000A    \n    NEWLINE
    \u000B    \v    VERTICAL TAB
    \u000C    \f    FORM FEED
    \u000D    \r    CARRIAGE RETURN
    \u0020          SPACE

NOTE: differences between this and other places where Tcl defines a role for "whitespace".

The interpretaton of a formatted substring as a list element follows rules similar to the parsing of the words of a command in a Tcl script. Backslash substitution plays a key role, and is defined exactly as it is in command parsing. The same routine, TclParseBackslash() is used in both command parsing and list parsing.

NOTE: This means that if and when backslash substitution rules ever change for command parsing, the interpretation of strings as lists also changes.

Backslash substitution replaces an "escape sequence" of one or more characters starting with

    \u005c    \    BACKSLASH

with a single character. The one character escape sequent case happens only when BACKSLASH is the last character in the string. In all other cases, the escape sequence is at least two characters long.

The formatted substrings are interpreted as element values according to the following cases:

This collection of parsing rules is implemented in the routine TclFindElement().

In order to produce lists that can be parsed by these rules, we need the ability to distinguish between characters that are part of a list element value from characters providing syntax that define the structure of the list. This means that our code that generates lists must at a minimum be able to produce escape sequences for the 10 characters identified above that have significance to a list parser.

Canonical Lists

In addition to the basic rules for parsing strings into Tcl lists, there are additional properties to be met by the set of list values that are generated by Tcl. Such list values are often said to be in "canonical form":

Within these constraints, the canonical list generation routines TclScanElement() and TclConvertElement() attempt to generate the string for any list that is easiest to read. When an element value is itself acceptable as the formatted substring, it is usually used (CONVERT_NONE). When some quoting or escaping is required, use of BRACEs (CONVERT_BRACE) is usually preferred over the use of escape sequences (CONVERT_ESCAPE). There are some exceptions to both of these preferences for reasons of code simplicity, efficiency, and continuation of historical habits. Canonical lists never use the QUOTE formatting to delimit their elements because that form of quoting does not nest, which makes construction of nested lists far too much trouble. Canonical lists always use only a single SPACE character for element-separating whitespace.

Future Considerations

When a list element requires quoting or escaping due to a CLOSE BRACKET character or an internal QUOTE character, a strange formatting mode is recommended. For example, if the value "a{b]c}d" is converted by the usual modes:

    CONVERT_BRACE:    a{b]c}d    => {a{b]c}d}
    CONVERT_ESCAPE:   a{b]c}d    => a\{b\]c\}d

we get perfectly usable formatted list elements. However, this is not what Tcl releases have been producing. Instead, we have:

    CONVERT_MASK:     a{b]c}d    => a{b\]c}d

where the CLOSE BRACKET is escaped, but the BRACEs are not. The same effect can be seen replacing ] with " in this example. There does not appear to be any functional or aesthetic purpose for this strange additional mode. The sole purpose I can see for preserving it is to keep generating the same formatted lists programmers have become accustomed to, and perhaps written tests to expect. That is, compatibility only. The additional code complexity required to support this mode is significant. The lines of code supporting it are delimited in the routines marked with #if COMPAT directives. This makes it easy to experiment with eliminating this formatting mode simply with "#define COMPAT 0" above. I believe this is worth considering.

Another consideration is the treatment of QUOTE characters in list elements. TclConvertElement() must have the ability to produce the escape sequence \" so that when a list element begins with a QUOTE we do not confuse that first character with a QUOTE used as list syntax to define list structure. However, that is the only place where QUOTE characters need quoting. In this way, handling QUOTE could really be much more like the way we handle HASH which also needs quoting and escaping only in particular situations. Following up this could increase the set of list elements that can use the CONVERT_NONE formatting mode.

More speculative is that the demands of canonical list form require brace balance for the list as a whole, while the current implementation achieves this by establishing brace balance for every element.

Finally, a reminder that the rules for parsing and formatting lists are closely tied together with the rules for parsing and evaluating scripts, and will need to evolve in sync.

Origin of Document and Copyright Notice

This document is based "near-verbatim" on comments in generic/tclUtil.c in the Tcl source code distribution. http://core.tcl.tk/tcl/doc/trunk/generic/tclUtil.c

Copyright (c) 1987-1993 The Regents of the University of California.
Copyright (c) 1994-1998 Sun Microsystems, Inc.
Copyright (c) 2001 by Kevin B. Kenny. All rights reserved.

This document is made available under the same license as Tcl.

[I, Kevin B. Kenny, dedicate any and all copyright interest in TIP #407 to the public domain. I make this dedication for the benefit of the public at large and to the detriment of my heirs and successors. I intend this dedication to be an overt act of relinquishment in perpetuity of all present and future rights to TIP #407 under copyright law.]