TIP 407: The String Representation of Tcl Lists: the Gory Details

FlightAware bounty program for improvements to Tcl and certain Tcl packages.
Author:         Donal K. Fellows <dkf@users.sf.net>
Author:         Kevin Kenny <kennykb@acm.org>
Author:		Don Porter <dgp@users.sf.net>
State:          Draft
Type:           Informative
Vote:           No voting
Created:        06-Aug-2012


This document explains some of the details behind Tcl lists; it is there to expose documentation that was previously only present as comments in Tcl's source code.

String Representation of Lists

The routines [in tclUtil.c] implement the conversions of strings to and from Tcl lists. To understand their operation, the rules of parsing and generating the string representation of lists must be known. Here we describe them in one place.

A list is made up of zero or more elements. Any string is a list if it is made up of alternating substrings of element-separating ASCII whitespace and properly formatted elements.

The ASCII characters which can make up the whitespace between list elements are:

    \u0009    \t    TAB
    \u000A    \n    NEWLINE
    \u000B    \v    VERTICAL TAB
    \u000C    \f    FORM FEED
    \u000D    \r    CARRIAGE RETURN
    \u0020          SPACE

NOTE: differences between this and other places where Tcl defines a role for "whitespace".

  • Unlike command parsing, here NEWLINE is just another whitespace character; its role as a command terminator in a script has no importance here.

  • Unlike command parsing, the BACKSLASH NEWLINE sequence is not considered to be a whitespace character.

  • Other Unicode whitespace characters (recognized by string is space or Tcl_UniCharIsSpace()) do not play any role as element separators in Tcl lists.

  • The NUL byte ought not appear, as it is not in strings properly encoded for Tcl, but if it is present, it is not treated as separating whitespace, or a string terminator. It is just another character in a list element.

The interpretaton of a formatted substring as a list element follows rules similar to the parsing of the words of a command in a Tcl script. Backslash substitution plays a key role, and is defined exactly as it is in command parsing. The same routine, TclParseBackslash() is used in both command parsing and list parsing.

NOTE: This means that if and when backslash substitution rules ever change for command parsing, the interpretation of strings as lists also changes.

Backslash substitution replaces an "escape sequence" of one or more characters starting with

    \u005c    \    BACKSLASH

with a single character. The one character escape sequent case happens only when BACKSLASH is the last character in the string. In all other cases, the escape sequence is at least two characters long.

The formatted substrings are interpreted as element values according to the following cases:

  • If the first character of a formatted substring is

        \u007b    {    OPEN BRACE

    then the end of the substring is the matching

        \u007d    }    CLOSE BRACE

    character, where matching is determined by counting nesting levels, and not including any brace characters that are contained within a backslash escape sequence in the nesting count. Having found the matching brace, all characters between the braces are the string value of the element. If no matching close brace is found before the end of the string, the string is not a Tcl list. If the character following the close brace is not an element separating whitespace character, or the end of the string, then the string is not a Tcl list.

    NOTE: this differs from a brace-quoted word in the parsing of a Tcl command only in its treatment of the backslash-newline sequence. In a list element, the literal characters in the backslash-newline sequence become part of the element value. In a script word, conversion to a single SPACE character is done.

    NOTE: Most list element values can be represented by a formatted substring using brace quoting. The exceptions are any element value that includes an unbalanced brace not in a backslash escape sequence, and any value that ends with a backslash not itself in a backslash escape sequence.

  • If the first character of a formatted substring is

        \u0022    "    QUOTE

    then the end of the substring is the next QUOTE character, not counting any QUOTE characters that are contained within a backslash escape sequence. If no next QUOTE is found before the end of the string, the string is not a Tcl list. If the character following the closing QUOTE is not an element separating whitespace character, or the end of the string, then the string is not a Tcl list. Having found the limits of the substring, the element value is produced by performing backslash substitution on the character sequence between the open and close QUOTEs.

    NOTE: Any element value can be represented by this style of formatting, given suitable choice of backslash escape sequences.

  • All other formatted substrings are terminated by the next element separating whitespace character in the string. Having found the limits of the substring, the element value is produced by performing backslash substitution on it.

    NOTE: Any element value can be represented by this style of formatting, given suitable choice of backslash escape sequences, with one exception. The empty string cannot be represented as a list element without the use of either braces or quotes to delimit it.

This collection of parsing rules is implemented in the routine TclFindElement().

In order to produce lists that can be parsed by these rules, we need the ability to distinguish between characters that are part of a list element value from characters providing syntax that define the structure of the list. This means that our code that generates lists must at a minimum be able to produce escape sequences for the 10 characters identified above that have significance to a list parser.

Canonical Lists

In addition to the basic rules for parsing strings into Tcl lists, there are additional properties to be met by the set of list values that are generated by Tcl. Such list values are often said to be in "canonical form":

  • When any canonical list is evaluated as a Tcl script, it is a script of either zero commands (an empty list) or exactly one command. The command word is exactly the first element of the list, and each argument word is exactly one of the following elements of the list. This means that any characters that have special meaning during script evaluation need special treatment when canonical lists are produced:

    * Whitespace between elements may not include NEWLINE.

    * The command terminating character,

        \u003b    ;    SEMICOLON

    must be BRACEd, QUOTEd, or escaped so that it does not terminate the command prematurely.

    * Any of the characters that begin substitutions in scripts,

        \u0024    $    DOLLAR
        \u005b    [    OPEN BRACKET
        \u005c    \    BACKSLASH

    need to be BRACEd or escaped.

    * In any list where the first character of the first element is

        \u0023    #    HASH

    that HASH character must be BRACEd, QUOTEd, or escaped so that it does not convert the command into a comment.

    * Any list element that contains the character sequence BACKSLASH NEWLINE cannot be formatted with BRACEs. The BACKSLASH character must be represented by an escape sequence, and unless QUOTEs are used, the NEWLINE must be as well.

  • It is also guaranteed that one can use a canonical list as a building block of a larger script within command substitution, as in this example:

            set script "puts \[[list $cmd $arg]]"; eval $script

    To support this usage, any appearance of the character

        \u005d    ]    CLOSE BRACKET

    in a list element must be BRACEd, QUOTEd, or escaped.

  • Finally it is guaranteed that enclosing a canonical list in braces produces a new value that is also a canonical list. This new list has length 1, and its only element is the original canonical list. This same guarantee also makes it possible to construct scripts where an argument word is given a list value by enclosing the canonical form of that list in braces:

            set script "puts {[list $one $two $three]}"; eval $script

    This sort of coding was once fairly common, though it's become more idiomatic to see the following instead:

            set script [list puts [list $one $two $three]]; eval $script

    In order to support this guarantee, every canonical list must have balance when counting those braces that are not in escape sequences.

Within these constraints, the canonical list generation routines TclScanElement() and TclConvertElement() attempt to generate the string for any list that is easiest to read. When an element value is itself acceptable as the formatted substring, it is usually used (CONVERT_NONE). When some quoting or escaping is required, use of BRACEs (CONVERT_BRACE) is usually preferred over the use of escape sequences (CONVERT_ESCAPE). There are some exceptions to both of these preferences for reasons of code simplicity, efficiency, and continuation of historical habits. Canonical lists never use the QUOTE formatting to delimit their elements because that form of quoting does not nest, which makes construction of nested lists far too much trouble. Canonical lists always use only a single SPACE character for element-separating whitespace.

Future Considerations

When a list element requires quoting or escaping due to a CLOSE BRACKET character or an internal QUOTE character, a strange formatting mode is recommended. For example, if the value "a{b]c}d" is converted by the usual modes:

    CONVERT_BRACE:    a{b]c}d    => {a{b]c}d}
    CONVERT_ESCAPE:   a{b]c}d    => a\{b\]c\}d

we get perfectly usable formatted list elements. However, this is not what Tcl releases have been producing. Instead, we have:

    CONVERT_MASK:     a{b]c}d    => a{b\]c}d

where the CLOSE BRACKET is escaped, but the BRACEs are not. The same effect can be seen replacing ] with " in this example. There does not appear to be any functional or aesthetic purpose for this strange additional mode. The sole purpose I can see for preserving it is to keep generating the same formatted lists programmers have become accustomed to, and perhaps written tests to expect. That is, compatibility only. The additional code complexity required to support this mode is significant. The lines of code supporting it are delimited in the routines marked with #if COMPAT directives. This makes it easy to experiment with eliminating this formatting mode simply with "#define COMPAT 0" above. I believe this is worth considering.

Another consideration is the treatment of QUOTE characters in list elements. TclConvertElement() must have the ability to produce the escape sequence \" so that when a list element begins with a QUOTE we do not confuse that first character with a QUOTE used as list syntax to define list structure. However, that is the only place where QUOTE characters need quoting. In this way, handling QUOTE could really be much more like the way we handle HASH which also needs quoting and escaping only in particular situations. Following up this could increase the set of list elements that can use the CONVERT_NONE formatting mode.

More speculative is that the demands of canonical list form require brace balance for the list as a whole, while the current implementation achieves this by establishing brace balance for every element.

Finally, a reminder that the rules for parsing and formatting lists are closely tied together with the rules for parsing and evaluating scripts, and will need to evolve in sync.

Origin of Document and Copyright Notice

This document is based "near-verbatim" on comments in generic/tclUtil.c in the Tcl source code distribution. http://core.tcl.tk/tcl/doc/trunk/generic/tclUtil.c

Copyright (c) 1987-1993 The Regents of the University of California.
Copyright (c) 1994-1998 Sun Microsystems, Inc.
Copyright (c) 2001 by Kevin B. Kenny. All rights reserved.

This document is made available under the same license as Tcl.

[I, Kevin B. Kenny, dedicate any and all copyright interest in TIP #407 to the public domain. I make this dedication for the benefit of the public at large and to the detriment of my heirs and successors. I intend this dedication to be an overt act of relinquishment in perpetuity of all present and future rights to TIP #407 under copyright law.]