TIP 475: Add [string insert] Command and C API

Login
Bounty program for improvements to Tcl and certain Tcl packages.
Author: Andy Goth (andrew.m.goth@gmail.com)
State: Draft
Type: Project
Vote: Pending
Created: 22-Sep-2017
Post-History:
Keywords: Tcl,string,insert
Tcl-Version: 8.7

Abstract

This TIP proposes a [string insert] subcommand for inserting a substring at a given index. This new [string insert] command is to be the string analogue of [linsert], much like how [string replace] resembles [lreplace].

This TIP additionally proposes a Tcl_ReplaceObj() public function to provide C-level access to the underlying functionality used to implement [string insert] and [string replace].

Rationale

[string insert] Command

Substring insertion is a basic string operation not directly available in current Tcl. Substring insertion can be synthesized from existing string commands, but the numerous legal forms of indexing yield significant difficulty. A novice user cannot be expected to know, much less implement, all possible index formats. Thus it is reasonable to provide a standard substring insertion command.

The current design of [string replace] expressly (albeit inexplicably) prevents its use for performing string insertion. TIP 323 originally proposed to extend [string replace] to allow string insertion, but this aspect of TIP 323 was withdrawn for the sake of compatibility.

Clarification: TIP 475 proposes no changes to the semantics of [string replace].

To mirror the behavior of [linsert], [string insert] at end should append to the string. This is in conflict with [string replace], were it to be extended to permit replacing an empty substring.

[lreplace] of an empty range at end inserts elements immediately before the final element, whereas appending requires the start index to be end+1. The hypothetical extended [string replace] would mirror [lreplace] and therefore would not have the end-relative indexing semantics needed to implement [string insert].

In conclusion, it is most straightforward to simply provide a [string insert] command with the same semantics as [linsert].

Tcl_ReplaceObj() Function

The proposed [string insert] cannot be implemented in terms of [string replace], even if [string replace] were extended to permit replacing an empty substring. However, this fact only argues against implementing [string insert] as a Tcl script which calls [string replace]. It is indeed possible to implement both [string insert] and [string replace] in terms of a common C function supporting the superset of required capabilities.

Not only is it possible, but it is also preferable to take this approach. Minimizing the number of code paths increases the potential benefit of any given optimization, improves instruction cache utilization, and reduces both the possibility for bugs and the number of interactions to be audited.

Furthermore, this new function should be publicly exported using the stubs interface so extension code can call it without having to enter the interpreter. The desirability of providing C APIs is made clear by the existence of the the FlightAware [array] enumeration C API project.

Current Behavior

Currently available methods for string insertion are awkward and do not handle end-relative and TIP 176-style indexing without extraordinary effort. To demonstrate the surprising degree of complexity, the following is a pure Tcl script reference implementation intended to handle all possible index formats and corner cases:

# Pure Tcl implementation of [string insert] command.
proc ::tcl::string::insert {string index insertString} {
    # Convert end-relative and TIP 176 indexes to simple integers.
    if {[regexp -expanded {
        ^(end(?![\t\n\v\f\r ])      # "end" is never followed by whitespace
        |[\t\n\v\f\r ]*[+-]?\d+)    # m, with optional leading whitespace
        (?:([+-])                   # op, omitted when index is "end"
        ([+-]?\d+))?                # n, omitted when index is "end"
        [\t\n\v\f\r ]*$             # optional whitespace (unless "end")
    } $index _ m op n]} {
        # Convert first index to an integer.
        switch $m {
            end     {set index [string length $string]}
            default {scan $m %d index}
        }

        # Add or subtract second index, if provided.
        switch $op {
            + {set index [expr {$index + $n}]}
            - {set index [expr {$index - $n}]}
        }
    } elseif {![string is integer -strict $index]} {
        # Reject invalid indexes.
        return -code error "bad index \"$index\": must be\
                integer?\[+-\]integer? or end?\[+-\]integer?"
    }

    # Concatenate the pre-insert, insertion, and post-insert strings.
    string cat [string range $string 0 [expr {$index - 1}]] $insertString\
               [string range $string $index end]
}

# Bind [string insert] to [::tcl::string::insert].
namespace ensemble configure string -map [dict replace\
        [namespace ensemble configure string -map]\
        insert ::tcl::string::insert]

More sample implementations can be found on the Additional String Functions page of the Tcler's Wiki, but at time of writing, they do not handle end-relative indexing nor can be used to append to a string. Since they are implemented in terms of [string replace] and do not perform any index arithmetic of their own, they actually do support TIP 176 indexes.

Compatibility Considerations

The existence of a command named [string insert] breaks any existing code that assumes [string in] is an unambiguous abbreviation for [string index]. Two options exist:

  1. Special-case [string in] to mean [string index].
  2. Take no special action, in which case [string in] becomes an error.

This TIP proposes option #2 because abbreviations are not guaranteed to be stable in the long term. This TIP targets Tcl 8.7, the first alpha version of which was recently released. Thus there are three reasons why compatibility is deemphasized in this situation.

Specification

Add a new [string insert] command:

string insert string index insertString

Returns a copy of string with insertString inserted at the index'th character. index may be specified as described in the STRING INDICES section.

If index is start-relative, the first character inserted in the returned string will be at the specified index. If index is end-relative, the last character inserted in the returned string will be at the specified index.

If index is at or before the start of string (e.g., index is 0), insertString is prepended to string. If index is at or after the end of string (e.g., index is end), insertString is appended to string.

Add a new Tcl_ReplaceObj() stubs-exported function:

Tcl_Obj *
Tcl_ReplaceObj(interp, objPtr, startIndex, removeCount, insertObjPtr)

Tcl_Interp *interp (in) If non-NULL, interpreter in which error result messages are stored.
Tcl_Obj *objPtr (in/out) Points to a value to manipulate.
int startIndex (in) The index of the first character to insert, replace, or remove.
int removeCount (in) The number of characters to replace or remove.
Tcl_Obj *insertObjPtr (in/out) The value to insert or replace. If NULL, interpreted as empty string.

Tcl_ReplaceObj inserts, replaces, or removes a substring in objPtr and returns a pointer to the new or updated string object. If objPtr or insertObjPtr are unshared, one of them will be modified in place, and its address will be returned. If this optimization is unavailable, a new string object will be allocated to store the result, its address will be returned, and neither objPtr nor insertObjPtr will be modified. Should a memory allocation error occur, Tcl_ReplaceObj returns NULL and (if interp is non-NULL) places error information in the result value of interp.

The choice of operation performed by Tcl_ReplaceObj is determined by removeCount and the length of insertObjPtr. To insert, removeCount is zero, and insertObjPtr is the insertion substring. To replace, removeCount is the number of characters to replace, and insertObjPtr is the replacement substring. To remove, removeCount is the number of characters to remove, and insertObjPtr is NULL or empty string.

startIndex is the index of the first character in objPtr to insert, replace, or remove. Negative startIndex is taken to be zero, and large startIndex is automatically limited to be no greater than the length of objPtr. In a similar manner, negative removeCount is taken to be zero, and large removeCount is automatically limited such that the sum of startIndex and removeCount is no greater than the length of objPtr. Consequently, if startIndex is at the end of objPtr, insertObjPtr will be appended to objPtr, and removeCount will be ignored.

Implement both the bytecoded and non-bytecoded [string insert] and [string replace] commands in terms of Tcl_ReplaceObj().

Reference Implementation

A pure Tcl reference implementation is given above.

The amg-string-insert branch in the Tcl Fossil repository provides an optimized C implementation, complete with documentation, bytecode compilation, and a full set of test cases exercising all code paths.

This C implementation creates a new strinsert bytecode, also known as STR_INSERT or INST_STR_INSERT in varying contexts. The decision was made to keep strinsert separate from the existing strreplace opcode due to differences in end-relative indexing.

Implementing [string insert] in terms of strreplace would require modifying the interface to strreplace and would therefore break compatibility with tclquadcode and potentially other projects that depend on the [tcl::unsupported::assemble] and [tcl::unsupported::disassemble] commands.

Despite the use of the word "separate" above, the strinsert and strreplace opcodes are both implemented in terms of the common Tcl_ReplaceObj() function, as are the non-bytecoded [string insert] and [string replace] commands. This function was written by distilling common code found across the baseline implementations of the various subroutines that now call it.

Open Issues

The Tcl_ReplaceObj() function header comment includes this TODO note:

Memory allocation failure is only checked when concatenating shared, non-pure byte array, non-pure Unicode character array strings. Need to commit to a consistent memory allocation failure handling policy.

This comment is describing the case in which Tcl_ReplaceObj() calls TclStringCatObjv() to concatenate its two arguments. TclStringCatObjv() returns TCL_ERROR on memory allocation failure, and Tcl_ReplaceObj() dutifully translates that to a NULL return, with detailed error information in the interpreter result.

In every other circumstance, Tcl_ReplaceObj() does not check for memory allocation errors. Many, if not all, of the underlying functions it uses call Tcl_Panic() on allocation error, so Tcl_ReplaceObj() has no opportunity to handle errors itself. Therefore, when Tcl_ReplaceObj()'s documentation claims it notifies its caller of memory allocation errors, it is making a promise it can only rarely keep.

Copyright

This document is placed in public domain.

History