Tcl Source Code

View Ticket
Login
Ticket UUID: 0f1ddc0df7fb734cf1cba12ac7c529d29fefb842
Title: exec fails on encoding errors present in the output of the executed command
Type: Bug Version: 9.0
Submitter: anonymous Created on: 2023-11-27 15:57:26
Subsystem: 16. Commands A-H Assigned To: nobody
Priority: 5 Medium Severity: Important
Status: Closed Last Modified: 2024-05-13 13:11:22
Resolution: Fixed Closed By: oehhar
    Closed on: 2024-05-13 13:11:22
Description:

If executing the diff program with command exec on two files with different encodings, the output of diff cannot be read with Tcl9. The following error message is generated: invalid or incomplete multibyte or wide character

The problem occurres using tkdiff and can be reproduced by the following script: puts "Using Tcl [info patchlevel]" set cmd "diff iso8859.txt utf8.txt" catch {eval ::exec $cmd} ret puts "ret=$ret"

Test files will be attached. Paul

User Comments: oehhar added on 2024-05-13 13:11:22:

Paul, I allowed me to change the title of the ticket.

I hope this is ok. Take care, Harald


oehhar added on 2024-05-13 12:28:03:

Nathan, sorry. The process for fixing was done and accepted by the author of the ticket.

The reason for this are many voices fearing incompatibilites and code breakage due to errors thrown, were there were no errors before. The drivers for this are people maintaining a lot of code within large organizations like Don Porter.

It was also made clear, that exec has limitations and open could be used, if the codepage profile is important.

I can understand your agument and I also was convinced of "strict anywhere". But this has changed as a community view.

As a consequence, I close the ticket and set the status to "fixed".

Thank you for your comprehension, Harald


jan.nijtmans added on 2024-05-13 10:08:53:

..... and Rolf, of course ....


jan.nijtmans added on 2024-05-13 10:06:24:

> I propose that exec retain its newfound strict powers

I'll discuss this with Harald, Brian, Steve and Ashok. My guess is that your proposal will not be accepted for 9.0b2.


pooryorick added on 2024-05-13 09:08:00:

I've reopened this issue and changed the resolution to "wont fix" in order to further discuss this. "replace" is a minefield, and saving unsuspecting scripts from stepping on a mine when using exec was a big benefit of the decision to use "strict" by default. glob was only excluded from TIP 657 because no one had the wherewithal to fix it at the time of that TIP. Bad data due to bad encoding is a big problem. Bad data cause problems down the line after it's too late to do anything about it, especially when they get into persistent storage and are processed much later. It is quite common for naive scripts to use exec to obtain binary data. In this case the earlier there is a notable error the better, as it allows the author to correct the script. This case is much more common than the mixed-encoding problem raised in the current issue. For programs like the diff program mentioned in this issue that are just doing their job in some encoding-agnostic way, open can always be used instead. Therefore, there is no real need to strip exec of its strict behaviour.

I propose that exec retain its newfound strict powers, and that Tcl developers focus on adding further knobs to exec if that is deemed worthwhile.


oehhar added on 2023-12-07 08:57:26:

I want to thank Ashok for this work.

We had that open for TCL 9.0 in the weekly wrap-up and Ashok volunteered to look into this.

We should always remember, that behind each commit, there are hours of work.

Thank you all, guys, Harald


apnadkarni added on 2023-12-07 03:11:33:
Fixed in [43d35b1d8b].

anonymous added on 2023-12-05 21:56:21:
Works fine with the test script and inside tkdiff.

Paul

apnadkarni added on 2023-12-05 17:12:52:

Code ready for review, will merge if no objections in a day or two.

Paul's sample test output is now

% catch {eval ::exec $cmd} ret
1
% puts "ret=$ret"
ret=2c2
< ����
---
> äöüß
child process exited abnormally

which I think is correct (the catch raises exception because of diff output as expected)


apnadkarni added on 2023-12-05 14:06:31:
WIP in [apn-bug-0f1ddc0df7]. This intentionally fixes only exec to use the replace profile. In particular, open is unchanged (strict default) as the -profile option can be used there to if necessary.

Still WIP because stderr output behaves as if strict profile is in use.

oehhar added on 2023-12-01 16:40:51:

Hi Jan, thanks, that you jump into the boat. Things move, if this happens.

My opinion:

  • In TCL9, any encoding profile default should change to replace, if it is tcl8. So, I agree with the change.
  • I don't think, this handles the initial issue. The exec profile must be set somewhere else to strict.

Take care, Harald


jan.nijtmans added on 2023-12-01 12:03:40:

If we decide that the best profile for env/glob/argv handling is "replace", we could also make one step further and change the default flag for Tcl_UtfToExternalDString/Tcl_ExternalToUtfDString. That's easier than replace all such calls with the *Ex() alternative, specifying TCL_ENCODING_PROFILE_REPLACE separately everywhere.

How do we think about: [0d84761f9fecedd4|this]?


jan.nijtmans added on 2023-12-01 11:48:21:
> This was already the case in 657 for glob, env as Rob mentions

glob, env currently still uses the "tcl8" profile, since no-one decided to change that. Command line arguments to tclsh/wish still use "tcl8" profile as well.

apnadkarni added on 2023-11-28 03:01:55:
I also share Rolf's opinion that commands where there is no option to configure the encoding profile used by the command should use the replace profile. This was already the case in 657 for glob, env as Rob mentions, and I would argue should be the case for exec as well.

As an aside, I would think the same would apply to *arguments* passed to exec as well as command line passed to tclsh (argv, argv0) though that may be a more minor issue and not worth bothering over.

pointsman added on 2023-11-27 20:07:27:

Perhaps the better workaround for the given example would be: set fd [open |$cmd] fconfigure $fd -profile replace puts "ret:[read $fd]"

While it is true that exec is simple-minded wrt to the encoding of the stdin, stdout (and stderr) of the started process the core problem here are the changes in the I/O system of Tcl 9, especially TIP 657.

Please note that this is an improvement. With Tcl 8.6 the code example provided by Paul show a wrong result without a chance that the programm / programmer is able to notice.

But perhaps it would be better if exec would open the channels always with -profile replace. TIP 657 already has the exception of the glob command (and access to the env array). This would be another one, for reasons.


dkf added on 2023-11-27 16:47:26:

The exec command is notably very simple-minded in that it uses the system encoding for everything including for the encoding of channels used to communicate with the subprocess, and it always assumes that output is readable text in that encoding. It very much is known that this isn't always so! However, "fixing" this tends to make exec much harder to use in the simple cases.

The workaround is to use a pipeline, perhaps like this:

  set f [open |$cmd rb]
  # Binary output can be any byte sequence; no encoding problems possible
  set output [read $f]
  catch {close $f} err
Since you have access to the channel above (unlike inside exec which handles all of that internally) you can configure it any way you want to handle the input or output you expect. (It also makes it easier to capture stderr of the pipeline without losing stdout, etc.)

Note that both exec and open |... are wrappers around the same subsystem.


Attachments: