Tcl Source Code

View Ticket
Login
2022-09-11
16:43 Closed ticket [fb642c54bc]: Incorrect download of compressed encoded data plus 6 other changes artifact: d64d866566 user: kjnash
2018-03-27
12:35 Add attachment test_download.tcl to ticket [fb642c54bc] artifact: e4927d03d2 user: gerhardr
12:19 Add attachment test_download.tcl to ticket [fb642c54bc] artifact: fcbc5c1def user: gerhardr
2018-03-26
16:40 Pending ticket [fb642c54bc]: Incorrect download of compressed encoded data plus 5 other changes artifact: 8dc0563eaf user: sebres
12:12 Closed ticket [fb642c54bc]. artifact: ada8006fc6 user: sebres
2018-03-25
15:46 Ticket [fb642c54bc]: 3 changes artifact: a2d5d9d6ad user: bll
13:52 Add attachment test_download.tcl to ticket [fb642c54bc] artifact: fbdf3b5bd1 user: gerhardr
13:12 New ticket [fb642c54bc] Incorrect download of compressed encoded data. artifact: a9b7cac406 user: gerhardr

Ticket UUID: fb642c54bc58b31daafba9ae495ded4b0417d9bc
Title: Incorrect download of compressed encoded data
Type: Bug Version: 8.6.6
Submitter: gerhardr Created on: 2018-03-25 13:12:25
Subsystem: 29. http Package Assigned To: nobody
Priority: 5 Medium Severity: Minor
Status: Closed Last Modified: 2022-09-11 16:43:20
Resolution: Fixed Closed By: kjnash
    Closed on: 2022-09-11 16:43:20
Description:
Some compressd files are not completely downloaded.
Test script (requires wget to compare the http downloaded file):
----------------------------------------------------- 
package require http
#
# 1st part from Tcl 8.6 manpage example
#
proc httpcopy { url file {chunk 4096} } {
  set out [open $file w]
  set token [::http::geturl $url -channel $out \
      -progress httpCopyProgress -blocksize $chunk]
  close $out
  
  # This ends the line started by httpCopyProgress
  puts stderr ""
  
  upvar #0 $token state
  set max 0
  foreach {name value} $state(meta) {
    if {[string length $name] > $max} {
      set max [string length $name]
    }
    if {[regexp -nocase ^location$ $name]} {
      # Handle URL redirects
      puts stderr "Location:$value"
      return [httpcopy [string trim $value] $file $chunk]
    }
  }
  incr max
  foreach {name value} $state(meta) {
    puts [format "%-*s %s" $max $name: $value]
  }
  
  return $token
}
proc httpCopyProgress {args} {
  puts -nonewline stderr .
  flush stderr
}

#
# === Here starts my additional testing code ===
#
if {[llength $argv]} {
  set url [lindex $argv 0]
} else {
  set url "http://someonewhocares.org/hosts/hosts"
}
set org "outfile1.txt"
set out "outfile2.txt"

puts "Loading file $org from $url using wget"
catch {exec wget $url -O $org}

puts "Loading file $out from $url via httpcopy"
httpcopy $url $out
puts "HTTP file copy size is [file size $out], wget filesize is [file 
size $org]"
----------------------------------------------------- 
Output example
$ tclsh test_download.tcl 
Loading file outfile1.txt from http://someonewhocares.org/hosts/hosts 
using wget
Loading file outfile2.txt from http://someonewhocares.org/hosts/hosts 
via httpcopy
.......
Date:                Fri, 23 Mar 2018 21:23:21 GMT
Server:              Apache/2.2.31 (Unix) mod_ssl/2.2.31 
OpenSSL/1.0.1e-fips
content-disposition: attachment: filename=hosts
cache-control:       public, max-age=86400
Last-Modified:       Thu, 22 Mar 2018 08:13:42 GMT
Vary:                Accept-Encoding
Content-Encoding:    gzip
Connection:          close
Transfer-Encoding:   chunked
Content-Type:        text/plain
HTTP file copy size is 342964, wget filesize is 416015
User Comments: kjnash added on 2022-09-11 16:43:20:
This ticket raises three separate issues:

1. [open $file wb]

The "b" flag is equivalent to
    fconfigure $file -translation binary
and in current http this is done automatically if the content-type is non-binary or if the stacked channel includes decompression - using the following code:

    if {$state(-binary) || [IsBinaryContentType $state(type)]} {
        # Turn off conversions for non-text data.
        set state(binary) 1
    }
    if {[info exists state(-channel)]} {
        if {$state(binary) || [llength [ContentEncoding $token]]} {
            fconfigure $state(-channel) -translation binary
        }
        ...
    }

2. "Accept-Encoding identity"

This is necessary to tell the server not to use compression.  The http::geturl option -zip 0 should do this but does not - this bug is unrelated to this ticket (which always used the default -zip 1) and was fixed in commit [3cee774ebf] of branch http-bugfixes-2022H2.

3. using gzip

The response is truncated if gzip is used.

The error is the same as the truncation issue seen in tickets [3610253] for a chunked+gzip response written directly to a -channel.  That bug is now fixed.

sebres (claiming to be [email protected]) added on 2018-03-26 16:40:47:

Reopened because of https://groups.google.com/d/msg/comp.lang.tcl/7mNIrCZH2Ks/dbOyZcYABQAJ:

> ```diff
> - set out [open $file w]
> + set out [open $file wb]
> ```

confirmed, download using: -headers {Accept-Encoding identity} brings with "set out [open $file wb]" the same result as wget.

But without the geturl header {Accept-Encoding identity} the download file size differs by ca. 70kB. .... "HTTP file copy size is 342957, wget filesize is 416135"

> Please reopen if I'm wrong.

Don't know if this can be seen as expected behavior. The man page also explains the option -binary and even setting this to true and using filemode binary does not download the complete file.

From my point of view it is really hard to find out why geturl does behave as it does. Maybe improving the man pages could also bring more clear picture or adding the gzip encoding to the manpage example.

BTW: Nevertheless my problem is solved, many thanks to clt :-)

Looks like some response headers seem to still not correctly impact, if "Accept-Encoding" is not "identity" (resp. "chunked" in my case).
Currently, I've no time to dig deeper.


sebres added on 2018-03-26 12:12:35:

You're trying to write file using current system encoding (which is UTF-8, I assume).

So just change:

- set out [open $file w]
+ set out [open $file wb]
And you'll get it correctly.

Note, this server does not provide the charset (just says text/plain in content-type, but it says nothing about what it is). So the target encoding is undefined, wget will use binary here (so writes as is).


bll added on 2018-03-25 15:46:53:
Relevant discussion at:

https://groups.google.com/forum/#!topic/comp.lang.tcl/7mNIrCZH2Ks

Attachments: