Tcl Source Code

View Ticket
Login
Ticket UUID: 3165071
Title: http::geturl fails for at least 1 RSS feed
Type: Bug Version: obsolete: 8.6b1
Submitter: Created on: 2011-01-25 04:10:04
Subsystem: 29. http Package Assigned To: patthoyts
Priority: 5 Medium Severity: Minor
Status: Closed Last Modified: 2022-09-10 12:04:28
Resolution: Fixed Closed By: kjnash
    Closed on: 2022-09-10 12:04:28
Description:
The following script fails when run on tcl-8.6.0.0b4:

  set url {http://www.npr.org/templates/rss/podlayer.php?id=13}
  set file "[ clock seconds ].xml"
  set out [ open $file w ]
  http::geturl $url -channel $out
  close $out
  set channel [ tDOM::xmlOpenFile $file ]
  set doc [ dom parse -channel $channel ]
  chan close $channel

The [dom parse ...] fails because the file retrieved by the [http::geturl ...] is missing a chunk of data starting somewhere around line 67.

_____________________________________________________________________________________________

Configuration information:
----------------------------------------

OS: Windows 7 (Version 6.1 (Build 7600))
Tcl: ActiveTcl 8.6.0.0b4

  % info tclversion
  8.6
  % info patchlevel
  8.6b1.2
  % package require http
  2.8.2
  % package require tdom
  0.8.3
User Comments: kjnash added on 2022-09-10 12:04:28:
(1) The encoding issue (when the server does not specify content-type or charset) is a duplicate of bug [2998307] and now has a fix.

(2) The URL http://www.npr.org/templates/rss/podlayer.php?id=13 no longer exists, and so it is not possible to reproduce the "missing chunk of text" bug.  Perhaps this was a bug in Tcl 8.6.0.0b4 - but it seems to have gone away.  Or at least I have never seen it in Tcl 8.6 or 8.7 (i.e. it "works for me").

added on 2011-03-09 06:09:02:
Notes: 
1) The attached XML file labeled "valid version of the RSS feed" was generated by running the code on Tcl 8.5.9.
2) The attached XML file labeled "corrupted version of the RSS feed" was generated by running the code on Tcl 8.6.0.0b4 (ActiveTcl version #).

added on 2011-03-09 06:00:47:
Thanks for the response.

I'll take a look at the specified URLs.

One thing that confuses me about this issue is that the application I excerpted the code from has been working fine on Tcl 8.5.6, 8.5.8, & 8.5.9; but when I tried it with an 8.6 beta release, I found that a chunk of data missing near the 67th line in the temp file of captured data.

Is the behavior of http::geturl expected to be different in 8.6?

patthoyts added on 2011-03-07 04:38:16:
See also http://diveintomark.org/archives/2004/02/13/xml-media-types and RFC3023. Given no explicit content-encoding header and a type of text/xml we should be (and are) treating the data as iso8859-1.

patthoyts added on 2011-03-07 04:31:12:
Your problem is in handling the encoding. You should probably be setting the file to binary otherwise we apply the channel encoding to the data as we write it to the disk file. In this case you likely later expect it to be utf8.

However, when we retrieve the resource using the http package we handle the encoding as declared on the remote site. In this case there is no http content-encoding header and its type is text/xml so we are treating the inbound stream as iso8859-1. The file internally declares xml utf-8 but we will never look inside the file.

I suggest you use [open $file wb] and see how that works.

added on 2011-01-25 11:19:28:

File Added - 399714: 1295925860.xml

added on 2011-01-25 11:18:43:

File Added - 399713: 1295925929.xml

added on 2011-01-25 11:10:05:

File Added - 399712: http-tcl-8.6.0.0b4-defect.tcl

Attachments: