Tcl Source Code

View Ticket
Login
Ticket UUID: 2895565
Title: Busyloop in vwait in innocent HTTP code
Type: Bug Version: obsolete: 8.5.7
Submitter: ferrieux Created on: 2009-11-10 23:02:12
Subsystem: 02. Event Loops Assigned To: andreas_kupries
Priority: 9 Immediate Severity:
Status: Closed Last Modified: 2009-11-13 00:41:46
Resolution: Fixed Closed By: andreas_kupries
    Closed on: 2009-11-12 17:41:46
Description:
As reported in http://groups.google.com/group/comp.lang.tcl/tree/browse_frm/thread/6deca54954e61cfd/94ab3376153994d0?rnum=1&_done=%2Fgroup%2Fcomp.lang.tcl%2Fbrowse_frm%2Fthread%2F6deca54954e61cfd%2F432b7b5576b7ec5c%3F#doc_19398a625d24ac4d 

In 8.5 HEAD:

    set file_id [open myfile.html w] 
    http::geturl http://www.yahoo.com -channel $file_id \ 
        -blocksize 4096 -timeout 60000 
    close $file_id 

This code, in addition to only downloading an incomplete file, busyloops during the whole timeout.
It seems the 0-millisecond timer at line 7978 of TclIO.c fires repeatedly.
User Comments: andreas_kupries added on 2009-11-13 00:41:46:

allow_comments - 1

Confirmed the existence of this bug on the 8.4 branch, and that the same
patch fixes it.
Fix committed to the branch.

andreas_kupries added on 2009-11-13 00:26:18:
Confirmed the existence of this bug on the 8.6 branch, and that the same patch fixes it.
Fix committed to HEAD.

andreas_kupries added on 2009-11-13 00:05:40:
The simpler patch passes the whole testsuite, not only io-53.11.
Committed to 8.5 branch.
Keeping the entry open until the ports to 8.4 and 8.6 are done.

andreas_kupries added on 2009-11-12 23:59:39:

File Added - 350616: fcopy-fix-simpler.patch

dgp added on 2009-11-12 22:48:15:
test io-53.11 committed to 8-5-branch for
testing this bug.

Confirmed the test fails, then applying
fcopy-fix.patch makes it succeed.

andreas_kupries added on 2009-11-12 07:52:04:
Attached a smaller testcase, using just fcopy, and the input file I used. And a patch which apparently fixes the issue, although currently not fully understood why.

More investigation tomorrow.

andreas_kupries added on 2009-11-12 07:51:14:

File Added - 350538: input

andreas_kupries added on 2009-11-12 07:49:57:

File Added - 350537: trial.tcl

andreas_kupries added on 2009-11-12 07:49:16:

File Added - 350536: fcopy-fix.patch

ferrieux added on 2009-11-12 07:47:09:
Very simple repro, portable and without http:

set f [open aa w]
fconfigure $f -translation binary
puts $f [string repeat \n 20]
close $f
set f [open aa r]
fconfigure $f -translation binary
set g [open bb w]
fconfigure $g -translation crlf -encoding utf-8
fcopy $f $g -size 4 -command done
vwait forever
puts DONE

dgp added on 2009-11-12 05:11:35:
tests io-52.10 and chan-io-52.10 test
for [chan copy]'s peculiar reaction to
-encoding binary channels.

andreas_kupries added on 2009-11-12 04:49:58:
Direct link to the bug mentioned in the chat logs
http://sourceforge.net/support/tracker.php?aid=219210

andreas_kupries added on 2009-11-12 04:36:59:
[13:25]dgpback for just a sec
[13:26]dgpin light of the surprising but documented (!) meaning of -encoding binary in the context of [fcopy] ...
[13:27]dgp...appears a possible resolution of 2895565 is some flavor of "Not A Bug"
[13:27]dgpat least not a bug in [fcopy]
[13:28]akuLike 'you can't mix in/out translations/encodings in that way' ?
[13:28]dgpFor ugly historical reasons "-encoding binary" doesn't mean what you think it does.
[13:29]dgprevise your use of [fcopy]
[13:29]akuNote: The bug as reported appears for -encoding != binary
[13:30]akuWell, at least out channel != binary. I can't say if the in channel (socket of http) is bianry or not.
[13:30]dgpok, probably multiple bugs
[13:30]dgphave to come back to it; afk again

andreas_kupries added on 2009-11-12 04:15:52:
[12:59]dgpsomebody named kupries documented this on 2001-05-19 
[13:00]dgpfor Tcl 8.4 ;/
[13:00]aku:}
[13:01]dgp209210
[13:02]akuUh, private artifact ? I can't see it.
[13:02]dgpwhat bug is that?
[13:02]akuI am logged in
[13:03]akuHm, could that be a scripts bug number ?
[13:03]akuscripts -> scriptics
[13:04]dgpdon't think so; they were 4 digits
[13:04]akuHm ... Not an AS bug number either. (although I already worked for them at the time, just remote from germany)
[13:06]dgp219210
[13:06]akuAh, yeah, thtat looks like it
[13:07]akuIt seems the special treatment is because I wanted to keep backward compat somehow, for binary encodings
[13:08]akuand then fumbled things ...
[13:08]aku;/
[13:09]   * dgp called away

andreas_kupries added on 2009-11-12 04:00:10:
more notes, copied from the chat.

[12:33]dgpaku?
[12:34]aku?
[12:34]dgpFor a channel with ChannelState field "encoding == NULL"
[12:34]dgpthat's meant to be interpreted as "-encoding binary" ?
[12:34]akuAFAIK yes
[12:35]dgpI think the binary transfer rules of CopyData are broken then too.
[12:36]dgpIf the out channel is -encoding binary
[12:36]dgpand the in channel is not.
[12:37]dgpthen the in channel reads with DoReadChars putting the data into the internal Tcl encoding in the process.
[12:37]dgpCopyData then just copies these bytes out
[12:37]dgpand that's not what -encoding binary is supposed to mean, AIUI
[12:38]aku... Looking ... Ok, yes ... The DoWriteChars has this special recoding thing going on, and DoWrite doesn't ... 
[12:39]dgpI'm making guesses about what DoWrite() does I suppose.
[12:39]aku(DoWriteChars line 3357 (8.5-branch))
[12:39]akuNo, I looked into DoWrite, and does EOL translation only, see my notes in the bug report
[12:41]akuso, it seems we have to forget in/outBinary, and use only sameEncoding to switch between the Do* functions.
[12:41]akuAnd the Do*Chars then internally handle -encoding binary
[12:41]akuif needed
[12:42]dgpthe other wrinkle is that [fcopy] docs say the -size argument is the number of bytes to copy
[12:43]dgpIf that number of bytes leaves us in the middle of a multi-byte character, my interpretation is the partial character does not get copied.
[12:44]akuew ... I believe it may be easier to treat this as doc bug, bytes only if -encoding binary on input, otherwise chars in the encoding.
[12:44]Out of a cloud of smoke, emiliano appears!
[12:45]dgpchars are usually more sensible things to count at the script level.
[12:45]kostix is away (Automatically away due to idle)
[12:45]dgpand for -encoding binary channels chars are bytes
[12:45]akuOh ... I just read this ... If only the output channel is set to  encoding  "binary"  the  system
       will  write  the internal UTF-8 representation of the incoming characters
[12:45]dgpoh
[12:46]akuIOW, fcopy seems to have a special interpretation for -encoding binary on the output, which the implementation is not honoring
[12:46]dgpno that's what the implementationj is doing.
[12:46]dgpJust bizarre that a -encoding binary configuration on a channel means something different to a [puts] than it does to an [fcopy]
[12:47]akuSure ? There is this GetByteArrayFromObj going on in DoWriteChars
[12:47]akuAnd for the input as well it seems, have you fcopy.n open ?
[12:47]dgpwhen outBinary is true, DoWrite() is called
[12:47]akuRight! Thay may now explain why that is done
[12:48]dgpsome archeology in order
[12:48]akuThere is also ... If only the input channel is set to encoding "binary" the system will assume that the
       incoming bytes are valid UTF-8 characters
[12:48]dgpDid the docs come first, or just record the botch rather than fix it.
[12:48]akuThat explain the DoRead on input inBinary
[12:48]dgpCause this is just too weird
[12:48]akudocs vs code - timeline ordering - I do not remember
[12:49]akuQuite
[12:49]dgpI'll explore a bit more
[12:49]akuThanks
[12:49]akuShall I copy this discussion and add to the bug rpeort, as additional notes ?
[12:49]dgpok
[12:50]dgp[fcopy] is now [chan copy] ?
[12:50]akuIIRC yes
[12:50]dgpand [chan] is 8.5 or 8.6?
[12:50]aku8.5 already
[12:50]dgpOne option (not sure I like it yet)....
[12:51]dgp...is to just leave [fcopy] untouched.
[12:51]dgpdeclare it irretrievably confused.
[12:51]dgpand start over and get [chan copy] sensibly correct.
[12:52]akuI have to admit, I do not like that either. OTOH it may be our best bet going forward, leaving the fcopy compatible, and fixing the bug is to change fcopy for chan copy in newer code

andreas_kupries added on 2009-11-11 23:54:53:
While I made no testcase ... From the description I remember on the chat the data is coming with EOL markers LF, and is stored to a file using EOL markers CR+LF. This means that during write the 4096 byte _expand_ to 4165 IIRC, then the use of the DoWrite* results to count how much is left to read goes wonky ... That should be something we can make a testcase for. I do not remember if we encodings of input and output also played a role in forcing the wonky code-path. Even if yes, that is also something we should be able to emulate in a test case.

dgp added on 2009-11-11 22:33:47:
Has the investigation so far led to a test case
that does not require the http package and external
servers?  Some quirky use of [fcopy] perhaps?

andreas_kupries added on 2009-11-11 09:08:46:
Ok, here my notes after reading CopyData and recursing into the related functions, i.e. Do*.

DoReadChars
        Request: 'toRead' - Number of characters.
        Result:  Number of characters read
        Storage: A Tcl_Obj
        Does EOL translation (ReadBytes, ReadChars branches)
        Does encoding conversion -> Result is UTF8, ReadChars branch

DoRead
        Request: 'toRead' - Number of _bytes_
        Result: Number of bytes read  - (Comment says chars, is in error)
        Storage: char* array (bytes).
        NOTE!           No encoding conversion applied.
        Does            EOL translation (GetInput + CopyAndTranslateBuffer)

DoWriteChars
        Request:        #bytes to write
        Result:         #bytes written
        Storage:        char* array, utf8 encoded characters
        Does encoding conversion, does EOL translation
        (WriteChars (E, EOL), WriteBytes (EOL)).

DoWrite
        Request:        #bytes to write.
        Result:         #bytes written
        Storage:        char* array
        Does EOL translation. No encodings applied.

Use in CopyData seems to be byte-oriented all the way, making trouble
for DoReadChars, DoWriteChars.

I readily see places where #chars are compared to #bytes. Gah.  And
then there is the possibility of EOL translations changing sizes
unpredictably.

I believe, whatever we have written should have no bearing on the
number of things 'toRead' left ... We are to copy N chars/bytes, and
it does not matter how many chars/bytes get written to the destination
for K chars/bytes of the N for the counting down of how much is left.

~~~~

We should maybe try to

   typedef IntChars int;
   typedef IntBytes int;

and see if the compiler warns if we assign variables of different Int*
types to each other. If fear it will not warn.

Or change the varnames to reflect exactly what type of things they
count.

Or maybe split CopyData into four routines, one for each cominbation
of in/outBinary//sameEncoding. Possibly easier, factor the read and
write sections of CopyData into separate functions, and then split
each of these into two, based on the encoding/binary settings. Then we
could write each function to count properly, without lots of
conditionals inside to switch between the possible variants.

This part of tclIO feels a bit like one of the stables of augias(sp?).

ferrieux added on 2009-11-11 07:07:37:
The number of bytes is probably larger because of CRLF translation (this happens only on Windows, and the output file is opened in unadorned w mode, hence transl is crlf).
What I can't grok is why we are taking the number of written bytes to decrement the number of remaining bytes to read. That doesn't make sense. Fix seems trivial but I'd like to understand...

andreas_kupries added on 2009-11-11 06:45:38:
Hm ... non-binary case only ? Could this be some byte vs char misinterpretation somewhere ?

andreas_kupries added on 2009-11-11 06:45:03:
Why is DoWriteChars() returning a number which is larger than the size of the buffer it got ?

ferrieux added on 2009-11-11 06:34:00:
In TclIO.c +8710: I see that in the binary->non-binary case, we correctly call DoWriteChars, but we then use the returned number of bytes (which is 4165 here for an initial 4096 buffer). Then when we decrement the total count to read by this value, we get a negative result:

if (outBinary || sameEncoding) {
    sizeb = DoWrite(outStatePtr->topChanPtr, buffer, sizeb);
} else {
    sizeb = DoWriteChars(outStatePtr->topChanPtr, buffer, sizeb);
}

if (inBinary || sameEncoding) {
    /*
     * Both read and write counted bytes.
     */

    size = sizeb;

ferrieux added on 2009-11-11 06:21:58:
Turns out the tight loop involves CopyData, the worker from [fcopy].
Once established, it is entered with csPtr->toRead == -4 !

patthoyts added on 2009-11-11 06:21:56:
I can confirm this on 8.5 branch. It important that the channel not be binary. When I used [open output wb] the download succeeded. Adding -timeout and -command options do not affect the fault but do allow the job to be timedout which should help in testing.
ie: http::geturl $url -channel $f -timeout 10000 -command [list close $f]

Attachments: