Tcl Source Code

View Ticket
Login
Ticket UUID: f583715154c6f33d621418e8c9f6dfae5cc5ebc4
Title: Fileevent runs in the wrong thread
Type: Bug Version: 8.6.10
Submitter: sbron Created on: 2020-01-10 12:14:24
Subsystem: 49. Threading Assigned To: jan.nijtmans
Priority: 5 Medium Severity: Severe
Status: Closed Last Modified: 2020-04-02 14:42:55
Resolution: Fixed Closed By: sebres
    Closed on: 2020-04-02 14:42:55
Description:

The script shown below sets up an async socket in a helper thread and then transfers the channel to the main thread. In the main thread, a writable file event handler is then configured on the channel. However, when this event handler is invoked, it runs in the helper thread rather than the main thread.

package require Thread
set tid [thread::create]

puts "Main thread: [thread::id]"
puts "Helper thread: $tid"

set fd [thread::send $tid {
    set fd [socket -async 8.8.8.8 53]
    thread::detach $fd
    set fd
}]

thread::attach $fd
fileevent $fd writable {
    puts "Fileevent thread: [thread::id]"
    exit
}

vwait forever

The output of the script is something like:

Main thread: tid0x7f868f4b5680
Helper thread: tid0x7f868ddbb700
Fileevent thread: tid0x7f868ddbb700

System: openSUSE Leap 15.1 x86_64

I initially opened this issue against the thread extension (d481f15516). However, it appears to be an issue in the Tcl core caused by the use of an internal event handler to deal with the async connect.

I believe this is a different issue than 815e246806.

I created a patch that reconfigures the internal event handler in the new thread when the channel is transferred. It fixes the problem in all scenarios I could come up with sofar.

User Comments: sebres added on 2020-04-02 14:42:55:
tests merged too, thus closing the ticket

sbron added on 2020-04-01 15:01:10:

The test script I provided was not sporadic on my machines. It failed consistently. I can confirm that it now passes on the most recent revision of Tcl's trunk.


sebres added on 2020-04-01 13:42:34:

[c082062e57] improves stability a bit (so it works on my windows boxes now).
Because another (windows-related) bug [b6d0d8cc2c] is still open (and could bother here with too many half-closed sockets in TIME_WAIT state), I added a small workaround that decreasing iteration count to 1/4 of supplied value.

I'd wait for travis (no idea that it has usable thread module, so would not be ignored by thread constraint) and merge it later if passed.


sebres added on 2020-04-01 12:20:46:

[9b7bd94769] provides tests covering this and few other RC similar situations around socket thread transfer (note it is under thread constraint).

Don has merged bug-f583715154-v2 to 8.6, so I'll close another leaf (bug-f583715154) too.


jan.nijtmans added on 2020-04-01 07:35:53:

Just tested the above demo script on Windows (latest core-8-6-branch without any patches). The result was:

Main thread: tid00000000000032E0
Helper thread: tid000000000000170C
Fileevent thread: tid00000000000032E0
This means that - indeed - Windows is not affected. I tested the patch on MacOS too: works as expected same as on other UNIX.

Conclusion: Still +1 for merge to core-8-6-branch.


jan.nijtmans added on 2020-04-01 06:48:15:
@sebres: I like your solution. I also tested it with the script supplied by Schelte Bron, and confirming this fixes the problem. So +1 for merging to core-8-6-branch and up.

If you would like to add a test-case, that would be super!

sebres added on 2020-03-31 20:12:39:

Commit [532506ef71] in branch bug-f583715154-v2 should ultimately fix that now (I'd suggest to add some threaded test covering the simplest SF-case)...

New introduced TcpThreadActionProc considers a transfer of asynchronously connecting channel between threads (handles detach and attach immediately by its execution), so it should prevent against unexpected state change in-between.

Test and review are welcome.


sebres added on 2020-03-31 16:55:22:

There are still few new findings about this:

1. 8.5 is not affected because since 8.6 a new (*nix) notifier got introduced, which handlers are not accurate now related to channel managed thread switch. Proposed patch does not really fix that (just retards), so one would simply get a segfault or a panic like "FlushChannel: damaged channel list" later.

2. 8.7 got already an attempt (by pooryorick) to fix similar issue, but also this has a same behavior here, if test continues (SF or panic on Nth iteration mostly larger than 500th).

3. it is hardly possible to fix that properly without some fixes addressed in [815e246806] (and others), just because if two threads are working on same channel or state (basically they do that, as managed thread switches in the interim state), it will be blamed to fail (so just matter of time when it gets broken) - as its ref-counts, internal descriptors or "handlers" registered outside of tcl get misused by one or the other thread.

4. I understand that this is a bug (and should be fixed or at least prohibit detach during async-connect), but without the fix a correct handling being to create event firstly in helper (so fulfill the asynchronous connect process) and only then transmit it to the other thread, but it would be the proper way to handle such async operations (not the common fix for the bug).


sebres added on 2020-03-31 14:21:31:

Although suggested fix seems to work, but I am unsure we really need to reassign the async-callback handler (Tcl_DeleteFileHandler/Tcl_CreateFileHandler) always, I'm trying to figure better solution out.

Let alone 8.5 seems to handle that correctly (and I don't see such handler reaction there).

As for suggested test, it is too sporadic on my side (I guess the helper thread mostly accomplished the context switch of the channel on my machine), so here is a bit better reproducible test script (that also repeats connection up to 100 times):

set ::helper [thread::create]
set ::master [thread::id]
puts "Main thread:\t\t $::master"
puts "Helper thread:\t\t $::helper"

set ::count 0 thread::send -async $::helper [list set ::master $::master] thread::send -async $::helper { set ::helper [thread::id] proc iteration {} { set fd [socket -async 127.0.0.1 0] thread::detach $fd thread::send -async $::master [list test $fd] } iteration } proc test {fd} { thread::attach $fd fileevent $fd writable [list apply {{fd} { puts "Fileevent thread:\t [thread::id]" if {[thread::id] ne $::master} { thread::send -async $::master {puts "ERROR: invalid thread,\t $::master is expecting"; set ::count 100} exit; # return } close $fd incr ::count thread::send -async $::helper iteration }} $fd] } puts **wait** while {$::count < 100} { vwait ::count } thread::release -wait $::helper


sbron added on 2020-03-27 16:13:27:

I thought I read somewhere that only the unix version uses an event handler for async connections, but I cannot find that quote now. Maybe I misunderstood.


jan.nijtmans added on 2020-03-25 17:03:11:

Branch bug-f583715154 created with your proposed fix.

At first sight: Looks good: Solid explanation. Just wondering whether tclWinSock.c should have the same modifications as tclUnixSock.c. I think it should.

Thanks for the bug report and fix. I will have a further look.


Attachments: