Tcl Source Code

View Ticket
Login
Ticket UUID: 1503729
Title: TclpDlopen latent bug now crashes after SunOS linker patch
Type: Bug Version: obsolete: 8.4.13
Submitter: kenstir Created on: 2006-06-09 21:09:08
Subsystem: 40. Dynamic Loading Assigned To: dkf
Priority: 5 Medium Severity:
Status: Closed Last Modified: 2006-06-14 05:54:42
Resolution: Fixed Closed By: dkf
    Closed on: 2006-06-13 22:54:42
Description:
There is a long-standing bug in tclLoadDl.c which is
exacerbated by recent Solaris linker patches. 
Basically, after a failed dlopen(), you must call
dlerror() right away, before any further dynamic
linking activity.  Otherwise, you risk the dlerror
string being corrupted.  It doesn't seem to be as
simple as dlerror() returning NULL, because that
wouldn't cause the crash.

SYMPTOM OF CRASH

$ ./tclsh
% load xxx
Segmentation Fault (core dumped)
puccini:~/build/rel50/src/vendor/tcl/tcl8.4.2/unix $
pstack core
core 'core' of 26650:   ./tclsh
 ff0331b4 strlen   (ffbee770, 14, 4, ffbee860, 1, 10) + 1c
 ff30ec98 Tcl_AppendResult (22c08, ff357dd8, 1,
ff357df0, ff3df8f8, 0) + 1c
 ff3268fc TclpDlopen (22c08, 33458, ffbeea34, ffbeed9c,
ff326830, 42048) + cc
 ff2f50c0 Tcl_FSLoadFile (22c08, 33458, ff34de08,
ffbeeb2c, ffbeea3c, ffbeea34) + 54
 ff2fb234 Tcl_LoadObjCmd (22c08, 0, 2, 260d4, 0,
ff345a9c) + 530
 ff2a9868 TclEvalObjvInternal (24748, 2, 0, 0, 0, 1) + 188
 ff2d5eb0 TclExecuteByteCode (ff34df24, ff34df2c,
2d014, 0, 260d4, 1) + 688
 ff2d54e8 TclCompEvalObj (0, 163, ff345a9c, 2cfa8,
2c990, 22c08) + 184
 ff2aa848 Tcl_EvalObjEx (0, 0, 20000, ff345a9c, 22c08,
2c990) + 60
 ff2e5010 Tcl_RecordAndEvalObj (20000, 2ca98, 20000,
22c08, 2c990, ff345a9c) + b8
 ff2fbdb4 Tcl_Main (1, 22c08, 1082c, ffbef334, 222b0,
2) + 4b0
 0001080c main     (1, ffbef334, ffbef33c, 20800, 0, 0)
+ 24
 000107c0 _start   (0, 0, 0, 0, 0, 0) + f8

EXPECTED BEHAVIOR

$ ./tclsh
% load xxx
couldn't load file "xxx": ld.so.1: tclsh: fatal: xxx:
open failed: No such file or directory

SYSTEM PATCH INFORMATION

Linker patch 109147-40 (latest as of this writing)
exhibits the problem.  Linker patch 109147-34 does not.
 I am unsure of other versions.  In order to see the
problem you have to compile tclLoadDl.c optimized with
the Sun compiler.
User Comments: dkf added on 2006-06-14 05:54:42:
Logged In: YES 
user_id=79902

Fixed on both HEAD and 8.4 branch

dkf added on 2006-06-14 05:06:12:
Logged In: YES 
user_id=79902

That's OK; as a maintainer I know those functions well
enough for both of us. :-)

That using a local variable stops the compiler from doing
the wrong thing is good enough for me; will apply that
trick. Thanks for your help in testing in an environment not
familiar to me any more.

dkf added on 2006-06-14 05:06:11:

data_type - 110894

kenstir added on 2006-06-13 23:17:42:
Logged In: YES 
user_id=246646

Sorry; I meant "long-ish time" not "long time".  I didn't
walk through all the code between the call to dlopen() and
dlerror() but it calls Tcl_DStringFree, Tcl_GetString, and
Tcl_AppendResult, which themselves may call other functions
and I am not familiar with their internals.  I know they
don't call dl* routines in my crashing scenario because I
traced it.

This is a compiler bug.  Pulling dlerror() out into a local
variable fixes the crash.  The crash happens with Forte 6
and not with Forte 6 Update 2 or with Studio 11.

dkf added on 2006-06-13 16:51:36:
Logged In: YES 
user_id=79902

Tcl most certainly isn't waiting a long time between calling
dlopen() and dlerror(); it only does a few calls between to
perform minor memory management and which are unlikely to
cause any OS traps at all (malloc implementations being the
way they are).

As you note, the problem is the compiler. According to the
Sun documentation, the pragma should mean that TclpDlOpen()
doesn't get optimized - not a big deal from our perspective
and surely not that hard for a compiler to do! - and
therefore the bug is definitely compiler-caused. So not our
fault! :-)

A workaround might be to try to compile that file with gcc
by hand...? Messy though. Another possibility might be to
put the result of dlerror() into a local variable before
passing it to Tcl_AppendResult(); if that stops the compiler
from going wrong, please reopen this issue and let me know
so that we can add a suitable kludge...

kenstir added on 2006-06-13 01:20:12:
Logged In: YES 
user_id=246646

I take it back. The Tcl code was not the root cause of the
crash.  Though it is probably bad style to wait a long-ish
time after calling dlopen() and before calling dlerror(), it
is not an error unless you call some other dl* function in
between.  I verified with truss that Tcl was not.

The real problem appears to be an optimizer bug in the Sun
Forte 6 (cc: Sun WorkShop 6 2000/06/19 C 5.1 Patch
109491-02) compiler exacerbated by the linker patch.  The
linker patch included a patch to a system header file which
did this:

#pragma unknown_control_flow(dlopen, dlsym, dlclose, dlerror)

With this change, the compiler generated different (and
apparently bad) assembler code.

kenstir added on 2006-06-10 04:09:08:

File Added - 181157: tclLoadDl.c.patch

Attachments: