Ticket UUID: | 1503729 | |||
Title: | TclpDlopen latent bug now crashes after SunOS linker patch | |||
Type: | Bug | Version: | obsolete: 8.4.13 | |
Submitter: | kenstir | Created on: | 2006-06-09 21:09:08 | |
Subsystem: | 40. Dynamic Loading | Assigned To: | dkf | |
Priority: | 5 Medium | Severity: | ||
Status: | Closed | Last Modified: | 2006-06-14 05:54:42 | |
Resolution: | Fixed | Closed By: | dkf | |
Closed on: | 2006-06-13 22:54:42 | |||
Description: |
There is a long-standing bug in tclLoadDl.c which is exacerbated by recent Solaris linker patches. Basically, after a failed dlopen(), you must call dlerror() right away, before any further dynamic linking activity. Otherwise, you risk the dlerror string being corrupted. It doesn't seem to be as simple as dlerror() returning NULL, because that wouldn't cause the crash. SYMPTOM OF CRASH $ ./tclsh % load xxx Segmentation Fault (core dumped) puccini:~/build/rel50/src/vendor/tcl/tcl8.4.2/unix $ pstack core core 'core' of 26650: ./tclsh ff0331b4 strlen (ffbee770, 14, 4, ffbee860, 1, 10) + 1c ff30ec98 Tcl_AppendResult (22c08, ff357dd8, 1, ff357df0, ff3df8f8, 0) + 1c ff3268fc TclpDlopen (22c08, 33458, ffbeea34, ffbeed9c, ff326830, 42048) + cc ff2f50c0 Tcl_FSLoadFile (22c08, 33458, ff34de08, ffbeeb2c, ffbeea3c, ffbeea34) + 54 ff2fb234 Tcl_LoadObjCmd (22c08, 0, 2, 260d4, 0, ff345a9c) + 530 ff2a9868 TclEvalObjvInternal (24748, 2, 0, 0, 0, 1) + 188 ff2d5eb0 TclExecuteByteCode (ff34df24, ff34df2c, 2d014, 0, 260d4, 1) + 688 ff2d54e8 TclCompEvalObj (0, 163, ff345a9c, 2cfa8, 2c990, 22c08) + 184 ff2aa848 Tcl_EvalObjEx (0, 0, 20000, ff345a9c, 22c08, 2c990) + 60 ff2e5010 Tcl_RecordAndEvalObj (20000, 2ca98, 20000, 22c08, 2c990, ff345a9c) + b8 ff2fbdb4 Tcl_Main (1, 22c08, 1082c, ffbef334, 222b0, 2) + 4b0 0001080c main (1, ffbef334, ffbef33c, 20800, 0, 0) + 24 000107c0 _start (0, 0, 0, 0, 0, 0) + f8 EXPECTED BEHAVIOR $ ./tclsh % load xxx couldn't load file "xxx": ld.so.1: tclsh: fatal: xxx: open failed: No such file or directory SYSTEM PATCH INFORMATION Linker patch 109147-40 (latest as of this writing) exhibits the problem. Linker patch 109147-34 does not. I am unsure of other versions. In order to see the problem you have to compile tclLoadDl.c optimized with the Sun compiler. | |||
User Comments: |
dkf added on 2006-06-14 05:54:42:
Logged In: YES user_id=79902 Fixed on both HEAD and 8.4 branch dkf added on 2006-06-14 05:06:12: Logged In: YES user_id=79902 That's OK; as a maintainer I know those functions well enough for both of us. :-) That using a local variable stops the compiler from doing the wrong thing is good enough for me; will apply that trick. Thanks for your help in testing in an environment not familiar to me any more. dkf added on 2006-06-14 05:06:11: data_type - 110894 kenstir added on 2006-06-13 23:17:42: Logged In: YES user_id=246646 Sorry; I meant "long-ish time" not "long time". I didn't walk through all the code between the call to dlopen() and dlerror() but it calls Tcl_DStringFree, Tcl_GetString, and Tcl_AppendResult, which themselves may call other functions and I am not familiar with their internals. I know they don't call dl* routines in my crashing scenario because I traced it. This is a compiler bug. Pulling dlerror() out into a local variable fixes the crash. The crash happens with Forte 6 and not with Forte 6 Update 2 or with Studio 11. dkf added on 2006-06-13 16:51:36: Logged In: YES user_id=79902 Tcl most certainly isn't waiting a long time between calling dlopen() and dlerror(); it only does a few calls between to perform minor memory management and which are unlikely to cause any OS traps at all (malloc implementations being the way they are). As you note, the problem is the compiler. According to the Sun documentation, the pragma should mean that TclpDlOpen() doesn't get optimized - not a big deal from our perspective and surely not that hard for a compiler to do! - and therefore the bug is definitely compiler-caused. So not our fault! :-) A workaround might be to try to compile that file with gcc by hand...? Messy though. Another possibility might be to put the result of dlerror() into a local variable before passing it to Tcl_AppendResult(); if that stops the compiler from going wrong, please reopen this issue and let me know so that we can add a suitable kludge... kenstir added on 2006-06-13 01:20:12: Logged In: YES user_id=246646 I take it back. The Tcl code was not the root cause of the crash. Though it is probably bad style to wait a long-ish time after calling dlopen() and before calling dlerror(), it is not an error unless you call some other dl* function in between. I verified with truss that Tcl was not. The real problem appears to be an optimizer bug in the Sun Forte 6 (cc: Sun WorkShop 6 2000/06/19 C 5.1 Patch 109491-02) compiler exacerbated by the linker patch. The linker patch included a patch to a system header file which did this: #pragma unknown_control_flow(dlopen, dlsym, dlclose, dlerror) With this change, the compiler generated different (and apparently bad) assembler code. kenstir added on 2006-06-10 04:09:08: File Added - 181157: tclLoadDl.c.patch |
Attachments:
- tclLoadDl.c.patch [download] added by kenstir on 2006-06-10 04:09:08. [details]