Tcl Source Code

View Ticket
Login
Ticket UUID: 3613497
Title: tclThreadAlloc crashes under heavy load
Type: Bug Version: current: 8.5.14
Submitter: gneumann Created on: 2013-05-17 08:32:53
Subsystem: 41. Memory Allocation Assigned To: msofer
Priority: 5 Medium Severity: Minor
Status: Open Last Modified: 2015-05-01 14:21:30
Resolution: None Closed By: nobody
    Closed on:
Description:
Wolfgang Winkler <[email protected]> experienced crashes from NaviServer under heavy load. In their setup they start naviserver with 60 connection threads (plus a view background threads), where every thread contains a tcl interp. The backtrace below shows that at the time of the crash, two threads are within GetBlocks() and one thread is within MoveObjs() in tclThreadAlloc.c, while the thread accessing previously malloced memory crashes (he is using tcl 8.5.12). 

The problem could be prevented by replacing Tcl's "zippy" malloc by a system malloc (see attached patch, compile with -DSYSTEM_MALLOC). We have in our environment experiences in the past several crashes in threadAlloc as well, but in the past i was not able reproduce the crash reliably, nor  to attribute the problem cleary to threadAlloc. We switched on our production system in September to SYSTEM_MALLOC + TCMalloc, and have not experienced crashes anymore (but we switched at the same time as well the architecture, from POWER to x86_64, so i could not exclude a subtle concurrency problem in gcc on POWER). Wolfgang has a setup where he can reproduce the crash, with SYSTEM_MALLOC, his system is running stable.

Although the patch (against 8.5.14) is no general fix, i would recommend to add the SYSTEM_MALLOC part to the Tcl sources, since this allows to try out different allocators more easily and to use tools like valgrind for memory debugging.


(gdb) bt
#0  0x00007f3c6fd371b5 in *__GI_raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x00007f3c6fd39fc0 in *__GI_abort () at abort.c:92
#2  0x00007f3c70f0592a in Panic () from /usr/local/naviserver/lib/libnsd.so
#3  0x00007f3c705ea143 in Tcl_PanicVA (format=0x7f3c70f3ce6d "received fatal signal %d", argList=0x7f3c68daf660) at /root/sourcen/tcl8.5.12/unix/../generic/tclPanic.c:91
#4  0x00007f3c705ea2f7 in Tcl_Panic (format=0x7f3c70f3ce6d "received fatal signal %d") at /root/sourcen/tcl8.5.12/unix/../generic/tclPanic.c:130
#5  <signal handler called>
#6  0x00007f3c6fd84788 in *__GI___strcasecmp (s1=0x7f3c70f385e7 "connsperthread", s2=0x2020090a7b207d5d <Address 0x2020090a7b207d5d out of bounds>) at strcasecmp.c:65
#7  0x00007f3c70f11156 in Ns_SetFindCmp () from /usr/local/naviserver/lib/libnsd.so
#8  0x00007f3c70ef39df in ConfigGet () from /usr/local/naviserver/lib/libnsd.so
#9  0x00007f3c70ef410d in Ns_ConfigIntRange () from /usr/local/naviserver/lib/libnsd.so
#10 0x00007f3c70f0a18f in NsConnThread () from /usr/local/naviserver/lib/libnsd.so
#11 0x00007f3c7087067e in NsThreadMain () from /usr/local/naviserver/lib/libnsthread.so
#12 0x00007f3c70871029 in ThreadMain () from /usr/local/naviserver/lib/libnsthread.so
#13 0x00007f3c6f8eb8ca in start_thread (arg=<value optimized out>) at pthread_create.c:300
#14 0x00007f3c6fdd492d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#15 0x0000000000000000 in ?? ()
(gdb) info threads
  12 Thread 32642  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
  11 Thread 32641  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
  10 Thread 32647  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
  9 Thread 32639  clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:84
  8 Thread 32654  __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
  7 Thread 32657  GetBlocks (cachePtr=0x4a76610, bucket=2) at /root/sourcen/tcl8.5.12/unix/../generic/tclThreadAlloc.c:919
  6 Thread 32658  __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
  5 Thread 32640  0x00007f3c6fdce1a3 in select () at ../sysdeps/unix/syscall-template.S:82
  4 Thread 32656  __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
  3 Thread 32655  GetBlocks (cachePtr=0x4602510, bucket=1) at /root/sourcen/tcl8.5.12/unix/../generic/tclThreadAlloc.c:919
  2 Thread 32653  0x00007f3c7060ccc4 in MoveObjs (fromPtr=0x7f3c7086b360, toPtr=0x4601f30, numMove=776) at /root/sourcen/tcl8.5.12/unix/../generic/tclThreadAlloc.c:706
* 1 Thread 32652  0x00007f3c6fd371b5 in *__GI_raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
User Comments: dgp added on 2015-05-01 14:21:30:
It appears that the submitted patch causes a

-DSYSTEM_MALLOC

build of Tcl to have the same effect that a build leaving out

-DUSE_THREAD_ALLOC=1

already achieves.

We don't need two knobs to do the same thing.

It would be useful, of course, to have a repeatable
demo of the crash for analysis and fix.

Since this bug was submitted, some code changes have
gone into tclThreadAlloc.c .  Hard to know whether
they had an impact on this matter.

gneumann added on 2013-05-17 15:32:54:

File Added - 463391: tclThreadAlloc.patch

Attachments: