2010-01-12 09:23:17

by Andrew than

[permalink] [raw]
Subject: Futex hang/lockup problem in 2.6.30+ on AMD64


After some investigation I believe I am experiencing a problem similar
to the one described in this posting:
http://sourceware.org/ml/libc-help/2009-10/msg00026.html, in that the
poster suspects a problem in the futex implementation in 2.6.30 and
above kernels. In my case, the problem is not a soft lockup in the
kernel, but it does result in an application lock up due to all threads
waiting for futex's.

For me this problem began to appear once I upgraded my Debian
squeeze/testing x86_64 installation (AMD) to a new kernel. I'm not
sure what the prior kernel version was. The same software running on
different machines with earlier kernels (lenny) does not seem to
experience the problem.

I'm really not sure if this is a libc or kernel problem, but due to
the stack trace, which shows what appears to be a hang on the internal
__lock of the condition variable, it appears likely this is not an
application bug. Memory does not appear to be corrupt (I store
sentinels around the mutexes, and they have retained their values).

It appears that the cond var's __lock indicates there are waiters
even though there are/should-be none (assuming I'm interpreting the
__lock value of 2 correctly). Since the __lock in question is a futex
primitive, and it must be held regardless of other libc/nptl state
variables,
I don't believe this is a libc problem.

The problem occurs rarely, but innevitably, and sometimes only after
several hours of normal program operation. I have not yet
successfully created a reduced test program that can faithfully
reproduce the hang in a short timeframe.

The application contains a thread pool where threads perform many
operations between pthread calls but can be summarized as one of three
cases below. Due to the design of the thread pool, threads
round-robbin or at least are randomly assigned a workload (in contrast
to having one constant broadcast thread).

case 1: while(1){ *A* pthread_lock();pthread_unlock();}
case 2: pthread_lock();pthread_cond_wait();pthread_unlock();
case 3: pthread_lock(); *B* pthread_cond_broadcast();pthread_unlock();

The application becomes hung with all threads but one stuck at *A*,
and one thread at *B*.

The stack trace and other details appear below. I've saved the core
file in case I can provide additional information.


$ uname -a
Linux UK22 2.6.30-2-amd64 #1 SMP Fri Sep 25 22:16:56 UTC 2009 x86_64
GNU/Linux

I rebuilt Debian's eglibc-2.10.2 from source with -g flag to get a
better trace. Here is ldd on the application:

linux-vdso.so.1 => (0x00007fff149ff000)
libboost_python.so.1.40.0 => ./libboost_python.so.1.40.0
(0x00007f1f2c55a000)
libpython2.5.so.1.0 => /usr/lib/libpython2.5.so.1.0 (0x00007f1f2c1e1000)
libACEXML_Parser.so.5.4.0 => /var/ACE/libACEXML_Parser.so.5.4.0
(0x00007f1f2bfbf000)
libACEXML.so.5.4.0 => /var/ACE/libACEXML.so.5.4.0 (0x00007f1f2bd77000)
libACE.so.5.4.0 => /var/ACE/libACE.so.5.4.0 (0x00007f1f2acc3000)
libdl.so.2 => /lib/libdl.so.2 (0x00007f1f2aabf000)
libpthread.so.0 =>
/home/root/eglibc-2.10.2/build-tree/amd64-libc/nptl/libpthread.so.0
(0x00007f1f2a8a2000)
librt.so.1 => /lib/librt.so.1 (0x00007f1f2a69a000)
libstdc++.so.6 => /usr/lib/libstdc++.so.6 (0x00007f1f2a38a000)
libm.so.6 => /lib/libm.so.6 (0x00007f1f2a107000)
libgcc_s.so.1 => /lib/libgcc_s.so.1 (0x00007f1f29ef1000)
libc.so.6 => /lib/libc.so.6 (0x00007f1f29b9d000)
libutil.so.1 => /lib/libutil.so.1 (0x00007f1f29999000)
/lib64/ld-linux-x86-64.so.2 (0x00007f1f2c7b1000)


+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
GDB BACKTRACE
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

See below for source of last couple of stack frames.

All threads except thread 4 are waiting for a lock on the "external"
mutex being used in conjunction with the condition variable. The
owner of that lock is 25521 which sure enough is thread 4. However,
thread 4 appears to be waiting on the internal __lock of the condition
variable. Since that variable appears to have no waiters and the
other threads' traces are not inside any pthread calls associated with
that __lock, it seems reasonable that there is either a pthread or
futex problem.



Thread 7 (Thread 25524):
#0 __lll_lock_wait () at
../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
#1 0x00007f9c9b282e79 in _L_lock_949 () from
/home/root/eglibc-2.10.2/build-tree/amd64-libc/nptl/libpthread.so.0
#2 0x00007f9c9b282c9b in __pthread_mutex_lock (mutex=0x1dc3960) at
pthread_mutex_lock.c:61
#3 0x00007f9c9c545021 in ACE_OS::mutex_lock (m=0x1dc3960)
at /opt/ttdev/ACE/v5.4/x86_64.linux2.6-testing/ace/OS_NS_Thread.inl:1296
#4 0x00007f9c9c545061 in ACE_OS::thread_mutex_lock (m=0x1dc3960)
at /opt/ttdev/ACE/v5.4/x86_64.linux2.6-testing/ace/OS_NS_Thread.inl:4443
#5 0x00007f9c9c54508f in ACE_Thread_Mutex::acquire (this=0x1dc3960)
at /opt/ttdev/ACE/v5.4/x86_64.linux2.6-testing/ace/Thread_Mutex.inl:57
#6 0x00007f9c9c5410e2 in ACE_Guard<ACE_Thread_Mutex>::acquire
(this=0x7f9c7f7f5e90)
at /opt/ttdev/ACE/v5.4/x86_64.linux2.6-testing/ace/Guard_T.inl:9
#7 0x00007f9c9c541123 in ACE_Guard (this=0x7f9c7f7f5e90, l=...)
at /opt/ttdev/ACE/v5.4/x86_64.linux2.6-testing/ace/Guard_T.inl:35
#8 0x00007f9c9c2e1da6 in TTWork::GeneratorSelect::reselect
(this=0x1dc38f0, wi=0x7f9c80af9660) at TTWork.cpp:873
#9 0x00007f9c9c2e1e92 in TTWork::WorkItemHandle::clearReadyMask
(this=0x7f9c80af9660, mask=1, resel=true)
at TTWork.cpp:1061
#10 0x00007f9c9c2eaea2 in TTWork::NetServiceTCP::doTheWork
(this=0x7f9c80af9660, workEV=...)
at TTWorkNetServiceTCP.cpp:278
#11 0x00007f9c9c2eb354 in TTWork::NetServiceTCP::doWork
(this=0x7f9c80af9660, workEV=...)
at TTWorkNetServiceTCP.cpp:351
#12 0x00007f9c9c2dfccb in TTWork::Dispatcher::dispatch (this=0x13b5c60)
at TTWork.cpp:234
#13 0x00007f9c9c2e3a4f in TTWork::Dispatcher::dispatchGenerate
(this=0x13b5c60, maxWait=0x0, min=0x7f9c7f7f6260)
at TTWork.cpp:324
#14 0x00007f9c9c2e44fd in TTWork::DispatcherTask::runTask
(this=0x13b6ec0) at TTWork.cpp:1580
#15 0x00007f9c9c2e4fee in TTWork::Task::svc (this=0x13b6ec0) at
TTWork.cpp:50
#16 0x00007f9c9b865344 in ACE_Task_Base::svc_run (args=0x13b6ee8) at
Task.cpp:210
#17 0x00007f9c9b7dcb0f in ACE_Thread_Adapter::invoke_i
(this=0x7f9c80000bc0) at Thread_Adapter.cpp:150
#18 0x00007f9c9b7dcbb9 in ACE_Thread_Adapter::invoke
(this=0x7f9c80000bc0) at Thread_Adapter.cpp:93
#19 0x00007f9c9b78c0e3 in ace_thread_adapter (args=0x7f9c80000bc0) at
Base_Thread_Adapter.cpp:131
#20 0x00007f9c9b28073a in start_thread (arg=<value optimized out>) at
pthread_create.c:300
#21 0x00007f9c9a64169d in clone () from /lib/libc.so.6
#22 0x0000000000000000 in ?? ()

Thread 6 (Thread 25523):
#0 __lll_lock_wait () at
../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
#1 0x00007f9c9b282e79 in _L_lock_949 () from
/home/root/eglibc-2.10.2/build-tree/amd64-libc/nptl/libpthread.so.0
#2 0x00007f9c9b282c9b in __pthread_mutex_lock (mutex=0x1dc3960) at
pthread_mutex_lock.c:61
#3 0x00007f9c9c545021 in ACE_OS::mutex_lock (m=0x1dc3960)
at /opt/ttdev/ACE/v5.4/x86_64.linux2.6-testing/ace/OS_NS_Thread.inl:1296
#4 0x00007f9c9c545061 in ACE_OS::thread_mutex_lock (m=0x1dc3960)
at /opt/ttdev/ACE/v5.4/x86_64.linux2.6-testing/ace/OS_NS_Thread.inl:4443
#5 0x00007f9c9c54508f in ACE_Thread_Mutex::acquire (this=0x1dc3960)
at /opt/ttdev/ACE/v5.4/x86_64.linux2.6-testing/ace/Thread_Mutex.inl:57
#6 0x00007f9c9c5410e2 in ACE_Guard<ACE_Thread_Mutex>::acquire
(this=0x7f9c7fff6e90)
at /opt/ttdev/ACE/v5.4/x86_64.linux2.6-testing/ace/Guard_T.inl:9
#7 0x00007f9c9c541123 in ACE_Guard (this=0x7f9c7fff6e90, l=...)
at /opt/ttdev/ACE/v5.4/x86_64.linux2.6-testing/ace/Guard_T.inl:35
#8 0x00007f9c9c2e1da6 in TTWork::GeneratorSelect::reselect
(this=0x1dc38f0, wi=0x7f9c80ab8e40) at TTWork.cpp:873
#9 0x00007f9c9c2e1e92 in TTWork::WorkItemHandle::clearReadyMask
(this=0x7f9c80ab8e40, mask=1, resel=true)
at TTWork.cpp:1061
#10 0x00007f9c9c2eaea2 in TTWork::NetServiceTCP::doTheWork
(this=0x7f9c80ab8e40, workEV=...)
at TTWorkNetServiceTCP.cpp:278
#11 0x00007f9c9c2eb354 in TTWork::NetServiceTCP::doWork
(this=0x7f9c80ab8e40, workEV=...)
at TTWorkNetServiceTCP.cpp:351
#12 0x00007f9c9c2dfccb in TTWork::Dispatcher::dispatch (this=0x13b5c60)
at TTWork.cpp:234
#13 0x00007f9c9c2e3a4f in TTWork::Dispatcher::dispatchGenerate
(this=0x13b5c60, maxWait=0x0, min=0x7f9c7fff7260)
at TTWork.cpp:324
#14 0x00007f9c9c2e44fd in TTWork::DispatcherTask::runTask
(this=0x13b6ec0) at TTWork.cpp:1580
#15 0x00007f9c9c2e4fee in TTWork::Task::svc (this=0x13b6ec0) at
TTWork.cpp:50
#16 0x00007f9c9b865344 in ACE_Task_Base::svc_run (args=0x13b6ee8) at
Task.cpp:210
#17 0x00007f9c9b7dcb0f in ACE_Thread_Adapter::invoke_i
(this=0x7f9c80000970) at Thread_Adapter.cpp:150
#18 0x00007f9c9b7dcbb9 in ACE_Thread_Adapter::invoke
(this=0x7f9c80000970) at Thread_Adapter.cpp:93
#19 0x00007f9c9b78c0e3 in ace_thread_adapter (args=0x7f9c80000970) at
Base_Thread_Adapter.cpp:131
#20 0x00007f9c9b28073a in start_thread (arg=<value optimized out>) at
pthread_create.c:300
#21 0x00007f9c9a64169d in clone () from /lib/libc.so.6
#22 0x0000000000000000 in ?? ()

Thread 5 (Thread 25522):
#0 __lll_lock_wait () at
../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
#1 0x00007f9c9b282e79 in _L_lock_949 () from
/home/root/eglibc-2.10.2/build-tree/amd64-libc/nptl/libpthread.so.0
#2 0x00007f9c9b282c9b in __pthread_mutex_lock (mutex=0x1dc3960) at
pthread_mutex_lock.c:61
#3 0x00007f9c9c545021 in ACE_OS::mutex_lock (m=0x1dc3960)
at /opt/ttdev/ACE/v5.4/x86_64.linux2.6-testing/ace/OS_NS_Thread.inl:1296
#4 0x00007f9c9c545061 in ACE_OS::thread_mutex_lock (m=0x1dc3960)
at /opt/ttdev/ACE/v5.4/x86_64.linux2.6-testing/ace/OS_NS_Thread.inl:4443
#5 0x00007f9c9c54508f in ACE_Thread_Mutex::acquire (this=0x1dc3960)
at /opt/ttdev/ACE/v5.4/x86_64.linux2.6-testing/ace/Thread_Mutex.inl:57
#6 0x00007f9c9c5410e2 in ACE_Guard<ACE_Thread_Mutex>::acquire
(this=0x7f9c84e14e90)
at /opt/ttdev/ACE/v5.4/x86_64.linux2.6-testing/ace/Guard_T.inl:9
#7 0x00007f9c9c541123 in ACE_Guard (this=0x7f9c84e14e90, l=...)
at /opt/ttdev/ACE/v5.4/x86_64.linux2.6-testing/ace/Guard_T.inl:35
#8 0x00007f9c9c2e1da6 in TTWork::GeneratorSelect::reselect
(this=0x1dc38f0, wi=0x7f9c80407020) at TTWork.cpp:873
#9 0x00007f9c9c2e1e92 in TTWork::WorkItemHandle::clearReadyMask
(this=0x7f9c80407020, mask=1, resel=true)
at TTWork.cpp:1061
#10 0x00007f9c9c2eaea2 in TTWork::NetServiceTCP::doTheWork
(this=0x7f9c80407020, workEV=...)
at TTWorkNetServiceTCP.cpp:278
#11 0x00007f9c9c2eb354 in TTWork::NetServiceTCP::doWork
(this=0x7f9c80407020, workEV=...)
at TTWorkNetServiceTCP.cpp:351
#12 0x00007f9c9c2dfccb in TTWork::Dispatcher::dispatch (this=0x13b5c60)
at TTWork.cpp:234
#13 0x00007f9c9c2e3a4f in TTWork::Dispatcher::dispatchGenerate
(this=0x13b5c60, maxWait=0x0, min=0x7f9c84e15260)
at TTWork.cpp:324
#14 0x00007f9c9c2e44fd in TTWork::DispatcherTask::runTask
(this=0x13b6ec0) at TTWork.cpp:1580
#15 0x00007f9c9c2e4fee in TTWork::Task::svc (this=0x13b6ec0) at
TTWork.cpp:50
#16 0x00007f9c9b865344 in ACE_Task_Base::svc_run (args=0x13b6ee8) at
Task.cpp:210
#17 0x00007f9c9b7dcb0f in ACE_Thread_Adapter::invoke_i
(this=0x7f9c80000bc0) at Thread_Adapter.cpp:150
#18 0x00007f9c9b7dcbb9 in ACE_Thread_Adapter::invoke
(this=0x7f9c80000bc0) at Thread_Adapter.cpp:93
#19 0x00007f9c9b78c0e3 in ace_thread_adapter (args=0x7f9c80000bc0) at
Base_Thread_Adapter.cpp:131
#20 0x00007f9c9b28073a in start_thread (arg=<value optimized out>) at
pthread_create.c:300
#21 0x00007f9c9a64169d in clone () from /lib/libc.so.6
#22 0x0000000000000000 in ?? ()

Thread 4 (Thread 25521):
#0 __lll_lock_wait () at
../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
#1 0x00007f9c9b2854d0 in pthread_cond_broadcast@@GLIBC_2.3.2 ()
at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_broadcast.S:118
#2 0x00007f9c9c2b87c7 in ACE_OS::cond_broadcast (cv=0x1dc4500)
at /opt/ttdev/ACE/v5.4/x86_64.linux2.6/ace/OS_NS_Thread.inl:294
#3 0x00007f9c9c2b5325 in ACE_Condition<ACE_Thread_Mutex>::broadcast
(this=0x1dc4500)
at /opt/ttdev/ACE/v5.4/x86_64.linux2.6/ace/Condition_T.inl:81
#4 0x00007f9c9c2e229e in TTWork::GeneratorSelect::generate
(this=0x1dc38f0, nextGenTime=...,
maxWait=0x7f9c856161c0) at TTWork.cpp:814
#5 0x00007f9c9c2e38f2 in TTWork::Dispatcher::generate (this=0x13b5c60,
maxWait=0x7f9c85616220, min=0x7f9c85616260)
at TTWork.cpp:300
#6 0x00007f9c9c2e3a9b in TTWork::Dispatcher::dispatchGenerate
(this=0x13b5c60, maxWait=0x0, min=0x7f9c85616260)
at TTWork.cpp:331
#7 0x00007f9c9c2e44fd in TTWork::DispatcherTask::runTask
(this=0x13b6ec0) at TTWork.cpp:1580
#8 0x00007f9c9c2e4fee in TTWork::Task::svc (this=0x13b6ec0) at
TTWork.cpp:50
#9 0x00007f9c9b865344 in ACE_Task_Base::svc_run (args=0x13b6ee8) at
Task.cpp:210
#10 0x00007f9c9b7dcb0f in ACE_Thread_Adapter::invoke_i
(this=0x7f9c80000970) at Thread_Adapter.cpp:150
#11 0x00007f9c9b7dcbb9 in ACE_Thread_Adapter::invoke
(this=0x7f9c80000970) at Thread_Adapter.cpp:93
#12 0x00007f9c9b78c0e3 in ace_thread_adapter (args=0x7f9c80000970) at
Base_Thread_Adapter.cpp:131
#13 0x00007f9c9b28073a in start_thread (arg=<value optimized out>) at
pthread_create.c:300
#14 0x00007f9c9a64169d in clone () from /lib/libc.so.6
#15 0x0000000000000000 in ?? ()

Thread 3 (Thread 25520):
#0 __lll_lock_wait () at
../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
#1 0x00007f9c9b282e79 in _L_lock_949 () from
/home/root/eglibc-2.10.2/build-tree/amd64-libc/nptl/libpthread.so.0
#2 0x00007f9c9b282c9b in __pthread_mutex_lock (mutex=0x1dc3960) at
pthread_mutex_lock.c:61
#3 0x00007f9c9c545021 in ACE_OS::mutex_lock (m=0x1dc3960)
at /opt/ttdev/ACE/v5.4/x86_64.linux2.6-testing/ace/OS_NS_Thread.inl:1296
#4 0x00007f9c9c545061 in ACE_OS::thread_mutex_lock (m=0x1dc3960)
at /opt/ttdev/ACE/v5.4/x86_64.linux2.6-testing/ace/OS_NS_Thread.inl:4443
#5 0x00007f9c9c54508f in ACE_Thread_Mutex::acquire (this=0x1dc3960)
at /opt/ttdev/ACE/v5.4/x86_64.linux2.6-testing/ace/Thread_Mutex.inl:57
#6 0x00007f9c9c5410e2 in ACE_Guard<ACE_Thread_Mutex>::acquire
(this=0x7f9c85e16e90)
at /opt/ttdev/ACE/v5.4/x86_64.linux2.6-testing/ace/Guard_T.inl:9
#7 0x00007f9c9c541123 in ACE_Guard (this=0x7f9c85e16e90, l=...)
at /opt/ttdev/ACE/v5.4/x86_64.linux2.6-testing/ace/Guard_T.inl:35
#8 0x00007f9c9c2e1da6 in TTWork::GeneratorSelect::reselect
(this=0x1dc38f0, wi=0x7f9c78177200) at TTWork.cpp:873
#9 0x00007f9c9c2e1e92 in TTWork::WorkItemHandle::clearReadyMask
(this=0x7f9c78177200, mask=1, resel=true)
at TTWork.cpp:1061
#10 0x00007f9c9c2eaea2 in TTWork::NetServiceTCP::doTheWork
(this=0x7f9c78177200, workEV=...)
at TTWorkNetServiceTCP.cpp:278
#11 0x00007f9c9c2eb354 in TTWork::NetServiceTCP::doWork
(this=0x7f9c78177200, workEV=...)
at TTWorkNetServiceTCP.cpp:351
#12 0x00007f9c9c2dfccb in TTWork::Dispatcher::dispatch (this=0x13b5c60)
at TTWork.cpp:234
#13 0x00007f9c9c2e3a4f in TTWork::Dispatcher::dispatchGenerate
(this=0x13b5c60, maxWait=0x0, min=0x7f9c85e17260)
at TTWork.cpp:324
#14 0x00007f9c9c2e44fd in TTWork::DispatcherTask::runTask
(this=0x13b6ec0) at TTWork.cpp:1580
#15 0x00007f9c9c2e4fee in TTWork::Task::svc (this=0x13b6ec0) at
TTWork.cpp:50
#16 0x00007f9c9b865344 in ACE_Task_Base::svc_run (args=0x13b6ee8) at
Task.cpp:210
#17 0x00007f9c9b7dcb0f in ACE_Thread_Adapter::invoke_i (this=0x13b5b20)
at Thread_Adapter.cpp:150
#18 0x00007f9c9b7dcbb9 in ACE_Thread_Adapter::invoke (this=0x13b5b20) at
Thread_Adapter.cpp:93
#19 0x00007f9c9b78c0e3 in ace_thread_adapter (args=0x13b5b20) at
Base_Thread_Adapter.cpp:131
#20 0x00007f9c9b28073a in start_thread (arg=<value optimized out>) at
pthread_create.c:300
#21 0x00007f9c9a64169d in clone () from /lib/libc.so.6
#22 0x0000000000000000 in ?? ()

Thread 2 (Thread 25519):
#0 __lll_lock_wait () at
../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
#1 0x00007f9c9b282e79 in _L_lock_949 () from
/home/root/eglibc-2.10.2/build-tree/amd64-libc/nptl/libpthread.so.0
#2 0x00007f9c9b282c9b in __pthread_mutex_lock (mutex=0x1dc3960) at
pthread_mutex_lock.c:61
#3 0x00007f9c9c545021 in ACE_OS::mutex_lock (m=0x1dc3960)
at /opt/ttdev/ACE/v5.4/x86_64.linux2.6-testing/ace/OS_NS_Thread.inl:1296
#4 0x00007f9c9c545061 in ACE_OS::thread_mutex_lock (m=0x1dc3960)
at /opt/ttdev/ACE/v5.4/x86_64.linux2.6-testing/ace/OS_NS_Thread.inl:4443
#5 0x00007f9c9c54508f in ACE_Thread_Mutex::acquire (this=0x1dc3960)
at /opt/ttdev/ACE/v5.4/x86_64.linux2.6-testing/ace/Thread_Mutex.inl:57
#6 0x00007f9c9c5410e2 in ACE_Guard<ACE_Thread_Mutex>::acquire
(this=0x7f9c86617e90)
at /opt/ttdev/ACE/v5.4/x86_64.linux2.6-testing/ace/Guard_T.inl:9
#7 0x00007f9c9c541123 in ACE_Guard (this=0x7f9c86617e90, l=...)
at /opt/ttdev/ACE/v5.4/x86_64.linux2.6-testing/ace/Guard_T.inl:35
#8 0x00007f9c9c2e1da6 in TTWork::GeneratorSelect::reselect
(this=0x1dc38f0, wi=0x2ee6240) at TTWork.cpp:873
#9 0x00007f9c9c2e1e92 in TTWork::WorkItemHandle::clearReadyMask
(this=0x2ee6240, mask=1, resel=true)
at TTWork.cpp:1061
#10 0x00007f9c9c2eaea2 in TTWork::NetServiceTCP::doTheWork
(this=0x2ee6240, workEV=...)
at TTWorkNetServiceTCP.cpp:278
#11 0x00007f9c9c2eb354 in TTWork::NetServiceTCP::doWork (this=0x2ee6240,
workEV=...) at TTWorkNetServiceTCP.cpp:351
#12 0x00007f9c9c2dfccb in TTWork::Dispatcher::dispatch (this=0x13b5c60)
at TTWork.cpp:234
#13 0x00007f9c9c2e3a4f in TTWork::Dispatcher::dispatchGenerate
(this=0x13b5c60, maxWait=0x0, min=0x7f9c86618260)
at TTWork.cpp:324
#14 0x00007f9c9c2e44fd in TTWork::DispatcherTask::runTask
(this=0x13b6ec0) at TTWork.cpp:1580
#15 0x00007f9c9c2e4fee in TTWork::Task::svc (this=0x13b6ec0) at
TTWork.cpp:50
#16 0x00007f9c9b865344 in ACE_Task_Base::svc_run (args=0x13b6ee8) at
Task.cpp:210
#17 0x00007f9c9b7dcb0f in ACE_Thread_Adapter::invoke_i (this=0x1dc2cb0)
at Thread_Adapter.cpp:150
#18 0x00007f9c9b7dcbb9 in ACE_Thread_Adapter::invoke (this=0x1dc2cb0) at
Thread_Adapter.cpp:93
#19 0x00007f9c9b78c0e3 in ace_thread_adapter (args=0x1dc2cb0) at
Base_Thread_Adapter.cpp:131
#20 0x00007f9c9b28073a in start_thread (arg=<value optimized out>) at
pthread_create.c:300
#21 0x00007f9c9a64169d in clone () from /lib/libc.so.6
#22 0x0000000000000000 in ?? ()

Thread 1 (Thread 25518):
#0 __lll_lock_wait () at
../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
#1 0x00007f9c9b282e79 in _L_lock_949 () from
/home/root/eglibc-2.10.2/build-tree/amd64-libc/nptl/libpthread.so.0
#2 0x00007f9c9b282c9b in __pthread_mutex_lock (mutex=0x1dc3960) at
pthread_mutex_lock.c:61
#3 0x00007f9c9c545021 in ACE_OS::mutex_lock (m=0x1dc3960)
at /opt/ttdev/ACE/v5.4/x86_64.linux2.6-testing/ace/OS_NS_Thread.inl:1296
#4 0x00007f9c9c545061 in ACE_OS::thread_mutex_lock (m=0x1dc3960)
at /opt/ttdev/ACE/v5.4/x86_64.linux2.6-testing/ace/OS_NS_Thread.inl:4443
#5 0x00007f9c9c54508f in ACE_Thread_Mutex::acquire (this=0x1dc3960)
at /opt/ttdev/ACE/v5.4/x86_64.linux2.6-testing/ace/Thread_Mutex.inl:57
#6 0x00007f9c9c5410e2 in ACE_Guard<ACE_Thread_Mutex>::acquire
(this=0x7f9c86e18e90)
at /opt/ttdev/ACE/v5.4/x86_64.linux2.6-testing/ace/Guard_T.inl:9
#7 0x00007f9c9c541123 in ACE_Guard (this=0x7f9c86e18e90, l=...)
at /opt/ttdev/ACE/v5.4/x86_64.linux2.6-testing/ace/Guard_T.inl:35
#8 0x00007f9c9c2e1da6 in TTWork::GeneratorSelect::reselect
(this=0x1dc38f0, wi=0x7f9c78463100) at TTWork.cpp:873
#9 0x00007f9c9c2e1e92 in TTWork::WorkItemHandle::clearReadyMask
(this=0x7f9c78463100, mask=1, resel=true)
at TTWork.cpp:1061
#10 0x00007f9c9c2eaea2 in TTWork::NetServiceTCP::doTheWork
(this=0x7f9c78463100, workEV=...)
at TTWorkNetServiceTCP.cpp:278
#11 0x00007f9c9c2eb354 in TTWork::NetServiceTCP::doWork
(this=0x7f9c78463100, workEV=...)
at TTWorkNetServiceTCP.cpp:351
#12 0x00007f9c9c2dfccb in TTWork::Dispatcher::dispatch (this=0x13b5c60)
at TTWork.cpp:234
#13 0x00007f9c9c2e3a4f in TTWork::Dispatcher::dispatchGenerate
(this=0x13b5c60, maxWait=0x0, min=0x7f9c86e19260)
at TTWork.cpp:324
#14 0x00007f9c9c2e44fd in TTWork::DispatcherTask::runTask
(this=0x13b6ec0) at TTWork.cpp:1580
#15 0x00007f9c9c2e4fee in TTWork::Task::svc (this=0x13b6ec0) at
TTWork.cpp:50
#16 0x00007f9c9b865344 in ACE_Task_Base::svc_run (args=0x13b6ee8) at
Task.cpp:210
#17 0x00007f9c9b7dcb0f in ACE_Thread_Adapter::invoke_i (this=0x1dc2a60)
at Thread_Adapter.cpp:150
#18 0x00007f9c9b7dcbb9 in ACE_Thread_Adapter::invoke (this=0x1dc2a60) at
Thread_Adapter.cpp:93
#19 0x00007f9c9b78c0e3 in ace_thread_adapter (args=0x1dc2a60) at
Base_Thread_Adapter.cpp:131
#20 0x00007f9c9b28073a in start_thread (arg=<value optimized out>) at
pthread_create.c:300
#21 0x00007f9c9a64169d in clone () from /lib/libc.so.6
#22 0x0000000000000000 in ?? ()



+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
DETAILS
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Note => markers in stack traces below for PC location




THREAD 4 -- hung in futex call getting internal __lock while holding
external mutex
--------------------------------------------------

Caller's view of the condition variable...
(gdb) p cv
$4 = (ACE_cond_t *) 0x1dc4500
(gdb) p *cv
$5 = {__data = {__lock = 2, __futex = 0, __total_seq = 0, __wakeup_seq =
0, __woken_seq = 0, __mutex = 0x0,
__nwaiters = 0, __broadcast_seq = 0}, __size = "\002", '\000'
<repeats 46 times>, __align = 2}


C code from glibc/nptl:

int
__pthread_cond_broadcast (cond)
pthread_cond_t *cond;
{
int pshared = (cond->__data.__mutex == (void *) ~0l)
? LLL_SHARED : LLL_PRIVATE;
/* Make sure we are alone. */
lll_lock (cond->__data.__lock, pshared);

/* Are there any waiters to be woken? */
if (cond->__data.__total_seq > cond->__data.__wakeup_seq)
{
/* Yes. Mark them all as woken. */
cond->__data.__wakeup_seq = cond->__data.__total_seq;
cond->__data.__woken_seq = cond->__data.__total_seq;


Lowest stack from gdb (I guess what was actually compiled is a hand
coded assembly version of above):

.globl __pthread_cond_broadcast
.type __pthread_cond_broadcast, @function
.align 16
__pthread_cond_broadcast:

/* Get internal lock. */
movl $1, %esi
xorl %eax, %eax
LOCK
#if cond_lock == 0
cmpxchgl %esi, (%rdi)
#else
cmpxchgl %esi, cond_lock(%rdi)
#endif
jnz 1f

2: addq $cond_futex, %rdi
movq total_seq-cond_futex(%rdi), %r9
cmpq wakeup_seq-cond_futex(%rdi), %r9
jna 4f

/* Cause all currently waiting threads to recognize they are
woken up. */
movq %r9, wakeup_seq-cond_futex(%rdi)
movq %r9, woken_seq-cond_futex(%rdi)
addq %r9, %r9
movl %r9d, (%rdi)
incl broadcast_seq-cond_futex(%rdi)

/* Get the address of the mutex used. */
movq dep_mutex-cond_futex(%rdi), %r8

/* Unlock. */
LOCK
decl cond_lock-cond_futex(%rdi)
jne 7f

8: cmpq $-1, %r8
je 9f

/* XXX: The kernel so far doesn't support requeue to PI futex. */
/* XXX: The kernel only supports FUTEX_CMP_REQUEUE to the same
type of futex (private resp. shared). */
testl $(PI_BIT | PS_BIT), MUTEX_KIND(%r8)
jne 9f

/* Wake up all threads. */
#ifdef __ASSUME_PRIVATE_FUTEX
movl $(FUTEX_CMP_REQUEUE|FUTEX_PRIVATE_FLAG), %esi
#else
movl %fs:PRIVATE_FUTEX, %esi
orl $FUTEX_CMP_REQUEUE, %esi
#endif
movl $SYS_futex, %eax
movl $1, %edx
movl $0x7fffffff, %r10d
syscall

/* For any kind of error, which mainly is EAGAIN, we try again
with WAKE. The general test also covers running on old
kernels. */
cmpq $-4095, %rax
jae 9f

10: xorl %eax, %eax
retq

.align 16
/* Unlock. */
4: LOCK
decl cond_lock-cond_futex(%rdi)
jne 5f

6: xorl %eax, %eax
retq

/* Initial locking failed. */
1:
#if cond_lock != 0
addq $cond_lock, %rdi
#endif
cmpq $-1, dep_mutex-cond_lock(%rdi)
movl $LLL_PRIVATE, %eax
movl $LLL_SHARED, %esi
cmovne %eax, %esi
=> callq __lll_lock_wait
#if cond_lock != 0
subq $cond_lock, %rdi
#endif
jmp 2b


..................................................
next stack down
..................................................

#ifdef NOT_IN_libc
.globl __lll_lock_wait
.type __lll_lock_wait,@function
.hidden __lll_lock_wait
.align 16
__lll_lock_wait:
cfi_startproc
pushq %r10
cfi_adjust_cfa_offset(8)
pushq %rdx
cfi_adjust_cfa_offset(8)
cfi_offset(%r10, -16)
cfi_offset(%rdx, -24)
xorq %r10, %r10 /* No timeout. */
movl $2, %edx
LOAD_FUTEX_WAIT (%esi)

cmpl %edx, %eax /* NB: %edx == 2 */
jne 2f

1: movl $SYS_futex, %eax
syscall

=> movl %edx, %eax
xchgl %eax, (%rdi) /* NB: lock is implied */

testl %eax, %eax
jnz 1b




OTHER THREADS -- waiting to get the external mutex
--------------------------------------------------
Caller's view of the mutex

(gdb) p m
$2 = (ACE_thread_mutex_t *) 0x1dc3960
(gdb) p *m
$3 = {__data = {__lock = 2, __count = 0, __owner = 25521, __nusers = 1,
__kind = 0, __spins = 0, __list = {
__prev = 0x0, __next = 0x0}},

Lower stack levels:

int
__pthread_mutex_lock (mutex)
pthread_mutex_t *mutex;
{
assert (sizeof (mutex->__size) >= sizeof (mutex->__data));

unsigned int type = PTHREAD_MUTEX_TYPE (mutex);
if (__builtin_expect (type & ~PTHREAD_MUTEX_KIND_MASK_NP, 0))
return __pthread_mutex_lock_full (mutex);

pid_t id = THREAD_GETMEM (THREAD_SELF, tid);

if (__builtin_expect (type, PTHREAD_MUTEX_TIMED_NP)
== PTHREAD_MUTEX_TIMED_NP)
{
simple:
/* Normal mutex. */
=> LLL_MUTEX_LOCK (mutex);
assert (mutex->__data.__owner == 0);

..................................................
next stack down
..................................................
#ifdef NOT_IN_libc
.globl __lll_lock_wait
.type __lll_lock_wait,@function
.hidden __lll_lock_wait
.align 16
__lll_lock_wait:
cfi_startproc
pushq %r10
cfi_adjust_cfa_offset(8)
pushq %rdx
cfi_adjust_cfa_offset(8)
cfi_offset(%r10, -16)
cfi_offset(%rdx, -24)
xorq %r10, %r10 /* No timeout. */
movl $2, %edx
LOAD_FUTEX_WAIT (%esi)

cmpl %edx, %eax /* NB: %edx == 2 */
jne 2f

1: movl $SYS_futex, %eax
syscall

=> movl %edx, %eax
xchgl %eax, (%rdi) /* NB: lock is implied */


2010-01-12 14:50:26

by Cong Wang

[permalink] [raw]
Subject: Re: Futex hang/lockup problem in 2.6.30+ on AMD64

On Tue, Jan 12, 2010 at 04:18:07AM -0500, Andrew Athan wrote:
>
> After some investigation I believe I am experiencing a problem similar
> to the one described in this posting:
> http://sourceware.org/ml/libc-help/2009-10/msg00026.html, in that the
> poster suspects a problem in the futex implementation in 2.6.30 and
> above kernels. In my case, the problem is not a soft lockup in the
> kernel, but it does result in an application lock up due to all threads
> waiting for futex's.
>
> For me this problem began to appear once I upgraded my Debian
> squeeze/testing x86_64 installation (AMD) to a new kernel. I'm not
> sure what the prior kernel version was. The same software running on
> different machines with earlier kernels (lenny) does not seem to
> experience the problem.
>
> I'm really not sure if this is a libc or kernel problem, but due to
> the stack trace, which shows what appears to be a hang on the internal
> __lock of the condition variable, it appears likely this is not an
> application bug. Memory does not appear to be corrupt (I store
> sentinels around the mutexes, and they have retained their values).
>
> It appears that the cond var's __lock indicates there are waiters
> even though there are/should-be none (assuming I'm interpreting the
> __lock value of 2 correctly). Since the __lock in question is a futex
> primitive, and it must be held regardless of other libc/nptl state
> variables,
> I don't believe this is a libc problem.
>
> The problem occurs rarely, but innevitably, and sometimes only after
> several hours of normal program operation. I have not yet
> successfully created a reduced test program that can faithfully
> reproduce the hang in a short timeframe.
>
> The application contains a thread pool where threads perform many
> operations between pthread calls but can be summarized as one of three
> cases below. Due to the design of the thread pool, threads
> round-robbin or at least are randomly assigned a workload (in contrast
> to having one constant broadcast thread).
>
> case 1: while(1){ *A* pthread_lock();pthread_unlock();}
> case 2: pthread_lock();pthread_cond_wait();pthread_unlock();
> case 3: pthread_lock(); *B* pthread_cond_broadcast();pthread_unlock();
>
> The application becomes hung with all threads but one stuck at *A*,
> and one thread at *B*.
>
> The stack trace and other details appear below. I've saved the core
> file in case I can provide additional information.
>
>
> $ uname -a
> Linux UK22 2.6.30-2-amd64 #1 SMP Fri Sep 25 22:16:56 UTC 2009 x86_64
> GNU/Linux

Hmm, thanks for reporting this here.

Adding futex experters into Cc...


--
Live like a child, think like the god.

2010-01-12 14:55:46

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Futex hang/lockup problem in 2.6.30+ on AMD64

On Tue, 2010-01-12 at 22:52 +0800, Américo Wang wrote:

> > $ uname -a
> > Linux UK22 2.6.30-2-amd64 #1 SMP Fri Sep 25 22:16:56 UTC 2009 x86_64
> > GNU/Linux

Does a recent kernel work?

2010-01-12 15:00:59

by Cong Wang

[permalink] [raw]
Subject: Re: Futex hang/lockup problem in 2.6.30+ on AMD64

On Tue, Jan 12, 2010 at 10:55 PM, Peter Zijlstra <[email protected]> wrote:
> On Tue, 2010-01-12 at 22:52 +0800, Américo Wang wrote:
>
>> > $ uname -a
>> > Linux UK22 2.6.30-2-amd64 #1 SMP Fri Sep 25 22:16:56 UTC 2009 x86_64
>> > GNU/Linux
>
> Does a recent kernel work?
>
>

Ah, I just wanted to ask the same question, adding the original reporter
Gong Cheng into Cc...

Gong, could you reproduce it on the latest kernel? And what is your .config?

Thanks!

2010-01-12 16:30:43

by Andrew Athan

[permalink] [raw]
Subject: Re: Futex hang/lockup problem in 2.6.30+ on AMD64

Américo Wang wrote:
> On Tue, Jan 12, 2010 at 10:55 PM, Peter Zijlstra <[email protected]> wrote:
>
>> On Tue, 2010-01-12 at 22:52 +0800, Américo Wang wrote:
>>
>>
>>>> $ uname -a
>>>> Linux UK22 2.6.30-2-amd64 #1 SMP Fri Sep 25 22:16:56 UTC 2009 x86_64
>>>> GNU/Linux
>>>>
>> Does a recent kernel work?
>>
>>
>>
>
> Ah, I just wanted to ask the same question, adding the original reporter
> Gong Cheng into Cc...
>
> Gong, could you reproduce it on the latest kernel? And what is your .config?
>
> Thanks!
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
Due to remote location of the hardware and I haven't been able to test a
more recent (or older) kernel. Remote hands have put a KVM on the box
as of an hour ago, so I hope to have some information for you in a day
or two.

A.

2010-01-12 17:59:43

by Gong Cheng

[permalink] [raw]
Subject: Re: Futex hang/lockup problem in 2.6.30+ on AMD64

Hi guys,
It has been a little while when we were hit by the problem, and from then I didn't touch that area, so forgive me if I can not answer all your questions.
Basically after switching from 2.6.31.1 and 2.6.31.4, the problem seems to have gone away.
It was very reproducible for us in 2.6.31.1 (if coupled with glibc 2.5.34.1 or any version around. I believe as long as it was after the private futex work in glibc).
After the switch, we might have had another similar occurrence but it was not clear if was the same problem. Since it didn't occur again, we let it go.
now we are in 2.6.32.1, and it is still running fine. So the problem is either completely gone with the newer kernels. Or it is just much less likely to happen.
That is pretty much all I can provide, and hope it helps.

-gong




----- Original Message ----
From: Am?rico Wang <[email protected]>
To: Peter Zijlstra <[email protected]>
Cc: Andrew Athan <[email protected]>; [email protected]; Darren Hart <[email protected]>; Thomas Gleixner <[email protected]>; Ingo Molnar <[email protected]>; Gong Cheng <[email protected]>
Sent: Tue, January 12, 2010 7:00:57 AM
Subject: Re: Futex hang/lockup problem in 2.6.30+ on AMD64

On Tue, Jan 12, 2010 at 10:55 PM, Peter Zijlstra <[email protected]> wrote:
> On Tue, 2010-01-12 at 22:52 +0800, Am?rico Wang wrote:
>
>> > $ uname -a
>> > Linux UK22 2.6.30-2-amd64 #1 SMP Fri Sep 25 22:16:56 UTC 2009 x86_64
>> > GNU/Linux
>
> Does a recent kernel work?
>
>

Ah, I just wanted to ask the same question, adding the original reporter
Gong Cheng into Cc...

Gong, could you reproduce it on the latest kernel? And what is your .config?

Thanks!

2010-01-13 16:01:49

by Cong Wang

[permalink] [raw]
Subject: Re: Futex hang/lockup problem in 2.6.30+ on AMD64

On Tue, Jan 12, 2010 at 09:53:01AM -0800, Gong Cheng wrote:
>Hi guys,
> It has been a little while when we were hit by the problem, and from then I didn't touch that area, so forgive me if I can not answer all your questions.
> Basically after switching from 2.6.31.1 and 2.6.31.4, the problem seems to have gone away.
> It was very reproducible for us in 2.6.31.1 (if coupled with glibc 2.5.34.1 or any version around. I believe as long as it was after the private futex work in glibc).
>After the switch, we might have had another similar occurrence but it was not clear if was the same problem. Since it didn't occur again, we let it go.
> now we are in 2.6.32.1, and it is still running fine. So the problem is either completely gone with the newer kernels. Or it is just much less likely to happen.
>That is pretty much all I can provide, and hope it helps.
>

Ok, thanks for your confirm.

--
Live like a child, think like the god.

2010-01-20 05:00:31

by Andrew than

[permalink] [raw]
Subject: Re: Futex hang/lockup problem in 2.6.30+ on AMD64

Andrew Athan wrote:
> Américo Wang wrote:
>> On Tue, Jan 12, 2010 at 10:55 PM, Peter Zijlstra
>> <[email protected]> wrote:
>>
>>> On Tue, 2010-01-12 at 22:52 +0800, Américo Wang wrote:
>>>
>>>
>>>>> $ uname -a
>>>>> Linux UK22 2.6.30-2-amd64 #1 SMP Fri Sep 25 22:16:56 UTC 2009 x86_64
>>>>> GNU/Linux
>>>>>
>>> Does a recent kernel work?
>>>
>>>
>>>
>>
>> Ah, I just wanted to ask the same question, adding the original reporter
>> Gong Cheng into Cc...
>>
>> Gong, could you reproduce it on the latest kernel? And what is your
>> .config?
>>
>> Thanks!
>> --
>> To unsubscribe from this list: send the line "unsubscribe
>> linux-kernel" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at http://www.tux.org/lkml/
>>
> Due to remote location of the hardware and I haven't been able to test
> a more recent (or older) kernel. Remote hands have put a KVM on the
> box as of an hour ago, so I hope to have some information for you in a
> day or two.
>
> A.
>


I wanted to report that although I have had no luck (so far) running
anything more recent than 2.6.30, I was able to revert to 2.6.26.
Unfortunately, the application hang still occurs. I also saw a similar
hang of the application running on a 32 bit Intel box, also under
2.6.26. So far, the hang *always* involves threads stuck on
pthread_cond_broadcast()'s condition variable's internal lock while
other threads are waiting on the outer "public" lock. These other
threads are *not* yet (nor about to) pthread_cond_wait(). I saw a
message from Darren Hart (subject "Re: Problems with futex") in response
to someone who apparently was having futex problems in 2.6.27, so I'm
still operating under the assumption that this is not an application bug.

Over the next couple of days, I will be running a version of the
application in which I replaced the pthread_cond calls with simpler
locks, in the hopes that it won't hang (because I'm hoping the
underlying implementation in pthreads uses a different set of futex
opcodes).

Andrew Athan

2010-01-20 17:54:39

by Darren Hart

[permalink] [raw]
Subject: Re: Futex hang/lockup problem in 2.6.30+ on AMD64

Andrew Athan wrote:
> Andrew Athan wrote:
>> Américo Wang wrote:
>>> On Tue, Jan 12, 2010 at 10:55 PM, Peter Zijlstra
>>> <[email protected]> wrote:
>>>
>>>> On Tue, 2010-01-12 at 22:52 +0800, Américo Wang wrote:
>>>>
>>>>
>>>>>> $ uname -a
>>>>>> Linux UK22 2.6.30-2-amd64 #1 SMP Fri Sep 25 22:16:56 UTC 2009 x86_64
>>>>>> GNU/Linux
>>>>>>
>>>> Does a recent kernel work?
>>>>
>>>>
>>>>
>>>
>>> Ah, I just wanted to ask the same question, adding the original reporter
>>> Gong Cheng into Cc...
>>>
>>> Gong, could you reproduce it on the latest kernel? And what is your
>>> .config?
>>>
>>> Thanks!
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe
>>> linux-kernel" in
>>> the body of a message to [email protected]
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>> Please read the FAQ at http://www.tux.org/lkml/
>>>
>> Due to remote location of the hardware and I haven't been able to test
>> a more recent (or older) kernel. Remote hands have put a KVM on the
>> box as of an hour ago, so I hope to have some information for you in a
>> day or two.
>>
>> A.
>>
>
>
> I wanted to report that although I have had no luck (so far) running
> anything more recent than 2.6.30, I was able to revert to 2.6.26.
> Unfortunately, the application hang still occurs. I also saw a similar
> hang of the application running on a 32 bit Intel box, also under
> 2.6.26. So far, the hang *always* involves threads stuck on
> pthread_cond_broadcast()'s condition variable's internal lock while
> other threads are waiting on the outer "public" lock.


Are you using real-time scheduling policy or priority inheritance
(PTHREAD_PRIO_INHERIT)? It is possible to suffer an unbounded priority
inversion on the internal condvar data lock in the current distro
implementations of glibc.


> These other
> threads are *not* yet (nor about to) pthread_cond_wait(). I saw a
> message from Darren Hart (subject "Re: Problems with futex") in response
> to someone who apparently was having futex problems in 2.6.27, so I'm
> still operating under the assumption that this is not an application bug.

Those all turned out to be application issues with one exception which
had already been fixed upstream.


> Over the next couple of days, I will be running a version of the
> application in which I replaced the pthread_cond calls with simpler
> locks, in the hopes that it won't hang (because I'm hoping the
> underlying implementation in pthreads uses a different set of futex
> opcodes).
>
> Andrew Athan
>


--
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team

2010-01-28 17:46:56

by Andrew Athan

[permalink] [raw]
Subject: Re: Futex hang/lockup problem in 2.6.30+ on AMD64

Darren Hart wrote:
> Andrew Athan wrote:
>> Andrew Athan wrote:
>>> Américo Wang wrote:
>>>> On Tue, Jan 12, 2010 at 10:55 PM, Peter Zijlstra
>>>> <[email protected]> wrote:
>>>>
>>>>> On Tue, 2010-01-12 at 22:52 +0800, Américo Wang wrote:
>>>>>
>>>>>
>>>>>>> $ uname -a
>>>>>>> Linux UK22 2.6.30-2-amd64 #1 SMP Fri Sep 25 22:16:56 UTC 2009
>>>>>>> x86_64
>>>>>>> GNU/Linux
>>>>>>>
>>>>> Does a recent kernel work?
>>>>>
>>>>>
>>>>>
>>>>
>>>> Ah, I just wanted to ask the same question, adding the original
>>>> reporter
>>>> Gong Cheng into Cc...
>>>>
>>>> Gong, could you reproduce it on the latest kernel? And what is your
>>>> .config?
>>>>
>>>> Thanks!
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe
>>>> linux-kernel" in
>>>> the body of a message to [email protected]
>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>> Please read the FAQ at http://www.tux.org/lkml/
>>>>
>>> Due to remote location of the hardware and I haven't been able to
>>> test a more recent (or older) kernel. Remote hands have put a KVM
>>> on the box as of an hour ago, so I hope to have some information for
>>> you in a day or two.
>>>
>>> A.
>>>
>>
>>
>> I wanted to report that although I have had no luck (so far) running
>> anything more recent than 2.6.30, I was able to revert to 2.6.26.
>> Unfortunately, the application hang still occurs. I also saw a
>> similar hang of the application running on a 32 bit Intel box, also
>> under 2.6.26. So far, the hang *always* involves threads stuck on
>> pthread_cond_broadcast()'s condition variable's internal lock while
>> other threads are waiting on the outer "public" lock.
>
>
> Are you using real-time scheduling policy or priority inheritance
> (PTHREAD_PRIO_INHERIT)? It is possible to suffer an unbounded priority
> inversion on the internal condvar data lock in the current distro
> implementations of glibc.
>
>
>> These other threads are *not* yet (nor about to)
>> pthread_cond_wait(). I saw a message from Darren Hart (subject "Re:
>> Problems with futex") in response to someone who apparently was
>> having futex problems in 2.6.27, so I'm still operating under the
>> assumption that this is not an application bug.
>
> Those all turned out to be application issues with one exception which
> had already been fixed upstream.
>
>
>> Over the next couple of days, I will be running a version of the
>> application in which I replaced the pthread_cond calls with simpler
>> locks, in the hopes that it won't hang (because I'm hoping the
>> underlying implementation in pthreads uses a different set of futex
>> opcodes).
>>
>> Andrew Athan
>>
>
>

I wanted to report that this application hang is certainly related to
pthread_cond_* calls. With them in place, it consistently hangs.
Without, it consistently does not. Whether pthread_cond_* is
misbehaving due to memory corruption or another application bug I
suppose is an open question.

We have now experienced several lockups where even a kill -9 of the
application won't get rid of it. Does this say anything about the
nature of the hang?

By the way, majordomo stopped sending me emails as of 1/17 so I have not
seen any updates to this thread sent after this date. Not sure why this
happened, as I never asked to be unsubscribed. I've resubscribed, but
not sure I will get anything. Please make sure I am directly cc:ed on
any responses.

carlinux138:~# uname -a
Linux carlinux138.thinktradellc.com 2.6.26-2-686 #1 SMP Sun Jun 21
04:57:38 UTC 2009 i686 GNU/Linux

(I have to go look up what the best way to give a system config snapshot
is, e.g., all major library version etc ... )

Thanks,
Andrew Athan