2008-11-07 09:45:58

by Michael Dressel

[permalink] [raw]
Subject: pthread_mutex_lock hangs on unlocked mutex

Hi,

(I'm not subscribed to the list, please CC me.)

in our software three processes are using several pthread mutexes.
Sometimes a process hangs inside pthread_mutex_lock even though the mutex is
not locked. I can tell it's not locked because another process is still
running and locking and unlocking the mutex.
When I connect with gdb to the hanging process I find:

#0 0xffffe430 in __kernel_vsyscall ()
#1 0xf7d5e7a9 in __lll_lock_wait () from /lib/libpthread.so.0
#2 0xf7d59c75 in _L_lock_288 () from /lib/libpthread.so.0
#3 0xf7d596c5 in pthread_mutex_lock () from /lib/libpthread.so.0
#4 0xf7e444b6 in pthread_mutex_lock () from /lib/libc.so.6
#5 0x080fb338 in <myfunction> (mutexP=0x70f4165c)

And the mutexP looks like this:
$1 = {__data = {__lock = 0, __count = 0, __owner = 0, __kind = 1,
__nusers = 0, {__spins = 0, __list = {__next = 0x0}}},
__size = '\0' <repeats 12 times>, "\001\000\000\000\000\000\000\000\000\000\00
0", __align = 0}

I guess the process had to wait for the mutex but when the mutex was unlocked
the signal to the waiting process got lost.

Our software was build on a SUUSE 10.1
kernel 2.6.16.13-4-default
glibc-32bit-2.4-27
system as 32 bit binary. On This system the problem does not occur.

The system showing the problem is a dual processor SUSE 11.0
kernel 2.6.25.18-0.2-default
glibc-32bit-2.8-14.1
system. Also if the second core is disabled in the BIOS we have the
problem. Some more details of the system are listed below.

When the system hangs it does not recover from that situation. But if
I send a kill -STOP <pid>; kill -CONT <pid> sequence to the hanging
process it does continue.
The same effect is reached by connecting and disconnecting with gdb.

Is there any way, either in configuring the system or pthread, to prevent
this problem?

Cheers,
Michael

openSUSE 11.0 (X86-64)

Linux 2.6.25.18-0.2-default #1 SMP 2008-10-21 16:30:26 +0200 x86_64 x86_64 x86_64 GNU/Linux

glibc-32bit-2.8-14.1

processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel(R) Core(TM)2 Duo CPU E6550 @ 2.33GHz
stepping : 11
cpu MHz : 1998.000
cache size : 4096 KB
physical id : 0
siblings : 1
core id : 0
cpu cores : 1
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good pni monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr lahf_lm
bogomips : 4658.50
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:


2008-11-07 10:47:20

by Bart Van Assche

[permalink] [raw]
Subject: Re: pthread_mutex_lock hangs on unlocked mutex

On Fri, Nov 7, 2008 at 9:43 AM, Michael Dressel
<[email protected]> wrote:
> in our software three processes are using several pthread mutexes.

LKML is a mailing list to discuss kernel issues. What you report is
most likely a userspace issue. Please check your application with the
Valgrind tools memcheck, helgrind and/or drd first.

Bart.

2008-11-13 05:43:40

by Ian Kent

[permalink] [raw]
Subject: Re: pthread_mutex_lock hangs on unlocked mutex

On Fri, 7 Nov 2008, Michael Dressel wrote:

> Hi,
>
> (I'm not subscribed to the list, please CC me.)
>
> in our software three processes are using several pthread mutexes.
> Sometimes a process hangs inside pthread_mutex_lock even though the mutex is
> not locked. I can tell it's not locked because another process is still
> running and locking and unlocking the mutex.

pthreads is implemented in glibc.
If you really think there is a bug in the ptheads implementation then
the glibc maintainers will require you to produce a simple example program
which demonstrates the bug before it's accepted as a bug.

When you say processes you mean threads, right?

If you can't produce such an example program and you can you prove (to
yourself) there are no use after free or execution order issues with your
code then your only option is to develop a workaround.

You code wouldn't happen to be doing thread synchronization with
pthread_cond_wait()/pthread_cond_signal() would it?

> When I connect with gdb to the hanging process I find:
>
> #0 0xffffe430 in __kernel_vsyscall ()
> #1 0xf7d5e7a9 in __lll_lock_wait () from /lib/libpthread.so.0
> #2 0xf7d59c75 in _L_lock_288 () from /lib/libpthread.so.0
> #3 0xf7d596c5 in pthread_mutex_lock () from /lib/libpthread.so.0
> #4 0xf7e444b6 in pthread_mutex_lock () from /lib/libc.so.6
> #5 0x080fb338 in <myfunction> (mutexP=0x70f4165c)
>
> And the mutexP looks like this:
> $1 = {__data = {__lock = 0, __count = 0, __owner = 0, __kind = 1,
> __nusers = 0, {__spins = 0, __list = {__next = 0x0}}},
> __size = '\0' <repeats 12 times>,
> "\001\000\000\000\000\000\000\000\000\000\00
> 0", __align = 0}
>
> I guess the process had to wait for the mutex but when the mutex was unlocked
> the signal to the waiting process got lost.
>
> Our software was build on a SUUSE 10.1
> kernel 2.6.16.13-4-default
> glibc-32bit-2.4-27
> system as 32 bit binary. On This system the problem does not occur.
>
> The system showing the problem is a dual processor SUSE 11.0
> kernel 2.6.25.18-0.2-default
> glibc-32bit-2.8-14.1
> system. Also if the second core is disabled in the BIOS we have the
> problem. Some more details of the system are listed below.
>
> When the system hangs it does not recover from that situation. But if
> I send a kill -STOP <pid>; kill -CONT <pid> sequence to the hanging
> process it does continue.
> The same effect is reached by connecting and disconnecting with gdb.
>
> Is there any way, either in configuring the system or pthread, to prevent
> this problem?
>
> Cheers,
> Michael
>
> openSUSE 11.0 (X86-64)
>
> Linux 2.6.25.18-0.2-default #1 SMP 2008-10-21 16:30:26 +0200 x86_64 x86_64
> x86_64 GNU/Linux
>
> glibc-32bit-2.8-14.1
>
> processor : 0
> vendor_id : GenuineIntel
> cpu family : 6
> model : 15
> model name : Intel(R) Core(TM)2 Duo CPU E6550 @ 2.33GHz
> stepping : 11
> cpu MHz : 1998.000
> cache size : 4096 KB
> physical id : 0
> siblings : 1
> core id : 0
> cpu cores : 1
> fpu : yes
> fpu_exception : yes
> cpuid level : 10
> wp : yes
> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
> cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm
> constant_tsc arch_perfmon pebs bts rep_good pni monitor ds_cpl vmx smx est tm2
> ssse3 cx16 xtpr lahf_lm
> bogomips : 4658.50
> clflush size : 64
> cache_alignment : 64
> address sizes : 36 bits physical, 48 bits virtual
> power management:
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2008-11-13 08:04:48

by Michael Dressel

[permalink] [raw]
Subject: Re: pthread_mutex_lock hangs on unlocked mutex

On Thu, 13 Nov 2008, Ian Kent wrote:

> On Fri, 7 Nov 2008, Michael Dressel wrote:
>
>> Hi,
>>
>> (I'm not subscribed to the list, please CC me.)
>>
>> in our software three processes are using several pthread mutexes.
>> Sometimes a process hangs inside pthread_mutex_lock even though the mutex is
>> not locked. I can tell it's not locked because another process is still
>> running and locking and unlocking the mutex.
>
> pthreads is implemented in glibc.
> If you really think there is a bug in the ptheads implementation then
> the glibc maintainers will require you to produce a simple example program
> which demonstrates the bug before it's accepted as a bug.

Yes.
I have not found any report related exactly to my problem in the
mailing lists or bug reports. But to be sure I didn't overlook something
I posted my problem. It looks like it's unique to me.

I failed to produce a simple example demonstrating the problem. In our
code we use timers and real time signals and we change process masks
with sigprocmask. If there is a bug at all (I don't think so) a
program to demonstrate that bug would potentially have to do all of
these things and would therefore not be simple.

Following Bart Van Assche's suggestion.
I did use valgrind (the default tool and helgrind) but I did not find
anything obviously related to my problem.

>
> When you say processes you mean threads, right?
>

No. We don't use threads. The mutexes are used between processes. I used
them because they feature recursion.

> If you can't produce such an example program and you can you prove (to
> yourself) there are no use after free or execution order issues with your
> code then your only option is to develop a workaround.
>

I found a workaround. We use normal semaphores now. This is possible
because we don't use multiple threads. In order to provide recursion I
had to implement a per process counter. This would not work if the
semaphore was required during signal handler execution. But this dose not
happen in our application.

> You code wouldn't happen to be doing thread synchronization with
> pthread_cond_wait()/pthread_cond_signal() would it?

No since we don't use multiple threads.

The reason why I wondered it is an issue (maybe configuration) of the
kernel was that sending a STOP CONT signal sequence to the hanging
process got it going again. So at least it is not a classical dead lock.

Cheers,
Michael