LinuxLists.cc - Recursion bug in -rt

2005-12-14 22:21:39

by Dinakar Guniguntala

[permalink] [raw]

Subject: Recursion bug in -rt

Hi David,

I hit this bug with -rt22-rf11

==========================================
[ BUG: lock recursion deadlock detected! |
------------------------------------------
already locked: [f7abbc94] {futex}
.. held by: testpi-3: 4595 [f7becdd0, 59]
... acquired at: futex_wait_robust+0x142/0x1f3
------------------------------
| showing all locks held by: | (testpi-3/4595 [f7becdd0, 59]):
------------------------------

#001: [f7abbc94] {futex}
... acquired at: futex_wait_robust+0x142/0x1f3

-{current task's backtrace}----------------->
[<c0103e04>] dump_stack+0x1e/0x20 (20)
[<c0136bc2>] check_deadlock+0x2d7/0x334 (44)
[<c01379bc>] task_blocks_on_lock+0x2c/0x224 (36)
[<c03f29c5>] __down_interruptible+0x37c/0x95d (160)
[<c013aebf>] down_futex+0xa3/0xe7 (40)
[<c013ebc5>] futex_wait_robust+0x142/0x1f3 (72)
[<c013f35c>] do_futex+0x9a/0x109 (40)
[<c013f4dd>] sys_futex+0x112/0x11e (68)
[<c0102f03>] sysenter_past_esp+0x54/0x75 (-8116)
------------------------------
| showing all locks held by: | (testpi-3/4595 [f7becdd0, 59]):
------------------------------

#001: [f7abbc94] {futex}
... acquired at: futex_wait_robust+0x142/0x1f3

---------------------------------------------------------------------

futex.c -> futex_wait_robust

if ((curval & FUTEX_PID) == current->pid) {
ret = -EAGAIN;
goto out_unlock;
}

rt.c -> down_futex

if (!owner_task || owner_task == current) {
up(sem);
up_read(&current->mm->mmap_sem);
return -EAGAIN;
}

I noticed that both the above checks below have been removed in your
patch. I do understand that the futex_wait_robust path has been
made similar to the futex_wait path, but I think we are not taking
PI into consideration. Basically it looks like we still need to check
if the current task has become owner. or are we missing a lock somewhere ?

I added the down_futex check above and my test has been
running for hours without the oops. Without this check it
used to oops within minutes.

Patch that works for me attached below. Thoughts?

-Dinakar

Attachments:

(No filename) (2.16 kB)
check_current.patch (486.00 B)
Download all attachments

2005-12-15 01:03:18

by David Singleton

[permalink] [raw]

Subject: Re: Recursion bug in -rt

Dinakar,
you may be correct, since reverting the change in down_futex seems to
work.
However, I'm wondering if you've hit a bug that Dave Carlson is
reporting that
he's tracked down to an inline in the glibc patch.

I'll explain what the kernel code is intending to do and you can let me
know what you think.

The changes to the latest robust futex patch are to close an
SMP race condition that Dave Carlson reported. The changes
were intended to make the robust wait/wake code behave like
the regular futex_wait/wake code, which doesn't see this race condition.

The change allowed the library to pass in the value that it
saw in the lock when it tried to lock the pthread_mutex. The kernel
then locks the mmap_sem and robust_sem to protect the kernel
lock and checks the value in the pthread_mutex again.

If the value has not changed since the time the library
saw it then the kernel continues to wait for the lock.

Reverting the check in down_futex would mean that the library
saw that the thread trying to lock the lock actually owned the lock in
user space and still tried to wait for the lock.

The library then passed this value into the kernel which checked
to see if the value had changed since the kernel acquired its locks.

The kernel determined that the value had not changed
and called down_futex to block for the lock.

The down futex code used the value in the lock to
find the owning task. The task it found was the current task!

Now there are two problems I see with the reversion:

1) Why did the library call into the kernel if the calling thread
owned the lock?

2) The reversion would have the kernel pass back EAGAIN
if it found that the the thread trying to lock the lock was
in fact the owning thread. EAGAIN would cause the
library to retry to lock the lock it already owns!

So I'm wondering if the error you're seeing is caused
by a bug in the library. I'm trying to get a glibc patch from Dave
Carlson that he says fixes his problem.

Can you run strace and see if you are getting any EAGAINs
back from mutex_lock? EAGAINs passed to user space is an
error, the library should try again as long as
the value it got back from wait_robust is EAGAIN.

What do you think?

David

On Dec 14, 2005, at 2:39 PM, Dinakar Guniguntala wrote:

> Hi David,
>
> I hit this bug with -rt22-rf11
>
> ==========================================
> [ BUG: lock recursion deadlock detected! |
> ------------------------------------------
> already locked: [f7abbc94] {futex}
> .. held by: testpi-3: 4595 [f7becdd0, 59]
> ... acquired at: futex_wait_robust+0x142/0x1f3
> ------------------------------
> | showing all locks held by: | (testpi-3/4595 [f7becdd0, 59]):
> ------------------------------
>
> #001: [f7abbc94] {futex}
> ... acquired at: futex_wait_robust+0x142/0x1f3
>
> -{current task's backtrace}----------------->
> [<c0103e04>] dump_stack+0x1e/0x20 (20)
> [<c0136bc2>] check_deadlock+0x2d7/0x334 (44)
> [<c01379bc>] task_blocks_on_lock+0x2c/0x224 (36)
> [<c03f29c5>] __down_interruptible+0x37c/0x95d (160)
> [<c013aebf>] down_futex+0xa3/0xe7 (40)
> [<c013ebc5>] futex_wait_robust+0x142/0x1f3 (72)
> [<c013f35c>] do_futex+0x9a/0x109 (40)
> [<c013f4dd>] sys_futex+0x112/0x11e (68)
> [<c0102f03>] sysenter_past_esp+0x54/0x75 (-8116)
> ------------------------------
> | showing all locks held by: | (testpi-3/4595 [f7becdd0, 59]):
> ------------------------------
>
> #001: [f7abbc94] {futex}
> ... acquired at: futex_wait_robust+0x142/0x1f3
>
> ---------------------------------------------------------------------
>
> futex.c -> futex_wait_robust
>
> if ((curval & FUTEX_PID) == current->pid) {
> ret = -EAGAIN;
> goto out_unlock;
> }
>
> rt.c -> down_futex
>
> if (!owner_task || owner_task == current) {
> up(sem);
> up_read(&current->mm->mmap_sem);
> return -EAGAIN;
> }
>
> I noticed that both the above checks below have been removed in your
> patch. I do understand that the futex_wait_robust path has been
> made similar to the futex_wait path, but I think we are not taking
> PI into consideration. Basically it looks like we still need to check
> if the current task has become owner. or are we missing a lock
> somewhere ?
>
> I added the down_futex check above and my test has been
> running for hours without the oops. Without this check it
> used to oops within minutes.
>
> Patch that works for me attached below. Thoughts?
>
> -Dinakar
>
>
> <check_current.patch>

2005-12-15 19:00:45

by David Singleton

[permalink] [raw]

Subject: Re: Recursion bug in -rt

Dinakar,
after further testing and investigation I believe you original
assessment was correct.
The problem you are seeing is not a library problem.
The changes to down_futex need to be reverted. There is a new patch at

http://source.mvista.com/~dsingleton/patch-2.6.15-rc5-rt2-rf2.

that reverts the changes to down_futex.

Thanks for testing this.

David

Dinakar Guniguntala wrote:

>Hi David,
>
>I hit this bug with -rt22-rf11
>
>==========================================
>[ BUG: lock recursion deadlock detected! |
>------------------------------------------
>already locked: [f7abbc94] {futex}
>.. held by: testpi-3: 4595 [f7becdd0, 59]
>... acquired at: futex_wait_robust+0x142/0x1f3
>------------------------------
>| showing all locks held by: | (testpi-3/4595 [f7becdd0, 59]):
>------------------------------
>
>#001: [f7abbc94] {futex}
>... acquired at: futex_wait_robust+0x142/0x1f3
>
>-{current task's backtrace}----------------->
> [<c0103e04>] dump_stack+0x1e/0x20 (20)
> [<c0136bc2>] check_deadlock+0x2d7/0x334 (44)
> [<c01379bc>] task_blocks_on_lock+0x2c/0x224 (36)
> [<c03f29c5>] __down_interruptible+0x37c/0x95d (160)
> [<c013aebf>] down_futex+0xa3/0xe7 (40)
> [<c013ebc5>] futex_wait_robust+0x142/0x1f3 (72)
> [<c013f35c>] do_futex+0x9a/0x109 (40)
> [<c013f4dd>] sys_futex+0x112/0x11e (68)
> [<c0102f03>] sysenter_past_esp+0x54/0x75 (-8116)
>------------------------------
>| showing all locks held by: | (testpi-3/4595 [f7becdd0, 59]):
>------------------------------
>
>#001: [f7abbc94] {futex}
>... acquired at: futex_wait_robust+0x142/0x1f3
>
>---------------------------------------------------------------------
>
>futex.c -> futex_wait_robust
>
> if ((curval & FUTEX_PID) == current->pid) {
> ret = -EAGAIN;
> goto out_unlock;
> }
>
>rt.c -> down_futex
>
> if (!owner_task || owner_task == current) {
> up(sem);
> up_read(&current->mm->mmap_sem);
> return -EAGAIN;
> }
>
>I noticed that both the above checks below have been removed in your
>patch. I do understand that the futex_wait_robust path has been
>made similar to the futex_wait path, but I think we are not taking
>PI into consideration. Basically it looks like we still need to check
>if the current task has become owner. or are we missing a lock somewhere ?
>
>I added the down_futex check above and my test has been
>running for hours without the oops. Without this check it
>used to oops within minutes.
>
>Patch that works for me attached below. Thoughts?
>
> -Dinakar
>
>
>
>
>------------------------------------------------------------------------
>
>Index: linux-2.6.14-rt22-rayrt5/kernel/rt.c
>===================================================================
>--- linux-2.6.14-rt22-rayrt5.orig/kernel/rt.c 2005-12-15 02:15:13.000000000 +0530
>+++ linux-2.6.14-rt22-rayrt5/kernel/rt.c 2005-12-15 02:18:29.000000000 +0530
>@@ -3001,7 +3001,7 @@
> * if the owner can't be found return try again.
> */
>
>- if (!owner_task) {
>+ if (!owner_task || owner_task == current) {
> up(sem);
> up_read(&current->mm->mmap_sem);
> return -EAGAIN;
>
>

2005-12-15 19:27:14

by Dinakar Guniguntala

[permalink] [raw]

Subject: Re: Recursion bug in -rt

On Wed, Dec 14, 2005 at 05:03:13PM -0800, david singleton wrote:
> Dinakar,
> you may be correct, since reverting the change in down_futex seems
> to work.

Well it did not. It ran for much longer than previously but still
hung up.

Well we have a very basic problem in the current implementation.

Currently if a thread calls pthread_mutex_lock on the same lock
(normal, non recursive lock) twice without unlocking in between, the
application hangs. Which is the right behaviour.
However if the same thing is done with a non recursive robust mutex,
it manages to acquire the lock.

I see many problems here (I am assuming that the right behaviour
with robust mutexes is for application to ultimately block
indefinitely in the kernel)

1. In down_futex we do the following check

if (owner_task != current)
down_try_futex(lock, owner_task->thread_info __EIP__);

In the above scenario, the thread would have acquired the uncontended
lock first time around in userspace. The second time it tries to
acquire the same mutex, because of the above check, does not
call down_try_futex and hence will not initialize the lock structure
in the kernel. It then goes to __down_interruptible where it is
granted the lock, which is wrong.

So IMO the above check is not right. However removing this check
is not the end of story. This time it gets to task_blocks_on_lock
and tries to grab the task->pi_lock of the owvner which is itself
and results in a system hang. (Assuming CONFIG_DEBUG_DEADLOCKS
is not set). So it looks like we need to add some check to
prevent this below in case lock_owner happens to be current.

_raw_spin_lock(&lock_owner(lock)->task->pi_lock);

> However, I'm wondering if you've hit a bug that Dave Carlson is
> reporting that he's tracked down to an inline in the glibc patch.

Yes I noticed that, basically it looks like INLINE_SYSCALL, on error
returns -1 with the error code in errno. Whereas we expect the
return code to be the same as the kernel return code. Are you referring
to this or something else ??

However even with all of the above fixes (remove the check in down_futex
as mentioned above, add a check in task_blocks_on_lock and the glibc
changes) my test program continues to hang up the system, though it
takes a lot longer to recreate the problem now

[snip]

> 1) Why did the library call into the kernel if the calling thread
> owned the lock?

This is something I still havent figured out and leads me to believe
that we still have a tiny race race somewhere

-Dinakar

2005-12-15 19:35:21

by Dinakar Guniguntala

[permalink] [raw]

Subject: Re: Recursion bug in -rt

On Thu, Dec 15, 2005 at 11:00:49AM -0800, David Singleton wrote:
> Dinakar,
> after further testing and investigation I believe you original
> assessment was correct.
> The problem you are seeing is not a library problem.
> The changes to down_futex need to be reverted. There is a new patch at
>
> http://source.mvista.com/~dsingleton/patch-2.6.15-rc5-rt2-rf2.
>
> that reverts the changes to down_futex.
>

David, See my previous mail. IMO this patch is not right and
besides it does not fix the hang that I am seeing either.

I am continuing to debug the hang. Will keep you posted

-Dinakar

> Dinakar Guniguntala wrote:
>
> >Hi David,
> >
> >I hit this bug with -rt22-rf11
> >
> >==========================================
> >[ BUG: lock recursion deadlock detected! |
> >------------------------------------------
> >already locked: [f7abbc94] {futex}
> >.. held by: testpi-3: 4595 [f7becdd0, 59]
> >... acquired at: futex_wait_robust+0x142/0x1f3
> >------------------------------
> >| showing all locks held by: | (testpi-3/4595 [f7becdd0, 59]):
> >------------------------------
> >
> >#001: [f7abbc94] {futex}
> >... acquired at: futex_wait_robust+0x142/0x1f3
> >
> >-{current task's backtrace}----------------->
> >[<c0103e04>] dump_stack+0x1e/0x20 (20)
> >[<c0136bc2>] check_deadlock+0x2d7/0x334 (44)
> >[<c01379bc>] task_blocks_on_lock+0x2c/0x224 (36)
> >[<c03f29c5>] __down_interruptible+0x37c/0x95d (160)
> >[<c013aebf>] down_futex+0xa3/0xe7 (40)
> >[<c013ebc5>] futex_wait_robust+0x142/0x1f3 (72)
> >[<c013f35c>] do_futex+0x9a/0x109 (40)
> >[<c013f4dd>] sys_futex+0x112/0x11e (68)
> >[<c0102f03>] sysenter_past_esp+0x54/0x75 (-8116)
> >------------------------------
> >| showing all locks held by: | (testpi-3/4595 [f7becdd0, 59]):
> >------------------------------
> >
> >#001: [f7abbc94] {futex}
> >... acquired at: futex_wait_robust+0x142/0x1f3
> >
> >---------------------------------------------------------------------
> >
> >futex.c -> futex_wait_robust
> >
> > if ((curval & FUTEX_PID) == current->pid) {
> > ret = -EAGAIN;
> > goto out_unlock;
> > }
> >
> >rt.c -> down_futex
> >
> > if (!owner_task || owner_task == current) {
> > up(sem);
> > up_read(&current->mm->mmap_sem);
> > return -EAGAIN;
> > }
> >
> >I noticed that both the above checks below have been removed in your
> >patch. I do understand that the futex_wait_robust path has been
> >made similar to the futex_wait path, but I think we are not taking
> >PI into consideration. Basically it looks like we still need to check
> >if the current task has become owner. or are we missing a lock somewhere ?
> >
> >I added the down_futex check above and my test has been
> >running for hours without the oops. Without this check it
> >used to oops within minutes.
> >
> >Patch that works for me attached below. Thoughts?
> >
> > -Dinakar
> >
> >
> >
> >
> >------------------------------------------------------------------------
> >
> >Index: linux-2.6.14-rt22-rayrt5/kernel/rt.c
> >===================================================================
> >--- linux-2.6.14-rt22-rayrt5.orig/kernel/rt.c 2005-12-15
> >02:15:13.000000000 +0530
> >+++ linux-2.6.14-rt22-rayrt5/kernel/rt.c 2005-12-15
> >02:18:29.000000000 +0530
> >@@ -3001,7 +3001,7 @@
> > * if the owner can't be found return try again.
> > */
> >
> >- if (!owner_task) {
> >+ if (!owner_task || owner_task == current) {
> > up(sem);
> > up_read(&current->mm->mmap_sem);
> > return -EAGAIN;
> >
> >
>

2005-12-15 20:40:01

by David Singleton

[permalink] [raw]

Subject: Re: Recursion bug in -rt

Dinakar,
two items.

1) Yes, the INLINE_SYSCALL is the problem that Dave Carlson is seeing.

2) Are you saying that a recursive pthread_mutex enters the kernel on
subsequent lock opertions when it already owns the lock?

I believe the library should take care of recursive pthread mutexes and
not enter the kernel on subsequent calls. The futex_wait function is
designed to block on lock owned by another thread.

Let me go spend some time looking at the library code and see what it is
doing for recursive mutexes.

David

>On Wed, Dec 14, 2005 at 05:03:13PM -0800, david singleton wrote:
>
>
>>Dinakar,
>>you may be correct, since reverting the change in down_futex seems
>>to work.
>>
>>
>
>Well it did not. It ran for much longer than previously but still
>hung up.
>
>Well we have a very basic problem in the current implementation.
>
>Currently if a thread calls pthread_mutex_lock on the same lock
>(normal, non recursive lock) twice without unlocking in between, the
>application hangs. Which is the right behaviour.
>However if the same thing is done with a non recursive robust mutex,
>it manages to acquire the lock.
>
>I see many problems here (I am assuming that the right behaviour
>with robust mutexes is for application to ultimately block
>indefinitely in the kernel)
>
>1. In down_futex we do the following check
>
> if (owner_task != current)
> down_try_futex(lock, owner_task->thread_info __EIP__);
>
> In the above scenario, the thread would have acquired the uncontended
> lock first time around in userspace. The second time it tries to
> acquire the same mutex, because of the above check, does not
> call down_try_futex and hence will not initialize the lock structure
> in the kernel. It then goes to __down_interruptible where it is
> granted the lock, which is wrong.
>
> So IMO the above check is not right. However removing this check
> is not the end of story. This time it gets to task_blocks_on_lock
> and tries to grab the task->pi_lock of the owvner which is itself
> and results in a system hang. (Assuming CONFIG_DEBUG_DEADLOCKS
> is not set). So it looks like we need to add some check to
> prevent this below in case lock_owner happens to be current.
>
> _raw_spin_lock(&lock_owner(lock)->task->pi_lock);
>
>
>
>
>>However, I'm wondering if you've hit a bug that Dave Carlson is
>>reporting that he's tracked down to an inline in the glibc patch.
>>
>>
>
>Yes I noticed that, basically it looks like INLINE_SYSCALL, on error
>returns -1 with the error code in errno. Whereas we expect the
>return code to be the same as the kernel return code. Are you referring
>to this or something else ??
>
>However even with all of the above fixes (remove the check in down_futex
>as mentioned above, add a check in task_blocks_on_lock and the glibc
>changes) my test program continues to hang up the system, though it
>takes a lot longer to recreate the problem now
>
>[snip]
>
>
>
>>1) Why did the library call into the kernel if the calling thread
>>owned the lock?
>>
>>
>
>This is something I still havent figured out and leads me to believe
>that we still have a tiny race race somewhere
>
> -Dinakar
>
>

2005-12-16 00:02:44

by David Singleton

[permalink] [raw]

Subject: Re: Recursion bug in -rt

Dinakar,

I believe the problem we have is that the library is not checking
to see if the mutex is a recursive mutex, and then checking to see
if the recursive mutex is already owned by the calling thread. If a
recursive mutex
is owned by the calling thread the library should increment the lock
count (POSIX says recursive mutexes must be unlocked as
many times as they are locked) and return 0 (success) to the caller.

I don't see anywhere in the library where the check for a recursive
mutex is being made. The mutex_trylock code just does a compare
and exchange on the lock to see if it can get it. A recursive mutex
will fail this test and erroneously enter the kernel.

I believe we need the down_futex change and a patch
to glibc in which recursive mutexes are handled in the library.

I'm talking to the library people now to see if that's the way
it's supposed to work for recursive mutexes. If it is we'll have
to provide a new glibc patch along with the kernel patch.

I think I'll have to provide a new glibc patch anyways to
fix the INLINE_SYSCALL problem.

Does this make sense?

David

On Dec 15, 2005, at 11:44 AM, Dinakar Guniguntala wrote:

> On Wed, Dec 14, 2005 at 05:03:13PM -0800, david singleton wrote:
>> Dinakar,
>> you may be correct, since reverting the change in down_futex seems
>> to work.
>
> Well it did not. It ran for much longer than previously but still
> hung up.
>
> Well we have a very basic problem in the current implementation.
>
> Currently if a thread calls pthread_mutex_lock on the same lock
> (normal, non recursive lock) twice without unlocking in between, the
> application hangs. Which is the right behaviour.
> However if the same thing is done with a non recursive robust mutex,
> it manages to acquire the lock.
>
> I see many problems here (I am assuming that the right behaviour
> with robust mutexes is for application to ultimately block
> indefinitely in the kernel)
>
> 1. In down_futex we do the following check
>
> if (owner_task != current)
> down_try_futex(lock, owner_task->thread_info __EIP__);
>
> In the above scenario, the thread would have acquired the
> uncontended
> lock first time around in userspace. The second time it tries to
> acquire the same mutex, because of the above check, does not
> call down_try_futex and hence will not initialize the lock structure
> in the kernel. It then goes to __down_interruptible where it is
> granted the lock, which is wrong.
>
> So IMO the above check is not right. However removing this check
> is not the end of story. This time it gets to task_blocks_on_lock
> and tries to grab the task->pi_lock of the owvner which is itself
> and results in a system hang. (Assuming CONFIG_DEBUG_DEADLOCKS
> is not set). So it looks like we need to add some check to
> prevent this below in case lock_owner happens to be current.
>
> _raw_spin_lock(&lock_owner(lock)->task->pi_lock);
>
>
>> However, I'm wondering if you've hit a bug that Dave Carlson is
>> reporting that he's tracked down to an inline in the glibc patch.
>
> Yes I noticed that, basically it looks like INLINE_SYSCALL, on error
> returns -1 with the error code in errno. Whereas we expect the
> return code to be the same as the kernel return code. Are you referring
> to this or something else ??
>
> However even with all of the above fixes (remove the check in
> down_futex
> as mentioned above, add a check in task_blocks_on_lock and the glibc
> changes) my test program continues to hang up the system, though it
> takes a lot longer to recreate the problem now
>
> [snip]
>
>> 1) Why did the library call into the kernel if the calling thread
>> owned the lock?
>
> This is something I still havent figured out and leads me to believe
> that we still have a tiny race race somewhere
>
> -Dinakar

2005-12-16 18:25:09

by Dinakar Guniguntala

[permalink] [raw]

Subject: Re: Recursion bug in -rt

On Thu, Dec 15, 2005 at 04:02:36PM -0800, david singleton wrote:
> Dinakar,
>
> I believe the problem we have is that the library is not checking
> to see if the mutex is a recursive mutex, and then checking to see
> if the recursive mutex is already owned by the calling thread. If a
> recursive mutex
> is owned by the calling thread the library should increment the lock
> count (POSIX says recursive mutexes must be unlocked as
> many times as they are locked) and return 0 (success) to the caller.

Sorry I seem to have confused you. I am _not_ talking about recursive
mutexes here.

I have two testcases. Here is a code snippet of testcase I

void* test_thread (void* arg)
{
pthread_mutex_lock(&child_mutex);
printf("test_thread got lock\n");

pthread_mutex_lock(&child_mutex);
printf("test_thread got lock 2nd time !!\n");

printf("test_thread exiting\n");
return NULL;
}

main ()
{
...

if (pthread_mutexattr_init(&mutexattr) != 0) {
printf("Failed to init mutexattr\n");
};
if (pthread_mutexattr_setprotocol(&mutexattr,
PTHREAD_PRIO_INHERIT) != 0) {
printf("Can't set protocol prio inherit\n");
}
if (pthread_mutexattr_getprotocol(&mutexattr, &protocol) != 0) {
printf("Can't get mutexattr protocol\n");
} else {
printf("protocol in mutexattr is %d\n", protocol);
}
if ((retc = pthread_mutex_init(&child_mutex, &mutexattr)) != 0) {
printf("Failed to init mutex: %d\n", retc);
}

...
}

Clearly what the application is doing here is wrong. However,

1. In the case of normal (non-robust) non recursive mutexes, the
behaviour when we make the second pthread_mutex_lock call is for glibc
to make a futex_wait call which will block forever.
(Which is the right behaviour)

2. In the case of a robust/PI non recursive mutex, the current
behaviour is the glibc makes a futex_wait_robust call (which is right)
The kernel (2.6.15-rc5-rt1) rt_mutex lock is currently unowned and
since we do not call down_try_futex if current == owner_task, we end
up grabbing the lock in __down_interruptible and returning succesfully !

3. Adding the check below in down_futex is also wrong

if (!owner_task || owner_task == current) {
up(sem);
up_read(&current->mm->mmap_sem);
return -EAGAIN;
}

This will result in glibc calling the kernel continuosly in a
loop and we will end up context switching to death

I guess we need to cleanup this path to ensure that the application
blocks forever.

I also have testcase II which does not do anything illegal as the
above one, instead it exercises the PI boosting code path in the
kernel and that is where I see the system hang up yet again
and this is the race that I am currently investigating

Hope that clearls up things a bit

-Dinakar

>
> I don't see anywhere in the library where the check for a recursive
> mutex is being made. The mutex_trylock code just does a compare
> and exchange on the lock to see if it can get it. A recursive mutex
> will fail this test and erroneously enter the kernel.
>
> I believe we need the down_futex change and a patch
> to glibc in which recursive mutexes are handled in the library.
>
> I'm talking to the library people now to see if that's the way
> it's supposed to work for recursive mutexes. If it is we'll have
> to provide a new glibc patch along with the kernel patch.
>
> I think I'll have to provide a new glibc patch anyways to
> fix the INLINE_SYSCALL problem.
>
> Does this make sense?
>
> David
>
> On Dec 15, 2005, at 11:44 AM, Dinakar Guniguntala wrote:
>
> >On Wed, Dec 14, 2005 at 05:03:13PM -0800, david singleton wrote:
> >>Dinakar,
> >>you may be correct, since reverting the change in down_futex seems
> >>to work.
> >
> >Well it did not. It ran for much longer than previously but still
> >hung up.
> >
> >Well we have a very basic problem in the current implementation.
> >
> >Currently if a thread calls pthread_mutex_lock on the same lock
> >(normal, non recursive lock) twice without unlocking in between, the
> >application hangs. Which is the right behaviour.
> >However if the same thing is done with a non recursive robust mutex,
> >it manages to acquire the lock.
> >
> >I see many problems here (I am assuming that the right behaviour
> >with robust mutexes is for application to ultimately block
> >indefinitely in the kernel)
> >
> >1. In down_futex we do the following check
> >
> > if (owner_task != current)
> > down_try_futex(lock, owner_task->thread_info __EIP__);
> >
> > In the above scenario, the thread would have acquired the
> >uncontended
> > lock first time around in userspace. The second time it tries to
> > acquire the same mutex, because of the above check, does not
> > call down_try_futex and hence will not initialize the lock structure
> > in the kernel. It then goes to __down_interruptible where it is
> > granted the lock, which is wrong.
> >
> > So IMO the above check is not right. However removing this check
> > is not the end of story. This time it gets to task_blocks_on_lock
> > and tries to grab the task->pi_lock of the owvner which is itself
> > and results in a system hang. (Assuming CONFIG_DEBUG_DEADLOCKS
> > is not set). So it looks like we need to add some check to
> > prevent this below in case lock_owner happens to be current.
> >
> > _raw_spin_lock(&lock_owner(lock)->task->pi_lock);
> >
> >
> >>However, I'm wondering if you've hit a bug that Dave Carlson is
> >>reporting that he's tracked down to an inline in the glibc patch.
> >
> >Yes I noticed that, basically it looks like INLINE_SYSCALL, on error
> >returns -1 with the error code in errno. Whereas we expect the
> >return code to be the same as the kernel return code. Are you referring
> >to this or something else ??
> >
> >However even with all of the above fixes (remove the check in
> >down_futex
> >as mentioned above, add a check in task_blocks_on_lock and the glibc
> >changes) my test program continues to hang up the system, though it
> >takes a lot longer to recreate the problem now
> >
> >[snip]
> >
> >>1) Why did the library call into the kernel if the calling thread
> >>owned the lock?
> >
> >This is something I still havent figured out and leads me to believe
> >that we still have a tiny race race somewhere
> >
> > -Dinakar
>

2005-12-16 21:26:40

by David Singleton

[permalink] [raw]

Subject: Re: Recursion bug in -rt

Dinakar,

I believe this patch will give you the behavior you expect.

http://source.mvista.com/~dsingleton/patch-2.6.15-rc5-rt2-rf3

With this patch your app that tries to lock a non-recursive
pthread_mutex twice
will not get EAGAIN back, it will hang.

I should point out that you will still get the new, and I had hoped,
added
benefit feature of being able to detect when you app would deadlock
itself
by turning on DEBUG_DEADLOCKS. It will give you the same
behavior that you originally posted. The kernel will inform you
of the deadlock.

If you need the POSIX behavior of having an app hang if it tries
to lock the same, non-recursive pthread_mutex twice then you need to
turn off
DEBUG_DEADLOCKS.

If you can send on the Priority Boost test I can start looking at it
as well.

David

>On Thu, Dec 15, 2005 at 04:02:36PM -0800, david singleton wrote:
>
>
>>Dinakar,
>>
>> I believe the problem we have is that the library is not checking
>>to see if the mutex is a recursive mutex, and then checking to see
>>if the recursive mutex is already owned by the calling thread. If a
>>recursive mutex
>>is owned by the calling thread the library should increment the lock
>>count (POSIX says recursive mutexes must be unlocked as
>>many times as they are locked) and return 0 (success) to the caller.
>>
>>
>
>Sorry I seem to have confused you. I am _not_ talking about recursive
>mutexes here.
>
>I have two testcases. Here is a code snippet of testcase I
>
> void* test_thread (void* arg)
> {
> pthread_mutex_lock(&child_mutex);
> printf("test_thread got lock\n");
>
> pthread_mutex_lock(&child_mutex);
> printf("test_thread got lock 2nd time !!\n");
>
> printf("test_thread exiting\n");
> return NULL;
> }
>
> main ()
> {
> ...
>
> if (pthread_mutexattr_init(&mutexattr) != 0) {
> printf("Failed to init mutexattr\n");
> };
> if (pthread_mutexattr_setprotocol(&mutexattr,
> PTHREAD_PRIO_INHERIT) != 0) {
> printf("Can't set protocol prio inherit\n");
> }
> if (pthread_mutexattr_getprotocol(&mutexattr, &protocol) != 0) {
> printf("Can't get mutexattr protocol\n");
> } else {
> printf("protocol in mutexattr is %d\n", protocol);
> }
> if ((retc = pthread_mutex_init(&child_mutex, &mutexattr)) != 0) {
> printf("Failed to init mutex: %d\n", retc);
> }
>
> ...
> }
>
>
>Clearly what the application is doing here is wrong. However,
>
>1. In the case of normal (non-robust) non recursive mutexes, the
>behaviour when we make the second pthread_mutex_lock call is for glibc
>to make a futex_wait call which will block forever.
>(Which is the right behaviour)
>
>2. In the case of a robust/PI non recursive mutex, the current
>behaviour is the glibc makes a futex_wait_robust call (which is right)
>The kernel (2.6.15-rc5-rt1) rt_mutex lock is currently unowned and
>since we do not call down_try_futex if current == owner_task, we end
>up grabbing the lock in __down_interruptible and returning succesfully !
>
>3. Adding the check below in down_futex is also wrong
>
> if (!owner_task || owner_task == current) {
> up(sem);
> up_read(&current->mm->mmap_sem);
> return -EAGAIN;
> }
>
> This will result in glibc calling the kernel continuosly in a
> loop and we will end up context switching to death
>
>I guess we need to cleanup this path to ensure that the application
>blocks forever.
>
>I also have testcase II which does not do anything illegal as the
>above one, instead it exercises the PI boosting code path in the
>kernel and that is where I see the system hang up yet again
>and this is the race that I am currently investigating
>
>Hope that clearls up things a bit
>
> -Dinakar
>
>
>

2005-12-19 11:39:22

by Dinakar Guniguntala

[permalink] [raw]

Subject: Re: Recursion bug in -rt

On Fri, Dec 16, 2005 at 01:26:44PM -0800, David Singleton wrote:
> Dinakar,
>
> I believe this patch will give you the behavior you expect.
>
> http://source.mvista.com/~dsingleton/patch-2.6.15-rc5-rt2-rf3

Not really, the reason, quoting from my previous mail

So IMO the above check is not right. However removing this check
is not the end of story. This time it gets to task_blocks_on_lock
and tries to grab the task->pi_lock of the owvner which is itself
and results in a system hang. (Assuming CONFIG_DEBUG_DEADLOCKS
is not set). So it looks like we need to add some check to
prevent this below in case lock_owner happens to be current.

_raw_spin_lock(&lock_owner(lock)->task->pi_lock);

The patch that works for me is attached below

-Dinakar

Attachments:

(No filename) (785.00 B)
check_current.patch (1.14 kB)
Download all attachments

2005-12-19 20:11:32

by David Singleton

[permalink] [raw]

Subject: Re: Recursion bug in -rt

Dinakar Guniguntala wrote:

Dinakar,
the new patch is available at

http://source.mvista.com/~dsingleton/patch-2.6.15-rc5-rt2-rf4

thanks for the patch and the testing.

David

>On Fri, Dec 16, 2005 at 01:26:44PM -0800, David Singleton wrote:
>
>
>>Dinakar,
>>
>> I believe this patch will give you the behavior you expect.
>>
>> http://source.mvista.com/~dsingleton/patch-2.6.15-rc5-rt2-rf3
>>
>>
>
>Not really, the reason, quoting from my previous mail
>
> So IMO the above check is not right. However removing this check
> is not the end of story. This time it gets to task_blocks_on_lock
> and tries to grab the task->pi_lock of the owvner which is itself
> and results in a system hang. (Assuming CONFIG_DEBUG_DEADLOCKS
> is not set). So it looks like we need to add some check to
> prevent this below in case lock_owner happens to be current.
>
> _raw_spin_lock(&lock_owner(lock)->task->pi_lock);
>
>The patch that works for me is attached below
>
> -Dinakar
>
>
>
>
>------------------------------------------------------------------------
>
>Index: linux-2.6.14-rt22-rayrt5/kernel/rt.c
>===================================================================
>--- linux-2.6.14-rt22-rayrt5.orig/kernel/rt.c 2005-12-15 02:15:13.000000000 +0530
>+++ linux-2.6.14-rt22-rayrt5/kernel/rt.c 2005-12-19 15:51:26.000000000 +0530
>@@ -1042,7 +1042,8 @@
> return;
> }
> #endif
>- _raw_spin_lock(&lock_owner(lock)->task->pi_lock);
>+ if (current != lock_owner(lock)->task)
>+ _raw_spin_lock(&lock_owner(lock)->task->pi_lock);
> plist_add(&waiter->pi_list, &lock_owner(lock)->task->pi_waiters);
> /*
> * Add RT tasks to the head:
>@@ -1055,7 +1056,8 @@
> */
> if (task->prio < lock_owner(lock)->task->prio)
> pi_setprio(lock, lock_owner(lock)->task, task->prio);
>- _raw_spin_unlock(&lock_owner(lock)->task->pi_lock);
>+ if (current != lock_owner(lock)->task)
>+ _raw_spin_unlock(&lock_owner(lock)->task->pi_lock);
> }
>
> /*
>@@ -3016,8 +3018,7 @@
> * the first waiter and we'll just block on the down_interruptible.
> */
>
>- if (owner_task != current)
>- down_try_futex(lock, owner_task->thread_info __EIP__);
>+ down_try_futex(lock, owner_task->thread_info __EIP__);
>
> /*
> * we can now drop the locks because the rt_mutex is held.
>
>

2005-12-20 13:20:35

[permalink] [raw]

Subject: Re: Recursion bug in -rt

* David Singleton <[email protected]> wrote:

> Dinakar,
> after further testing and investigation I believe you original
> assessment was correct.
> The problem you are seeing is not a library problem.
> The changes to down_futex need to be reverted. There is a new patch at
>
> http://source.mvista.com/~dsingleton/patch-2.6.15-rc5-rt2-rf2.
>
> that reverts the changes to down_futex.

hm, i'm looking at -rf4 - these changes look fishy:

- _raw_spin_lock(&lock_owner(lock)->task->pi_lock);
+ if (current != lock_owner(lock)->task)
+ _raw_spin_lock(&lock_owner(lock)->task->pi_lock);

why is this done?

Ingo

2005-12-20 15:33:51

by Dinakar Guniguntala

[permalink] [raw]

Subject: Re: Recursion bug in -rt

On Tue, Dec 20, 2005 at 02:19:56PM +0100, Ingo Molnar wrote:
>
> hm, i'm looking at -rf4 - these changes look fishy:
>
> - _raw_spin_lock(&lock_owner(lock)->task->pi_lock);
> + if (current != lock_owner(lock)->task)
> + _raw_spin_lock(&lock_owner(lock)->task->pi_lock);
>
> why is this done?
>

Ingo, this is to prevent a kernel hang due to application error.

Basically when an application does a pthread_mutex_lock twice on a
_nonrecursive_ mutex with robust/PI attributes the whole system hangs.
Ofcourse the application clearly should not be doing anything like
that, but it should not end up hanging the system either

-Dinakar

2005-12-20 17:44:14

by Esben Nielsen

[permalink] [raw]

Subject: Re: Recursion bug in -rt

On Tue, 20 Dec 2005, Dinakar Guniguntala wrote:

> On Tue, Dec 20, 2005 at 02:19:56PM +0100, Ingo Molnar wrote:
> >
> > hm, i'm looking at -rf4 - these changes look fishy:
> >
> > - _raw_spin_lock(&lock_owner(lock)->task->pi_lock);
> > + if (current != lock_owner(lock)->task)
> > + _raw_spin_lock(&lock_owner(lock)->task->pi_lock);
> >
> > why is this done?
> >
>
> Ingo, this is to prevent a kernel hang due to application error.
>
> Basically when an application does a pthread_mutex_lock twice on a
> _nonrecursive_ mutex with robust/PI attributes the whole system hangs.
> Ofcourse the application clearly should not be doing anything like
> that, but it should not end up hanging the system either
>

Hmm, reading the comment on the function, wouldn't it be more natural to
use
if(task != lock_owner(lock)->task)
as it assumes that task->pi_lock is locked, not that current->pi_lock is
locked.

By the way:
task->pi_lock is taken. lock_owner(lock)->task->pi_lock will be taken.
What if the task lock_owner(lock)->task tries to lock another futex,
(lock2) with which has lock_owner(lock2)->task==task.
Can't you promote a user space futex deadlock into a kernel spin deadlock
this way?

Esben

> -Dinakar
>
>

2005-12-20 19:34:16

by Steven Rostedt

[permalink] [raw]

Subject: Re: Recursion bug in -rt

On Tue, 2005-12-20 at 18:43 +0100, Esben Nielsen wrote:
>
> On Tue, 20 Dec 2005, Dinakar Guniguntala wrote:
>
> > On Tue, Dec 20, 2005 at 02:19:56PM +0100, Ingo Molnar wrote:
> > >
> > > hm, i'm looking at -rf4 - these changes look fishy:
> > >
> > > - _raw_spin_lock(&lock_owner(lock)->task->pi_lock);
> > > + if (current != lock_owner(lock)->task)
> > > + _raw_spin_lock(&lock_owner(lock)->task->pi_lock);
> > >
> > > why is this done?
> > >
> >
> > Ingo, this is to prevent a kernel hang due to application error.
> >
> > Basically when an application does a pthread_mutex_lock twice on a
> > _nonrecursive_ mutex with robust/PI attributes the whole system hangs.
> > Ofcourse the application clearly should not be doing anything like
> > that, but it should not end up hanging the system either
> >
>
> Hmm, reading the comment on the function, wouldn't it be more natural to
> use
> if(task != lock_owner(lock)->task)
> as it assumes that task->pi_lock is locked, not that current->pi_lock is
> locked.
>
> By the way:
> task->pi_lock is taken. lock_owner(lock)->task->pi_lock will be taken.
> What if the task lock_owner(lock)->task tries to lock another futex,
> (lock2) with which has lock_owner(lock2)->task==task.
> Can't you promote a user space futex deadlock into a kernel spin deadlock
> this way?

Yes!

The locking code of the pi locks in the rt.c code is VERY dependent on
the order of locks taken. It works by assuming the order of locks taken
will not themselves cause a deadlock. I just recently submitted a patch
to Ingo because I found that mutex_trylock can cause a deadlock, since
it is not bound to the order of locks.

So, to answer your question more formal this time. If the futex code
uses the pi_lock code in rt.c and the futex causes a deadlock, then the
kernel can deadlock too.

The benefit of this locking order is that we got rid of the global
pi_lock, and that was worth the problems you face today.

-- Steve

2005-12-20 20:42:46

by Esben Nielsen

[permalink] [raw]

Subject: Re: Recursion bug in -rt

On Tue, 20 Dec 2005, Steven Rostedt wrote:

> On Tue, 2005-12-20 at 18:43 +0100, Esben Nielsen wrote:
> >
> > On Tue, 20 Dec 2005, Dinakar Guniguntala wrote:
> >
> > > On Tue, Dec 20, 2005 at 02:19:56PM +0100, Ingo Molnar wrote:
> > > >
> > > > hm, i'm looking at -rf4 - these changes look fishy:
> > > >
> > > > - _raw_spin_lock(&lock_owner(lock)->task->pi_lock);
> > > > + if (current != lock_owner(lock)->task)
> > > > + _raw_spin_lock(&lock_owner(lock)->task->pi_lock);
> > > >
> > > > why is this done?
> > > >
> > >
> > > Ingo, this is to prevent a kernel hang due to application error.
> > >
> > > Basically when an application does a pthread_mutex_lock twice on a
> > > _nonrecursive_ mutex with robust/PI attributes the whole system hangs.
> > > Ofcourse the application clearly should not be doing anything like
> > > that, but it should not end up hanging the system either
> > >
> >
> > Hmm, reading the comment on the function, wouldn't it be more natural to
> > use
> > if(task != lock_owner(lock)->task)
> > as it assumes that task->pi_lock is locked, not that current->pi_lock is
> > locked.
> >
> > By the way:
> > task->pi_lock is taken. lock_owner(lock)->task->pi_lock will be taken.
> > What if the task lock_owner(lock)->task tries to lock another futex,
> > (lock2) with which has lock_owner(lock2)->task==task.
> > Can't you promote a user space futex deadlock into a kernel spin deadlock
> > this way?
>
> Yes!
>
> The locking code of the pi locks in the rt.c code is VERY dependent on
> the order of locks taken. It works by assuming the order of locks taken
> will not themselves cause a deadlock. I just recently submitted a patch
> to Ingo because I found that mutex_trylock can cause a deadlock, since
> it is not bound to the order of locks.
>
> So, to answer your question more formal this time. If the futex code
> uses the pi_lock code in rt.c and the futex causes a deadlock, then the
> kernel can deadlock too.
>
This is ok for kernel mutexes, which are supposed not to cause deadlocks.
But user space deadlocks must not cause kernel deadlocks. Therefore the
robust futex code _must_ be fixed.

> The benefit of this locking order is that we got rid of the global
> pi_lock, and that was worth the problems you face today.
>

I believe this problem can be solved in the pi_lock code - but it will
require quite a bit of recoding. I started on it a little while ago but
didn't at all get the time to get into anything to even compile :-(
I don't have time to finish any code at all but I guess I can plant an
the idea instead:

When resolving the mutex chain (task A locks mutex 1 owned by B blocked
on 2 owned by C etc) for PI boosting and also when finding deadlocks,
release _all_ locks before going to the next step in the chain. Use
get_task_struct on the next task in the chain, release the locks,
take the pi_locks in a fixed order (sort by address forinstance), do the
PI boosting, get_task_struct on the next lock, release all the locks, do
the put_task_struct() on the previous task etc.
This a lot more expensive approach as it involves double as many spinlock
operations and get/put_task_struct() calls , but it has the added benifit
of reducing the overall system latency for long locking chains as there
are spots where interrupts and preemption will be enabled for each step in
the chain. Remember also that deep locking chains should be considered an
exeption so the code doesn't have to be optimal wrt. performance.

I added my feeble attempts to implement this below. I have no chance of
ever getting time finish it :-(

> -- Steve
>
>

This is a "near" replacement for task_blocks_on_lock(). It leaves with all
spinlocks unlocked. A lot of other code needs to be changed as well to
accept the new right order of locking.

static void
task_blocks_on_lock2(struct rt_mutex_waiter *waiter, struct thread_info *ti,
struct rt_mutex *lock __EIP_DECL__)
{
task_t *owner, *blocked_task = ti->task;

#ifdef CONFIG_RT_DEADLOCK_DETECT
/* mark the current thread as blocked on the lock */
waiter->eip = eip;
#endif

blocked_task->blocked_on = waiter;
waiter->lock = lock;
waiter->ti = blocked_task->thread_info;
waiter->on_pi_waiters = 0;
plist_init(&waiter->pi_list, blocked_task->prio);
plist_init(&waiter->list, blocked_task->prio);

SMP_TRACE_BUG_ON_LOCKED(!spin_is_locked(&blocked_task->pi_lock));
SMP_TRACE_BUG_ON_LOCKED(!spin_is_locked(&lock->wait_lock));

plist_add(&waiter->list, &lock->wait_list);
set_lock_owner_pending(lock);

owner = lock_owner_task(lock);
/* Put code to grab the lock and return on success here */

/* Code for waiting: */
get_task_struct(owner);
get_task_struct(blocked_task);
blocked_task->blocked_on = waiter;

_raw_spin_unlock(&lock->wait_lock);
_raw_spin_unlock(&current->pi_lock);

while(owner) {
/* No spin-locks held here! */

struct rt_mutex *this_lock;
task_t *next_owner = NULL;

if(owner-blocked_task<0) {
_raw_spin_lock(&owner->pi_lock);
_raw_spin_lock(&blocked_task->pi_lock);
}
else {
_raw_spin_lock(&blocked_task->pi_lock);
_raw_spin_lock(&owner->pi_lock);
}

if(!blocked_task->blocked_on ||
!blocked_task->blocked_on->ti) {
/* The lock have been released!! */
_raw_spin_unlock(&blocked_task->pi_lock);
_raw_spin_unlock(&owner->pi_lock);
put_task_struct(owner);
break;
}
waiter = blocked_task->blocked_on;
this_lock = waiter->lock;
_raw_spin_lock(&this_lock->wait_lock);
if(lock_owner_task(this_lock)!=owner) {
/* Ups, owner have changed
redo it with the new owner */
next_owner = lock_owner_task(this_lock);
if(next_owner)
get_task_struct(next_owner);
_raw_spin_unlock(&this_lock->wait_lock);
_raw_spin_unlock(&blocked_task->pi_lock);
_raw_spin_unlock(&owner->pi_lock);
put_task_struct(owner);
owner = next_owner;
continue;
}

/* Prio might have changed - it does so in the
previous loop. So the lists needs to be resorted
*/
plist_del(&waiter->list, &lock->wait_list);
plist_init(&waiter->list, blocked_task->prio);
plist_add(&waiter->list, &lock->wait_list);

if(waiter->on_pi_waiters) {
plist_del(&waiter->pi_list,
&owner->pi_waiters);
waiter->on_pi_waiters = 0;
}

/* We only need the highest priority waiter on
owner->pi_list */
if(plist_first_entry(&lock->wait_list,
struct rt_mutex_waiter, pi_list)
== waiter) {
plist_init(&waiter->pi_list, blocked_task->prio);
plist_add(&waiter->pi_list,
&owner->pi_waiters);
waiter->on_pi_waiters = 1;
}

_raw_spin_unlock(&this_lock->wait_lock);
_raw_spin_unlock(&blocked_task->pi_lock);
if(DEADLOCK_DETECT || owner->prio > blocked_task->prio) {
if(owner->prio > blocked_task->prio)
mutex_setprio(owner,blocked_task->prio);
if(owner->blocked_on) {
waiter = owner->blocked_on;
_raw_spin_lock(&waiter->lock->wait_lock);
next_owner = lock_owner_task(waiter->lock);
if(next_owner)
get_task_struct(next_owner);
_raw_spin_unlock(&waiter->lock->wait_lock);
}
}
put_task_struct(blocked_task);
blocked_task = owner;
owner = next_owner;
}
BUG_ON(!blocked_task);
put_task_struct(blocked_task);
}

2005-12-20 21:20:37

by Steven Rostedt

[permalink] [raw]

Subject: Re: Recursion bug in -rt

On Tue, 2005-12-20 at 21:42 +0100, Esben Nielsen wrote:

> >
> This is ok for kernel mutexes, which are supposed not to cause deadlocks.
> But user space deadlocks must not cause kernel deadlocks. Therefore the
> robust futex code _must_ be fixed.

And it should be fixed in the futex side. Not in rt.c.

>
> > The benefit of this locking order is that we got rid of the global
> > pi_lock, and that was worth the problems you face today.
> >
>
> I believe this problem can be solved in the pi_lock code - but it will
> require quite a bit of recoding. I started on it a little while ago but
> didn't at all get the time to get into anything to even compile :-(
> I don't have time to finish any code at all but I guess I can plant an
> the idea instead:

I removed the global pi_lock in two sleepless days, and it was all on
the theory that the locks themselves would not deadlock. That was the
only sanity I was able to hang on to. That was complex enough, and very
scary to get right (scary since it _had_ to be done right). And it was
complex enough to keep a highly skilled programmer living in Hungary
from doing it himself (not to say he couldn't do it, just complex enough
for him to put it off for quite some time). And one must also worry
about the BKL which is a separate beast all together.

So making it any more complex is IMO out of the question.

>
> When resolving the mutex chain (task A locks mutex 1 owned by B blocked
> on 2 owned by C etc) for PI boosting and also when finding deadlocks,
> release _all_ locks before going to the next step in the chain. Use
> get_task_struct on the next task in the chain, release the locks,
> take the pi_locks in a fixed order (sort by address forinstance), do the
> PI boosting, get_task_struct on the next lock, release all the locks, do
> the put_task_struct() on the previous task etc.
> This a lot more expensive approach as it involves double as many spinlock
> operations and get/put_task_struct() calls , but it has the added benifit
> of reducing the overall system latency for long locking chains as there
> are spots where interrupts and preemption will be enabled for each step in
> the chain. Remember also that deep locking chains should be considered an
> exeption so the code doesn't have to be optimal wrt. performance.

The long lock holding is only by the lock being grabbed and the owner
grabbing it. All other locks don't need to be held for long periods of
time. There's lots of issues if you release these two locks. How do you
deal with the mutex being released while going up the chain?

>
> I added my feeble attempts to implement this below. I have no chance of
> ever getting time finish it :-(

I doubt they were feeble, but just proof that this approach is far too
complex. As I said, if you don't want futex to deadlock the kernel, the
API for futex should have deadlocking checks, since the only way this
can deadlock the system, is if two threads are in the kernel at the same
time.

Also, the ones who are paying me to help out the -rt kernel, don't care
if we let the user side deadlock the system. They deliver a complete
package, kernel and user apps, and nothing else is to be running on the
system. This means that if the user app deadlocks, it doesn't matter if
the kernel deadlocks or not, because the deadlocking of the user app
means the system has failed. And I have a feeling that a lot of other
users of -rt feel the same way.

-- Steve

2005-12-20 21:55:23

by David Singleton

[permalink] [raw]

Subject: Re: Recursion bug in -rt

On Dec 20, 2005, at 1:20 PM, Steven Rostedt wrote:

> On Tue, 2005-12-20 at 21:42 +0100, Esben Nielsen wrote:
>
>>>
>> This is ok for kernel mutexes, which are supposed not to cause
>> deadlocks.
>> But user space deadlocks must not cause kernel deadlocks. Therefore
>> the
>> robust futex code _must_ be fixed.
>
> And it should be fixed in the futex side. Not in rt.c.

Yes. I believe you are right about this. The bad news is POSIX
specifies
that an app that calls lock_mutex twice on the same mutex hangs.
EINVAL or EWOULDBLOCK are not POSIX compliant.

Let me think on this a while and see if I can come up with an idea that
will let the app hang itself (and be POSIX compliant) and not hang the
kernel.

Perhaps I'll just detect it and hang them on a waitqueue waiting
for the user to get impatient and hit control C . . . .

David

>
>>
>>> The benefit of this locking order is that we got rid of the global
>>> pi_lock, and that was worth the problems you face today.
>>>
>>
>> I believe this problem can be solved in the pi_lock code - but it will
>> require quite a bit of recoding. I started on it a little while ago
>> but
>> didn't at all get the time to get into anything to even compile :-(
>> I don't have time to finish any code at all but I guess I can plant an
>> the idea instead:
>
> I removed the global pi_lock in two sleepless days, and it was all on
> the theory that the locks themselves would not deadlock. That was the
> only sanity I was able to hang on to. That was complex enough, and
> very
> scary to get right (scary since it _had_ to be done right). And it was
> complex enough to keep a highly skilled programmer living in Hungary
> from doing it himself (not to say he couldn't do it, just complex
> enough
> for him to put it off for quite some time). And one must also worry
> about the BKL which is a separate beast all together.
>
> So making it any more complex is IMO out of the question.
>
>>
>> When resolving the mutex chain (task A locks mutex 1 owned by B
>> blocked
>> on 2 owned by C etc) for PI boosting and also when finding deadlocks,
>> release _all_ locks before going to the next step in the chain. Use
>> get_task_struct on the next task in the chain, release the locks,
>> take the pi_locks in a fixed order (sort by address forinstance), do
>> the
>> PI boosting, get_task_struct on the next lock, release all the locks,
>> do
>> the put_task_struct() on the previous task etc.
>> This a lot more expensive approach as it involves double as many
>> spinlock
>> operations and get/put_task_struct() calls , but it has the added
>> benifit
>> of reducing the overall system latency for long locking chains as
>> there
>> are spots where interrupts and preemption will be enabled for each
>> step in
>> the chain. Remember also that deep locking chains should be
>> considered an
>> exeption so the code doesn't have to be optimal wrt. performance.
>
> The long lock holding is only by the lock being grabbed and the owner
> grabbing it. All other locks don't need to be held for long periods of
> time. There's lots of issues if you release these two locks. How do
> you
> deal with the mutex being released while going up the chain?
>
>>
>> I added my feeble attempts to implement this below. I have no chance
>> of
>> ever getting time finish it :-(
>
> I doubt they were feeble, but just proof that this approach is far too
> complex. As I said, if you don't want futex to deadlock the kernel,
> the
> API for futex should have deadlocking checks, since the only way this
> can deadlock the system, is if two threads are in the kernel at the
> same
> time.
>
> Also, the ones who are paying me to help out the -rt kernel, don't care
> if we let the user side deadlock the system. They deliver a complete
> package, kernel and user apps, and nothing else is to be running on the
> system. This means that if the user app deadlocks, it doesn't matter
> if
> the kernel deadlocks or not, because the deadlocking of the user app
> means the system has failed. And I have a feeling that a lot of other
> users of -rt feel the same way.
>
> -- Steve
>
>
> _______________________________________________
> robustmutexes mailing list
> [email protected]
> https://lists.osdl.org/mailman/listinfo/robustmutexes

2005-12-20 22:43:37

by Esben Nielsen

[permalink] [raw]

Subject: Re: Recursion bug in -rt

On Tue, 20 Dec 2005, Steven Rostedt wrote:

> On Tue, 2005-12-20 at 21:42 +0100, Esben Nielsen wrote:
>
> > >
> > This is ok for kernel mutexes, which are supposed not to cause deadlocks.
> > But user space deadlocks must not cause kernel deadlocks. Therefore the
> > robust futex code _must_ be fixed.
>
> And it should be fixed in the futex side. Not in rt.c.
>
> >
> > > The benefit of this locking order is that we got rid of the global
> > > pi_lock, and that was worth the problems you face today.
> > >
> >
> > I believe this problem can be solved in the pi_lock code - but it will
> > require quite a bit of recoding. I started on it a little while ago but
> > didn't at all get the time to get into anything to even compile :-(
> > I don't have time to finish any code at all but I guess I can plant an
> > the idea instead:
>
> I removed the global pi_lock in two sleepless days, and it was all on
> the theory that the locks themselves would not deadlock. That was the
> only sanity I was able to hang on to. That was complex enough, and very
> scary to get right (scary since it _had_ to be done right).
Gosh I wish I had such nights available.... :-( And then some equipment to
test the result on.

> And it was
> complex enough to keep a highly skilled programmer living in Hungary
> from doing it himself (not to say he couldn't do it, just complex enough
> for him to put it off for quite some time).
My guess is that he focuses on other areas. The global pi_lock worked - it
just didn't scale.

> And one must also worry
> about the BKL which is a separate beast all together.

The BKL is the scary point from my point of view. The rest of it is really
not very Linux specific and is really just elaborating the textbook
implementations of mutexes.

>
> So making it any more complex is IMO out of the question.

Well the lock grabbing was making it very complex, too :-)

>
> >
> > When resolving the mutex chain (task A locks mutex 1 owned by B blocked
> > on 2 owned by C etc) for PI boosting and also when finding deadlocks,
> > release _all_ locks before going to the next step in the chain. Use
> > get_task_struct on the next task in the chain, release the locks,
> > take the pi_locks in a fixed order (sort by address forinstance), do the
> > PI boosting, get_task_struct on the next lock, release all the locks, do
> > the put_task_struct() on the previous task etc.
> > This a lot more expensive approach as it involves double as many spinlock
> > operations and get/put_task_struct() calls , but it has the added benifit
> > of reducing the overall system latency for long locking chains as there
> > are spots where interrupts and preemption will be enabled for each step in
> > the chain. Remember also that deep locking chains should be considered an
> > exeption so the code doesn't have to be optimal wrt. performance.
>
> The long lock holding is only by the lock being grabbed and the owner
> grabbing it. All other locks don't need to be held for long periods of
> time.

Preemption is disabled all along because you always hold at least one
spinlock.

> There's lots of issues if you release these two locks. How do you
> deal with the mutex being released while going up the chain?
>

You have to recheck your conditions again once you have retaken the
necesary locks. I do that in the code below.

> >
> > I added my feeble attempts to implement this below. I have no chance of
> > ever getting time finish it :-(
>
> I doubt they were feeble, but just proof that this approach is far too
> complex. As I said, if you don't want futex to deadlock the kernel, the
> API for futex should have deadlocking checks, since the only way this
> can deadlock the system, is if two threads are in the kernel at the same
> time.

That that code needs to traverse the locks. The PI code need to traverse
the locks. It would be simpler to do it in one go... Ofcourse, all
robust futex calls could take one global mutex akin to the old pi_lock you
removed, to fix it now, but that is a hack.

>
> Also, the ones who are paying me to help out the -rt kernel, don't care
> if we let the user side deadlock the system. They deliver a complete
> package, kernel and user apps, and nothing else is to be running on the
> system. This means that if the user app deadlocks, it doesn't matter if
> the kernel deadlocks or not, because the deadlocking of the user app
> means the system has failed. And I have a feeling that a lot of other
> users of -rt feel the same way.

1) If PREEMPT_RT ever goes into the main line kernel this kind of aproach
will _not_ work.
2) A for me for using Linux 2.6-rt over some specialized RTOS would be
that I could have my RT task running without any risk of being destroyed
by even buggy normal tasks.
What if you run, say apache, compiled with robust futexes and it
deadlocks? Normally I would simply restart apache. Customers could easily
accept that I restart apache once in a while - they wouldn't even notice.
But they can't accept that the system should reboot.

Esben

>
> -- Steve
>
>

2005-12-20 22:57:24

by Esben Nielsen

[permalink] [raw]

Subject: Re: Recursion bug in -rt

On Tue, 20 Dec 2005, david singleton wrote:

>
> On Dec 20, 2005, at 1:20 PM, Steven Rostedt wrote:
>
> > On Tue, 2005-12-20 at 21:42 +0100, Esben Nielsen wrote:
> >
> >>>
> >> This is ok for kernel mutexes, which are supposed not to cause
> >> deadlocks.
> >> But user space deadlocks must not cause kernel deadlocks. Therefore
> >> the
> >> robust futex code _must_ be fixed.
> >
> > And it should be fixed in the futex side. Not in rt.c.
>
> Yes. I believe you are right about this. The bad news is POSIX
> specifies
> that an app that calls lock_mutex twice on the same mutex hangs.
> EINVAL or EWOULDBLOCK are not POSIX compliant.
>
> Let me think on this a while and see if I can come up with an idea that
> will let the app hang itself (and be POSIX compliant) and not hang the
> kernel.
>
> Perhaps I'll just detect it and hang them on a waitqueue waiting
> for the user to get impatient and hit control C . . . .
>

The same lock taken twice is just a special case of deadlocking. It would
be very hard to check for the general case in the futex code without
"fixing" the rt_mutex. Not that the rt_mutex code is broken - it just
doesn't handle deadlocks very well as it wasn't supposed to. But as the
futex indirectly exposes the rt_mutex to userspace it becomes a problem.

The only _hack_ I can see is to force all robust futex calls to go through
one global lock to prevent the futex deadlocks becomming rt_mutex
deadlocks which again can turn into spin-lock deadlocks.

I instead argue for altering the premisses for the rt_mutex such
they can handle deadlocks without turning them into spin-lock deadlocks
blocking the whole system. Then a futex deadlock will become a rt_mutex
deadlock which can be handled.

Esben

>
>
> David
>
> >
> >>
> >>> The benefit of this locking order is that we got rid of the global
> >>> pi_lock, and that was worth the problems you face today.
> >>>
> >>
> >> I believe this problem can be solved in the pi_lock code - but it will
> >> require quite a bit of recoding. I started on it a little while ago
> >> but
> >> didn't at all get the time to get into anything to even compile :-(
> >> I don't have time to finish any code at all but I guess I can plant an
> >> the idea instead:
> >
> > I removed the global pi_lock in two sleepless days, and it was all on
> > the theory that the locks themselves would not deadlock. That was the
> > only sanity I was able to hang on to. That was complex enough, and
> > very
> > scary to get right (scary since it _had_ to be done right). And it was
> > complex enough to keep a highly skilled programmer living in Hungary
> > from doing it himself (not to say he couldn't do it, just complex
> > enough
> > for him to put it off for quite some time). And one must also worry
> > about the BKL which is a separate beast all together.
> >
> > So making it any more complex is IMO out of the question.
> >
> >>
> >> When resolving the mutex chain (task A locks mutex 1 owned by B
> >> blocked
> >> on 2 owned by C etc) for PI boosting and also when finding deadlocks,
> >> release _all_ locks before going to the next step in the chain. Use
> >> get_task_struct on the next task in the chain, release the locks,
> >> take the pi_locks in a fixed order (sort by address forinstance), do
> >> the
> >> PI boosting, get_task_struct on the next lock, release all the locks,
> >> do
> >> the put_task_struct() on the previous task etc.
> >> This a lot more expensive approach as it involves double as many
> >> spinlock
> >> operations and get/put_task_struct() calls , but it has the added
> >> benifit
> >> of reducing the overall system latency for long locking chains as
> >> there
> >> are spots where interrupts and preemption will be enabled for each
> >> step in
> >> the chain. Remember also that deep locking chains should be
> >> considered an
> >> exeption so the code doesn't have to be optimal wrt. performance.
> >
> > The long lock holding is only by the lock being grabbed and the owner
> > grabbing it. All other locks don't need to be held for long periods of
> > time. There's lots of issues if you release these two locks. How do
> > you
> > deal with the mutex being released while going up the chain?
> >
> >>
> >> I added my feeble attempts to implement this below. I have no chance
> >> of
> >> ever getting time finish it :-(
> >
> > I doubt they were feeble, but just proof that this approach is far too
> > complex. As I said, if you don't want futex to deadlock the kernel,
> > the
> > API for futex should have deadlocking checks, since the only way this
> > can deadlock the system, is if two threads are in the kernel at the
> > same
> > time.
> >
> > Also, the ones who are paying me to help out the -rt kernel, don't care
> > if we let the user side deadlock the system. They deliver a complete
> > package, kernel and user apps, and nothing else is to be running on the
> > system. This means that if the user app deadlocks, it doesn't matter
> > if
> > the kernel deadlocks or not, because the deadlocking of the user app
> > means the system has failed. And I have a feeling that a lot of other
> > users of -rt feel the same way.
> >
> > -- Steve
> >
> >
> > _______________________________________________
> > robustmutexes mailing list
> > [email protected]
> > https://lists.osdl.org/mailman/listinfo/robustmutexes
>

2005-12-20 22:59:49

by Steven Rostedt

[permalink] [raw]

Subject: Re: Recursion bug in -rt

On Tue, 20 Dec 2005, Esben Nielsen wrote:
> On Tue, 20 Dec 2005, Steven Rostedt wrote:
> > On Tue, 2005-12-20 at 21:42 +0100, Esben Nielsen wrote:

> > I removed the global pi_lock in two sleepless days, and it was all on
> > the theory that the locks themselves would not deadlock. That was the
> > only sanity I was able to hang on to. That was complex enough, and very
> > scary to get right (scary since it _had_ to be done right).
> Gosh I wish I had such nights available.... :-( And then some equipment to
> test the result on.

My wife was nice enough to handle the kids during that time ;)

>
> > And it was
> > complex enough to keep a highly skilled programmer living in Hungary
> > from doing it himself (not to say he couldn't do it, just complex enough
> > for him to put it off for quite some time).
> My guess is that he focuses on other areas. The global pi_lock worked - it
> just didn't scale.

True.

>
> > And one must also worry
> > about the BKL which is a separate beast all together.
>
> The BKL is the scary point from my point of view. The rest of it is really
> not very Linux specific and is really just elaborating the textbook
> implementations of mutexes.

Yeah, but I always found that a lot of text book implementations don't
apply very well to the real world without some sort of hack.

>
> >
> > So making it any more complex is IMO out of the question.
>
> Well the lock grabbing was making it very complex, too :-)

True, and that was part of my point.

>
> >
> > >
> > > When resolving the mutex chain (task A locks mutex 1 owned by B blocked
> > > on 2 owned by C etc) for PI boosting and also when finding deadlocks,
> > > release _all_ locks before going to the next step in the chain. Use
> > > get_task_struct on the next task in the chain, release the locks,
> > > take the pi_locks in a fixed order (sort by address forinstance), do the
> > > PI boosting, get_task_struct on the next lock, release all the locks, do
> > > the put_task_struct() on the previous task etc.
> > > This a lot more expensive approach as it involves double as many spinlock
> > > operations and get/put_task_struct() calls , but it has the added benifit
> > > of reducing the overall system latency for long locking chains as there
> > > are spots where interrupts and preemption will be enabled for each step in
> > > the chain. Remember also that deep locking chains should be considered an
> > > exeption so the code doesn't have to be optimal wrt. performance.
> >
> > The long lock holding is only by the lock being grabbed and the owner
> > grabbing it. All other locks don't need to be held for long periods of
> > time.
>
> Preemption is disabled all along because you always hold at least one
> spinlock.

But you still need to worry about SMP.

>
> > There's lots of issues if you release these two locks. How do you
> > deal with the mutex being released while going up the chain?
> >
>
> You have to recheck your conditions again once you have retaken the
> necesary locks. I do that in the code below.

Sorry, I didn't quite take the time to thoroughly go through your code.
I really should before responding, but I'm a tad busy at the moment,
which I'm sure you understand.

>
> > >
> > > I added my feeble attempts to implement this below. I have no chance of
> > > ever getting time finish it :-(
> >
> > I doubt they were feeble, but just proof that this approach is far too
> > complex. As I said, if you don't want futex to deadlock the kernel, the
> > API for futex should have deadlocking checks, since the only way this
> > can deadlock the system, is if two threads are in the kernel at the same
> > time.
>
> That that code needs to traverse the locks. The PI code need to traverse
> the locks. It would be simpler to do it in one go... Ofcourse, all
> robust futex calls could take one global mutex akin to the old pi_lock you
> removed, to fix it now, but that is a hack.
>
> >
> > Also, the ones who are paying me to help out the -rt kernel, don't care
> > if we let the user side deadlock the system. They deliver a complete
> > package, kernel and user apps, and nothing else is to be running on the
> > system. This means that if the user app deadlocks, it doesn't matter if
> > the kernel deadlocks or not, because the deadlocking of the user app
> > means the system has failed. And I have a feeling that a lot of other
> > users of -rt feel the same way.
>
> 1) If PREEMPT_RT ever goes into the main line kernel this kind of aproach
> will _not_ work.

:) I know!

That last paragraph came out when I was in the middle of debugging my
kernel. It suddenly dawned on me that the one who pays me must be
satisfied first. And you got me in one of my working moments.

There's a lot I do just to help out -rt that doesn't directly go along
with what I'm paid to do, and if at the time I was doing that, I would
not have responded as I did ;)

> 2) A for me for using Linux 2.6-rt over some specialized RTOS would be
> that I could have my RT task running without any risk of being destroyed
> by even buggy normal tasks.
> What if you run, say apache, compiled with robust futexes and it
> deadlocks? Normally I would simply restart apache. Customers could easily
> accept that I restart apache once in a while - they wouldn't even notice.
> But they can't accept that the system should reboot.
>

Point taken. But as far as I'm concerned, the code to deal with this
really needs to be satisfied by futex. I'm not saying it's "not my
problem", I would gladly modify rt.c to help. But I would not sacrifice
performance, or add much more complexity to do so.

-- Steve

2005-12-20 23:13:17

by Steven Rostedt

[permalink] [raw]

Subject: Re: Recursion bug in -rt

On Tue, 20 Dec 2005, Esben Nielsen wrote:

> >
>
> The same lock taken twice is just a special case of deadlocking. It would
> be very hard to check for the general case in the futex code without
> "fixing" the rt_mutex. Not that the rt_mutex code is broken - it just
> doesn't handle deadlocks very well as it wasn't supposed to. But as the
> futex indirectly exposes the rt_mutex to userspace it becomes a problem.
>
> The only _hack_ I can see is to force all robust futex calls to go through
> one global lock to prevent the futex deadlocks becomming rt_mutex
> deadlocks which again can turn into spin-lock deadlocks.
>
> I instead argue for altering the premisses for the rt_mutex such
> they can handle deadlocks without turning them into spin-lock deadlocks
> blocking the whole system. Then a futex deadlock will become a rt_mutex
> deadlock which can be handled.
>

For the type of deadlock you are talking about is the following:

P1 -- grabs futex A (no system call)
P2 -- grabs futex B (no system call)

P1 -- tries to grab futex B (system call to block and boost P2)
But holds no other kernel rt_mutex!
P2 -- tries to grab futex A (system call to block and boost P1)
spinning deadlock here,

So, before P2 blocks on P1, can there be a circular check t see if this is
a deadlock. You don't need to worry about other kernel rt_mutexes, you
only need to worry about blocked process.

Is this feasible?

-- Steve

2005-12-20 23:56:23

by Esben Nielsen

[permalink] [raw]

Subject: Re: Recursion bug in -rt

On Tue, 20 Dec 2005, Steven Rostedt wrote:

>
> On Tue, 20 Dec 2005, Esben Nielsen wrote:
>
> > >
> >
> > The same lock taken twice is just a special case of deadlocking. It would
> > be very hard to check for the general case in the futex code without
> > "fixing" the rt_mutex. Not that the rt_mutex code is broken - it just
> > doesn't handle deadlocks very well as it wasn't supposed to. But as the
> > futex indirectly exposes the rt_mutex to userspace it becomes a problem.
> >
> > The only _hack_ I can see is to force all robust futex calls to go through
> > one global lock to prevent the futex deadlocks becomming rt_mutex
> > deadlocks which again can turn into spin-lock deadlocks.
> >
> > I instead argue for altering the premisses for the rt_mutex such
> > they can handle deadlocks without turning them into spin-lock deadlocks
> > blocking the whole system. Then a futex deadlock will become a rt_mutex
> > deadlock which can be handled.
> >
>
> For the type of deadlock you are talking about is the following:
>
> P1 -- grabs futex A (no system call)
> P2 -- grabs futex B (no system call)
>
> P1 -- tries to grab futex B (system call to block and boost P2)
> But holds no other kernel rt_mutex!
> P2 -- tries to grab futex A (system call to block and boost P1)
> spinning deadlock here,
>
> So, before P2 blocks on P1, can there be a circular check t see if this is
> a deadlock. You don't need to worry about other kernel rt_mutexes, you
> only need to worry about blocked process.
>
> Is this feasible?
Ofcourse it is - but it the exact same kind of traversal you do in the
rt_mutex part to resolve the PI boosting. Thus by making the futex code do
this by it's own you essentially just move the complexity in there and
make the total more complex. Notice, that the futex code can't rely on
user-space flag - it can't be trusted for this (opening a local DOS
attack). So it has to rely on the rt_mutex structure to do this - and
therefore also the locks in there to protect against simultanious
unlocking.

Esben

>
> -- Steve
>

2005-12-22 04:37:43

by David Singleton

[permalink] [raw]

Subject: Re: Recursion bug in -rt

On Dec 20, 2005, at 3:55 PM, Esben Nielsen wrote:

> On Tue, 20 Dec 2005, Steven Rostedt wrote:
>
>>
>> On Tue, 20 Dec 2005, Esben Nielsen wrote:
>>
>>>>
>>>
>>> The same lock taken twice is just a special case of deadlocking. It
>>> would
>>> be very hard to check for the general case in the futex code without
>>> "fixing" the rt_mutex. Not that the rt_mutex code is broken - it just
>>> doesn't handle deadlocks very well as it wasn't supposed to. But as
>>> the
>>> futex indirectly exposes the rt_mutex to userspace it becomes a
>>> problem.
>>>
>>> The only _hack_ I can see is to force all robust futex calls to go
>>> through
>>> one global lock to prevent the futex deadlocks becomming rt_mutex
>>> deadlocks which again can turn into spin-lock deadlocks.
>>>
>>> I instead argue for altering the premisses for the rt_mutex such
>>> they can handle deadlocks without turning them into spin-lock
>>> deadlocks
>>> blocking the whole system. Then a futex deadlock will become a
>>> rt_mutex
>>> deadlock which can be handled.
>>>
>>
>> For the type of deadlock you are talking about is the following:
>>
>> P1 -- grabs futex A (no system call)
>> P2 -- grabs futex B (no system call)
>>
>> P1 -- tries to grab futex B (system call to block and boost P2)
>> But holds no other kernel rt_mutex!
>> P2 -- tries to grab futex A (system call to block and boost P1)
>> spinning deadlock here,
>>
>> So, before P2 blocks on P1, can there be a circular check t see if
>> this is
>> a deadlock. You don't need to worry about other kernel rt_mutexes,
>> you
>> only need to worry about blocked process.
>>
>> Is this feasible?
> Ofcourse it is - but it the exact same kind of traversal you do in the
> rt_mutex part to resolve the PI boosting. Thus by making the futex
> code do
> this by it's own you essentially just move the complexity in there and
> make the total more complex. Notice, that the futex code can't rely on
> user-space flag - it can't be trusted for this (opening a local DOS
> attack). So it has to rely on the rt_mutex structure to do this - and
> therefore also the locks in there to protect against simultanious
> unlocking.

Correct. And it's one of the reasons why I chose the rt_mutex
to back the pthread_mutex. First we add robustness and
then with the rt_mutex we get:

1) priority queuing. Tasks blocking on pthread_mutexes
are queued in priority order, and the highest priority task is the
first to be woken,
with or without priority inheritance.

2) priority inheritance, if so desired.

3) With DEBUG_DEADLOCKS on we get deadlock detection.

I was hoping people would use the deadlock detection to
fix incorrect apps, not file bugs against the kernel.

The real difficulty lies with POSIX stating that a non-recursive
mutex will hang if locked twice. I would prefer to return
-EWOULDDEADLOCK and not hang the app or the kernel,
and we could in the case of robust mutexes, but POSIX
priority inheriting mutexes could not, and be POSIX compliant.

It would be simple to detect a simple deadlock, it's much more
difficult to detect circular deadlocks, which is why I left it up
to the rt_mutex's deadlock detect code.

I understand the position that misbehaved apps should not
be able to hang the kernel though.

And this is where I ask two things:

1) The really smart people helping with a brilliant solution.

2) Anyone has a nice 4-way or 8-way for debugging purposes?

I think I've pretty much exhausted UP debugging.

David

>
>>
>> -- Steve
>>
>

2006-01-03 01:54:46

by David Singleton

[permalink] [raw]

Subject: Re: Recursion bug in -rt

Dinakar,
can you try patch-2.6.15-rc7-rt3-rf1 on
http://source.mvista.com/~dsingleton/ and see if it works for your
tests?

This new patch creates a 'futex_deadlock' semaphore that we hang
applications that
are deadlocking themselves. This method will only hang the
application, not the system,
as no other locks are held, like the mmap_sem, just the futex_deadlock
semaphore.

NOTE: for pthread_mutexes that are robust but NOT POSIX priority
inheriting I return -EWOULDDEADLOCK,
since there is no POSIX specfication for robust pthread_mutexes yet.
POSIX PI pthread_mutexes will hang
on the futex_deadlock semaphore.

Let me know how it works.

David

On Dec 20, 2005, at 7:50 AM, Dinakar Guniguntala wrote:

> On Tue, Dec 20, 2005 at 02:19:56PM +0100, Ingo Molnar wrote:
>>
>> hm, i'm looking at -rf4 - these changes look fishy:
>>
>> - _raw_spin_lock(&lock_owner(lock)->task->pi_lock);
>> + if (current != lock_owner(lock)->task)
>> + _raw_spin_lock(&lock_owner(lock)->task->pi_lock);
>>
>> why is this done?
>>
>
> Ingo, this is to prevent a kernel hang due to application error.
>
> Basically when an application does a pthread_mutex_lock twice on a
> _nonrecursive_ mutex with robust/PI attributes the whole system hangs.
> Ofcourse the application clearly should not be doing anything like
> that, but it should not end up hanging the system either
>
> -Dinakar
>

2006-01-05 02:14:59

by David Singleton

[permalink] [raw]

Subject: Re: Recursion bug in -rt

Dinakar,
I've got a more complete patch for deadlock detection for robust
futexes.

The patch is at
http://source.mvista.com/~dsingleton/patch-2.6.15-rc7-rt3-rf2

This patch handles pthread_mutex deadlocks in two ways:

1) POSIX states that non-recursive pthread_mutexes hang if locked twice.
This patch hangs the deadlocking thread on a waitqueue. It releases
all other kernel locks, like the mmap_sem and robust_sem before waiting
on the waitqueue so as not to hang the kernel.

2) pthread_mutexes that are only robust, not PRIO_INHERIT, do not
hang. They have a new error code returned to them, -EWOULDDEADLOCK.
Since there is no POSIX specification for robust pthread_mutexes yet,
returning EWOULDDEADLOCK to a robust mutex is more in the spirit
of robustness. For robust pthread_mutexes we clean up if the holding
thread dies and we return EWOULDDEADLOCK to inform the
application that it is trying to deadlock itself.

The patch handles both simple and circular deadlocks. This is
something I've been wanting to do for a while, export deadlock
detection to all flavors of kernels. The patch provides the correct
deadlock behavior while not hanging the system.

It's also easier to see if a POSIX compliant app has deadlocked itself.
the 'ps' command will show that the wait channel of a deadlocked
application is waiting at 'futex_deadlock'.

Let me know if it passes all your tests.

David

On Dec 20, 2005, at 7:50 AM, Dinakar Guniguntala wrote:

> On Tue, Dec 20, 2005 at 02:19:56PM +0100, Ingo Molnar wrote:
>>
>> hm, i'm looking at -rf4 - these changes look fishy:
>>
>> - _raw_spin_lock(&lock_owner(lock)->task->pi_lock);
>> + if (current != lock_owner(lock)->task)
>> + _raw_spin_lock(&lock_owner(lock)->task->pi_lock);
>>
>> why is this done?
>>
>
> Ingo, this is to prevent a kernel hang due to application error.
>
> Basically when an application does a pthread_mutex_lock twice on a
> _nonrecursive_ mutex with robust/PI attributes the whole system hangs.
> Ofcourse the application clearly should not be doing anything like
> that, but it should not end up hanging the system either
>
> -Dinakar
>

2006-01-05 09:47:21

by Esben Nielsen

[permalink] [raw]

Subject: Re: Recursion bug in -rt

On Wed, 4 Jan 2006, david singleton wrote:

> Dinakar,
> I've got a more complete patch for deadlock detection for robust
> futexes.
>
> The patch is at
> http://source.mvista.com/~dsingleton/patch-2.6.15-rc7-rt3-rf2
>
> This patch handles pthread_mutex deadlocks in two ways:
>
> 1) POSIX states that non-recursive pthread_mutexes hang if locked twice.
> This patch hangs the deadlocking thread on a waitqueue. It releases
> all other kernel locks, like the mmap_sem and robust_sem before waiting
> on the waitqueue so as not to hang the kernel.
>
> 2) pthread_mutexes that are only robust, not PRIO_INHERIT, do not
> hang. They have a new error code returned to them, -EWOULDDEADLOCK.
> Since there is no POSIX specification for robust pthread_mutexes yet,
> returning EWOULDDEADLOCK to a robust mutex is more in the spirit
> of robustness. For robust pthread_mutexes we clean up if the holding
> thread dies and we return EWOULDDEADLOCK to inform the
> application that it is trying to deadlock itself.
>
> The patch handles both simple and circular deadlocks. This is
> something I've been wanting to do for a while, export deadlock
> detection to all flavors of kernels. The patch provides the correct
> deadlock behavior while not hanging the system.

Have you tested this on SMP - without CONFIG_DEBUG_DEADLOCKS? As far as I
can see check_futex_deadlock() can easily crash the system as there are
_no_ spinlocks protecting access to task->blocked_on,
task->blocked_on->lock etc. As far as I can see the tasks and locks
structures can be freed at any point while you are in
check_futex_deadlock().

Furthermore, check_futex_deadlock is recursive. You can easily from
userspace trick the system into a kernel stack overflow by making a many
nested futexes.

I am working on a new way to do priority inheritance for nested locks in
rt_mutex such you do not risc deadlocking the system on raw-spinlocks
when you have a rt_mutex deadlock. But it wont have deadlock detection without
CONFIG_DEBUG_DEADLOCKS. On the other hand it would be possible to make a
deadlock scanner finding deadlocks in the system after they have happened.
With a suitable /proc interface it could even be done in userspace.

My patch to the rt_mutex is far from finished. I haven't even compiled a
kernel with it yet. I spend the little time I have between my
family goes to bed and I simply have to go to bed myself writing a
unittest framework for the rt_mutex and have both the original and the
patched rt_mutex parsing all my tests. But I need more tests to hammer
out the details about task->state forinstance. If anyone is interrested I
would be happy to send what I got right now.

Esben

>
> It's also easier to see if a POSIX compliant app has deadlocked itself.
> the 'ps' command will show that the wait channel of a deadlocked
> application is waiting at 'futex_deadlock'.
>
> Let me know if it passes all your tests.
>
> David
>
>
>
>
> On Dec 20, 2005, at 7:50 AM, Dinakar Guniguntala wrote:
>
> > On Tue, Dec 20, 2005 at 02:19:56PM +0100, Ingo Molnar wrote:
> >>
> >> hm, i'm looking at -rf4 - these changes look fishy:
> >>
> >> - _raw_spin_lock(&lock_owner(lock)->task->pi_lock);
> >> + if (current != lock_owner(lock)->task)
> >> + _raw_spin_lock(&lock_owner(lock)->task->pi_lock);
> >>
> >> why is this done?
> >>
> >
> > Ingo, this is to prevent a kernel hang due to application error.
> >
> > Basically when an application does a pthread_mutex_lock twice on a
> > _nonrecursive_ mutex with robust/PI attributes the whole system hangs.
> > Ofcourse the application clearly should not be doing anything like
> > that, but it should not end up hanging the system either
> >
> > -Dinakar
> >
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2006-01-05 17:11:39

by David Singleton

[permalink] [raw]

Subject: Re: Recursion bug in -rt

On Jan 5, 2006, at 1:43 AM, Esben Nielsen wrote:

>
>
> On Wed, 4 Jan 2006, david singleton wrote:
>
>> Dinakar,
>> I've got a more complete patch for deadlock detection for robust
>> futexes.
>>
>> The patch is at
>> http://source.mvista.com/~dsingleton/patch-2.6.15-rc7-rt3-rf2
>>
>> This patch handles pthread_mutex deadlocks in two ways:
>>
>> 1) POSIX states that non-recursive pthread_mutexes hang if locked
>> twice.
>> This patch hangs the deadlocking thread on a waitqueue. It releases
>> all other kernel locks, like the mmap_sem and robust_sem before
>> waiting
>> on the waitqueue so as not to hang the kernel.
>>
>> 2) pthread_mutexes that are only robust, not PRIO_INHERIT, do not
>> hang. They have a new error code returned to them, -EWOULDDEADLOCK.
>> Since there is no POSIX specification for robust pthread_mutexes yet,
>> returning EWOULDDEADLOCK to a robust mutex is more in the spirit
>> of robustness. For robust pthread_mutexes we clean up if the holding
>> thread dies and we return EWOULDDEADLOCK to inform the
>> application that it is trying to deadlock itself.
>>
>> The patch handles both simple and circular deadlocks. This is
>> something I've been wanting to do for a while, export deadlock
>> detection to all flavors of kernels. The patch provides the correct
>> deadlock behavior while not hanging the system.
>
>
> Have you tested this on SMP - without CONFIG_DEBUG_DEADLOCKS? As far
> as I
> can see check_futex_deadlock() can easily crash the system as there are
> _no_ spinlocks protecting access to task->blocked_on,
> task->blocked_on->lock etc. As far as I can see the tasks and locks
> structures can be freed at any point while you are in
> check_futex_deadlock().
>

You are right. I didn't extract enough of the deadlock check routines
to get
the locking. I'll fix that.

> Furthermore, check_futex_deadlock is recursive. You can easily from
> userspace trick the system into a kernel stack overflow by making a
> many
> nested futexes.

The rt deadlock check is also recursive, but it stops at a depth of 20,
deciding something must be corrupt to have a task blocked on more
than 20 locks.

We should also set a limit. We can either just hang an ill-behaved
app on the waitqueue or return an error code to the application.
Any suggestions on which would be best, hang an illbehaved app
or just return it an error?

David

>
> I am working on a new way to do priority inheritance for nested locks
> in
> rt_mutex such you do not risc deadlocking the system on raw-spinlocks
> when you have a rt_mutex deadlock. But it wont have deadlock detection
> without
> CONFIG_DEBUG_DEADLOCKS. On the other hand it would be possible to make
> a
> deadlock scanner finding deadlocks in the system after they have
> happened.
> With a suitable /proc interface it could even be done in userspace.
>
> My patch to the rt_mutex is far from finished. I haven't even compiled
> a
> kernel with it yet. I spend the little time I have between my
> family goes to bed and I simply have to go to bed myself writing a
> unittest framework for the rt_mutex and have both the original and the
> patched rt_mutex parsing all my tests. But I need more tests to hammer
> out the details about task->state forinstance. If anyone is
> interrested I
> would be happy to send what I got right now.
>
> Esben
>
>>
>> It's also easier to see if a POSIX compliant app has deadlocked
>> itself.
>> the 'ps' command will show that the wait channel of a deadlocked
>> application is waiting at 'futex_deadlock'.
>>
>> Let me know if it passes all your tests.
>>
>> David
>>
>>
>>
>>
>> On Dec 20, 2005, at 7:50 AM, Dinakar Guniguntala wrote:
>>
>>> On Tue, Dec 20, 2005 at 02:19:56PM +0100, Ingo Molnar wrote:
>>>>
>>>> hm, i'm looking at -rf4 - these changes look fishy:
>>>>
>>>> - _raw_spin_lock(&lock_owner(lock)->task->pi_lock);
>>>> + if (current != lock_owner(lock)->task)
>>>> + _raw_spin_lock(&lock_owner(lock)->task->pi_lock);
>>>>
>>>> why is this done?
>>>>
>>>
>>> Ingo, this is to prevent a kernel hang due to application error.
>>>
>>> Basically when an application does a pthread_mutex_lock twice on a
>>> _nonrecursive_ mutex with robust/PI attributes the whole system
>>> hangs.
>>> Ofcourse the application clearly should not be doing anything like
>>> that, but it should not end up hanging the system either
>>>
>>> -Dinakar
>>>
>>
>> -
>> To unsubscribe from this list: send the line "unsubscribe
>> linux-kernel" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at http://www.tux.org/lkml/
>>
>

2006-01-05 17:48:48

by Esben Nielsen

[permalink] [raw]

Subject: Re: Recursion bug in -rt

On Thu, 5 Jan 2006, david singleton wrote:

>
> On Jan 5, 2006, at 1:43 AM, Esben Nielsen wrote:
>
> >
> >
> > On Wed, 4 Jan 2006, david singleton wrote:
> >
> >> Dinakar,
> >> I've got a more complete patch for deadlock detection for robust
> >> futexes.
> >>
> >> The patch is at
> >> http://source.mvista.com/~dsingleton/patch-2.6.15-rc7-rt3-rf2
> >>
> >> This patch handles pthread_mutex deadlocks in two ways:
> >>
> >> 1) POSIX states that non-recursive pthread_mutexes hang if locked
> >> twice.
> >> This patch hangs the deadlocking thread on a waitqueue. It releases
> >> all other kernel locks, like the mmap_sem and robust_sem before
> >> waiting
> >> on the waitqueue so as not to hang the kernel.
> >>
> >> 2) pthread_mutexes that are only robust, not PRIO_INHERIT, do not
> >> hang. They have a new error code returned to them, -EWOULDDEADLOCK.
> >> Since there is no POSIX specification for robust pthread_mutexes yet,
> >> returning EWOULDDEADLOCK to a robust mutex is more in the spirit
> >> of robustness. For robust pthread_mutexes we clean up if the holding
> >> thread dies and we return EWOULDDEADLOCK to inform the
> >> application that it is trying to deadlock itself.
> >>
> >> The patch handles both simple and circular deadlocks. This is
> >> something I've been wanting to do for a while, export deadlock
> >> detection to all flavors of kernels. The patch provides the correct
> >> deadlock behavior while not hanging the system.
> >
> >
> > Have you tested this on SMP - without CONFIG_DEBUG_DEADLOCKS? As far
> > as I
> > can see check_futex_deadlock() can easily crash the system as there are
> > _no_ spinlocks protecting access to task->blocked_on,
> > task->blocked_on->lock etc. As far as I can see the tasks and locks
> > structures can be freed at any point while you are in
> > check_futex_deadlock().
> >
>
> You are right. I didn't extract enough of the deadlock check routines
> to get
> the locking. I'll fix that.
>
> > Furthermore, check_futex_deadlock is recursive. You can easily from
> > userspace trick the system into a kernel stack overflow by making a
> > many
> > nested futexes.
>
> The rt deadlock check is also recursive, but it stops at a depth of 20,
> deciding something must be corrupt to have a task blocked on more
> than 20 locks.
>
> We should also set a limit. We can either just hang an ill-behaved
> app on the waitqueue or return an error code to the application.
> Any suggestions on which would be best, hang an illbehaved app
> or just return it an error?
>

Per default make the tasks block. You can always make the return code an
option on each futex, but it has to be off as default. Here is why I think
so:

Having an return code would require you do the deadlock detection "up
front" in the down() operation. That takes a lot of CPU cycles .Ofcourse,
if you return an error code, the application could use the info for something
constructive, but I am afraid must applications wont do anything constructive
about it anyway (i.e. such the application continues to run) - such error
handling code would be hard to write.

What is needed in most application is that the stuff simply
deadlocks with the tasks blocked on the various locks. Then you can go in and
trace the locks "postmortem".

With the current setup, where you deadlock detection has to be performed
"up front" because the rt_mutex can make spinlock-deadlocks, the error
code will be a natural thing. But when rt_mutex is "fixed" or
replaced with something else, this feature will force the kernel to run
deadlock detection "up front" even though it isn't used for anything
usefull.

Esben

> David
>
> >
> > I am working on a new way to do priority inheritance for nested locks
> > in
> > rt_mutex such you do not risc deadlocking the system on raw-spinlocks
> > when you have a rt_mutex deadlock. But it wont have deadlock detection
> > without
> > CONFIG_DEBUG_DEADLOCKS. On the other hand it would be possible to make
> > a
> > deadlock scanner finding deadlocks in the system after they have
> > happened.
> > With a suitable /proc interface it could even be done in userspace.
> >
> > My patch to the rt_mutex is far from finished. I haven't even compiled
> > a
> > kernel with it yet. I spend the little time I have between my
> > family goes to bed and I simply have to go to bed myself writing a
> > unittest framework for the rt_mutex and have both the original and the
> > patched rt_mutex parsing all my tests. But I need more tests to hammer
> > out the details about task->state forinstance. If anyone is
> > interrested I
> > would be happy to send what I got right now.
> >
> > Esben
> >
> >>
> >> It's also easier to see if a POSIX compliant app has deadlocked
> >> itself.
> >> the 'ps' command will show that the wait channel of a deadlocked
> >> application is waiting at 'futex_deadlock'.
> >>
> >> Let me know if it passes all your tests.
> >>
> >> David
> >>
> >>
> >>
> >>
> >> On Dec 20, 2005, at 7:50 AM, Dinakar Guniguntala wrote:
> >>
> >>> On Tue, Dec 20, 2005 at 02:19:56PM +0100, Ingo Molnar wrote:
> >>>>
> >>>> hm, i'm looking at -rf4 - these changes look fishy:
> >>>>
> >>>> - _raw_spin_lock(&lock_owner(lock)->task->pi_lock);
> >>>> + if (current != lock_owner(lock)->task)
> >>>> + _raw_spin_lock(&lock_owner(lock)->task->pi_lock);
> >>>>
> >>>> why is this done?
> >>>>
> >>>
> >>> Ingo, this is to prevent a kernel hang due to application error.
> >>>
> >>> Basically when an application does a pthread_mutex_lock twice on a
> >>> _nonrecursive_ mutex with robust/PI attributes the whole system
> >>> hangs.
> >>> Ofcourse the application clearly should not be doing anything like
> >>> that, but it should not end up hanging the system either
> >>>
> >>> -Dinakar
> >>>
> >>
> >> -
> >> To unsubscribe from this list: send the line "unsubscribe
> >> linux-kernel" in
> >> the body of a message to [email protected]
> >> More majordomo info at http://vger.kernel.org/majordomo-info.html
> >> Please read the FAQ at http://www.tux.org/lkml/
> >>
> >
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2006-01-05 18:26:04

by David Singleton

[permalink] [raw]

Subject: Re: Recursion bug in -rt

>> The rt deadlock check is also recursive, but it stops at a depth of
>> 20,
>> deciding something must be corrupt to have a task blocked on more
>> than 20 locks.
>>
>> We should also set a limit. We can either just hang an ill-behaved
>> app on the waitqueue or return an error code to the application.
>> Any suggestions on which would be best, hang an illbehaved app
>> or just return it an error?
>>
>
> Per default make the tasks block. You can always make the return code
> an
> option on each futex, but it has to be off as default. Here is why I
> think
> so:
>
> Having an return code would require you do the deadlock detection "up
> front" in the down() operation. That takes a lot of CPU cycles
> .Ofcourse,
> if you return an error code, the application could use the info for
> something
> constructive, but I am afraid must applications wont do anything
> constructive
> about it anyway (i.e. such the application continues to run) - such
> error
> handling code would be hard to write.

Yes, I agree. I'll hang the app on a new waitqueue that will let
the user
see with 'ps' that they are hung in futex_ill_behaved waitqueue or some
other name
that makes it easy to see where the app is stopped.

thanks.

David

>
> What is needed in most application is that the stuff simply
> deadlocks with the tasks blocked on the various locks. Then you can go
> in and
> trace the locks "postmortem".
>
> With the current setup, where you deadlock detection has to be
> performed
> "up front" because the rt_mutex can make spinlock-deadlocks, the error
> code will be a natural thing. But when rt_mutex is "fixed" or
> replaced with something else, this feature will force the kernel to
> run
> deadlock detection "up front" even though it isn't used for anything
> usefull.
>
> Esben
>
>> David
>>
>>>
>>> I am working on a new way to do priority inheritance for nested locks
>>> in
>>> rt_mutex such you do not risc deadlocking the system on raw-spinlocks
>>> when you have a rt_mutex deadlock. But it wont have deadlock
>>> detection
>>> without
>>> CONFIG_DEBUG_DEADLOCKS. On the other hand it would be possible to
>>> make
>>> a
>>> deadlock scanner finding deadlocks in the system after they have
>>> happened.
>>> With a suitable /proc interface it could even be done in userspace.
>>>
>>> My patch to the rt_mutex is far from finished. I haven't even
>>> compiled
>>> a
>>> kernel with it yet. I spend the little time I have between my
>>> family goes to bed and I simply have to go to bed myself writing a
>>> unittest framework for the rt_mutex and have both the original and
>>> the
>>> patched rt_mutex parsing all my tests. But I need more tests to
>>> hammer
>>> out the details about task->state forinstance. If anyone is
>>> interrested I
>>> would be happy to send what I got right now.
>>>
>>> Esben
>>>
>>>>
>>>> It's also easier to see if a POSIX compliant app has deadlocked
>>>> itself.
>>>> the 'ps' command will show that the wait channel of a deadlocked
>>>> application is waiting at 'futex_deadlock'.
>>>>
>>>> Let me know if it passes all your tests.
>>>>
>>>> David
>>>>
>>>>
>>>>
>>>>
>>>> On Dec 20, 2005, at 7:50 AM, Dinakar Guniguntala wrote:
>>>>
>>>>> On Tue, Dec 20, 2005 at 02:19:56PM +0100, Ingo Molnar wrote:
>>>>>>
>>>>>> hm, i'm looking at -rf4 - these changes look fishy:
>>>>>>
>>>>>> - _raw_spin_lock(&lock_owner(lock)->task->pi_lock);
>>>>>> + if (current != lock_owner(lock)->task)
>>>>>> + _raw_spin_lock(&lock_owner(lock)->task->pi_lock);
>>>>>>
>>>>>> why is this done?
>>>>>>
>>>>>
>>>>> Ingo, this is to prevent a kernel hang due to application error.
>>>>>
>>>>> Basically when an application does a pthread_mutex_lock twice on a
>>>>> _nonrecursive_ mutex with robust/PI attributes the whole system
>>>>> hangs.
>>>>> Ofcourse the application clearly should not be doing anything like
>>>>> that, but it should not end up hanging the system either
>>>>>
>>>>> -Dinakar
>>>>>
>>>>
>>>> -
>>>> To unsubscribe from this list: send the line "unsubscribe
>>>> linux-kernel" in
>>>> the body of a message to [email protected]
>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>> Please read the FAQ at http://www.tux.org/lkml/
>>>>
>>>
>>
>> -
>> To unsubscribe from this list: send the line "unsubscribe
>> linux-kernel" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at http://www.tux.org/lkml/
>>
>

2006-01-07 02:40:12

by David Singleton

[permalink] [raw]

Subject: robust futex deadlock detection patch

Here is a new patch that provides both futex deadlock detection and
prevents ill-behaved and
malicious apps from deadlocking the kernel through the robust futex
interface.

http://source.mvista.com/~dsingleton/patch-2.6.15-rt2-rf1

Deadlock detection is done 'up front' for both POSIX and robust
pthread_mutexes. Non-recursive
POSIX mutexes will hang if deadlocked, as defined by the POSIX spec.
The wait channel they
are hung on is 'futex_deadlock'. This wait channel makes it easy to
spot that your POSIX app
has deadlocked itself via the 'ps' command.

Robust futexes will have -EDEADLK returned to them since there is no
POSIX specification for
robust mutexes, yet, and returning -EDEADLK is more in the spirit of
robustness. Robust
mutexes are cleaned up by the kernel after a thread dies and they also
report to the app if
it is deadlocking itself.

Deadlock detection is something I have wanted to provide for both debug
and production kernels
for a while. It was previously available through DEBUG_DEADLOCKS. I
needed to add the
deadlock dection code for both production and debug kernels to prevent
applications hanging
the kernel.

David

2006-01-07 19:49:54

by Ulrich Drepper

[permalink] [raw]

Subject: Re: robust futex deadlock detection patch

> Robust futexes will have -EDEADLK returned to them since there is no
> POSIX specification for
> robust mutexes, yet, and returning -EDEADLK is more in the spirit of
> robustness.

I disagree. Robustness is only an additional mutex attribute on top of
the mutex type. I.e., there can be robust
normal/error-checking/recursive mutexes. What kind it is can be told to
the kernel at the time the robust mutex is registered with the kernel.

This is the way I'll write up the proposal for robust mutexes for
inclusion in the next POSIX revision.

--
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

Attachments:

signature.asc (251.00 B)
OpenPGP digital signature

2006-01-09 09:24:47

by Esben Nielsen

[permalink] [raw]

Subject: Re: robust futex deadlock detection patch

On Fri, 6 Jan 2006, david singleton wrote:

>
> Here is a new patch that provides both futex deadlock detection and
> prevents ill-behaved and
> malicious apps from deadlocking the kernel through the robust futex
> interface.
>
> http://source.mvista.com/~dsingleton/patch-2.6.15-rt2-rf1
>
> Deadlock detection is done 'up front' for both POSIX and robust
> pthread_mutexes. Non-recursive
> POSIX mutexes will hang if deadlocked, as defined by the POSIX spec.
> The wait channel they
> are hung on is 'futex_deadlock'. This wait channel makes it easy to
> spot that your POSIX app
> has deadlocked itself via the 'ps' command.
>
> Robust futexes will have -EDEADLK returned to them since there is no
> POSIX specification for
> robust mutexes, yet, and returning -EDEADLK is more in the spirit of
> robustness. Robust
> mutexes are cleaned up by the kernel after a thread dies and they also
> report to the app if
> it is deadlocking itself.
>
> Deadlock detection is something I have wanted to provide for both debug
> and production kernels
> for a while. It was previously available through DEBUG_DEADLOCKS. I
> needed to add the
> deadlock dection code for both production and debug kernels to prevent
> applications hanging
> the kernel.
>

I am a little bit confused when I read check_futex_deadlock():
It takes a parameter struct thread_info *ti and immediately do
struct task_struct *task = ti->task. Now we have the usual pair
(thread_info *ti, task_t *task) corresponding to the same process. Later
on in the function you do ti = lock_owner(lock), but do not update task.
Was this intented?

Anyway, I can't see that you have locked the necesary raw_spin_locks.
Forinstance lock_owner(lock) must be called with the lock->wait_lock taken
and task->blocked_on needs task->pi_lock locked.
To avoid deadlocks in all the deadlock detection you have to do the loop
something like

for(owner = current; owner; ) {
raw_spin_lock(&owner->pi_lock);
if(owner->task->blocked_on) {
lock = owner->task->blocked_on->lock;
raw_spin_lock(&lock->wait_lock);
owner2 = lock_owner(lock);
if(owner2) {
get_task_struct(owner2->task);
raw_spin_unlock(&lock->wait_lock)
raw_spin_unlock(&owner->pi_lock);
}
put_task_struct(owner->task);
owner = owner2;
if(owner2==current) DEADLOCK
}

Esben

> David
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2006-01-09 20:00:48

by David Singleton

[permalink] [raw]

Subject: Re: robust futex deadlock detection patch

Esben Nielsen wrote:

>
>I am a little bit confused when I read check_futex_deadlock():
>It takes a parameter struct thread_info *ti and immediately do
>struct task_struct *task = ti->task. Now we have the usual pair
>(thread_info *ti, task_t *task) corresponding to the same process. Later
>on in the function you do ti = lock_owner(lock), but do not update task.
>Was this intented?
>
>

Whoops. You are right. I've fixed the update to the task structure.
The check_futex_deadlock()
code now mirrors the existing check_deadlock() code.

>Anyway, I can't see that you have locked the necesary raw_spin_locks.
>Forinstance lock_owner(lock) must be called with the lock->wait_lock taken
>and task->blocked_on needs task->pi_lock locked.
>
>

Actually those locks are grabbed in the down_try_futex code. I hold
both of those
locks across the check for deadlocks and into __down_interruptible.
Those locks
need to be held for the check for deadlocks and holding
them from down_try_futex to down_interruptible garuantees that a thread
that enters the kernel to block on a lock will block on the lock. There
is no
window any more between dropping the robust_sem and mmap_sem
and calling down_interruptible.

This also fixed an SMP problem that Dave Carlson has been seeing on earlier
patches.

The new patch is at

http://source.mvista.com/~dsingleton/patch-2.6.15-rt2-rf2

David

>To avoid deadlocks in all the deadlock detection you have to do the loop
>something like
>
>for(owner = current; owner; ) {
> raw_spin_lock(&owner->pi_lock);
> if(owner->task->blocked_on) {
> lock = owner->task->blocked_on->lock;
> raw_spin_lock(&lock->wait_lock);
> owner2 = lock_owner(lock);
> if(owner2) {
> get_task_struct(owner2->task);
> raw_spin_unlock(&lock->wait_lock)
> raw_spin_unlock(&owner->pi_lock);
> }
> put_task_struct(owner->task);
> owner = owner2;
> if(owner2==current) DEADLOCK
>}
>
>
>Esben
>
>
>
>
>>David
>>
>>-
>>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>>the body of a message to [email protected]
>>More majordomo info at http://vger.kernel.org/majordomo-info.html
>>Please read the FAQ at http://www.tux.org/lkml/
>>
>>
>>
>
>
>
>
>

2006-01-09 20:20:12

by Esben Nielsen

[permalink] [raw]

Subject: Re: robust futex deadlock detection patch

On Mon, 9 Jan 2006, David Singleton wrote:

> Esben Nielsen wrote:
>
> >
> >I am a little bit confused when I read check_futex_deadlock():
> >It takes a parameter struct thread_info *ti and immediately do
> >struct task_struct *task = ti->task. Now we have the usual pair
> >(thread_info *ti, task_t *task) corresponding to the same process. Later
> >on in the function you do ti = lock_owner(lock), but do not update task.
> >Was this intented?
> >
> >
>
> Whoops. You are right. I've fixed the update to the task structure.
> The check_futex_deadlock()
> code now mirrors the existing check_deadlock() code.
>
> >Anyway, I can't see that you have locked the necesary raw_spin_locks.
> >Forinstance lock_owner(lock) must be called with the lock->wait_lock taken
> >and task->blocked_on needs task->pi_lock locked.
> >
> >
>
> Actually those locks are grabbed in the down_try_futex code. I hold
> both of those
> locks across the check for deadlocks and into __down_interruptible.
> Those locks
> need to be held for the check for deadlocks and holding
> them from down_try_futex to down_interruptible garuantees that a thread
> that enters the kernel to block on a lock will block on the lock. There
> is no
> window any more between dropping the robust_sem and mmap_sem
> and calling down_interruptible.
>

You only take the spinlocks corresponding to the current lock. What about
the next locks in the chain? Remember those locks might not be
futexes but a lock inside the kernel, taken in system calls. I.e. the
robust_sem might not protect you.
If there are n locks you need to lock n lock->wait_lock and n
owner->task->pi_lock as you traverse the locks. That is what I tried to
sketch in the examble below.

Esben

> This also fixed an SMP problem that Dave Carlson has been seeing on earlier
> patches.
>
> The new patch is at
>
> http://source.mvista.com/~dsingleton/patch-2.6.15-rt2-rf2
>
> David
>
> >To avoid deadlocks in all the deadlock detection you have to do the loop
> >something like
> >
> >for(owner = current; owner; ) {
> > raw_spin_lock(&owner->pi_lock);
> > if(owner->task->blocked_on) {
> > lock = owner->task->blocked_on->lock;
> > raw_spin_lock(&lock->wait_lock);
> > owner2 = lock_owner(lock);
> > if(owner2) {
> > get_task_struct(owner2->task);
> > raw_spin_unlock(&lock->wait_lock)
> > raw_spin_unlock(&owner->pi_lock);
> > }
> > put_task_struct(owner->task);
> > owner = owner2;
> > if(owner2==current) DEADLOCK
> >}
> >
> >
> >Esben
> >
> >
> >
> >
> >>David
> >>
> >>-
> >>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> >>the body of a message to [email protected]
> >>More majordomo info at http://vger.kernel.org/majordomo-info.html
> >>Please read the FAQ at http://www.tux.org/lkml/
> >>
> >>
> >>
> >
> >
> >
> >
> >
>

2006-01-09 21:08:48

by Steven Rostedt

[permalink] [raw]

Subject: Re: robust futex deadlock detection patch

On Mon, 2006-01-09 at 21:16 +0100, Esben Nielsen wrote:

> You only take the spinlocks corresponding to the current lock. What about
> the next locks in the chain? Remember those locks might not be
> futexes but a lock inside the kernel, taken in system calls. I.e. the
> robust_sem might not protect you.
> If there are n locks you need to lock n lock->wait_lock and n
> owner->task->pi_lock as you traverse the locks. That is what I tried to
> sketch in the examble below.

The thing about this is that you can't (if the kernel is not buggy)
deadlock on the kernel spin locks. As long as you protect the locks in
the futex from deadlocking you are fine. This is because you don't grab
a futex after grabbing a kernel spin lock. All kernel spin locks
__must__ be released before going back to user land, and that's where
you grab the futexes.

-- Steve

2006-01-09 21:19:45

by Esben Nielsen

[permalink] [raw]

Subject: Re: robust futex deadlock detection patch

On Mon, 9 Jan 2006, Steven Rostedt wrote:

> On Mon, 2006-01-09 at 21:16 +0100, Esben Nielsen wrote:
>
> > You only take the spinlocks corresponding to the current lock. What about
> > the next locks in the chain? Remember those locks might not be
> > futexes but a lock inside the kernel, taken in system calls. I.e. the
> > robust_sem might not protect you.
> > If there are n locks you need to lock n lock->wait_lock and n
> > owner->task->pi_lock as you traverse the locks. That is what I tried to
> > sketch in the examble below.
>
> The thing about this is that you can't (if the kernel is not buggy)
> deadlock on the kernel spin locks. As long as you protect the locks in
> the futex from deadlocking you are fine. This is because you don't grab
> a futex after grabbing a kernel spin lock. All kernel spin locks
> __must__ be released before going back to user land, and that's where
> you grab the futexes.
>
Yes, but the deadlock-detect code doesn't know what is a futex and what
isn't. Therefore when it starts to traverse rt_mutexes which aren't
futexes it has to be _very_ carefull taking the right spinlocks. On the
other hand when traversing the futexes it has to be carefull not to
deadlocks on those very same spinlocks.

And yes, you _can_ traverse the lock chain without deadlocking the kernel.
The code I sketched in the previouse mail should do it.

Anyway, I am trying to rewrite rt_mutex to remove the spinlock deadlock
altogether. My girlfriend is away with our daugther a few days so I might
get time to do finish it :-)

> -- Steve
>
>

Esben