LinuxLists.cc - [RFC][PATCH] muptiple bugs in PI futexes

2007-05-07 15:14:14

Subject: [RFC][PATCH] muptiple bugs in PI futexes

Hello!

1. New entries can be added to tsk->pi_state_list after task completed
exit_pi_state_list(). The result is memory leakage and deadlocks.

2. handle_mm_fault() is called under spinlock. The result is obvious.

3. State machine is broken. Kernel thinks it owns futex after
it released all the locks. Ergo, it corrupts futex. The result is that
two processes think they took a futex.

All the bugs are trivially reproduced when running glibc's tst-robustpi7
test long enough.

The patch is not quite good (RFC!), because:

1. There is one case, when I did not figure out how to handle
page fault correctly. I would do it releasing taken rtmutex
and hb->lock and retrying futex from the very beginning.
It is quite ugly. Probably, state machine can be fixed somehow.

2. Before this patch I had one unexplained oops inside rtmutex
in plist_del. I did _not_ fix this, but it does not want to reproduce.
Probably, more strong locking did some race window too narrow.

Alexey

diff --git a/kernel/futex.c b/kernel/futex.c
index 5a270b5..4207034 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -396,10 +396,13 @@ static struct task_struct * futex_find_g
p = NULL;
goto out_unlock;
}
+#if 0
+ /* The test is meaningless. Should I explain why? --ANK */
if (p->exit_state != 0) {
p = NULL;
goto out_unlock;
}
+#endif
get_task_struct(p);
out_unlock:
rcu_read_unlock();
@@ -505,6 +508,18 @@ lookup_pi_state(u32 uval, struct futex_h
if (!p)
return -ESRCH;

+ /* Task state transitions are protected only by tasklist_lock
+ * and by nothing else. Better solution would be
+ * to use ->pi_lock. But this should be done in do_exit() as well.
+ * So, I use tasklist_lock for now.
+ */
+ read_lock(&tasklist_lock);
+ if (p->exit_state) {
+ read_unlock(&tasklist_lock);
+ put_task_struct(p);
+ return -ESRCH;
+ }
+
pi_state = alloc_pi_state();

/*
@@ -522,6 +537,8 @@ lookup_pi_state(u32 uval, struct futex_h
pi_state->owner = p;
spin_unlock_irq(&p->pi_lock);

+ read_unlock(&tasklist_lock);
+
put_task_struct(p);

me->pi_state = pi_state;
@@ -1149,6 +1166,7 @@ static int futex_lock_pi(u32 __user *uad
if (unlikely(ret != 0))
goto out_release_sem;

+ retry_unlocked:
hb = queue_lock(&q, -1, NULL);

retry_locked:
@@ -1279,13 +1297,48 @@ static int futex_lock_pi(u32 __user *uad
list_add(&q.pi_state->list, &current->pi_state_list);
spin_unlock_irq(&current->pi_lock);

+ for (;;) {
+ newval = (uval & FUTEX_OWNER_DIED) | newtid;
+ pagefault_disable();
+ curval = futex_atomic_cmpxchg_inatomic(uaddr,
+ uval, newval);
+ pagefault_enable();
+
+ /* If course, this is wrong. If we fault here,
+ * we must:
+ * 1. Probably, drop rtmutex, if we still hold it.
+ * 2. drop all the locks.
+ * 3. handle fault
+ * 4. restart from the very beginning.
+ *
+ * I am not doing this just because all the code
+ * dealing with faults (handle_mm_fault under
+ * spinlock, is not it cool?) used to be very wrong
+ * and my fix would be overwritten in any
+ * case. Seems, this needs more qualification
+ * than I have. Ingo, please...
+ */
+ if (curval == -EFAULT) {
+ ret = -EFAULT;
+ break;
+ }
+ if (curval == uval)
+ break;
+ uval = curval;
+ }
+
/* Unqueue and drop the lock */
unqueue_me_pi(&q, hb);
up_read(&curr->mm->mmap_sem);
+#if 0
/*
* We own it, so we have to replace the pending owner
* TID. This must be atomic as we have preserve the
* owner died bit here.
+ *
+ * Why do you think you "own" it? You have just dropped
+ * all the locks and rtmutex. No traces of your right
+ * remained :-) --ANK
*/
ret = get_user(uval, uaddr);
while (!ret) {
@@ -1298,6 +1351,7 @@ static int futex_lock_pi(u32 __user *uad
break;
uval = curval;
}
+#endif
} else {
/*
* Catch the rare case, where the lock was released
@@ -1331,16 +1385,19 @@ static int futex_lock_pi(u32 __user *uad
* non-atomically. Therefore, if get_user below is not
* enough, we need to handle the fault ourselves, while
* still holding the mmap_sem.
+ *
+ * ... and hb->lock. :-) --ANK
*/
+ queue_unlock(&q, hb);
+
if (attempt++) {
if (futex_handle_fault((unsigned long)uaddr, attempt)) {
ret = -EFAULT;
- goto out_unlock_release_sem;
+ goto out_release_sem;
}
- goto retry_locked;
+ goto retry_unlocked;
}

- queue_unlock(&q, hb);
up_read(&curr->mm->mmap_sem);

ret = get_user(uval, uaddr);
@@ -1382,9 +1439,9 @@ retry:
goto out;

hb = hash_futex(&key);
+retry_unlocked:
spin_lock(&hb->lock);

-retry_locked:
/*
* To avoid races, try to do the TID -> 0 atomic transition
* again. If it succeeds then we can return without waking
@@ -1446,16 +1503,19 @@ pi_faulted:
* non-atomically. Therefore, if get_user below is not
* enough, we need to handle the fault ourselves, while
* still holding the mmap_sem.
+ *
+ * ... and hb->lock. --ANK
*/
+ spin_unlock(&hb->lock);
+
if (attempt++) {
if (futex_handle_fault((unsigned long)uaddr, attempt)) {
ret = -EFAULT;
- goto out_unlock;
+ goto out;
}
- goto retry_locked;
+ goto retry_unlocked;
}

- spin_unlock(&hb->lock);
up_read(&current->mm->mmap_sem);

ret = get_user(uval, uaddr);

2007-05-23 07:27:00

by Ingo Molnar

[permalink] [raw]

Subject: Re: [RFC][PATCH] muptiple bugs in PI futexes

* Alexey Kuznetsov <[email protected]> wrote:

> Hello!
>
> 1. New entries can be added to tsk->pi_state_list after task completed
> exit_pi_state_list(). The result is memory leakage and deadlocks.
>
> 2. handle_mm_fault() is called under spinlock. The result is obvious.
>
> 3. State machine is broken. Kernel thinks it owns futex after
> it released all the locks. Ergo, it corrupts futex. The result is that
> two processes think they took a futex.
>
> All the bugs are trivially reproduced when running glibc's tst-robustpi7
> test long enough.
>
> The patch is not quite good (RFC!), because:
>
> 1. There is one case, when I did not figure out how to handle
> page fault correctly. I would do it releasing taken rtmutex
> and hb->lock and retrying futex from the very beginning.
> It is quite ugly. Probably, state machine can be fixed somehow.
>
> 2. Before this patch I had one unexplained oops inside rtmutex
> in plist_del. I did _not_ fix this, but it does not want to reproduce.
> Probably, more strong locking did some race window too narrow.

thanks for the fixes - they look all good and we'll check it in -rt.
We'll try to find a solution for the remaining problem too. Could your
#2 crash be explained via any of the bugs you fixed? (i.e. memory
corruption?) I'd exclude genuine rtmutex.c breakage for now because
that's the basis of all locking in -rt - but maybe the futex interfacing
upsets something ...

Ingo

2007-05-23 11:52:36

by Alexey Kuznetsov

[permalink] [raw]

Subject: Re: [RFC][PATCH] muptiple bugs in PI futexes

Hello!

> #2 crash be explained via any of the bugs you fixed? (i.e. memory
> corruption?)

Yes, I found the reason, it is really fixed by taking tasklist_lock.
This happens after task struct with not cleared pi_state_list is freed
and the list of futex_pi_state's is corrupted.

Meanwhile... two more bugs were found.

The first chunk: results in self-inflicted deadlock inside glibc.
Sometimes futex_lock_pi returns -ESRCH, when it is not expected
and glibc enters to for(;;) sleep() to simulate deadlock. This problem
is quite obvious and I think the patch is right. Though it looks like
each "if" in futex_lock_pi() got some stupid special case "else if". :-)

The second chunk: sometimes futex_lock_pi() returns -EDEADLK,
when nobody has the lock. The reason is also obvious (see comment
in the patch), but correct fix is far beyond my comprehension.
I guess someone already saw this, the chunk:

if (rt_mutex_trylock(&q.pi_state->pi_mutex))
ret = 0;

is obviously from the same opera. But it does not work, because the
rtmutex is really taken at this point: wake_futex_pi() of previous
owner reassigned it to us. My fix works. But it looks very stupid.
I would think about removal of shift of ownership in wake_futex_pi()
and making all the work in context of process taking lock.

Both bugs show up when running glibc's tst-robustpi8 long enough.

Yes, all this about pre May 8 futexes. Seems, updates did not change
anything in logic, but I am not sure.

--- kernel/futex.c.intermediate 2007-05-23 14:48:27.000000000 +0400
+++ kernel/futex.c 2007-05-23 14:58:06.000000000 +0400
@@ -1244,8 +1244,21 @@ static int futex_lock_pi(u32 __user *uad
if (unlikely(curval != uval))
goto retry_locked;
ret = 0;
- }
- goto out_unlock_release_sem;
+ } else if (ret == -ESRCH) {
+ /* Process could exit right now, so that robust list
+ * was processed after we got uval. Retry. */
+ pagefault_disable();
+ curval = futex_atomic_cmpxchg_inatomic(uaddr,
+ uval, uval);
+ pagefault_enable();
+ if (unlikely(curval == -EFAULT))
+ goto uaddr_faulted;
+ if (unlikely(curval != uval)) {
+ printk("RETRY %x %x %x\n", current->pid, uval, curval);
+ goto retry_locked;
+ }
+ }
+ goto out_unlock_release_sem;
}

/*
@@ -1361,6 +1374,22 @@ static int futex_lock_pi(u32 __user *uad
if (ret && q.pi_state->owner == curr) {
if (rt_mutex_trylock(&q.pi_state->pi_mutex))
ret = 0;
+ /* Holy crap... Now futex lock returns -EDEADLK
+ * sometimes, because ownership was passed to us while
+ * unlock of previous owner. Who wrote this?
+ * Please, fix this correctly. For now:
+ */
+ if (ret == -EDEADLK) {
+ pagefault_disable();
+ uval = futex_atomic_cmpxchg_inatomic(uaddr,
+ 0, 0);
+ pagefault_enable();
+ if (uval != -EFAULT &&
+ (uval&FUTEX_TID_MASK) == current->pid) {
+ printk("ALERT1 %x\n", uval);
+ ret = 0;
+ }
+ }
}
/* Unqueue and drop the lock */
unqueue_me_pi(&q, hb);

2007-06-05 16:25:26

by Thomas Gleixner

[permalink] [raw]

Subject: Re: [RFC][PATCH] muptiple bugs in PI futexes

Hi,

On Wed, 2007-05-23 at 15:51 +0400, Alexey Kuznetsov wrote:
> The first chunk: results in self-inflicted deadlock inside glibc.
> Sometimes futex_lock_pi returns -ESRCH, when it is not expected
> and glibc enters to for(;;) sleep() to simulate deadlock. This problem
> is quite obvious and I think the patch is right. Though it looks like
> each "if" in futex_lock_pi() got some stupid special case "else if". :-)

Hmm, what means not expected ? -ESRCH is returned, when the owner task
is not found.

> + } else if (ret == -ESRCH) {
> + /* Process could exit right now, so that robust list
> + * was processed after we got uval. Retry. */
> + pagefault_disable();
> + curval = futex_atomic_cmpxchg_inatomic(uaddr,
> + uval, uval);

When we retrieve uval, we hold the hash bucket lock and we keep it down
to this point, so the robust list processing is stuck on the hash bucket
lock. Also using uval is wrong. The user space variable holds "newval",
which was written there before:

curval = cmpxchg_futex_value_locked(uaddr, uval, newval);

So you just go back up to retry_locked or into the fault handling path
(which is unlikely).

This does not really explain, why you do prevent the -ESRCH return value
in the next cycle, except for the case, where we go through the fault
handling path.

> The second chunk: sometimes futex_lock_pi() returns -EDEADLK,
> when nobody has the lock. The reason is also obvious (see comment
> in the patch), but correct fix is far beyond my comprehension.
> I guess someone already saw this, the chunk:
>
> if (rt_mutex_trylock(&q.pi_state->pi_mutex))
> ret = 0;
>
> is obviously from the same opera. But it does not work, because the
> rtmutex is really taken at this point: wake_futex_pi() of previous
> owner reassigned it to us.

No, -EDEADLK is returned from here:

ret = rt_mutex_timed_lock(&q.pi_state->pi_mutex, to, 1);

The rtmutex code only returns -EDEADLK, when the lock is already held by
the task or when it runs into a circular rtmutex dependency (e.g. a ABBA
deadlock) during the PI-walk.

Also wake_futex_pi() does not assign the ownership of the rtmutex, it
assigns the ownership of the futex pi state and unlocks the rtmutex,
which sets the pending owner. The pending owner is the highest priority
waiter. The ownership assignment of the rtmutex happens inside the
rtmutex code, not in wake_futex_pi.

The trylock is covering the following situation:

rt_mutex_timed_lock(&q.pi_state->pi_mutex, to, 1) returns -ETIMEOUT;

Now it get's blocked on:
spin_lock(q.lock_ptr);

because on the other CPU the futex is unlocked. The rt-mutex code does
not find a waiter and unlocks the rtmutex with no new owner. The hack
bucket lock is released and the task which is blocked on the hash bucket
lock has a chance to grab it.

> My fix works. But it looks very stupid.
> I would think about removal of shift of ownership in wake_futex_pi()
> and making all the work in context of process taking lock.

I have no clue why it works. The only explanation is, that an existing
deadlock is resolved by the exiting task.

> if (ret && q.pi_state->owner == curr) {

This happens only, when the rtmutex was unlocked by another task between
the return from the rtmutex code and reacquiring the hash bucket lock.

> if (rt_mutex_trylock(&q.pi_state->pi_mutex))
> ret = 0;

If we can get the rtmutex here, we are the owner of the lock and need to
fix up ownership in the user space variable further down.

> + /* Holy crap... Now futex lock returns -EDEADLK
> + * sometimes, because ownership was passed to us while
> + * unlock of previous owner. Who wrote this?

I admit, that I was one of the authors :)

> + * Please, fix this correctly. For now:
> + */

The fix is to remove it and to find the real cause of the problem.

I'm running the glibc tests since hours w/o tripping into it.

tglx

2007-06-05 17:40:44

by Alexey Kuznetsov

[permalink] [raw]

Subject: Re: [RFC][PATCH] muptiple bugs in PI futexes

Hello!

> Hmm, what means not expected ? -ESRCH is returned, when the owner task
> is not found.

This is not supposed to happen with robust futexes.

glibs aborts (which is correct), or for build with disabled debugging
enters simulated deadlock (which is confusing).

> lock. Also using uval is wrong.

Yup. You are right.

This means those RETRY messages could be spurious. I must rerun the test.

> This does not really explain, why you do prevent the -ESRCH return value
> in the next cycle,

Because right curval is refetched, it already has FUTEX_OWNER_DIED bit set
and we succesfully take the lock.

> No, -EDEADLK is returned from here:
>
> ret = rt_mutex_timed_lock(&q.pi_state->pi_mutex, to, 1);

Of course. It is the only place where ret is set. :-)

>
> The rtmutex code only returns -EDEADLK, when the lock is already held by
> the task

This case.

> The fix is to remove it and to find the real cause of the problem.
>
> I'm running the glibc tests since hours w/o tripping into it.

OK.

You need run only tst-robustpi8 in loop. It should be triggered quickly,
a few of minutes on 8-way smp here.

If you want, I can insert some debugging printks, which you need,
and run the test here.

Alexey

2007-06-05 18:48:38

by Thomas Gleixner

[permalink] [raw]

Subject: Re: [RFC][PATCH] muptiple bugs in PI futexes

On Tue, 2007-06-05 at 21:39 +0400, Alexey Kuznetsov wrote:
> Hello!
>
> > Hmm, what means not expected ? -ESRCH is returned, when the owner task
> > is not found.
>
> This is not supposed to happen with robust futexes.

Hmm, right.

> > This does not really explain, why you do prevent the -ESRCH return value
> > in the next cycle,
>
> Because right curval is refetched, it already has FUTEX_OWNER_DIED bit set
> and we succesfully take the lock.

Ok, handle_futex_death() is punching the OWNER_DIED bit into the futex
without the hash bucket lock. We might as well grab the hash bucket lock
right there to avoid this. I look for a sane solution.

> > The rtmutex code only returns -EDEADLK, when the lock is already held by
> > the task
>
> This case.

Sorry, I was not clear here: not the user space lock, the rtmutex must
be held or a deadlock situation against another rtmutex must be
detected. There is no way that the exiting code assigns the owner ship
of the rtmutex. It solely calls rtmutex_unlock() which makes the highest
priority waiter the _PENDING_ owner, which means the pending owner needs
to acquire it for real.

> You need run only tst-robustpi8 in loop. It should be triggered quickly,
> a few of minutes on 8-way smp here.

My largest box is a 4 way and it runs since hours in a while true loop.

> If you want, I can insert some debugging printks, which you need,
> and run the test here.

I fix up some things in the code first and then I'll add a couple of
debugs to nail this EDEADLK problem.

tglx

2007-06-05 19:16:07

by Thomas Gleixner

[permalink] [raw]

Subject: Re: [RFC][PATCH] muptiple bugs in PI futexes

On Tue, 2007-06-05 at 20:48 +0200, Thomas Gleixner wrote:
> > > This does not really explain, why you do prevent the -ESRCH return value
> > > in the next cycle,
> >
> > Because right curval is refetched, it already has FUTEX_OWNER_DIED bit set
> > and we succesfully take the lock.
>
> Ok, handle_futex_death() is punching the OWNER_DIED bit into the futex
> without the hash bucket lock. We might as well grab the hash bucket lock
> right there to avoid this. I look for a sane solution.

We actually need to do something about this, as we might loop for ever
there. The robust cleanup code can fail (e.g. due to list corruption)
and we would see exit_state != 0 and the OWNER_DIED bit would never be
set, so we are stuck in a busy loop. I think I have an idea how to fix
this.

tglx

2007-06-05 21:01:50

by Alexey Kuznetsov

[permalink] [raw]

Subject: Re: [RFC][PATCH] muptiple bugs in PI futexes

Hello!

> We actually need to do something about this, as we might loop for ever
> there. The robust cleanup code can fail (e.g. due to list corruption)
> and we would see exit_state != 0 and the OWNER_DIED bit would never be
> set, so we are stuck in a busy loop.

Yes...

It is possible to take read_lock(&tasklist_lock) before:

inc_preempt_count();
curval = futex_atomic_cmpxchg_inatomic(uaddr, uval, newval);
dec_preempt_count();

and drop it after lookup_pi_state().

In this case exiting task will set FUTEX_OWNER_DIED, but will
spin in exit_notify(), we will find valid pi_state and go slow path
taking rtmutex.

Alexey

2007-06-05 21:13:19

by Thomas Gleixner

[permalink] [raw]

Subject: Re: [RFC][PATCH] muptiple bugs in PI futexes

On Wed, 2007-06-06 at 01:00 +0400, Alexey Kuznetsov wrote:
> Hello!
>
> > We actually need to do something about this, as we might loop for ever
> > there. The robust cleanup code can fail (e.g. due to list corruption)
> > and we would see exit_state != 0 and the OWNER_DIED bit would never be
> > set, so we are stuck in a busy loop.
>
> Yes...
>
> It is possible to take read_lock(&tasklist_lock) before:

We really want to avoid the tasklist_lock at all. It's not trivial, but
I think it's doable. I send you a patch to test tomorrow morning when my
brain recovered from looking at that code :)

tglx