2005-12-23 16:38:19

by tip-bot for Jack Steiner

[permalink] [raw]
Subject: [PATCH] - Fix memory ordering problem in wake_futex()


Here is a fix for a ugly race condition that occurs in wake_futex() on IA64.

On IA64, locks are released using a "st.rel" instruction. This ensures that
preceding "stores" are visible before the lock is released but does NOT prevent
a "store" that follows the "st.rel" from becoming visible before the "st.rel".
The result is that the task that owns the futex_q continues prematurely.

The failure I saw is the task that owned the futex_q resumed prematurely and
was context-switch off of the cpu. The task's switch_stack occupied the same
space of the futex_q. The store to q->lock_ptr overwrote the ar.bspstore in the
switch_stack. When the task resumed, it ran with a corrupted ar.bspstore.
Things went downhill from there.

Without the fix, the application fails roughly every 10 minutes. With
the fix, it ran 16 hours without a failure.

----
Fix a memory ordering problem that occurs on IA64. The "store" to q->lock_ptr
in wake_futex() can become visible before wake_up_all() clears the lock in the
futex_q.


Signed-off-by: Jack Steiner <[email protected]>





Index: linux/kernel/futex.c
===================================================================
--- linux.orig/kernel/futex.c 2005-12-22 15:05:43.821889257 -0600
+++ linux/kernel/futex.c 2005-12-22 15:30:21.617973325 -0600
@@ -287,7 +287,13 @@ static void wake_futex(struct futex_q *q
/*
* The waiting task can free the futex_q as soon as this is written,
* without taking any locks. This must come last.
+ *
+ * A memory barrier is required here to prevent the following store
+ * to lock_ptr from getting ahead of the wakeup. Clearing the lock
+ * at the end of wake_up_all() does not prevent this store from
+ * moving.
*/
+ wmb();
q->lock_ptr = NULL;
}


2005-12-23 17:02:33

by Joe Seigh

[permalink] [raw]
Subject: Re: [PATCH] - Fix memory ordering problem in wake_futex()

Jack Steiner wrote:
> Here is a fix for a ugly race condition that occurs in wake_futex() on IA64.
>
> On IA64, locks are released using a "st.rel" instruction. This ensures that
> preceding "stores" are visible before the lock is released but does NOT prevent
> a "store" that follows the "st.rel" from becoming visible before the "st.rel".
> The result is that the task that owns the futex_q continues prematurely.
>
> The failure I saw is the task that owned the futex_q resumed prematurely and
> was context-switch off of the cpu. The task's switch_stack occupied the same
> space of the futex_q. The store to q->lock_ptr overwrote the ar.bspstore in the
> switch_stack. When the task resumed, it ran with a corrupted ar.bspstore.
> Things went downhill from there.
>
> Without the fix, the application fails roughly every 10 minutes. With
> the fix, it ran 16 hours without a failure.
>

I'm not sure I understand. Mutex semantics allow for memory accesses to
be moved into the critical region but not vice versa. This is true for Java
and it's pretty much agreed by all the "experts" that it's allowed in Posix
if there was such a thing as a formal definition of mutex semantics
in Posix. It would also seem to be to be the reason why Intel designed the
release semantics that way, so the hardware, not just the compiler, could
move code into the critical region for better performance.

So I suspect the problem is something else.


--
Joe Seigh

2005-12-23 20:49:23

by Olof Johansson

[permalink] [raw]
Subject: Re: [PATCH] - Fix memory ordering problem in wake_futex()

On Fri, Dec 23, 2005 at 10:38:16AM -0600, Jack Steiner wrote:
>
> Here is a fix for a ugly race condition that occurs in wake_futex() on IA64.
>
> On IA64, locks are released using a "st.rel" instruction. This ensures that
> preceding "stores" are visible before the lock is released but does NOT prevent
> a "store" that follows the "st.rel" from becoming visible before the "st.rel".
> The result is that the task that owns the futex_q continues prematurely.
>
> The failure I saw is the task that owned the futex_q resumed prematurely and
> was context-switch off of the cpu. The task's switch_stack occupied the same
> space of the futex_q. The store to q->lock_ptr overwrote the ar.bspstore in the
> switch_stack. When the task resumed, it ran with a corrupted ar.bspstore.
> Things went downhill from there.
>
> Without the fix, the application fails roughly every 10 minutes. With
> the fix, it ran 16 hours without a failure.

So what happened to what the comment 10 lines above your patch says?

/*
* The lock in wake_up_all() is a crucial memory barrier after
* the list_del_init() and also before assigning to q->lock_ptr.
*/

On PPC64, the spinlock unlock path has a sync in there for the very
purpose of adding the write barrier. Maybe the ia64 unlock path is
missing something similar?


-Olof

2005-12-23 21:32:24

by tip-bot for Jack Steiner

[permalink] [raw]
Subject: Re: [PATCH] - Fix memory ordering problem in wake_futex()

On Fri, Dec 23, 2005 at 02:48:22PM -0600, Olof Johansson wrote:
> On Fri, Dec 23, 2005 at 10:38:16AM -0600, Jack Steiner wrote:
> >
> > Here is a fix for a ugly race condition that occurs in wake_futex() on IA64.
> >
> > On IA64, locks are released using a "st.rel" instruction. This ensures that
> > preceding "stores" are visible before the lock is released but does NOT prevent
> > a "store" that follows the "st.rel" from becoming visible before the "st.rel".
> > The result is that the task that owns the futex_q continues prematurely.
> >
> > The failure I saw is the task that owned the futex_q resumed prematurely and
> > was context-switch off of the cpu. The task's switch_stack occupied the same
> > space of the futex_q. The store to q->lock_ptr overwrote the ar.bspstore in the
> > switch_stack. When the task resumed, it ran with a corrupted ar.bspstore.
> > Things went downhill from there.
> >
> > Without the fix, the application fails roughly every 10 minutes. With
> > the fix, it ran 16 hours without a failure.
>
> So what happened to what the comment 10 lines above your patch says?
>
> /*
> * The lock in wake_up_all() is a crucial memory barrier after
> * the list_del_init() and also before assigning to q->lock_ptr.
> */
>
> On PPC64, the spinlock unlock path has a sync in there for the very
> purpose of adding the write barrier. Maybe the ia64 unlock path is
> missing something similar?

On IA64, the "sync" instructions are actually part of the ld.acq ot st.rel
instructions that are used to set/clear spinlocks.

The program order of the instructions is:

cmpxchg.acq // set lock
..
.. do things
..
st.rel // release lock

st NULL -> lock_ptr // set flag to allow futex_q to be freed


IA64 implements fencing of ld.acq or st.rel instructions as one-directional
barriers.

For example, st.rel allows stores that FOLLOW the st.rel to be reordered
and become visible before the st.rel
(I hope this picture survives the various mailers)


ld |
st | | ^ ^
- fence - | |
st.rel --------------|-----|------------------- fence
| |
ld | |
st |

I don't understand PPC64 but I suspect it has different memory ordering semantics.

--
Thanks

Jack Steiner ([email protected]) 651-683-5302
Principal Engineer SGI - Silicon Graphics, Inc.


2005-12-23 22:00:24

by Olof Johansson

[permalink] [raw]
Subject: Re: [PATCH] - Fix memory ordering problem in wake_futex()

On Fri, Dec 23, 2005 at 03:32:16PM -0600, Jack Steiner wrote:

> On IA64, the "sync" instructions are actually part of the ld.acq ot st.rel
> instructions that are used to set/clear spinlocks.
[...]
> IA64 implements fencing of ld.acq or st.rel instructions as one-directional
> barriers.

So ia64 spin_unlock doesn't do store-store ordering across it. I'm
surprised this is the first time this causes problems. Other architectures
seem to order:

* sparc64 does a membar StoreStore|LoadStore
* powerpc does lwsync or sync, depending on arch
* alpha does an mb();

* x86 is in-order

So, sounds to me like you need to fix your lock primitives, not add
barriers to generic code?


-Olof

2005-12-23 22:23:27

by Manfred Spraul

[permalink] [raw]
Subject: Re: [PATCH] - Fix memory ordering problem in wake_futex()

Jack wrote:

>On IA64, locks are released using a "st.rel" instruction. This ensures that
>preceding "stores" are visible before the lock is released but does NOT prevent
>a "store" that follows the "st.rel" from becoming visible before the "st.rel".
>The result is that the task that owns the futex_q continues prematurely.
>
>The failure I saw is the task that owned the futex_q resumed prematurely and
>was context-switch off of the cpu. The task's switch_stack occupied the same
>space of the futex_q. The store to q->lock_ptr overwrote the ar.bspstore in the
>switch_stack.
>
Bad race.
Unfortuantely the scenario that you describe is quite frequent:
- autoremove_wake_function()
- ipc/sem.c (search for IN_WAKEUP)
- ipc/msg.c appears to be correct, there are smp_wmb() calls.

--
Manfred

2005-12-23 22:52:52

by Manfred Spraul

[permalink] [raw]
Subject: Re: [PATCH] - Fix memory ordering problem in wake_futex()

Manfred Spraul wrote:

> Jack wrote:
>
>> On IA64, locks are released using a "st.rel" instruction. This
>> ensures that
>> preceding "stores" are visible before the lock is released but does
>> NOT prevent
>> a "store" that follows the "st.rel" from becoming visible before the
>> "st.rel".
>> The result is that the task that owns the futex_q continues prematurely.
>> The failure I saw is the task that owned the futex_q resumed
>> prematurely and
>> was context-switch off of the cpu. The task's switch_stack occupied
>> the same
>> space of the futex_q. The store to q->lock_ptr overwrote the
>> ar.bspstore in the
>> switch_stack.
>>
I'm stupid - first I was certains that your description is wrong, I
started a long mail, but the I confused myself and accepted your
description.

Your description can't be correct: The store to q->lock_ptr can't
overwrite ar.bspstore. As long as q->lock_ptr is not NULL, unqueue_me
will spin and prevent a race. But the other race seems to be possible_
q->lock_ptr is set to NULL early, and then the spin_unlock_irqrestore()
in __wake_up could overwrite ar.bspstore.

2005-12-23 23:49:08

by Robin Holt

[permalink] [raw]
Subject: Re: [PATCH] - Fix memory ordering problem in wake_futex()

On Fri, Dec 23, 2005 at 03:59:16PM -0600, Olof Johansson wrote:
> On Fri, Dec 23, 2005 at 03:32:16PM -0600, Jack Steiner wrote:
>
> > On IA64, the "sync" instructions are actually part of the ld.acq ot st.rel
> > instructions that are used to set/clear spinlocks.
> [...]
> > IA64 implements fencing of ld.acq or st.rel instructions as one-directional
> > barriers.
>
> So ia64 spin_unlock doesn't do store-store ordering across it. I'm
> surprised this is the first time this causes problems. Other architectures
> seem to order:
>
> * sparc64 does a membar StoreStore|LoadStore
> * powerpc does lwsync or sync, depending on arch
> * alpha does an mb();
>
> * x86 is in-order
>
> So, sounds to me like you need to fix your lock primitives, not add
> barriers to generic code?

I don't think this is a case which is handled by the typical lock
primitives. Here we essentially have two things being unlocked in
close succession. The first is the wait queue, the second the futex_q.

There is nothing in the typical unlock path which would require unlocks
to be ordered with respect to each other. However, in this case, the
futex_q expects to finish processing the wake_up_all before releasing
the lock_ptr. That is a requirement of wake_futex and not the locking
primitives. If wake_futex() requires it, then it should be responsible
for enforcing that requirement.

I suppose a step in the right direction would be doing a volatile store
to q->lock_ptr. I haven't looked, but that should at least prevent the
clearing of lock_ptr until the wait queue is unlocked.

Jack, can you repeat your testing with a cast on the q->lock_ptr line to
a volatile. After looking at it some more, shouldn't the struct futex_q{}
definition for the spinlock_t *lock_ptr be volatile?


Thanks,
Robin Holt

2005-12-24 03:45:17

by tip-bot for Jack Steiner

[permalink] [raw]
Subject: Re: [PATCH] - Fix memory ordering problem in wake_futex()

On Fri, Dec 23, 2005 at 11:23:11PM +0100, Manfred Spraul wrote:
> Jack wrote:
>
> >On IA64, locks are released using a "st.rel" instruction. This ensures that
> >preceding "stores" are visible before the lock is released but does NOT
> >prevent
> >a "store" that follows the "st.rel" from becoming visible before the
> >"st.rel".
> >The result is that the task that owns the futex_q continues prematurely.
> >
> >The failure I saw is the task that owned the futex_q resumed prematurely
> >and
> >was context-switch off of the cpu. The task's switch_stack occupied the
> >same
> >space of the futex_q. The store to q->lock_ptr overwrote the ar.bspstore
> >in the
> >switch_stack.
> >
> Bad race.
> Unfortuantely the scenario that you describe is quite frequent:
> - autoremove_wake_function()
> - ipc/sem.c (search for IN_WAKEUP)
> - ipc/msg.c appears to be correct, there are smp_wmb() calls.

Yuck. I agree - both of these look incorrect.

Also, I should have used smp_wmb(), not wmb(). Thanks for
pointing that out.

I wonder how many other spots have the same problem. IIRC, we ran into
similar problems in the tty driver a few years ago but I have not
seen any problems recently.

>
> --
> Manfred

--
Thanks

Jack Steiner ([email protected]) 651-683-5302
Principal Engineer SGI - Silicon Graphics, Inc.


2005-12-24 13:45:28

by tip-bot for Jack Steiner

[permalink] [raw]
Subject: Re: [PATCH] - Fix memory ordering problem in wake_futex()

This patch is identical to the first patch except I used smp_wmb() instead
of wmb(). Ordering doen't matter on non-SMP kernels.


Here is a fix for a ugly race condition that occurs in wake_futex() on IA64.

On IA64, locks are released using a "st.rel" instruction. This ensures that
preceding "stores" are visible before the lock is released but does NOT prevent
a "store" that follows the "st.rel" from becoming visible before the "st.rel".
The result is that the task that owns the futex_q continues prematurely.

The failure I saw is the task that owned the futex_q resumed prematurely and
was context-switch off of the cpu. The task's switch_stack occupied the same
space of the futex_q. The store to q->lock_ptr overwrote the ar.bspstore in the
switch_stack. When the task resumed, it ran with a corrupted ar.bspstore.
Things went downhill from there.

Without the fix, the application fails roughly every 10 minutes. With
the fix, it ran 16 hours without a failure.

----
Fix a memory ordering problem that occurs on IA64. The "store" to q->lock_ptr
in wake_futex() can become visible before wake_up_all() clears the lock in the
futex_q.


Signed-off-by: Jack Steiner <[email protected]>





Index: linux/kernel/futex.c
===================================================================
--- linux.orig/kernel/futex.c 2005-12-22 15:05:43.821889257 -0600
+++ linux/kernel/futex.c 2005-12-22 15:30:21.617973325 -0600
@@ -287,7 +287,13 @@ static void wake_futex(struct futex_q *q
/*
* The waiting task can free the futex_q as soon as this is written,
* without taking any locks. This must come last.
+ *
+ * A memory barrier is required here to prevent the following store
+ * to lock_ptr from getting ahead of the wakeup. Clearing the lock
+ * at the end of wake_up_all() does not prevent this store from
+ * moving.
*/
+ smp_wmb();
q->lock_ptr = NULL;
}

2005-12-24 18:13:45

by Olof Johansson

[permalink] [raw]
Subject: Re: [PATCH] - Fix memory ordering problem in wake_futex()

Hi,

On Sat, Dec 24, 2005 at 07:45:23AM -0600, Jack Steiner wrote:
> This patch is identical to the first patch except I used smp_wmb() instead
> of wmb(). Ordering doen't matter on non-SMP kernels.

Ok, I guess I was wrong -- there's no guarantee that protects stuff from
bleeding into a critical region from after it. Comments in line 54-58 of
kernel/wait.c seems to imply this. Nevermind the fact that most other
architectures seem to protect it anyway. :-)

However, please do fix the comment earlier in the function that implies
that the unlock does indeed do enough barriers while you're at it,
since it seems to be incorrect and misleading.


-Olof

2005-12-25 16:02:41

by Manfred Spraul

[permalink] [raw]
Subject: Re: [PATCH] - Fix memory ordering problem in wake_futex()

Again replying to myself, sorry:

Jack Steiner wrote:

>On Fri, Dec 23, 2005 at 11:23:11PM +0100, Manfred Spraul wrote:
>
>
>>Bad race.
>>Unfortuantely the scenario that you describe is quite frequent:
>>- autoremove_wake_function()
>>
>>
Not a bug, but for very subtile reasons:
There are no writes to the wait_queue structure except the
list_del_init(), and for list_del_init() no memory barriers are required
because finish_wait() uses list_empty_careful(), i.e. the spin_lock() is
only bypassed if both write operations from list_del_init() have completed.

>>- ipc/sem.c (search for IN_WAKEUP)
>>
>>
fixed in -rc7.


--
Manfred

2005-12-27 16:31:07

by tip-bot for Jack Steiner

[permalink] [raw]
Subject: Re: [PATCH] - Fix memory ordering problem in wake_futex()

Hi Linus,


Here is a fix for a ugly race condition that occurs in wake_futex(). The
failure was detected on IA64 but may also occur on other architectures.

On IA64, locks are released using a "st.rel" instruction. This ensures that
preceding "stores" are visible before the lock is released but does NOT prevent
a "store" that follows the "st.rel" from becoming visible before the "st.rel".

The failure I saw is a task that owned a futex_q resumed prematurely and
was context-switch off of the cpu. The task's switch_stack occupied the same
space as the futex_q. The store to q->lock_ptr in futex_wait()overwrote the
ar.bspstore in the switch_stack. When the task resumed, it ran with a corrupted
ar.bspstore. Things went downhill from there.

Without the fix, the application fails roughly every 10 minutes. With
the fix, it ran over 16 hours without a failure.

----
Fix a memory ordering problem that occurs on IA64. The "store" to q->lock_ptr
in wake_futex() can become visible before wake_up_all() clears the lock in the
futex_q.



Signed-off-by: Jack Steiner <[email protected]>





Index: linux/kernel/futex.c
===================================================================
--- linux.orig/kernel/futex.c 2005-12-24 15:09:23.381357908 -0600
+++ linux/kernel/futex.c 2005-12-24 15:14:26.362119396 -0600
@@ -262,15 +262,18 @@ static void wake_futex(struct futex_q *q
list_del_init(&q->list);
if (q->filp)
send_sigio(&q->filp->f_owner, q->fd, POLL_IN);
- /*
- * The lock in wake_up_all() is a crucial memory barrier after the
- * list_del_init() and also before assigning to q->lock_ptr.
- */
wake_up_all(&q->waiters);
+
/*
* The waiting task can free the futex_q as soon as this is written,
* without taking any locks. This must come last.
+ *
+ * A memory barrier is required here to prevent the following store
+ * to lock_ptr from getting ahead of the wakeup. Clearing the lock
+ * at the end of wake_up_all() is not a write barrier on all
+ * architectures.
*/
+ smp_wmb();
q->lock_ptr = NULL;
}