It was observed occasionally in PowerPC systems that there was reader
who had not been woken up but that its waiter->task had been cleared.
One probable cause of this missed wakeup may be the fact that the
waiter->task and the task state have not been properly synchronized as
the lock release-acquire pair of different locks in the wakeup code path
does not provide a full memory barrier guarantee. So smp_store_mb()
is now used to set waiter->task to NULL to provide a proper memory
barrier for synchronization.
Signed-off-by: Waiman Long <[email protected]>
---
kernel/locking/rwsem-xadd.c | 17 +++++++++++++++++
1 file changed, 17 insertions(+)
diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index e795908..b3c588c 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -209,6 +209,23 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
smp_store_release(&waiter->task, NULL);
}
+ /*
+ * To avoid missed wakeup of reader, we need to make sure
+ * that task state and waiter->task are properly synchronized.
+ *
+ * wakeup sleep
+ * ------ -----
+ * __rwsem_mark_wake: rwsem_down_read_failed*:
+ * [S] waiter->task [S] set_current_state(state)
+ * MB MB
+ * try_to_wake_up:
+ * [L] state [L] waiter->task
+ *
+ * For the wakeup path, the original lock release-acquire pair
+ * does not provide enough guarantee of proper synchronization.
+ */
+ smp_mb();
+
adjustment = woken * RWSEM_ACTIVE_READ_BIAS - adjustment;
if (list_empty(&sem->wait_list)) {
/* hit end of list above */
--
1.8.3.1
On Tue, 2018-04-10 at 13:22 -0400, Waiman Long wrote:
> It was observed occasionally in PowerPC systems that there was reader
> who had not been woken up but that its waiter->task had been cleared.
>
> One probable cause of this missed wakeup may be the fact that the
> waiter->task and the task state have not been properly synchronized as
> the lock release-acquire pair of different locks in the wakeup code path
> does not provide a full memory barrier guarantee. So smp_store_mb()
> is now used to set waiter->task to NULL to provide a proper memory
> barrier for synchronization.
>
> Signed-off-by: Waiman Long <[email protected]>
That looks right... nothing in either lock or unlock will prevent a
store going past a load.
> ---
> kernel/locking/rwsem-xadd.c | 17 +++++++++++++++++
> 1 file changed, 17 insertions(+)
>
> diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
> index e795908..b3c588c 100644
> --- a/kernel/locking/rwsem-xadd.c
> +++ b/kernel/locking/rwsem-xadd.c
> @@ -209,6 +209,23 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
> smp_store_release(&waiter->task, NULL);
> }
>
> + /*
> + * To avoid missed wakeup of reader, we need to make sure
> + * that task state and waiter->task are properly synchronized.
> + *
> + * wakeup sleep
> + * ------ -----
> + * __rwsem_mark_wake: rwsem_down_read_failed*:
> + * [S] waiter->task [S] set_current_state(state)
> + * MB MB
> + * try_to_wake_up:
> + * [L] state [L] waiter->task
> + *
> + * For the wakeup path, the original lock release-acquire pair
> + * does not provide enough guarantee of proper synchronization.
> + */
> + smp_mb();
> +
> adjustment = woken * RWSEM_ACTIVE_READ_BIAS - adjustment;
> if (list_empty(&sem->wait_list)) {
> /* hit end of list above */
On 04/10/2018 01:22 PM, Waiman Long wrote:
> It was observed occasionally in PowerPC systems that there was reader
> who had not been woken up but that its waiter->task had been cleared.
>
> One probable cause of this missed wakeup may be the fact that the
> waiter->task and the task state have not been properly synchronized as
> the lock release-acquire pair of different locks in the wakeup code path
> does not provide a full memory barrier guarantee. So smp_store_mb()
> is now used to set waiter->task to NULL to provide a proper memory
> barrier for synchronization.
>
> Signed-off-by: Waiman Long <[email protected]>
> ---
> kernel/locking/rwsem-xadd.c | 17 +++++++++++++++++
> 1 file changed, 17 insertions(+)
>
> diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
> index e795908..b3c588c 100644
> --- a/kernel/locking/rwsem-xadd.c
> +++ b/kernel/locking/rwsem-xadd.c
> @@ -209,6 +209,23 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
> smp_store_release(&waiter->task, NULL);
> }
>
> + /*
> + * To avoid missed wakeup of reader, we need to make sure
> + * that task state and waiter->task are properly synchronized.
> + *
> + * wakeup sleep
> + * ------ -----
> + * __rwsem_mark_wake: rwsem_down_read_failed*:
> + * [S] waiter->task [S] set_current_state(state)
> + * MB MB
> + * try_to_wake_up:
> + * [L] state [L] waiter->task
> + *
> + * For the wakeup path, the original lock release-acquire pair
> + * does not provide enough guarantee of proper synchronization.
> + */
> + smp_mb();
> +
> adjustment = woken * RWSEM_ACTIVE_READ_BIAS - adjustment;
> if (list_empty(&sem->wait_list)) {
> /* hit end of list above */
Ping!
Any thought on this patch?
I am wondering if there is a cheaper way to apply the memory barrier
just on architectures that need it.
Cheers,
Longman
Hi Waiman,
On Mon, Apr 23, 2018 at 12:46:12PM -0400, Waiman Long wrote:
> On 04/10/2018 01:22 PM, Waiman Long wrote:
> > It was observed occasionally in PowerPC systems that there was reader
> > who had not been woken up but that its waiter->task had been cleared.
Can you provide more details about these observations? (links to LKML
posts, traces, applications used/micro-benchmarks, ...)
> >
> > One probable cause of this missed wakeup may be the fact that the
> > waiter->task and the task state have not been properly synchronized as
> > the lock release-acquire pair of different locks in the wakeup code path
> > does not provide a full memory barrier guarantee.
I guess that by the "pair of different locks" you mean (sem->wait_lock,
p->pi_lock), right? BTW, __rwsem_down_write_failed_common() is calling
wake_up_q() _before_ releasing the wait_lock: did you intend to exclude
this callsite? (why?)
> So smp_store_mb()
> > is now used to set waiter->task to NULL to provide a proper memory
> > barrier for synchronization.
Mmh; the patch is not introducing an smp_store_mb()... My guess is that
you are thinking at the sequence:
smp_store_release(&waiter->task, NULL);
[...]
smp_mb(); /* added with your patch */
or what am I missing?
> >
> > Signed-off-by: Waiman Long <[email protected]>
> > ---
> > kernel/locking/rwsem-xadd.c | 17 +++++++++++++++++
> > 1 file changed, 17 insertions(+)
> >
> > diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
> > index e795908..b3c588c 100644
> > --- a/kernel/locking/rwsem-xadd.c
> > +++ b/kernel/locking/rwsem-xadd.c
> > @@ -209,6 +209,23 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
> > smp_store_release(&waiter->task, NULL);
> > }
> >
> > + /*
> > + * To avoid missed wakeup of reader, we need to make sure
> > + * that task state and waiter->task are properly synchronized.
> > + *
> > + * wakeup sleep
> > + * ------ -----
> > + * __rwsem_mark_wake: rwsem_down_read_failed*:
> > + * [S] waiter->task [S] set_current_state(state)
> > + * MB MB
> > + * try_to_wake_up:
> > + * [L] state [L] waiter->task
> > + *
> > + * For the wakeup path, the original lock release-acquire pair
> > + * does not provide enough guarantee of proper synchronization.
> > + */
> > + smp_mb();
> > +
> > adjustment = woken * RWSEM_ACTIVE_READ_BIAS - adjustment;
> > if (list_empty(&sem->wait_list)) {
> > /* hit end of list above */
>
> Ping!
>
> Any thought on this patch?
>
> I am wondering if there is a cheaper way to apply the memory barrier
> just on architectures that need it.
try_to_wake_up() does:
raw_spin_lock_irqsave(&p->pi_lock, flags);
smp_mb__after_spinlock();
if (!(p->state & state))
My understanding is that this smp_mb__after_spinlock() provides us with
the guarantee you described above. The smp_mb__after_spinlock() should
represent a 'cheaper way' to provide such a guarantee.
If this understanding is correct, the remaining question would be about
whether you want to rely on (and document) the smp_mb__after_spinlock()
in the callsite in question (the comment in wake_up_q()
/*
* wake_up_process() implies a wmb() to pair with the queueing
* in wake_q_add() so as not to miss wakeups.
*/
does not appear to be suffient...).
Andrea
>
> Cheers,
> Longman
>
On 04/23/2018 04:55 PM, Andrea Parri wrote:
> Hi Waiman,
>
> On Mon, Apr 23, 2018 at 12:46:12PM -0400, Waiman Long wrote:
>> On 04/10/2018 01:22 PM, Waiman Long wrote:
>>> It was observed occasionally in PowerPC systems that there was reader
>>> who had not been woken up but that its waiter->task had been cleared.
> Can you provide more details about these observations? (links to LKML
> posts, traces, applications used/micro-benchmarks, ...)
They were customers reported problems on the RHEL7 kernel. Actually, we
are not able to reproduce it in-house. From the symptom, it does look
like a memory synchronization issue which I believe is also there in the
upstream code.
>>> One probable cause of this missed wakeup may be the fact that the
>>> waiter->task and the task state have not been properly synchronized as
>>> the lock release-acquire pair of different locks in the wakeup code path
>>> does not provide a full memory barrier guarantee.
> I guess that by the "pair of different locks" you mean (sem->wait_lock,
> p->pi_lock), right? BTW, __rwsem_down_write_failed_common() is calling
> wake_up_q() _before_ releasing the wait_lock: did you intend to exclude
> this callsite? (why?)
Yes, I am talking about sem->wait_lock and p->pi_lock.
Right, there is an alternative reader wakeup path in
__rwsem_down_write_failed_common() iin addition to unlock wakeup. I
didn't purposely exclude that, but I think it has similar issue. I will
clarify that in the next version.
>
>> So smp_store_mb()
>>> is now used to set waiter->task to NULL to provide a proper memory
>>> barrier for synchronization.
> Mmh; the patch is not introducing an smp_store_mb()... My guess is that
> you are thinking at the sequence:
>
> smp_store_release(&waiter->task, NULL);
> [...]
> smp_mb(); /* added with your patch */
>
> or what am I missing?
I actually thought about doing that at the beginning. The reasons why I
changed my mind were:
1) If there are multiple readers ready to be woken up, you will have to
issue multiple smp_mb instead of just 1 here.
2) smp_store_mb() is implemented with a smp_mb after the store. So it
may not be able to ensure that setting the reader waiter to nil is the
last operation seen by other CPUs as noted in the comment above
smp_store_release().
For these 2 reasons, I decide to add an extra smp_mb at the end.
Cheers,
Longman
On Mon, Apr 23, 2018 at 10:55:14PM +0200, Andrea Parri wrote:
> > > + /*
> > > + * To avoid missed wakeup of reader, we need to make sure
> > > + * that task state and waiter->task are properly synchronized.
> > > + *
> > > + * wakeup sleep
> > > + * ------ -----
> > > + * __rwsem_mark_wake: rwsem_down_read_failed*:
> > > + * [S] waiter->task [S] set_current_state(state)
> > > + * MB MB
> > > + * try_to_wake_up:
> > > + * [L] state [L] waiter->task
> > > + *
> > > + * For the wakeup path, the original lock release-acquire pair
> > > + * does not provide enough guarantee of proper synchronization.
> > > + */
> > > + smp_mb();
> > > +
> > > adjustment = woken * RWSEM_ACTIVE_READ_BIAS - adjustment;
> > > if (list_empty(&sem->wait_list)) {
> > > /* hit end of list above */
> >
> try_to_wake_up() does:
>
> raw_spin_lock_irqsave(&p->pi_lock, flags);
> smp_mb__after_spinlock();
> if (!(p->state & state))
>
> My understanding is that this smp_mb__after_spinlock() provides us with
> the guarantee you described above. The smp_mb__after_spinlock() should
> represent a 'cheaper way' to provide such a guarantee.
Right, I don't see what problem is being fixed here either. The scenario
in the comment is already closed by the smp_mb__after_spinlock() you
mention.
And it is fine to rely on that, we do in other places.
> If this understanding is correct, the remaining question would be about
> whether you want to rely on (and document) the smp_mb__after_spinlock()
> in the callsite in question (the comment in wake_up_q()
>
> /*
> * wake_up_process() implies a wmb() to pair with the queueing
> * in wake_q_add() so as not to miss wakeups.
> */
>
So that comment is about the ordering required for wake_q_add() vs
wake_up_q(). But yes, wmb is a little confusing. I suppose I was
thinking of the NULL store vs the wakeup (store), but that doesn't
really make much sense.
And wake_up_process() being a mb means it also implies a wmb; if such is
all that is required for the scenario at hand.
On 04/24/2018 05:15 AM, Peter Zijlstra wrote:
> On Mon, Apr 23, 2018 at 10:55:14PM +0200, Andrea Parri wrote:
>>>> + /*
>>>> + * To avoid missed wakeup of reader, we need to make sure
>>>> + * that task state and waiter->task are properly synchronized.
>>>> + *
>>>> + * wakeup sleep
>>>> + * ------ -----
>>>> + * __rwsem_mark_wake: rwsem_down_read_failed*:
>>>> + * [S] waiter->task [S] set_current_state(state)
>>>> + * MB MB
>>>> + * try_to_wake_up:
>>>> + * [L] state [L] waiter->task
>>>> + *
>>>> + * For the wakeup path, the original lock release-acquire pair
>>>> + * does not provide enough guarantee of proper synchronization.
>>>> + */
>>>> + smp_mb();
>>>> +
>>>> adjustment = woken * RWSEM_ACTIVE_READ_BIAS - adjustment;
>>>> if (list_empty(&sem->wait_list)) {
>>>> /* hit end of list above */
>> try_to_wake_up() does:
>>
>> raw_spin_lock_irqsave(&p->pi_lock, flags);
>> smp_mb__after_spinlock();
>> if (!(p->state & state))
>>
>> My understanding is that this smp_mb__after_spinlock() provides us with
>> the guarantee you described above. The smp_mb__after_spinlock() should
>> represent a 'cheaper way' to provide such a guarantee.
> Right, I don't see what problem is being fixed here either. The scenario
> in the comment is already closed by the smp_mb__after_spinlock() you
> mention.
>
> And it is fine to rely on that, we do in other places.
Right, I missed the smp_mb__after_spinlock(). So the upstream code is
fine after all. Sorry for the noise.
Cheers,
Longman