On systems with weaker memory ordering (e.g. power) commit dbfb089d360b
("sched: Fix loadavg accounting race") causes increasing values of load
average (via rq->calc_load_active and calc_load_tasks) due to the wakeup
CPU not always seeing the write to task->sched_contributes_to_load in
__schedule(). Missing that we fail to decrement nr_uninterruptible when
waking up a task which incremented nr_uninterruptible when it slept.
The rq->lock serialization is insufficient across different rq->locks.
Add smp_wmb() to schedule and smp_rmb() before the read in
ttwu_do_activate().
Fixes: dbfb089d360b ("sched: Fix loadavg accounting race")
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: <[email protected]>
Signed-off-by: Phil Auld <[email protected]>
---
kernel/sched/core.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4ca80df205ce..ced7074716eb 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2992,6 +2992,8 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
lockdep_assert_held(&rq->lock);
+ /* Pairs with smp_wmb in __schedule() */
+ smp_rmb();
if (p->sched_contributes_to_load)
rq->nr_uninterruptible--;
@@ -5084,6 +5086,11 @@ static void __sched notrace __schedule(bool preempt)
!(prev_state & TASK_NOLOAD) &&
!(prev->flags & PF_FROZEN);
+ /*
+ * Make sure the previous write is ordered before p->on_rq etc so
+ * that it is visible to other cpus in the wakeup path (ttwu_do_activate()).
+ */
+ smp_wmb();
if (prev->sched_contributes_to_load)
rq->nr_uninterruptible++;
--
2.18.0
Hi Peter,
On Thu, Jul 08, 2021 at 09:26:26AM +0200 Peter Zijlstra wrote:
> On Wed, Jul 07, 2021 at 03:04:57PM -0400, Phil Auld wrote:
> > On systems with weaker memory ordering (e.g. power) commit dbfb089d360b
> > ("sched: Fix loadavg accounting race") causes increasing values of load
> > average (via rq->calc_load_active and calc_load_tasks) due to the wakeup
> > CPU not always seeing the write to task->sched_contributes_to_load in
> > __schedule(). Missing that we fail to decrement nr_uninterruptible when
> > waking up a task which incremented nr_uninterruptible when it slept.
> >
> > The rq->lock serialization is insufficient across different rq->locks.
> >
> > Add smp_wmb() to schedule and smp_rmb() before the read in
> > ttwu_do_activate().
>
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 4ca80df205ce..ced7074716eb 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -2992,6 +2992,8 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
> >
> > lockdep_assert_held(&rq->lock);
> >
> > + /* Pairs with smp_wmb in __schedule() */
> > + smp_rmb();
> > if (p->sched_contributes_to_load)
> > rq->nr_uninterruptible--;
> >
>
> Is this really needed ?! (this question is a big fat clue the comment is
> insufficient). AFAICT try_to_wake_up() has a LOAD-ACQUIRE on p->on_rq
> and hence the p->sched_contributed_to_load must already happen after.
>
Yes, it is needed. We've got idle power systems with load average of 530.21.
Calc_load_tasks is 530, and the sum of both nr_uninterruptible and
calc_load_active across all the runqueues is 530. Basically monotonically
non-decreasing load average. With the patch this no longer happens.
We need the sched_contributed_to_load to "happen before" so that it's seen
on the other cpu on wakeup.
> > @@ -5084,6 +5086,11 @@ static void __sched notrace __schedule(bool preempt)
> > !(prev_state & TASK_NOLOAD) &&
> > !(prev->flags & PF_FROZEN);
> >
> > + /*
> > + * Make sure the previous write is ordered before p->on_rq etc so
> > + * that it is visible to other cpus in the wakeup path (ttwu_do_activate()).
> > + */
> > + smp_wmb();
> > if (prev->sched_contributes_to_load)
> > rq->nr_uninterruptible++;
>
> That comment is terrible, look at all the other barrier comments around
> there for clues; in effect you're worrying about:
>
> p->sched_contributes_to_load = X R1 = p->on_rq
> WMB RMB
> p->on_rq = Y R2 = p->sched_contributes_to_load
>
> Right?
The only way I can see that decrememnt being missed is if the write to
sched_contributes_to_load is not being seen on the wakeup cpu.
Before the previous patch the _state condition was checked again on the
wakeup cpu and that is ordered.
>
>
> Bah bah bah.. I so detest having to add barriers here for silly
> accounting. Let me think about this a little.
>
>
Thanks,
Phil
--
On Thu, Jul 08, 2021 at 09:54:06AM +0200 Peter Zijlstra wrote:
> On Thu, Jul 08, 2021 at 09:48:03AM +0200, Peter Zijlstra wrote:
> > On Thu, Jul 08, 2021 at 09:26:26AM +0200, Peter Zijlstra wrote:
> > > On Wed, Jul 07, 2021 at 03:04:57PM -0400, Phil Auld wrote:
> > > > On systems with weaker memory ordering (e.g. power) commit dbfb089d360b
> > > > ("sched: Fix loadavg accounting race") causes increasing values of load
> > > > average (via rq->calc_load_active and calc_load_tasks) due to the wakeup
> > > > CPU not always seeing the write to task->sched_contributes_to_load in
> > > > __schedule(). Missing that we fail to decrement nr_uninterruptible when
> > > > waking up a task which incremented nr_uninterruptible when it slept.
> > > >
> > > > The rq->lock serialization is insufficient across different rq->locks.
> > > >
> > > > Add smp_wmb() to schedule and smp_rmb() before the read in
> > > > ttwu_do_activate().
> > >
> > > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > > index 4ca80df205ce..ced7074716eb 100644
> > > > --- a/kernel/sched/core.c
> > > > +++ b/kernel/sched/core.c
> > > > @@ -2992,6 +2992,8 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
> > > >
> > > > lockdep_assert_held(&rq->lock);
> > > >
> > > > + /* Pairs with smp_wmb in __schedule() */
> > > > + smp_rmb();
> > > > if (p->sched_contributes_to_load)
> > > > rq->nr_uninterruptible--;
> > > >
> > >
> > > Is this really needed ?! (this question is a big fat clue the comment is
> > > insufficient). AFAICT try_to_wake_up() has a LOAD-ACQUIRE on p->on_rq
> > > and hence the p->sched_contributed_to_load must already happen after.
> > >
> > > > @@ -5084,6 +5086,11 @@ static void __sched notrace __schedule(bool preempt)
> > > > !(prev_state & TASK_NOLOAD) &&
> > > > !(prev->flags & PF_FROZEN);
> > > >
> > > > + /*
> > > > + * Make sure the previous write is ordered before p->on_rq etc so
> > > > + * that it is visible to other cpus in the wakeup path (ttwu_do_activate()).
> > > > + */
> > > > + smp_wmb();
> > > > if (prev->sched_contributes_to_load)
> > > > rq->nr_uninterruptible++;
> > >
> > > That comment is terrible, look at all the other barrier comments around
> > > there for clues; in effect you're worrying about:
> > >
> > > p->sched_contributes_to_load = X R1 = p->on_rq
> > > WMB RMB
> > > p->on_rq = Y R2 = p->sched_contributes_to_load
> > >
> > > Right?
> > >
> > >
> > > Bah bah bah.. I so detest having to add barriers here for silly
> > > accounting. Let me think about this a little.
> >
> > I got the below:
> >
> > __schedule() ttwu()
> >
> > rq_lock() raw_spin_lock(&p->pi_lock)
> > smp_mb__after_spinlock(); smp_mb__after_spinlock();
> >
> > p->sched_contributes_to_load = X; if (READ_ONCE(p->on_rq) && ...)
> > goto unlock;
> > smp_acquire__after_ctrl_dep();
> >
> > smp_cond_load_acquire(&p->on_cpu, !VAL)
> >
> > deactivate_task()
> > p->on_rq = 0;
> >
> > context_switch()
> > finish_task_switch()
> > finish_task()
> > smp_store_release(p->on_cpu, 0);
> >
> > ttwu_queue()
> > rq_lock()
> > ttwu_do_activate()
> > if (p->sched_contributes_to_load)
> > ...
> > rq_unlock()
> > raw_spin_unlock(&p->pi_lock);
> > finish_lock_switch()
> > rq_unlock();
> >
> >
> >
> > The only way for ttwu() to end up in an enqueue, is if it did a
> > LOAD-ACQUIRE on ->on_cpu,
>
> That's not completely true; there's the WF_ON_CPU case, but in that
> scenario we IPI the CPU doing __schedule and it becomes simple UP/PO and
> everything must trivially work.
>
> > but that orders with the STORE-RELEASE on the
> > same, which ensures the p->sched_contributes_to_load LOAD must happen
> > after the STORE.
> >
> > What am I missing? Your Changelog/comments provide insufficient clues..
>
Sorry... I don't have a nice diagram. I'm still looking at what all those
macros actually mean on the various architectures.
"Works great in practice but how does it work in theory?" :)
Using what you have above I get the same thing. It looks like it should be
ordered but in practice it's not, and ordering it "more" as I did in the
patch, fixes it.
Is it possible that the bit field is causing some of the assumptions about
ordering in those various macros to be off?
I notice in all the comments about smp_mb__after_spinlock etc, it's always
WRITE_ONCE/READ_ONCE on the variables in question but we can't do that with
the bit field.
I appreciate your time on this.
Cheers,
Phil
--
On Fri, Jul 09, 2021 at 01:38:20PM +0200 Peter Zijlstra wrote:
> On Thu, Jul 08, 2021 at 09:25:45AM -0400, Phil Auld wrote:
> > Hi Peter,
> >
> > On Thu, Jul 08, 2021 at 09:26:26AM +0200 Peter Zijlstra wrote:
> > > On Wed, Jul 07, 2021 at 03:04:57PM -0400, Phil Auld wrote:
> > > > On systems with weaker memory ordering (e.g. power) commit dbfb089d360b
> > > > ("sched: Fix loadavg accounting race") causes increasing values of load
> > > > average (via rq->calc_load_active and calc_load_tasks) due to the wakeup
> > > > CPU not always seeing the write to task->sched_contributes_to_load in
> > > > __schedule(). Missing that we fail to decrement nr_uninterruptible when
> > > > waking up a task which incremented nr_uninterruptible when it slept.
> > > >
> > > > The rq->lock serialization is insufficient across different rq->locks.
> > > >
> > > > Add smp_wmb() to schedule and smp_rmb() before the read in
> > > > ttwu_do_activate().
> > >
> > > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > > index 4ca80df205ce..ced7074716eb 100644
> > > > --- a/kernel/sched/core.c
> > > > +++ b/kernel/sched/core.c
> > > > @@ -2992,6 +2992,8 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
> > > >
> > > > lockdep_assert_held(&rq->lock);
> > > >
> > > > + /* Pairs with smp_wmb in __schedule() */
> > > > + smp_rmb();
> > > > if (p->sched_contributes_to_load)
> > > > rq->nr_uninterruptible--;
> > > >
> > >
> > > Is this really needed ?! (this question is a big fat clue the comment is
> > > insufficient). AFAICT try_to_wake_up() has a LOAD-ACQUIRE on p->on_rq
> > > and hence the p->sched_contributed_to_load must already happen after.
> > >
> >
> > Yes, it is needed. We've got idle power systems with load average of 530.21.
> > Calc_load_tasks is 530, and the sum of both nr_uninterruptible and
> > calc_load_active across all the runqueues is 530. Basically monotonically
> > non-decreasing load average. With the patch this no longer happens.
>
> Have you tried without the rmb here? Do you really need both barriers?
>
No and no. It was originally a READ_ONCE which didn't compile. I was trying
to pair them up but did think that was overkill. I'll try that as soon as I
get back to it.
Thanks,
Phil
--
Hi Peter,
On Fri, Jul 09, 2021 at 02:57:10PM +0200 Peter Zijlstra wrote:
> On Thu, Jul 08, 2021 at 10:54:58AM -0400, Phil Auld wrote:
> > Sorry... I don't have a nice diagram. I'm still looking at what all those
> > macros actually mean on the various architectures.
>
> Don't worry about other architectures, lets focus on Power, because
> that's the case where you can reprouce funnies. Now Power only has 2
> barrier ops (not quite true, but close enough for all this):
>
> - SYNC is the full barrier
>
> - LWSYNC is a TSO like barrier
>
> Pretty much everything (LOAD-ACQUIRE, STORE-RELEASE, WMB, RMB) uses
> LWSYNC. Only MB result in SYNC.
>
> Power is 'funny' because their spinlocks are weaker than everybody
> else's, but AFAICT that doesn't seem relevant here.
>
Thanks.
> > Using what you have above I get the same thing. It looks like it should be
> > ordered but in practice it's not, and ordering it "more" as I did in the
> > patch, fixes it.
>
> And you're running Linus' tree, not some franken-kernel from RHT, right?
> As asked in that other email, can you try with just the WMB added? I
> really don't believe that RMB you added can make a difference.
So, no. Right now the reproducer is on the franken-kernel :(
As far as I can tell the relevant code paths (schedule, barriers, wakeup
etc) are all current and the same. I traced through your diagram and
it all matches exactly.
I have a suspicion that Linus's tree may hide it. I believe this is tickled
by NFS io, which I _think_ is effected by the unboud workqueue changes
that may make it less likely to do the wakeup on a different cpu. But
that's just speculation.
The issue is that the systems under test here are in a partner's lab
to which I have no direct access.
I will try to get an upstream build on there, if possible, as soon
as I can.
>
> Also, can you try with TTWU_QUEUE disabled (without any additional
> barriers added), that simplifies the wakeup path a lot.
>
Will do.
> > Is it possible that the bit field is causing some of the assumptions about
> > ordering in those various macros to be off?
>
> *should* not matter...
>
> prev->sched_contributes_to_load = X;
>
> smp_store_release(&prev->on_cpu, 0);
> asm("LWSYNC" : : : "memory");
> WRITE_ONCE(prev->on_cpu, 0);
>
> due to that memory clobber, the compiler must emit whatever stores are
> required for the bitfield prior to the LWSYNC.
>
> > I notice in all the comments about smp_mb__after_spinlock etc, it's always
> > WRITE_ONCE/READ_ONCE on the variables in question but we can't do that with
> > the bit field.
>
> Yeah, but both ->on_rq and ->sched_contributes_to_load are 'normal'
> stores. That said, given that ttwu() does a READ_ONCE() on ->on_rq, we
> should match that with WRITE_ONCE()...
>
> So I think we should do the below, but I don't believe it'll make a
> difference. Let me stare more.
>
I'm out of the office for the next week+ so don't stare to hard. I'll try to
get the tests you asked for as soon as I get back in the (home) office.
I'm not sure the below will make a difference either, but will try it too.
Thanks again for the help. And sorry for the timing.
Cheers,
Phil
> ---
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index ca9a523c9a6c..da93551b298d 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1973,12 +1973,12 @@ void activate_task(struct rq *rq, struct task_struct *p, int flags)
> {
> enqueue_task(rq, p, flags);
>
> - p->on_rq = TASK_ON_RQ_QUEUED;
> + WRITE_ONCE(p->on_rq, TASK_ON_RQ_QUEUED);
> }
>
> void deactivate_task(struct rq *rq, struct task_struct *p, int flags)
> {
> - p->on_rq = (flags & DEQUEUE_SLEEP) ? 0 : TASK_ON_RQ_MIGRATING;
> + WRITE_ONCE(p->on_rq, (flags & DEQUEUE_SLEEP) ? 0 : TASK_ON_RQ_MIGRATING);
>
> dequeue_task(rq, p, flags);
> }
> @@ -5662,11 +5662,11 @@ static bool try_steal_cookie(int this, int that)
> if (p->core_occupation > dst->idle->core_occupation)
> goto next;
>
> - p->on_rq = TASK_ON_RQ_MIGRATING;
> + WRITE_ONCE(p->on_rq, TASK_ON_RQ_MIGRATING);
> deactivate_task(src, p, 0);
> set_task_cpu(p, this);
> activate_task(dst, p, 0);
> - p->on_rq = TASK_ON_RQ_QUEUED;
> + WRITE_ONCE(p->on_rq, TASK_ON_RQ_QUEUED);
>
> resched_curr(dst);
>
>
--
On Fri, Jul 09, 2021 at 01:38:20PM +0200 Peter Zijlstra wrote:
> On Thu, Jul 08, 2021 at 09:25:45AM -0400, Phil Auld wrote:
> > Hi Peter,
> >
> > On Thu, Jul 08, 2021 at 09:26:26AM +0200 Peter Zijlstra wrote:
> > > On Wed, Jul 07, 2021 at 03:04:57PM -0400, Phil Auld wrote:
> > > > On systems with weaker memory ordering (e.g. power) commit dbfb089d360b
> > > > ("sched: Fix loadavg accounting race") causes increasing values of load
> > > > average (via rq->calc_load_active and calc_load_tasks) due to the wakeup
> > > > CPU not always seeing the write to task->sched_contributes_to_load in
> > > > __schedule(). Missing that we fail to decrement nr_uninterruptible when
> > > > waking up a task which incremented nr_uninterruptible when it slept.
> > > >
> > > > The rq->lock serialization is insufficient across different rq->locks.
> > > >
> > > > Add smp_wmb() to schedule and smp_rmb() before the read in
> > > > ttwu_do_activate().
> > >
> > > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > > index 4ca80df205ce..ced7074716eb 100644
> > > > --- a/kernel/sched/core.c
> > > > +++ b/kernel/sched/core.c
> > > > @@ -2992,6 +2992,8 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
> > > >
> > > > lockdep_assert_held(&rq->lock);
> > > >
> > > > + /* Pairs with smp_wmb in __schedule() */
> > > > + smp_rmb();
> > > > if (p->sched_contributes_to_load)
> > > > rq->nr_uninterruptible--;
> > > >
> > >
> > > Is this really needed ?! (this question is a big fat clue the comment is
> > > insufficient). AFAICT try_to_wake_up() has a LOAD-ACQUIRE on p->on_rq
> > > and hence the p->sched_contributed_to_load must already happen after.
> > >
> >
> > Yes, it is needed. We've got idle power systems with load average of 530.21.
> > Calc_load_tasks is 530, and the sum of both nr_uninterruptible and
> > calc_load_active across all the runqueues is 530. Basically monotonically
> > non-decreasing load average. With the patch this no longer happens.
>
> Have you tried without the rmb here? Do you really need both barriers?
>
You're right here. (I see now that you were asking about the rmb specifically
in the first question) The rmb is not needed.
I was unable to reproducde it with the upstream kernel. I still think it is
a problem though since the code in question is all the same. The recent
changes to unbound workqueues which make it more likely to run on the
submitting cpu may be masking the problem since it obviously requires
multiple cpus to hit.
If I can isolate those changes I can try to revert them in upstream and
see if I can get it there.
I suppose pulling those changes back could get us past this but
I'm not a fan of just hiding it by making it harder to hit.
I've not gotten to the disable TTWU_QUEUE test, that's next...
Cheers,
Phil
--
On Fri, Jul 23, 2021 at 09:38:44AM -0400 Phil Auld wrote:
> On Fri, Jul 09, 2021 at 01:38:20PM +0200 Peter Zijlstra wrote:
> > On Thu, Jul 08, 2021 at 09:25:45AM -0400, Phil Auld wrote:
> > > Hi Peter,
> > >
> > > On Thu, Jul 08, 2021 at 09:26:26AM +0200 Peter Zijlstra wrote:
> > > > On Wed, Jul 07, 2021 at 03:04:57PM -0400, Phil Auld wrote:
> > > > > On systems with weaker memory ordering (e.g. power) commit dbfb089d360b
> > > > > ("sched: Fix loadavg accounting race") causes increasing values of load
> > > > > average (via rq->calc_load_active and calc_load_tasks) due to the wakeup
> > > > > CPU not always seeing the write to task->sched_contributes_to_load in
> > > > > __schedule(). Missing that we fail to decrement nr_uninterruptible when
> > > > > waking up a task which incremented nr_uninterruptible when it slept.
> > > > >
> > > > > The rq->lock serialization is insufficient across different rq->locks.
> > > > >
> > > > > Add smp_wmb() to schedule and smp_rmb() before the read in
> > > > > ttwu_do_activate().
> > > >
> > > > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > > > index 4ca80df205ce..ced7074716eb 100644
> > > > > --- a/kernel/sched/core.c
> > > > > +++ b/kernel/sched/core.c
> > > > > @@ -2992,6 +2992,8 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
> > > > >
> > > > > lockdep_assert_held(&rq->lock);
> > > > >
> > > > > + /* Pairs with smp_wmb in __schedule() */
> > > > > + smp_rmb();
> > > > > if (p->sched_contributes_to_load)
> > > > > rq->nr_uninterruptible--;
> > > > >
> > > >
> > > > Is this really needed ?! (this question is a big fat clue the comment is
> > > > insufficient). AFAICT try_to_wake_up() has a LOAD-ACQUIRE on p->on_rq
> > > > and hence the p->sched_contributed_to_load must already happen after.
> > > >
> > >
> > > Yes, it is needed. We've got idle power systems with load average of 530.21.
> > > Calc_load_tasks is 530, and the sum of both nr_uninterruptible and
> > > calc_load_active across all the runqueues is 530. Basically monotonically
> > > non-decreasing load average. With the patch this no longer happens.
> >
> > Have you tried without the rmb here? Do you really need both barriers?
> >
>
> You're right here. (I see now that you were asking about the rmb specifically
> in the first question) The rmb is not needed.
>
> I was unable to reproducde it with the upstream kernel. I still think it is
> a problem though since the code in question is all the same. The recent
> changes to unbound workqueues which make it more likely to run on the
> submitting cpu may be masking the problem since it obviously requires
> multiple cpus to hit.
>
> If I can isolate those changes I can try to revert them in upstream and
> see if I can get it there.
>
> I suppose pulling those changes back could get us past this but
> I'm not a fan of just hiding it by making it harder to hit.
>
> I've not gotten to the disable TTWU_QUEUE test, that's next...
>
ETOOMANYTREES
Sorry for the noise. I was using the wrong tree to compare with upstream.
The offending tree is missing f97bb5272d9e ("sched: Fix data-race in wakeup").
I thought the loadavg increase sounded familiar...
Cheers,
Phil
>
> Cheers,
> Phil
>
> --
>
--