Hi,
I got cc'd internal bug report filed against a 5.8 and 5.9 kernel
that loadavg was "exploding" on arch64 on a machines acting as a build
servers. It happened on at least two different arm64 variants. That setup
is complex to replicate but fortunately can be reproduced by running
hackbench-process-pipes while heavily overcomitting a machine with 96
logical CPUs and then checking if loadavg drops afterwards. With an
MMTests clone, I reproduced it as follows
./run-mmtests.sh --config configs/config-workload-hackbench-process-pipes --no-monitor testrun; \
for i in `seq 1 60`; do cat /proc/loadavg; sleep 60; done
Load should drop to 10 after about 10 minutes and it does on x86-64 but
remained at around 200+ on arm64.
The reproduction case simply hammers the case where a task can be
descheduling while also being woken by another task at the same time. It
takes a long time to run but it makes the problem very obvious. The
expectation is that after hackbench has been running and saturating the
machine for a long time.
Commit dbfb089d360b ("sched: Fix loadavg accounting race") fixed a loadavg
accounting race in the generic case. Later it was documented why the
ordering of when p->sched_contributes_to_load is read/updated relative
to p->on_cpu. This is critical when a task is descheduling at the same
time it is being activated on another CPU. While the load/stores happen
under the RQ lock, the RQ lock on its own does not give any guarantees
on the task state.
Over the weekend I convinced myself that it must be because the
implementation of smp_load_acquire and smp_store_release do not appear
to implement acquire/release semantics because I didn't find something
arm64 that was playing with p->state behind the schedulers back (I could
have missed it if it was in an assembly portion as I can't reliablyh read
arm assembler). Similarly, it's not clear why the arm64 implementation
does not call smp_acquire__after_ctrl_dep in the smp_load_acquire
implementation. Even when it was introduced, the arm64 implementation
differed significantly from the arm implementation in terms of what
barriers it used for non-obvious reasons.
Unfortunately, making that work similar to the arch-independent version
did not help but it's not helped that I know nothing about the arm64
memory model.
I'll be looking again today to see can I find a mistake in the ordering for
how sched_contributes_to_load is handled but again, the lack of knowledge
on the arm64 memory model means I'm a bit stuck and a second set of eyes
would be nice :(
--
Mel Gorman
SUSE Labs
On Mon, Nov 16, 2020 at 11:49:38AM +0000, Mel Gorman wrote:
> On Mon, Nov 16, 2020 at 09:10:54AM +0000, Mel Gorman wrote:
> > I'll be looking again today to see can I find a mistake in the ordering for
> > how sched_contributes_to_load is handled but again, the lack of knowledge
> > on the arm64 memory model means I'm a bit stuck and a second set of eyes
> > would be nice :(
> >
>
> This morning, it's not particularly clear what orders the visibility of
> sched_contributes_to_load exactly like other task fields in the schedule
> vs try_to_wake_up paths. I thought the rq lock would have ordered them but
> something is clearly off or loadavg would not be getting screwed. It could
> be done with an rmb and wmb (testing and hasn't blown up so far) but that's
> far too heavy. smp_load_acquire/smp_store_release might be sufficient
> on it although less clear if the arm64 gives the necessary guarantees.
>
And smp_* can't be used anyway because sched_contributes_to_load is a
bit field that is not protected with a specific lock so it's "special".
--
Mel Gorman
SUSE Labs
On Mon, Nov 16, 2020 at 09:10:54AM +0000, Mel Gorman wrote:
> Similarly, it's not clear why the arm64 implementation
> does not call smp_acquire__after_ctrl_dep in the smp_load_acquire
> implementation. Even when it was introduced, the arm64 implementation
> differed significantly from the arm implementation in terms of what
> barriers it used for non-obvious reasons.
This is because ARM64's smp_cond_load_acquire() implementation uses
smp_load_aquire() directly, as opposed to the generic version that uses
READ_ONCE().
This is because ARM64 has a load-acquire instruction, which is highly
optimized, and generally considered cheaper than the smp_rmb() from
smp_acquire__after_ctrl_dep().
Or so I've been led to believe.
On Mon, Nov 16, 2020 at 01:46:57PM +0100, Peter Zijlstra wrote:
> On Mon, Nov 16, 2020 at 09:10:54AM +0000, Mel Gorman wrote:
> > Similarly, it's not clear why the arm64 implementation
> > does not call smp_acquire__after_ctrl_dep in the smp_load_acquire
> > implementation. Even when it was introduced, the arm64 implementation
> > differed significantly from the arm implementation in terms of what
> > barriers it used for non-obvious reasons.
>
> This is because ARM64's smp_cond_load_acquire() implementation uses
> smp_load_aquire() directly, as opposed to the generic version that uses
> READ_ONCE().
>
> This is because ARM64 has a load-acquire instruction, which is highly
> optimized, and generally considered cheaper than the smp_rmb() from
> smp_acquire__after_ctrl_dep().
>
> Or so I've been led to believe.
Fair enough. Either way, barriering sched_contributes_to_load "works"
but it's clumsy and may not be guaranteed to be correct. The bits
should have been protected by the rq lock but sched_remote_wakeup
updates outside of the lock which might be leading to the adject fields
(like sched_contributes_to_load) getting corrupted as per the "anti
guarantees" in memory-barriers.txt. The rq lock could be conditionally
acquired __ttwu_queue_wakelist for WF_MIGRATED and explicitly cleared in
sched_ttwu_pending (not tested if this works) but it would also suck to
acquire a remote lock when that's what we're explicitly trying to avoid
in that path.
--
Mel Gorman
SUSE Labs
On Mon, Nov 16, 2020 at 01:53:55PM +0100, Peter Zijlstra wrote:
> On Mon, Nov 16, 2020 at 11:49:38AM +0000, Mel Gorman wrote:
> > On Mon, Nov 16, 2020 at 09:10:54AM +0000, Mel Gorman wrote:
> > > I'll be looking again today to see can I find a mistake in the ordering for
> > > how sched_contributes_to_load is handled but again, the lack of knowledge
> > > on the arm64 memory model means I'm a bit stuck and a second set of eyes
> > > would be nice :(
> > >
> >
> > This morning, it's not particularly clear what orders the visibility of
> > sched_contributes_to_load exactly like other task fields in the schedule
> > vs try_to_wake_up paths. I thought the rq lock would have ordered them but
> > something is clearly off or loadavg would not be getting screwed. It could
> > be done with an rmb and wmb (testing and hasn't blown up so far) but that's
> > far too heavy. smp_load_acquire/smp_store_release might be sufficient
> > on it although less clear if the arm64 gives the necessary guarantees.
> >
> > (This is still at the chucking out ideas as I haven't context switched
> > back in all the memory barrier rules).
>
> IIRC it should be so ordered by ->on_cpu.
>
> We have:
>
> schedule()
> prev->sched_contributes_to_load = X;
> smp_store_release(prev->on_cpu, 0);
>
>
> on the one hand, and:
Ah, my bad, ttwu() itself will of course wait for !p->on_cpu before we
even get here.
> sched_ttwu_pending()
> if (WARN_ON_ONCE(p->on_cpu))
> smp_cond_load_acquire(&p->on_cpu)
>
> ttwu_do_activate()
> if (p->sched_contributes_to_load)
> ...
>
> on the other (for the remote case, which is the only 'interesting' one).
Also see the "Notes on Program-Order guarantees on SMP systems."
comment.
On Mon, Nov 16, 2020 at 09:10:54AM +0000, Mel Gorman wrote:
> I got cc'd internal bug report filed against a 5.8 and 5.9 kernel
> that loadavg was "exploding" on arch64 on a machines acting as a build
> servers. It happened on at least two different arm64 variants. That setup
> is complex to replicate but fortunately can be reproduced by running
> hackbench-process-pipes while heavily overcomitting a machine with 96
> logical CPUs and then checking if loadavg drops afterwards. With an
> MMTests clone, I reproduced it as follows
>
> ./run-mmtests.sh --config configs/config-workload-hackbench-process-pipes --no-monitor testrun; \
> for i in `seq 1 60`; do cat /proc/loadavg; sleep 60; done
>
> Load should drop to 10 after about 10 minutes and it does on x86-64 but
> remained at around 200+ on arm64.
Do you think you could use this to bisect the problem? Also, are you able
to reproduce the issue on any other arm64 machines, or just this one?
> The reproduction case simply hammers the case where a task can be
> descheduling while also being woken by another task at the same time. It
> takes a long time to run but it makes the problem very obvious. The
> expectation is that after hackbench has been running and saturating the
> machine for a long time.
>
> Commit dbfb089d360b ("sched: Fix loadavg accounting race") fixed a loadavg
> accounting race in the generic case. Later it was documented why the
> ordering of when p->sched_contributes_to_load is read/updated relative
> to p->on_cpu. This is critical when a task is descheduling at the same
> time it is being activated on another CPU. While the load/stores happen
> under the RQ lock, the RQ lock on its own does not give any guarantees
> on the task state.
>
> Over the weekend I convinced myself that it must be because the
> implementation of smp_load_acquire and smp_store_release do not appear
> to implement acquire/release semantics because I didn't find something
> arm64 that was playing with p->state behind the schedulers back (I could
> have missed it if it was in an assembly portion as I can't reliablyh read
> arm assembler). Similarly, it's not clear why the arm64 implementation
> does not call smp_acquire__after_ctrl_dep in the smp_load_acquire
> implementation. Even when it was introduced, the arm64 implementation
> differed significantly from the arm implementation in terms of what
> barriers it used for non-obvious reasons.
Why would you expect smp_acquire__after_ctrl_dep() to be called as part of
the smp_load_acquire() implementation?
FWIW, arm64 has special instructions for acquire and release (and they
actually provide more order than is strictly needed by Linux), so we
just map acquire/release to those instructions directly. Since these
instructions are not available on most 32-bit cores, the arm implementation
just uses the fence-based implementation.
Anyway, setting all that aside, I do agree with you that the bitfield usage
in task_struct looks pretty suspicious. For example, in __schedule() we
have:
rq_lock(rq, &rf);
smp_mb__after_spinlock();
...
prev_state = prev->state;
if (!preempt && prev_state) {
if (signal_pending_state(prev_state, prev)) {
prev->state = TASK_RUNNING;
} else {
prev->sched_contributes_to_load =
(prev_state & TASK_UNINTERRUPTIBLE) &&
!(prev_state & TASK_NOLOAD) &&
!(prev->flags & PF_FROZEN);
...
deactivate_task(rq, prev, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK);
where deactivate_task() updates p->on_rq directly:
p->on_rq = (flags & DEQUEUE_SLEEP) ? 0 : TASK_ON_RQ_MIGRATING;
so this is _not_ ordered wrt sched_contributes_to_load. But then over in
__ttwu_queue_wakelist() we have:
p->sched_remote_wakeup = !!(wake_flags & WF_MIGRATED);
which can be invoked on the try_to_wake_up() path if p->on_rq is first read
as zero and then p->on_cpu is read as 1. Perhaps these non-atomic bitfield
updates can race and cause the flags to be corrupted?
Then again, I went through the list of observed KCSAN splats and don't
see this race showing up in there, so perhaps it's serialised by something
I haven't spotted.
Will
On Mon, Nov 16, 2020 at 01:11:03PM +0000, Will Deacon wrote:
> On Mon, Nov 16, 2020 at 09:10:54AM +0000, Mel Gorman wrote:
> > I got cc'd internal bug report filed against a 5.8 and 5.9 kernel
> > that loadavg was "exploding" on arch64 on a machines acting as a build
> > servers. It happened on at least two different arm64 variants. That setup
> > is complex to replicate but fortunately can be reproduced by running
> > hackbench-process-pipes while heavily overcomitting a machine with 96
> > logical CPUs and then checking if loadavg drops afterwards. With an
> > MMTests clone, I reproduced it as follows
> >
> > ./run-mmtests.sh --config configs/config-workload-hackbench-process-pipes --no-monitor testrun; \
> > for i in `seq 1 60`; do cat /proc/loadavg; sleep 60; done
> >
> > Load should drop to 10 after about 10 minutes and it does on x86-64 but
> > remained at around 200+ on arm64.
>
> Do you think you could use this to bisect the problem? Also, are you able
> to reproduce the issue on any other arm64 machines, or just this one?
>
I didn't bisect it as I was assuming it was related to c6e7bd7afaeb
("sched/core: Optimize ttwu() spinning on p->on_cpu") which is something I
would still like to preserve and was responsible to a loadavg glitch fixed
by dbfb089d360b ("sched: Fix loadavg accounting race") and d136122f5845
("sched: Fix race against ptrace_freeze_trace()")
While *I* can only reproduce it on one machine, I have a bug report saying
it affects others. It's not a single machine issue or a single ARM variant.
> > The reproduction case simply hammers the case where a task can be
> > descheduling while also being woken by another task at the same time. It
> > takes a long time to run but it makes the problem very obvious. The
> > expectation is that after hackbench has been running and saturating the
> > machine for a long time.
> >
> > Commit dbfb089d360b ("sched: Fix loadavg accounting race") fixed a loadavg
> > accounting race in the generic case. Later it was documented why the
> > ordering of when p->sched_contributes_to_load is read/updated relative
> > to p->on_cpu. This is critical when a task is descheduling at the same
> > time it is being activated on another CPU. While the load/stores happen
> > under the RQ lock, the RQ lock on its own does not give any guarantees
> > on the task state.
> >
> > Over the weekend I convinced myself that it must be because the
> > implementation of smp_load_acquire and smp_store_release do not appear
> > to implement acquire/release semantics because I didn't find something
> > arm64 that was playing with p->state behind the schedulers back (I could
> > have missed it if it was in an assembly portion as I can't reliablyh read
> > arm assembler). Similarly, it's not clear why the arm64 implementation
> > does not call smp_acquire__after_ctrl_dep in the smp_load_acquire
> > implementation. Even when it was introduced, the arm64 implementation
> > differed significantly from the arm implementation in terms of what
> > barriers it used for non-obvious reasons.
>
> Why would you expect smp_acquire__after_ctrl_dep() to be called as part of
> the smp_load_acquire() implementation?
>
I wouldn't, I should have said smp_cond_load_acquire.
> FWIW, arm64 has special instructions for acquire and release (and they
> actually provide more order than is strictly needed by Linux), so we
> just map acquire/release to those instructions directly. Since these
> instructions are not available on most 32-bit cores, the arm implementation
> just uses the fence-based implementation.
>
Ok, makes sense. I think this was a red herring anyway as it's now looking
more like a sched_contibutes_to_load ordering issue.
> Anyway, setting all that aside, I do agree with you that the bitfield usage
> in task_struct looks pretty suspicious. For example, in __schedule() we
> have:
>
> rq_lock(rq, &rf);
> smp_mb__after_spinlock();
> ...
> prev_state = prev->state;
>
> if (!preempt && prev_state) {
> if (signal_pending_state(prev_state, prev)) {
> prev->state = TASK_RUNNING;
> } else {
> prev->sched_contributes_to_load =
> (prev_state & TASK_UNINTERRUPTIBLE) &&
> !(prev_state & TASK_NOLOAD) &&
> !(prev->flags & PF_FROZEN);
> ...
> deactivate_task(rq, prev, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK);
>
> where deactivate_task() updates p->on_rq directly:
>
> p->on_rq = (flags & DEQUEUE_SLEEP) ? 0 : TASK_ON_RQ_MIGRATING;
>
It used to be at least a WRITE_ONCE until 58877d347b58 ("sched: Better
document ttwu()") which changed it. Not sure why that is and didn't
think about it too deep as it didn't appear to be directly related to
the problem and didn't have ordering consequences.
> so this is _not_ ordered wrt sched_contributes_to_load. But then over in
> __ttwu_queue_wakelist() we have:
>
> p->sched_remote_wakeup = !!(wake_flags & WF_MIGRATED);
>
> which can be invoked on the try_to_wake_up() path if p->on_rq is first read
> as zero and then p->on_cpu is read as 1. Perhaps these non-atomic bitfield
> updates can race and cause the flags to be corrupted?
>
I think this is at least one possibility. I think at least that one
should only be explicitly set on WF_MIGRATED and explicitly cleared in
sched_ttwu_pending. While I haven't audited it fully, it might be enough
to avoid a double write outside of the rq lock on the bitfield but I
still need to think more about the ordering of sched_contributes_to_load
and whether it's ordered by p->on_cpu or not.
--
Mel Gorman
SUSE Labs
On Mon, Nov 16, 2020 at 01:37:21PM +0000, Mel Gorman wrote:
> On Mon, Nov 16, 2020 at 01:11:03PM +0000, Will Deacon wrote:
> > Anyway, setting all that aside, I do agree with you that the bitfield usage
> > in task_struct looks pretty suspicious. For example, in __schedule() we
> > have:
> >
> > rq_lock(rq, &rf);
> > smp_mb__after_spinlock();
> > ...
> > prev_state = prev->state;
> >
> > if (!preempt && prev_state) {
> > if (signal_pending_state(prev_state, prev)) {
> > prev->state = TASK_RUNNING;
> > } else {
> > prev->sched_contributes_to_load =
> > (prev_state & TASK_UNINTERRUPTIBLE) &&
> > !(prev_state & TASK_NOLOAD) &&
> > !(prev->flags & PF_FROZEN);
> > ...
> > deactivate_task(rq, prev, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK);
> >
> > where deactivate_task() updates p->on_rq directly:
> >
> > p->on_rq = (flags & DEQUEUE_SLEEP) ? 0 : TASK_ON_RQ_MIGRATING;
> >
>
> It used to be at least a WRITE_ONCE until 58877d347b58 ("sched: Better
> document ttwu()") which changed it. Not sure why that is and didn't
> think about it too deep as it didn't appear to be directly related to
> the problem and didn't have ordering consequences.
I'm confused; that commit didn't change deactivate_task(). Anyway,
->on_rq should be strictly under rq->lock. That said, since there is a
READ_ONCE() consumer of ->on_rq it makes sense to have the stores as
WRITE_ONCE().
> > __ttwu_queue_wakelist() we have:
> >
> > p->sched_remote_wakeup = !!(wake_flags & WF_MIGRATED);
> >
> > which can be invoked on the try_to_wake_up() path if p->on_rq is first read
> > as zero and then p->on_cpu is read as 1. Perhaps these non-atomic bitfield
> > updates can race and cause the flags to be corrupted?
> >
>
> I think this is at least one possibility. I think at least that one
> should only be explicitly set on WF_MIGRATED and explicitly cleared in
> sched_ttwu_pending. While I haven't audited it fully, it might be enough
> to avoid a double write outside of the rq lock on the bitfield but I
> still need to think more about the ordering of sched_contributes_to_load
> and whether it's ordered by p->on_cpu or not.
The scenario you're worried about is something like:
CPU0 CPU1
schedule()
prev->sched_contributes_to_load = X;
deactivate_task(prev);
try_to_wake_up()
if (p->on_rq &&) // false
if (smp_load_acquire(&p->on_cpu) && // true
ttwu_queue_wakelist())
p->sched_remote_wakeup = Y;
smp_store_release(prev->on_cpu, 0);
And then the stores of X and Y clobber one another.. Hummph, seems
reasonable. One quick thing to test would be something like this:
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7abbdd7f3884..9844e541c94c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -775,7 +775,9 @@ struct task_struct {
unsigned sched_reset_on_fork:1;
unsigned sched_contributes_to_load:1;
unsigned sched_migrated:1;
+ unsigned :0;
unsigned sched_remote_wakeup:1;
+ unsigned :0;
#ifdef CONFIG_PSI
unsigned sched_psi_wake_requeue:1;
#endif
On Mon, Nov 16, 2020 at 01:58:03PM +0100, Peter Zijlstra wrote:
> On Mon, Nov 16, 2020 at 01:53:55PM +0100, Peter Zijlstra wrote:
> > On Mon, Nov 16, 2020 at 11:49:38AM +0000, Mel Gorman wrote:
> > > On Mon, Nov 16, 2020 at 09:10:54AM +0000, Mel Gorman wrote:
> > > > I'll be looking again today to see can I find a mistake in the ordering for
> > > > how sched_contributes_to_load is handled but again, the lack of knowledge
> > > > on the arm64 memory model means I'm a bit stuck and a second set of eyes
> > > > would be nice :(
> > > >
> > >
> > > This morning, it's not particularly clear what orders the visibility of
> > > sched_contributes_to_load exactly like other task fields in the schedule
> > > vs try_to_wake_up paths. I thought the rq lock would have ordered them but
> > > something is clearly off or loadavg would not be getting screwed. It could
> > > be done with an rmb and wmb (testing and hasn't blown up so far) but that's
> > > far too heavy. smp_load_acquire/smp_store_release might be sufficient
> > > on it although less clear if the arm64 gives the necessary guarantees.
> > >
> > > (This is still at the chucking out ideas as I haven't context switched
> > > back in all the memory barrier rules).
> >
> > IIRC it should be so ordered by ->on_cpu.
> >
> > We have:
> >
> > schedule()
> > prev->sched_contributes_to_load = X;
> > smp_store_release(prev->on_cpu, 0);
> >
> >
> > on the one hand, and:
>
> Ah, my bad, ttwu() itself will of course wait for !p->on_cpu before we
> even get here.
>
Sortof, it will either have called smp_load_acquire(&p->on_cpu) or
smp_cond_load_acquire(&p->on_cpu, !VAL) before hitting one of the paths
leading to ttwu_do_activate. Either way, it's covered.
> > sched_ttwu_pending()
> > if (WARN_ON_ONCE(p->on_cpu))
> > smp_cond_load_acquire(&p->on_cpu)
> >
> > ttwu_do_activate()
> > if (p->sched_contributes_to_load)
> > ...
> >
> > on the other (for the remote case, which is the only 'interesting' one).
>
But this side is interesting because I'm having trouble convincing
myself it's 100% correct for sched_contributes_to_load. The write of
prev->sched_contributes_to_load in the schedule() path has a big gap
before it hits the smp_store_release(prev->on_cpu).
On the ttwu path, we have
if (smp_load_acquire(&p->on_cpu) &&
ttwu_queue_wakelist(p, task_cpu(p), wake_flags | WF_ON_CPU))
goto unlock;
ttwu_queue_wakelist queues task on the wakelist, sends IPI
and on the receiver side it calls ttwu_do_activate and reads
sched_contributes_to_load
sched_ttwu_pending() is not necessarily using the same rq lock so no
protection there. The smp_load_acquire() has just been hit but it still
leaves a gap between when sched_contributes_to_load is written and a
parallel read of sched_contributes_to_load.
So while we might be able to avoid a smp_rmb() before the read of
sched_contributes_to_load and rely on p->on_cpu ordering there,
we may still need a smp_wmb() after nr_interruptible() increments
instead of waiting until the smp_store_release() is hit while a task
is scheduling. That would be a real memory barrier on arm64 and a plain
compiler barrier on x86-64.
> Also see the "Notes on Program-Order guarantees on SMP systems."
> comment.
I did, it was the on_cpu ordering for the blocking case that had me
looking at the smp_store_release and smp_cond_load_acquire in arm64 in
the first place thinking that something in there must be breaking the
on_cpu ordering. I'm re-reading it every so often while trying to figure
out where the gap is or whether I'm imagining things.
Not fully tested but did not instantly break either
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d2003a7d5ab5..877eaeba45ac 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4459,14 +4459,26 @@ static void __sched notrace __schedule(bool preempt)
if (signal_pending_state(prev_state, prev)) {
prev->state = TASK_RUNNING;
} else {
- prev->sched_contributes_to_load =
+ int acct_load =
(prev_state & TASK_UNINTERRUPTIBLE) &&
!(prev_state & TASK_NOLOAD) &&
!(prev->flags & PF_FROZEN);
- if (prev->sched_contributes_to_load)
+ prev->sched_contributes_to_load = acct_load;
+ if (acct_load) {
rq->nr_uninterruptible++;
+ /*
+ * Pairs with p->on_cpu ordering, either a
+ * smp_load_acquire or smp_cond_load_acquire
+ * in the ttwu path before ttwu_do_activate
+ * p->sched_contributes_to_load. It's only
+ * after the nr_interruptible update happens
+ * that the ordering is critical.
+ */
+ smp_wmb();
+ }
+
/*
* __schedule() ttwu()
* prev_state = prev->state; if (p->on_rq && ...)
--
Mel Gorman
SUSE Labs
On Mon, Nov 16, 2020 at 03:20:05PM +0100, Peter Zijlstra wrote:
> > It used to be at least a WRITE_ONCE until 58877d347b58 ("sched: Better
> > document ttwu()") which changed it. Not sure why that is and didn't
> > think about it too deep as it didn't appear to be directly related to
> > the problem and didn't have ordering consequences.
>
> I'm confused; that commit didn't change deactivate_task(). Anyway,
> ->on_rq should be strictly under rq->lock. That said, since there is a
> READ_ONCE() consumer of ->on_rq it makes sense to have the stores as
> WRITE_ONCE().
>
It didn't change deactivate_task but it did this
- WRITE_ONCE(p->on_rq, TASK_ON_RQ_MIGRATING);
- dequeue_task(rq, p, DEQUEUE_NOCLOCK);
+ deactivate_task(rq, p, DEQUEUE_NOCLOCK);
which makes that write a
p->on_rq = (flags & DEQUEUE_SLEEP) ? 0 : TASK_ON_RQ_MIGRATING;
As activate_task is also a plain store and I didn't spot a relevant
ordering problem that would impact loadavg, I concluded it was not
immediately relevant, just a curiousity.
> > > __ttwu_queue_wakelist() we have:
> > >
> > > p->sched_remote_wakeup = !!(wake_flags & WF_MIGRATED);
> > >
> > > which can be invoked on the try_to_wake_up() path if p->on_rq is first read
> > > as zero and then p->on_cpu is read as 1. Perhaps these non-atomic bitfield
> > > updates can race and cause the flags to be corrupted?
> > >
> >
> > I think this is at least one possibility. I think at least that one
> > should only be explicitly set on WF_MIGRATED and explicitly cleared in
> > sched_ttwu_pending. While I haven't audited it fully, it might be enough
> > to avoid a double write outside of the rq lock on the bitfield but I
> > still need to think more about the ordering of sched_contributes_to_load
> > and whether it's ordered by p->on_cpu or not.
>
> The scenario you're worried about is something like:
>
> CPU0 CPU1
>
> schedule()
> prev->sched_contributes_to_load = X;
> deactivate_task(prev);
>
> try_to_wake_up()
> if (p->on_rq &&) // false
> if (smp_load_acquire(&p->on_cpu) && // true
> ttwu_queue_wakelist())
> p->sched_remote_wakeup = Y;
>
> smp_store_release(prev->on_cpu, 0);
>
Yes, mostly because of what memory-barriers.txt warns about for bitfields
if they are not protected by the same lock.
> And then the stores of X and Y clobber one another.. Hummph, seems
> reasonable. One quick thing to test would be something like this:
>
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 7abbdd7f3884..9844e541c94c 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -775,7 +775,9 @@ struct task_struct {
> unsigned sched_reset_on_fork:1;
> unsigned sched_contributes_to_load:1;
> unsigned sched_migrated:1;
> + unsigned :0;
> unsigned sched_remote_wakeup:1;
> + unsigned :0;
> #ifdef CONFIG_PSI
> unsigned sched_psi_wake_requeue:1;
> #endif
I'll test this after the smp_wmb() test completes. While a clobbering may
be the issue, I also think the gap between the rq->nr_uninterruptible++
and smp_store_release(prev->on_cpu, 0) is relevant and a better candidate.
--
Mel Gorman
SUSE Labs
On Mon, Nov 16, 2020 at 03:29:46PM +0000, Mel Gorman wrote:
> I did, it was the on_cpu ordering for the blocking case that had me
> looking at the smp_store_release and smp_cond_load_acquire in arm64 in
> the first place thinking that something in there must be breaking the
> on_cpu ordering. I'm re-reading it every so often while trying to figure
> out where the gap is or whether I'm imagining things.
>
> Not fully tested but did not instantly break either
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index d2003a7d5ab5..877eaeba45ac 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4459,14 +4459,26 @@ static void __sched notrace __schedule(bool preempt)
> if (signal_pending_state(prev_state, prev)) {
> prev->state = TASK_RUNNING;
> } else {
> - prev->sched_contributes_to_load =
> + int acct_load =
> (prev_state & TASK_UNINTERRUPTIBLE) &&
> !(prev_state & TASK_NOLOAD) &&
> !(prev->flags & PF_FROZEN);
>
> - if (prev->sched_contributes_to_load)
> + prev->sched_contributes_to_load = acct_load;
> + if (acct_load) {
> rq->nr_uninterruptible++;
>
> + /*
> + * Pairs with p->on_cpu ordering, either a
> + * smp_load_acquire or smp_cond_load_acquire
> + * in the ttwu path before ttwu_do_activate
> + * p->sched_contributes_to_load. It's only
> + * after the nr_interruptible update happens
> + * that the ordering is critical.
> + */
> + smp_wmb();
> + }
> +
> /*
> * __schedule() ttwu()
> * prev_state = prev->state; if (p->on_rq && ...)
>
This passed the test. Load averages taken once a minute after the test
completed showed
950.21 977.17 990.69 1/853 2117
349.00 799.32 928.69 1/859 2439
128.18 653.85 870.56 1/861 2736
47.08 534.84 816.08 1/860 3029
17.29 437.50 765.00 1/865 3357
6.35 357.87 717.13 1/865 3653
2.33 292.74 672.24 1/861 3709
0.85 239.46 630.17 1/859 3711
0.31 195.87 590.73 1/857 3713
0.11 160.22 553.76 1/853 3715
With 5.10-rc3, it got stuck with a load average of 244 after the test
completed even though the machine was idle.
--
Mel Gorman
SUSE Labs
On Mon, Nov 16, 2020 at 03:29:46PM +0000, Mel Gorman wrote:
> On Mon, Nov 16, 2020 at 01:58:03PM +0100, Peter Zijlstra wrote:
> > > sched_ttwu_pending()
> > > if (WARN_ON_ONCE(p->on_cpu))
> > > smp_cond_load_acquire(&p->on_cpu)
> > >
> > > ttwu_do_activate()
> > > if (p->sched_contributes_to_load)
> > > ...
> > >
> > > on the other (for the remote case, which is the only 'interesting' one).
> >
>
> But this side is interesting because I'm having trouble convincing
> myself it's 100% correct for sched_contributes_to_load. The write of
> prev->sched_contributes_to_load in the schedule() path has a big gap
> before it hits the smp_store_release(prev->on_cpu).
>
> On the ttwu path, we have
>
> if (smp_load_acquire(&p->on_cpu) &&
> ttwu_queue_wakelist(p, task_cpu(p), wake_flags | WF_ON_CPU))
> goto unlock;
>
> ttwu_queue_wakelist queues task on the wakelist, sends IPI
> and on the receiver side it calls ttwu_do_activate and reads
> sched_contributes_to_load
>
> sched_ttwu_pending() is not necessarily using the same rq lock so no
> protection there. The smp_load_acquire() has just been hit but it still
> leaves a gap between when sched_contributes_to_load is written and a
> parallel read of sched_contributes_to_load.
>
> So while we might be able to avoid a smp_rmb() before the read of
> sched_contributes_to_load and rely on p->on_cpu ordering there,
> we may still need a smp_wmb() after nr_interruptible() increments
> instead of waiting until the smp_store_release() is hit while a task
> is scheduling. That would be a real memory barrier on arm64 and a plain
> compiler barrier on x86-64.
I'm mighty confused by your words here; and the patch below. What actual
scenario are you worried about?
If we take the WF_ON_CPU path, we IPI the CPU the task is ->on_cpu on.
So the IPI lands after the schedule() that clears ->on_cpu on the very
same CPU.
>
> > Also see the "Notes on Program-Order guarantees on SMP systems."
> > comment.
>
> I did, it was the on_cpu ordering for the blocking case that had me
> looking at the smp_store_release and smp_cond_load_acquire in arm64 in
> the first place thinking that something in there must be breaking the
> on_cpu ordering. I'm re-reading it every so often while trying to figure
> out where the gap is or whether I'm imagining things.
>
> Not fully tested but did not instantly break either
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index d2003a7d5ab5..877eaeba45ac 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4459,14 +4459,26 @@ static void __sched notrace __schedule(bool preempt)
> if (signal_pending_state(prev_state, prev)) {
> prev->state = TASK_RUNNING;
> } else {
> - prev->sched_contributes_to_load =
> + int acct_load =
> (prev_state & TASK_UNINTERRUPTIBLE) &&
> !(prev_state & TASK_NOLOAD) &&
> !(prev->flags & PF_FROZEN);
>
> - if (prev->sched_contributes_to_load)
> + prev->sched_contributes_to_load = acct_load;
> + if (acct_load) {
> rq->nr_uninterruptible++;
>
> + /*
> + * Pairs with p->on_cpu ordering, either a
> + * smp_load_acquire or smp_cond_load_acquire
> + * in the ttwu path before ttwu_do_activate
> + * p->sched_contributes_to_load. It's only
> + * after the nr_interruptible update happens
> + * that the ordering is critical.
> + */
> + smp_wmb();
> + }
Sorry, I can't follow, at all.
On Mon, Nov 16, 2020 at 03:52:32PM +0000, Mel Gorman wrote:
> On Mon, Nov 16, 2020 at 03:20:05PM +0100, Peter Zijlstra wrote:
> > > It used to be at least a WRITE_ONCE until 58877d347b58 ("sched: Better
> > > document ttwu()") which changed it. Not sure why that is and didn't
> > > think about it too deep as it didn't appear to be directly related to
> > > the problem and didn't have ordering consequences.
> >
> > I'm confused; that commit didn't change deactivate_task(). Anyway,
> > ->on_rq should be strictly under rq->lock. That said, since there is a
> > READ_ONCE() consumer of ->on_rq it makes sense to have the stores as
> > WRITE_ONCE().
> >
>
> It didn't change deactivate_task but it did this
>
> - WRITE_ONCE(p->on_rq, TASK_ON_RQ_MIGRATING);
> - dequeue_task(rq, p, DEQUEUE_NOCLOCK);
> + deactivate_task(rq, p, DEQUEUE_NOCLOCK);
>
> which makes that write a
>
> p->on_rq = (flags & DEQUEUE_SLEEP) ? 0 : TASK_ON_RQ_MIGRATING;
>
> As activate_task is also a plain store and I didn't spot a relevant
> ordering problem that would impact loadavg, I concluded it was not
> immediately relevant, just a curiousity.
That's move_queued_task() case, which is irrelevant for the issue at
hand.
> > > > __ttwu_queue_wakelist() we have:
> > > >
> > > > p->sched_remote_wakeup = !!(wake_flags & WF_MIGRATED);
> > > >
> > > > which can be invoked on the try_to_wake_up() path if p->on_rq is first read
> > > > as zero and then p->on_cpu is read as 1. Perhaps these non-atomic bitfield
> > > > updates can race and cause the flags to be corrupted?
> > > >
> > >
> > > I think this is at least one possibility. I think at least that one
> > > should only be explicitly set on WF_MIGRATED and explicitly cleared in
> > > sched_ttwu_pending. While I haven't audited it fully, it might be enough
> > > to avoid a double write outside of the rq lock on the bitfield but I
> > > still need to think more about the ordering of sched_contributes_to_load
> > > and whether it's ordered by p->on_cpu or not.
> >
> > The scenario you're worried about is something like:
> >
> > CPU0 CPU1
> >
> > schedule()
> > prev->sched_contributes_to_load = X;
> > deactivate_task(prev);
> >
> > try_to_wake_up()
> > if (p->on_rq &&) // false
> > if (smp_load_acquire(&p->on_cpu) && // true
> > ttwu_queue_wakelist())
> > p->sched_remote_wakeup = Y;
> >
> > smp_store_release(prev->on_cpu, 0);
> >
>
> Yes, mostly because of what memory-barriers.txt warns about for bitfields
> if they are not protected by the same lock.
I'm not sure memory-barriers.txt is relevant; that's simply two racing
stores and 'obviously' buggered.
> > And then the stores of X and Y clobber one another.. Hummph, seems
> > reasonable. One quick thing to test would be something like this:
> >
> >
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index 7abbdd7f3884..9844e541c94c 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -775,7 +775,9 @@ struct task_struct {
> > unsigned sched_reset_on_fork:1;
> > unsigned sched_contributes_to_load:1;
> > unsigned sched_migrated:1;
> > + unsigned :0;
> > unsigned sched_remote_wakeup:1;
> > + unsigned :0;
> > #ifdef CONFIG_PSI
> > unsigned sched_psi_wake_requeue:1;
> > #endif
>
> I'll test this after the smp_wmb() test completes. While a clobbering may
> be the issue, I also think the gap between the rq->nr_uninterruptible++
> and smp_store_release(prev->on_cpu, 0) is relevant and a better candidate.
I really don't understand what you wrote in that email...
On Mon, Nov 16, 2020 at 05:54:15PM +0100, Peter Zijlstra wrote:
> > > And then the stores of X and Y clobber one another.. Hummph, seems
> > > reasonable. One quick thing to test would be something like this:
> > >
> > >
> > > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > > index 7abbdd7f3884..9844e541c94c 100644
> > > --- a/include/linux/sched.h
> > > +++ b/include/linux/sched.h
> > > @@ -775,7 +775,9 @@ struct task_struct {
> > > unsigned sched_reset_on_fork:1;
> > > unsigned sched_contributes_to_load:1;
> > > unsigned sched_migrated:1;
> > > + unsigned :0;
> > > unsigned sched_remote_wakeup:1;
> > > + unsigned :0;
> > > #ifdef CONFIG_PSI
> > > unsigned sched_psi_wake_requeue:1;
> > > #endif
> >
> > I'll test this after the smp_wmb() test completes. While a clobbering may
> > be the issue, I also think the gap between the rq->nr_uninterruptible++
> > and smp_store_release(prev->on_cpu, 0) is relevant and a better candidate.
>
> I really don't understand what you wrote in that email...
Sorry :(. I tried writing a changelog showing where I think the race
might be. I'll queue up your patch that is potentially sched_migrated
and sched_remote_wakeup being clobbered.
--8<--
sched: Fix loadavg accounting race on arm64
An internal bug report filed against a 5.8 and 5.9 kernel that loadavg
was "exploding" on arm64 on a machines acting as a build servers. It
happened on at least two different arm64 variants. That setup is complex to
replicate but can be reproduced by running hackbench-process-pipes while
heavily overcommitting a machine with 96 logical CPUs and then checking
if loadavg drops afterwards. With an MMTests clone, reproduce it as follows
./run-mmtests.sh --config configs/config-workload-hackbench-process-pipes --no-monitor testrun; \
for i in `seq 1 60`; do cat /proc/loadavg; sleep 60; done
The reproduction case simply hammers the case where a task can be
descheduling while also being woken by another task at the same time.
After the test completes, load avg should reach 0 within a few minutes.
Commit dbfb089d360b ("sched: Fix loadavg accounting race") fixed a loadavg
accounting race in the generic case. Later it was documented why the ordering
of when p->sched_contributes_to_load is read/updated relative to p->on_cpu.
This is critical when a task is descheduling at the same time it is being
activated on another CPU. While the load/stores happen under the RQ lock,
the RQ lock on its own does not give any guarantees on the task state.
The problem appears to be because the schedule and wakeup paths rely
on being ordered by ->on_cpu for some fields as documented in core.c
under "Notes on Program-Order guarantees on SMP systems". However,
this can happen
CPU 0 CPU 1 CPU 2
__schedule()
prev->sched_contributes_to_load = 1
rq->nr_uninterruptible++;
rq_unlock_irq
try_to_wake_up
smp_load_acquire(&p->on_cpu) && ttwu_queue_wakelist(p) == true
ttwu_queue_wakelist
ttwu_queue_cond (true)
__ttwu_queue_wakelist
sched_ttwu_pending
ttwu_do_activate
if (p->sched_contributes_to_load) (wrong value observed, load drifts)
finish_task
smp_store_release(X->on_cpu, 0)
There is a gap between when rq->nr_uninterruptible is written
to before p->on_cpu is updated with smp_store_release(). During
that window, a parallel waker may observe the incorrect value for
p->sched_contributes_to_load and fail to decrement rq->nr_uninterruptible
and the loadavg starts drifting.
This patch adds a write barrier after nr_uninterruptible is updated
with the matching read barrier done when reading p->on_cpu in the ttwu
path. With the patch applied, the load averages taken every minute after
the hackbench test case completes is
950.21 977.17 990.69 1/853 2117
349.00 799.32 928.69 1/859 2439
128.18 653.85 870.56 1/861 2736
47.08 534.84 816.08 1/860 3029
17.29 437.50 765.00 1/865 3357
6.35 357.87 717.13 1/865 3653
2.33 292.74 672.24 1/861 3709
0.85 239.46 630.17 1/859 3711
0.31 195.87 590.73 1/857 3713
0.11 160.22 553.76 1/853 3715
Without the patch, the load average stabilised at 244 on an otherwise
idle machine.
Fixes: c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu")
Signed-off-by: Mel Gorman <[email protected]>
Cc: [email protected] # v5.8+
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d2003a7d5ab5..877eaeba45ac 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4459,14 +4459,26 @@ static void __sched notrace __schedule(bool preempt)
if (signal_pending_state(prev_state, prev)) {
prev->state = TASK_RUNNING;
} else {
- prev->sched_contributes_to_load =
+ int acct_load =
(prev_state & TASK_UNINTERRUPTIBLE) &&
!(prev_state & TASK_NOLOAD) &&
!(prev->flags & PF_FROZEN);
- if (prev->sched_contributes_to_load)
+ prev->sched_contributes_to_load = acct_load;
+ if (acct_load) {
rq->nr_uninterruptible++;
+ /*
+ * Pairs with p->on_cpu ordering, either a
+ * smp_load_acquire or smp_cond_load_acquire
+ * in the ttwu path before ttwu_do_activate
+ * p->sched_contributes_to_load. It's only
+ * after the nr_interruptible update happens
+ * that the ordering is critical.
+ */
+ smp_wmb();
+ }
+
/*
* __schedule() ttwu()
* prev_state = prev->state; if (p->on_rq && ...)
--
Mel Gorman
SUSE Labs
On Mon, Nov 16, 2020 at 05:49:28PM +0100, Peter Zijlstra wrote:
> > So while we might be able to avoid a smp_rmb() before the read of
> > sched_contributes_to_load and rely on p->on_cpu ordering there,
> > we may still need a smp_wmb() after nr_interruptible() increments
> > instead of waiting until the smp_store_release() is hit while a task
> > is scheduling. That would be a real memory barrier on arm64 and a plain
> > compiler barrier on x86-64.
>
Wish I read this before sending the changelog
> I'm mighty confused by your words here; and the patch below. What actual
> scenario are you worried about?
>
The wrong one apparently. Even if the IRQ is released, the IPI would
deliver to the CPU that should observe the correct value or take the
other path when smp_cond_load_acquire(&p->on_cpu, !VAL) waits for the
schedule to finish so I'm now both confused and wondering why smp_wmb
made a difference at all.
--
Mel Gorman
SUSE Labs
On Mon, Nov 16, 2020 at 05:24:44PM +0000, Mel Gorman wrote:
> On Mon, Nov 16, 2020 at 05:49:28PM +0100, Peter Zijlstra wrote:
> > > So while we might be able to avoid a smp_rmb() before the read of
> > > sched_contributes_to_load and rely on p->on_cpu ordering there,
> > > we may still need a smp_wmb() after nr_interruptible() increments
> > > instead of waiting until the smp_store_release() is hit while a task
> > > is scheduling. That would be a real memory barrier on arm64 and a plain
> > > compiler barrier on x86-64.
> >
>
> Wish I read this before sending the changelog
>
> > I'm mighty confused by your words here; and the patch below. What actual
> > scenario are you worried about?
> >
>
> The wrong one apparently. Even if the IRQ is released, the IPI would
> deliver to the CPU that should observe the correct value or take the
> other path when smp_cond_load_acquire(&p->on_cpu, !VAL) waits for the
> schedule to finish so I'm now both confused and wondering why smp_wmb
> made a difference at all.
Probably still worth trying Peter's hack to pad the bitfields though, as I
think that's still a plausible issue (and one which would appear to be
fixed by that smp_wmb() too).
Will
On Mon, Nov 16, 2020 at 03:20:05PM +0100, Peter Zijlstra wrote:
> > I think this is at least one possibility. I think at least that one
> > should only be explicitly set on WF_MIGRATED and explicitly cleared in
> > sched_ttwu_pending. While I haven't audited it fully, it might be enough
> > to avoid a double write outside of the rq lock on the bitfield but I
> > still need to think more about the ordering of sched_contributes_to_load
> > and whether it's ordered by p->on_cpu or not.
>
> The scenario you're worried about is something like:
>
> CPU0 CPU1
>
> schedule()
> prev->sched_contributes_to_load = X;
> deactivate_task(prev);
>
> try_to_wake_up()
> if (p->on_rq &&) // false
> if (smp_load_acquire(&p->on_cpu) && // true
> ttwu_queue_wakelist())
> p->sched_remote_wakeup = Y;
>
> smp_store_release(prev->on_cpu, 0);
>
Yes.
> And then the stores of X and Y clobber one another.. Hummph, seems
> reasonable. One quick thing to test would be something like this:
>
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 7abbdd7f3884..9844e541c94c 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -775,7 +775,9 @@ struct task_struct {
> unsigned sched_reset_on_fork:1;
> unsigned sched_contributes_to_load:1;
> unsigned sched_migrated:1;
> + unsigned :0;
> unsigned sched_remote_wakeup:1;
> + unsigned :0;
> #ifdef CONFIG_PSI
> unsigned sched_psi_wake_requeue:1;
> #endif
And this works.
986.01 1008.17 1013.15 2/855 1212
362.19 824.70 949.75 1/856 1564
133.19 674.65 890.32 1/864 1958
49.04 551.89 834.61 2/871 2339
18.33 451.54 782.41 1/867 2686
6.77 369.37 733.45 1/866 2929
2.55 302.16 687.55 1/864 2931
0.97 247.18 644.52 1/860 2933
0.48 202.23 604.20 1/849 2935
I should have gone with this after rereading the warning about bit fields
having to be protected by the same lock in the "anti-guarantees" section
of memory-barriers.txt :(
sched_psi_wake_requeue can probably stay with the other three fields
given they are under the rq lock but sched_remote_wakeup needs to move
out.
--
Mel Gorman
SUSE Labs
On Mon, Nov 16, 2020 at 09:10:54AM +0000, Mel Gorman wrote:
> I'll be looking again today to see can I find a mistake in the ordering for
> how sched_contributes_to_load is handled but again, the lack of knowledge
> on the arm64 memory model means I'm a bit stuck and a second set of eyes
> would be nice :(
>
This morning, it's not particularly clear what orders the visibility of
sched_contributes_to_load exactly like other task fields in the schedule
vs try_to_wake_up paths. I thought the rq lock would have ordered them but
something is clearly off or loadavg would not be getting screwed. It could
be done with an rmb and wmb (testing and hasn't blown up so far) but that's
far too heavy. smp_load_acquire/smp_store_release might be sufficient
on it although less clear if the arm64 gives the necessary guarantees.
(This is still at the chucking out ideas as I haven't context switched
back in all the memory barrier rules).
--
Mel Gorman
SUSE Labs
On Mon, Nov 16, 2020 at 11:49:38AM +0000, Mel Gorman wrote:
> On Mon, Nov 16, 2020 at 09:10:54AM +0000, Mel Gorman wrote:
> > I'll be looking again today to see can I find a mistake in the ordering for
> > how sched_contributes_to_load is handled but again, the lack of knowledge
> > on the arm64 memory model means I'm a bit stuck and a second set of eyes
> > would be nice :(
> >
>
> This morning, it's not particularly clear what orders the visibility of
> sched_contributes_to_load exactly like other task fields in the schedule
> vs try_to_wake_up paths. I thought the rq lock would have ordered them but
> something is clearly off or loadavg would not be getting screwed. It could
> be done with an rmb and wmb (testing and hasn't blown up so far) but that's
> far too heavy. smp_load_acquire/smp_store_release might be sufficient
> on it although less clear if the arm64 gives the necessary guarantees.
>
> (This is still at the chucking out ideas as I haven't context switched
> back in all the memory barrier rules).
IIRC it should be so ordered by ->on_cpu.
We have:
schedule()
prev->sched_contributes_to_load = X;
smp_store_release(prev->on_cpu, 0);
on the one hand, and:
sched_ttwu_pending()
if (WARN_ON_ONCE(p->on_cpu))
smp_cond_load_acquire(&p->on_cpu)
ttwu_do_activate()
if (p->sched_contributes_to_load)
...
on the other (for the remote case, which is the only 'interesting' one).
On Mon, Nov 16, 2020 at 07:31:49PM +0000, Mel Gorman wrote:
> And this works.
Yay!
> sched_psi_wake_requeue can probably stay with the other three fields
> given they are under the rq lock but sched_remote_wakeup needs to move
> out.
I _think_ we can move the bit into the unserialized section below.
It's a bit cheecky, but it should work I think because the only time we
actually use this bit, we're guaranteed the task isn't actually running,
so current doesn't exist.
I suppose the question is wether this is worth saving 31 bits over...
How's this?
---
Subject: sched: Fix data-race in wakeup
From: Peter Zijlstra <[email protected]>
Date: Tue Nov 17 09:08:41 CET 2020
Mel reported that on some ARM64 platforms loadavg goes bananas and
tracked it down to the following data race:
CPU0 CPU1
schedule()
prev->sched_contributes_to_load = X;
deactivate_task(prev);
try_to_wake_up()
if (p->on_rq &&) // false
if (smp_load_acquire(&p->on_cpu) && // true
ttwu_queue_wakelist())
p->sched_remote_wakeup = Y;
smp_store_release(prev->on_cpu, 0);
where both p->sched_contributes_to_load and p->sched_remote_wakeup are
in the same word, and thus the stores X and Y race (and can clobber
one another's data).
Whereas prior to commit c6e7bd7afaeb ("sched/core: Optimize ttwu()
spinning on p->on_cpu") the p->on_cpu handoff serialized access to
p->sched_remote_wakeup (just as it still does with
p->sched_contributes_to_load) that commit broke that by calling
ttwu_queue_wakelist() with p->on_cpu != 0.
However, due to
p->XXX ttwu()
schedule() if (p->on_rq && ...) // false
smp_mb__after_spinlock() if (smp_load_acquire(&p->on_cpu) &&
deactivate_task() ttwu_queue_wakelist())
p->on_rq = 0; p->sched_remote_wakeup = X;
We can be sure any 'current' store is complete and 'current' is
guaranteed asleep. Therefore we can move p->sched_remote_wakeup into
the current flags word.
Note: while the observed failure was loadavg accounting gone wrong due
to ttwu() cobbering p->sched_contributes_to_load, the reverse problem
is also possible where schedule() clobbers p->sched_remote_wakeup,
this could result in enqueue_entity() wrecking ->vruntime and causing
scheduling artifacts.
Fixes: c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu")
Reported-by: Mel Gorman <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
---
include/linux/sched.h | 13 ++++++++++++-
1 file changed, 12 insertions(+), 1 deletion(-)
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -775,7 +775,6 @@ struct task_struct {
unsigned sched_reset_on_fork:1;
unsigned sched_contributes_to_load:1;
unsigned sched_migrated:1;
- unsigned sched_remote_wakeup:1;
#ifdef CONFIG_PSI
unsigned sched_psi_wake_requeue:1;
#endif
@@ -785,6 +784,18 @@ struct task_struct {
/* Unserialized, strictly 'current' */
+ /*
+ * p->in_iowait = 1; ttwu()
+ * schedule() if (p->on_rq && ..) // false
+ * smp_mb__after_spinlock(); if (smp_load_acquire(&p->on_cpu) && //true
+ * deactivate_task() ttwu_queue_wakelist())
+ * p->on_rq = 0; p->sched_remote_wakeup = X;
+ *
+ * Guarantees all stores of 'current' are visible before
+ * ->sched_remote_wakeup gets used.
+ */
+ unsigned sched_remote_wakeup:1;
+
/* Bit to tell LSMs we're in execve(): */
unsigned in_execve:1;
unsigned in_iowait:1;
On Tue, Nov 17, 2020 at 09:30:16AM +0100, Peter Zijlstra wrote:
> On Mon, Nov 16, 2020 at 07:31:49PM +0000, Mel Gorman wrote:
>
> > And this works.
>
> Yay!
>
> > sched_psi_wake_requeue can probably stay with the other three fields
> > given they are under the rq lock but sched_remote_wakeup needs to move
> > out.
>
> I _think_ we can move the bit into the unserialized section below.
>
> It's a bit cheecky, but it should work I think because the only time we
> actually use this bit, we're guaranteed the task isn't actually running,
> so current doesn't exist.
>
> I suppose the question is wether this is worth saving 31 bits over...
>
> How's this?
>
> ---
> Subject: sched: Fix data-race in wakeup
> From: Peter Zijlstra <[email protected]>
> Date: Tue Nov 17 09:08:41 CET 2020
>
> Mel reported that on some ARM64 platforms loadavg goes bananas and
> tracked it down to the following data race:
>
> CPU0 CPU1
>
> schedule()
> prev->sched_contributes_to_load = X;
> deactivate_task(prev);
>
> try_to_wake_up()
> if (p->on_rq &&) // false
> if (smp_load_acquire(&p->on_cpu) && // true
> ttwu_queue_wakelist())
> p->sched_remote_wakeup = Y;
>
> smp_store_release(prev->on_cpu, 0);
(nit: I suggested this race over at [1] ;)
> where both p->sched_contributes_to_load and p->sched_remote_wakeup are
> in the same word, and thus the stores X and Y race (and can clobber
> one another's data).
>
> Whereas prior to commit c6e7bd7afaeb ("sched/core: Optimize ttwu()
> spinning on p->on_cpu") the p->on_cpu handoff serialized access to
> p->sched_remote_wakeup (just as it still does with
> p->sched_contributes_to_load) that commit broke that by calling
> ttwu_queue_wakelist() with p->on_cpu != 0.
>
> However, due to
>
> p->XXX ttwu()
> schedule() if (p->on_rq && ...) // false
> smp_mb__after_spinlock() if (smp_load_acquire(&p->on_cpu) &&
> deactivate_task() ttwu_queue_wakelist())
> p->on_rq = 0; p->sched_remote_wakeup = X;
>
> We can be sure any 'current' store is complete and 'current' is
> guaranteed asleep. Therefore we can move p->sched_remote_wakeup into
> the current flags word.
>
> Note: while the observed failure was loadavg accounting gone wrong due
> to ttwu() cobbering p->sched_contributes_to_load, the reverse problem
> is also possible where schedule() clobbers p->sched_remote_wakeup,
> this could result in enqueue_entity() wrecking ->vruntime and causing
> scheduling artifacts.
>
> Fixes: c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu")
> Reported-by: Mel Gorman <[email protected]>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> ---
> include/linux/sched.h | 13 ++++++++++++-
> 1 file changed, 12 insertions(+), 1 deletion(-)
>
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -775,7 +775,6 @@ struct task_struct {
> unsigned sched_reset_on_fork:1;
> unsigned sched_contributes_to_load:1;
> unsigned sched_migrated:1;
> - unsigned sched_remote_wakeup:1;
> #ifdef CONFIG_PSI
> unsigned sched_psi_wake_requeue:1;
> #endif
> @@ -785,6 +784,18 @@ struct task_struct {
>
> /* Unserialized, strictly 'current' */
>
> + /*
> + * p->in_iowait = 1; ttwu()
> + * schedule() if (p->on_rq && ..) // false
> + * smp_mb__after_spinlock(); if (smp_load_acquire(&p->on_cpu) && //true
> + * deactivate_task() ttwu_queue_wakelist())
> + * p->on_rq = 0; p->sched_remote_wakeup = X;
> + *
> + * Guarantees all stores of 'current' are visible before
> + * ->sched_remote_wakeup gets used.
I'm still not sure this is particularly clear -- don't we want to highlight
that the store of p->on_rq is unordered wrt the update to
p->sched_contributes_to_load() in deactivate_task()?
I dislike bitfields with a passion, but the fix looks good:
Acked-by: Will Deacon <[email protected]>
Now the million dollar question is why KCSAN hasn't run into this. Hrmph.
Will
[1] https://lore.kernel.org/r/20201116131102.GA29992@willie-the-truck
On Tue, Nov 17, 2020 at 09:15:46AM +0000, Will Deacon wrote:
> On Tue, Nov 17, 2020 at 09:30:16AM +0100, Peter Zijlstra wrote:
> > Subject: sched: Fix data-race in wakeup
> > From: Peter Zijlstra <[email protected]>
> > Date: Tue Nov 17 09:08:41 CET 2020
> >
> > Mel reported that on some ARM64 platforms loadavg goes bananas and
> > tracked it down to the following data race:
> >
> > CPU0 CPU1
> >
> > schedule()
> > prev->sched_contributes_to_load = X;
> > deactivate_task(prev);
> >
> > try_to_wake_up()
> > if (p->on_rq &&) // false
> > if (smp_load_acquire(&p->on_cpu) && // true
> > ttwu_queue_wakelist())
> > p->sched_remote_wakeup = Y;
> >
> > smp_store_release(prev->on_cpu, 0);
>
> (nit: I suggested this race over at [1] ;)
Ah, I'll ammend and get you a Debugged-by line or something ;-)
> > where both p->sched_contributes_to_load and p->sched_remote_wakeup are
> > in the same word, and thus the stores X and Y race (and can clobber
> > one another's data).
> >
> > Whereas prior to commit c6e7bd7afaeb ("sched/core: Optimize ttwu()
> > spinning on p->on_cpu") the p->on_cpu handoff serialized access to
> > p->sched_remote_wakeup (just as it still does with
> > p->sched_contributes_to_load) that commit broke that by calling
> > ttwu_queue_wakelist() with p->on_cpu != 0.
> >
> > However, due to
> >
> > p->XXX ttwu()
> > schedule() if (p->on_rq && ...) // false
> > smp_mb__after_spinlock() if (smp_load_acquire(&p->on_cpu) &&
> > deactivate_task() ttwu_queue_wakelist())
> > p->on_rq = 0; p->sched_remote_wakeup = X;
> >
> > We can be sure any 'current' store is complete and 'current' is
> > guaranteed asleep. Therefore we can move p->sched_remote_wakeup into
> > the current flags word.
> >
> > Note: while the observed failure was loadavg accounting gone wrong due
> > to ttwu() cobbering p->sched_contributes_to_load, the reverse problem
> > is also possible where schedule() clobbers p->sched_remote_wakeup,
> > this could result in enqueue_entity() wrecking ->vruntime and causing
> > scheduling artifacts.
> >
> > Fixes: c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu")
> > Reported-by: Mel Gorman <[email protected]>
> > Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> > ---
> > include/linux/sched.h | 13 ++++++++++++-
> > 1 file changed, 12 insertions(+), 1 deletion(-)
> >
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -775,7 +775,6 @@ struct task_struct {
> > unsigned sched_reset_on_fork:1;
> > unsigned sched_contributes_to_load:1;
> > unsigned sched_migrated:1;
> > - unsigned sched_remote_wakeup:1;
> > #ifdef CONFIG_PSI
> > unsigned sched_psi_wake_requeue:1;
> > #endif
> > @@ -785,6 +784,18 @@ struct task_struct {
> >
> > /* Unserialized, strictly 'current' */
> >
> > + /*
> > + * p->in_iowait = 1; ttwu()
> > + * schedule() if (p->on_rq && ..) // false
> > + * smp_mb__after_spinlock(); if (smp_load_acquire(&p->on_cpu) && //true
> > + * deactivate_task() ttwu_queue_wakelist())
> > + * p->on_rq = 0; p->sched_remote_wakeup = X;
> > + *
> > + * Guarantees all stores of 'current' are visible before
> > + * ->sched_remote_wakeup gets used.
>
> I'm still not sure this is particularly clear -- don't we want to highlight
> that the store of p->on_rq is unordered wrt the update to
> p->sched_contributes_to_load() in deactivate_task()?
I can explicitly call that out I suppose.
> I dislike bitfields with a passion, but the fix looks good:
I don't particularly hate them, they're just a flag field with names on
(in this case).
> Acked-by: Will Deacon <[email protected]>
Thanks!
> Now the million dollar question is why KCSAN hasn't run into this. Hrmph.
kernel/sched/Makefile:KCSAN_SANITIZE := n
might have something to do with that, I suppose.
And poking at this reminded me of an order email from TJ that seems to
have stagnated.
---
Subject: sched: Fix rq->nr_iowait ordering
From: Peter Zijlstra <[email protected]>
Date: Thu, 24 Sep 2020 13:50:42 +0200
schedule() ttwu()
deactivate_task(); if (p->on_rq && ...) // false
atomic_dec(&task_rq(p)->nr_iowait);
if (prev->in_iowait)
atomic_inc(&rq->nr_iowait);
Allows nr_iowait to be decremented before it gets incremented,
resulting in more dodgy IO-wait numbers than usual.
Note that because we can now do ttwu_queue_wakelist() before
p->on_cpu==0, we lose the natural ordering and have to further delay
the decrement.
Fixes: Fixes: c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu")
Reported-by: Tejun Heo <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
---
kernel/sched/core.c | 15 ++++++++++-----
1 file changed, 10 insertions(+), 5 deletions(-)
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2949,7 +2949,12 @@ ttwu_do_activate(struct rq *rq, struct t
#ifdef CONFIG_SMP
if (wake_flags & WF_MIGRATED)
en_flags |= ENQUEUE_MIGRATED;
+ else
#endif
+ if (p->in_iowait) {
+ delayacct_blkio_end(p);
+ atomic_dec(&task_rq(p)->nr_iowait);
+ }
activate_task(rq, p, en_flags);
ttwu_do_wakeup(rq, p, wake_flags, rf);
@@ -3336,11 +3341,6 @@ try_to_wake_up(struct task_struct *p, un
if (READ_ONCE(p->on_rq) && ttwu_runnable(p, wake_flags))
goto unlock;
- if (p->in_iowait) {
- delayacct_blkio_end(p);
- atomic_dec(&task_rq(p)->nr_iowait);
- }
-
#ifdef CONFIG_SMP
/*
* Ensure we load p->on_cpu _after_ p->on_rq, otherwise it would be
@@ -3411,6 +3411,11 @@ try_to_wake_up(struct task_struct *p, un
cpu = select_task_rq(p, p->wake_cpu, wake_flags | WF_TTWU);
if (task_cpu(p) != cpu) {
+ if (p->in_iowait) {
+ delayacct_blkio_end(p);
+ atomic_dec(&task_rq(p)->nr_iowait);
+ }
+
wake_flags |= WF_MIGRATED;
psi_ttwu_dequeue(p);
set_task_cpu(p, cpu);
On Tue, Nov 17, 2020 at 10:29:36AM +0100, Peter Zijlstra wrote:
> On Tue, Nov 17, 2020 at 09:15:46AM +0000, Will Deacon wrote:
> > On Tue, Nov 17, 2020 at 09:30:16AM +0100, Peter Zijlstra wrote:
> > > /* Unserialized, strictly 'current' */
> > >
> > > + /*
> > > + * p->in_iowait = 1; ttwu()
> > > + * schedule() if (p->on_rq && ..) // false
> > > + * smp_mb__after_spinlock(); if (smp_load_acquire(&p->on_cpu) && //true
> > > + * deactivate_task() ttwu_queue_wakelist())
> > > + * p->on_rq = 0; p->sched_remote_wakeup = X;
> > > + *
> > > + * Guarantees all stores of 'current' are visible before
> > > + * ->sched_remote_wakeup gets used.
> >
> > I'm still not sure this is particularly clear -- don't we want to highlight
> > that the store of p->on_rq is unordered wrt the update to
> > p->sched_contributes_to_load() in deactivate_task()?
How's this then? It still doesn't explicitly call out the specific race,
but does mention the more fundamental issue that wakelist queueing
doesn't respect the regular rules anymore.
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -775,7 +775,6 @@ struct task_struct {
unsigned sched_reset_on_fork:1;
unsigned sched_contributes_to_load:1;
unsigned sched_migrated:1;
- unsigned sched_remote_wakeup:1;
#ifdef CONFIG_PSI
unsigned sched_psi_wake_requeue:1;
#endif
@@ -785,6 +784,21 @@ struct task_struct {
/* Unserialized, strictly 'current' */
+ /*
+ * This field must not be in the scheduler word above due to wakelist
+ * queueing no longer being serialized by p->on_cpu. However:
+ *
+ * p->XXX = X; ttwu()
+ * schedule() if (p->on_rq && ..) // false
+ * smp_mb__after_spinlock(); if (smp_load_acquire(&p->on_cpu) && //true
+ * deactivate_task() ttwu_queue_wakelist())
+ * p->on_rq = 0; p->sched_remote_wakeup = Y;
+ *
+ * guarantees all stores of 'current' are visible before
+ * ->sched_remote_wakeup gets used, so it can be in this word.
+ */
+ unsigned sched_remote_wakeup:1;
+
/* Bit to tell LSMs we're in execve(): */
unsigned in_execve:1;
unsigned in_iowait:1;
On Tue, Nov 17, 2020 at 10:46:21AM +0100, Peter Zijlstra wrote:
> On Tue, Nov 17, 2020 at 10:29:36AM +0100, Peter Zijlstra wrote:
> > On Tue, Nov 17, 2020 at 09:15:46AM +0000, Will Deacon wrote:
> > > On Tue, Nov 17, 2020 at 09:30:16AM +0100, Peter Zijlstra wrote:
> > > > /* Unserialized, strictly 'current' */
> > > >
> > > > + /*
> > > > + * p->in_iowait = 1; ttwu()
> > > > + * schedule() if (p->on_rq && ..) // false
> > > > + * smp_mb__after_spinlock(); if (smp_load_acquire(&p->on_cpu) && //true
> > > > + * deactivate_task() ttwu_queue_wakelist())
> > > > + * p->on_rq = 0; p->sched_remote_wakeup = X;
> > > > + *
> > > > + * Guarantees all stores of 'current' are visible before
> > > > + * ->sched_remote_wakeup gets used.
> > >
> > > I'm still not sure this is particularly clear -- don't we want to highlight
> > > that the store of p->on_rq is unordered wrt the update to
> > > p->sched_contributes_to_load() in deactivate_task()?
>
> How's this then? It still doesn't explicitly call out the specific race,
> but does mention the more fundamental issue that wakelist queueing
> doesn't respect the regular rules anymore.
>
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -775,7 +775,6 @@ struct task_struct {
> unsigned sched_reset_on_fork:1;
> unsigned sched_contributes_to_load:1;
> unsigned sched_migrated:1;
> - unsigned sched_remote_wakeup:1;
> #ifdef CONFIG_PSI
> unsigned sched_psi_wake_requeue:1;
> #endif
> @@ -785,6 +784,21 @@ struct task_struct {
>
> /* Unserialized, strictly 'current' */
>
> + /*
> + * This field must not be in the scheduler word above due to wakelist
> + * queueing no longer being serialized by p->on_cpu. However:
> + *
> + * p->XXX = X; ttwu()
> + * schedule() if (p->on_rq && ..) // false
> + * smp_mb__after_spinlock(); if (smp_load_acquire(&p->on_cpu) && //true
> + * deactivate_task() ttwu_queue_wakelist())
> + * p->on_rq = 0; p->sched_remote_wakeup = Y;
> + *
> + * guarantees all stores of 'current' are visible before
> + * ->sched_remote_wakeup gets used, so it can be in this word.
> + */
> + unsigned sched_remote_wakeup:1;
Much better, thanks!
Will
On Tue, Nov 17, 2020 at 10:38:29AM +0100, Peter Zijlstra wrote:
> Subject: sched: Fix rq->nr_iowait ordering
> From: Peter Zijlstra <[email protected]>
> Date: Thu, 24 Sep 2020 13:50:42 +0200
>
> schedule() ttwu()
> deactivate_task(); if (p->on_rq && ...) // false
> atomic_dec(&task_rq(p)->nr_iowait);
> if (prev->in_iowait)
> atomic_inc(&rq->nr_iowait);
>
> Allows nr_iowait to be decremented before it gets incremented,
> resulting in more dodgy IO-wait numbers than usual.
>
> Note that because we can now do ttwu_queue_wakelist() before
> p->on_cpu==0, we lose the natural ordering and have to further delay
> the decrement.
>
> Fixes: Fixes: c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu")
> Reported-by: Tejun Heo <[email protected]>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
s/Fixes: Fixes:/Fixes:/
Ok, very minor hazard that the same logic gets duplicated that someone
might try "fix" but git blame should help. Otherwise, it makes sense as
I've received more than one "bug" that complained that a number was larger
than they expected even if no other problem was present so
Acked-by: Mel Gorman <[email protected]>
--
Mel Gorman
SUSE Labs
On Tue, Nov 17, 2020 at 09:30:16AM +0100, Peter Zijlstra wrote:
> > sched_psi_wake_requeue can probably stay with the other three fields
> > given they are under the rq lock but sched_remote_wakeup needs to move
> > out.
>
> I _think_ we can move the bit into the unserialized section below.
>
> It's a bit cheecky, but it should work I think because the only time we
> actually use this bit, we're guaranteed the task isn't actually running,
> so current doesn't exist.
>
Putting the bit there has the added advantage that if the bit existed
on its own that it would be very special in terms of how it should be
treated. Adding a bit adjacent to it would be potentially hazardous.
> ---
> Subject: sched: Fix data-race in wakeup
> From: Peter Zijlstra <[email protected]>
> Date: Tue Nov 17 09:08:41 CET 2020
>
> <SNIP>
>
> Fixes: c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu")
> Reported-by: Mel Gorman <[email protected]>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Thanks, testing completed successfully! With the suggested alternative
comment above sched_remote_wakeup;
Reviewed-by: Mel Gorman <[email protected]>
--
Mel Gorman
SUSE Labs
On 17/11/20 09:46, Peter Zijlstra wrote:
> How's this then? It still doesn't explicitly call out the specific race,
> but does mention the more fundamental issue that wakelist queueing
> doesn't respect the regular rules anymore.
>
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -775,7 +775,6 @@ struct task_struct {
> unsigned sched_reset_on_fork:1;
> unsigned sched_contributes_to_load:1;
> unsigned sched_migrated:1;
> - unsigned sched_remote_wakeup:1;
> #ifdef CONFIG_PSI
> unsigned sched_psi_wake_requeue:1;
> #endif
> @@ -785,6 +784,21 @@ struct task_struct {
>
> /* Unserialized, strictly 'current' */
>
> + /*
> + * This field must not be in the scheduler word above due to wakelist
> + * queueing no longer being serialized by p->on_cpu. However:
> + *
> + * p->XXX = X; ttwu()
> + * schedule() if (p->on_rq && ..) // false
> + * smp_mb__after_spinlock(); if (smp_load_acquire(&p->on_cpu) && //true
> + * deactivate_task() ttwu_queue_wakelist())
> + * p->on_rq = 0; p->sched_remote_wakeup = Y;
> + *
> + * guarantees all stores of 'current' are visible before
> + * ->sched_remote_wakeup gets used, so it can be in this word.
> + */
Isn't the control dep between that ttwu() p->on_rq read and
p->sched_remote_wakeup write "sufficient"? That should be giving the right
ordering for the rest of ttwu() wrt. those 'current' bits, considering they
are written before that smp_mb__after_spinlock().
In any case, consider me convinced:
Reviewed-by: Valentin Schneider <[email protected]>
> + unsigned sched_remote_wakeup:1;
> +
> /* Bit to tell LSMs we're in execve(): */
> unsigned in_execve:1;
> unsigned in_iowait:1;
On 17/11/20 12:52, Valentin Schneider wrote:
> On 17/11/20 09:46, Peter Zijlstra wrote:
>> How's this then? It still doesn't explicitly call out the specific race,
>> but does mention the more fundamental issue that wakelist queueing
>> doesn't respect the regular rules anymore.
>>
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -775,7 +775,6 @@ struct task_struct {
>> unsigned sched_reset_on_fork:1;
>> unsigned sched_contributes_to_load:1;
>> unsigned sched_migrated:1;
>> - unsigned sched_remote_wakeup:1;
>> #ifdef CONFIG_PSI
>> unsigned sched_psi_wake_requeue:1;
>> #endif
>> @@ -785,6 +784,21 @@ struct task_struct {
>>
>> /* Unserialized, strictly 'current' */
>>
>> + /*
>> + * This field must not be in the scheduler word above due to wakelist
>> + * queueing no longer being serialized by p->on_cpu. However:
>> + *
>> + * p->XXX = X; ttwu()
>> + * schedule() if (p->on_rq && ..) // false
>> + * smp_mb__after_spinlock(); if (smp_load_acquire(&p->on_cpu) && //true
>> + * deactivate_task() ttwu_queue_wakelist())
>> + * p->on_rq = 0; p->sched_remote_wakeup = Y;
>> + *
>> + * guarantees all stores of 'current' are visible before
>> + * ->sched_remote_wakeup gets used, so it can be in this word.
>> + */
>
> Isn't the control dep between that ttwu() p->on_rq read and
> p->sched_remote_wakeup write "sufficient"?
smp_acquire__after_ctrl_dep() that is, since we need
->on_rq load => 'current' bits load + store
> That should be giving the right
> ordering for the rest of ttwu() wrt. those 'current' bits, considering they
> are written before that smp_mb__after_spinlock().
>
> In any case, consider me convinced:
>
> Reviewed-by: Valentin Schneider <[email protected]>
>
>> + unsigned sched_remote_wakeup:1;
>> +
>> /* Bit to tell LSMs we're in execve(): */
>> unsigned in_execve:1;
>> unsigned in_iowait:1;
On Tue, Nov 17, 2020 at 03:37:24PM +0000, Valentin Schneider wrote:
> >> + /*
> >> + * This field must not be in the scheduler word above due to wakelist
> >> + * queueing no longer being serialized by p->on_cpu. However:
> >> + *
> >> + * p->XXX = X; ttwu()
> >> + * schedule() if (p->on_rq && ..) // false
> >> + * smp_mb__after_spinlock(); if (smp_load_acquire(&p->on_cpu) && //true
> >> + * deactivate_task() ttwu_queue_wakelist())
> >> + * p->on_rq = 0; p->sched_remote_wakeup = Y;
> >> + *
> >> + * guarantees all stores of 'current' are visible before
> >> + * ->sched_remote_wakeup gets used, so it can be in this word.
> >> + */
> >
> > Isn't the control dep between that ttwu() p->on_rq read and
> > p->sched_remote_wakeup write "sufficient"?
>
> smp_acquire__after_ctrl_dep() that is, since we need
> ->on_rq load => 'current' bits load + store
I don't think we need that extra barrier; after all, there will be a
complete schedule() between waking the task and it actually becoming
current.
On 17/11/20 16:13, Peter Zijlstra wrote:
> On Tue, Nov 17, 2020 at 03:37:24PM +0000, Valentin Schneider wrote:
>
>> >> + /*
>> >> + * This field must not be in the scheduler word above due to wakelist
>> >> + * queueing no longer being serialized by p->on_cpu. However:
>> >> + *
>> >> + * p->XXX = X; ttwu()
>> >> + * schedule() if (p->on_rq && ..) // false
>> >> + * smp_mb__after_spinlock(); if (smp_load_acquire(&p->on_cpu) && //true
>> >> + * deactivate_task() ttwu_queue_wakelist())
>> >> + * p->on_rq = 0; p->sched_remote_wakeup = Y;
>> >> + *
>> >> + * guarantees all stores of 'current' are visible before
>> >> + * ->sched_remote_wakeup gets used, so it can be in this word.
>> >> + */
>> >
>> > Isn't the control dep between that ttwu() p->on_rq read and
>> > p->sched_remote_wakeup write "sufficient"?
>>
>> smp_acquire__after_ctrl_dep() that is, since we need
>> ->on_rq load => 'current' bits load + store
>
> I don't think we need that extra barrier; after all, there will be a
> complete schedule() between waking the task and it actually becoming
> current.
Apologies for the messy train of thought; what I was trying to say is that
we have already the following, which AIUI is sufficient:
* p->XXX = X; ttwu()
* schedule() if (p->on_rq && ..) // false
* smp_mb__after_spinlock(); smp_acquire__after_ctrl_dep();
* deactivate_task() ttwu_queue_wakelist()
* p->on_rq = 0; p->sched_remote_wakeup = Y;
On Tue, Nov 17, 2020 at 07:32:16PM +0000, Valentin Schneider wrote:
>
> On 17/11/20 16:13, Peter Zijlstra wrote:
> > On Tue, Nov 17, 2020 at 03:37:24PM +0000, Valentin Schneider wrote:
> >
> >> >> + /*
> >> >> + * This field must not be in the scheduler word above due to wakelist
> >> >> + * queueing no longer being serialized by p->on_cpu. However:
> >> >> + *
> >> >> + * p->XXX = X; ttwu()
> >> >> + * schedule() if (p->on_rq && ..) // false
> >> >> + * smp_mb__after_spinlock(); if (smp_load_acquire(&p->on_cpu) && //true
> >> >> + * deactivate_task() ttwu_queue_wakelist())
> >> >> + * p->on_rq = 0; p->sched_remote_wakeup = Y;
> >> >> + *
> >> >> + * guarantees all stores of 'current' are visible before
> >> >> + * ->sched_remote_wakeup gets used, so it can be in this word.
> >> >> + */
> >> >
> >> > Isn't the control dep between that ttwu() p->on_rq read and
> >> > p->sched_remote_wakeup write "sufficient"?
> >>
> >> smp_acquire__after_ctrl_dep() that is, since we need
> >> ->on_rq load => 'current' bits load + store
> >
> > I don't think we need that extra barrier; after all, there will be a
> > complete schedule() between waking the task and it actually becoming
> > current.
>
> Apologies for the messy train of thought; what I was trying to say is that
> we have already the following, which AIUI is sufficient:
>
> * p->XXX = X; ttwu()
> * schedule() if (p->on_rq && ..) // false
> * smp_mb__after_spinlock(); smp_acquire__after_ctrl_dep();
> * deactivate_task() ttwu_queue_wakelist()
> * p->on_rq = 0; p->sched_remote_wakeup = Y;
>
Ah, you meant the existing smp_acquire__after_ctrl_dep(). Yeah, that's
not required here either ;-)
The reason I had the ->on_cpu thing in there is because it shows we
violate the regular ->on_cpu handoff rules, not for the acquire.
The only ordering that matters on the RHS of that thing is the ->on_rq
load to p->sched_remote_wakeup store ctrl dep. That, combined with the
LHS, guarantees there is a strict order on the stores.
Makes sense?
On 18/11/20 08:05, Peter Zijlstra wrote:
> On Tue, Nov 17, 2020 at 07:32:16PM +0000, Valentin Schneider wrote:
>>
>> On 17/11/20 16:13, Peter Zijlstra wrote:
>> > On Tue, Nov 17, 2020 at 03:37:24PM +0000, Valentin Schneider wrote:
>> >
>> >> >> + /*
>> >> >> + * This field must not be in the scheduler word above due to wakelist
>> >> >> + * queueing no longer being serialized by p->on_cpu. However:
>> >> >> + *
>> >> >> + * p->XXX = X; ttwu()
>> >> >> + * schedule() if (p->on_rq && ..) // false
>> >> >> + * smp_mb__after_spinlock(); if (smp_load_acquire(&p->on_cpu) && //true
>> >> >> + * deactivate_task() ttwu_queue_wakelist())
>> >> >> + * p->on_rq = 0; p->sched_remote_wakeup = Y;
>> >> >> + *
>> >> >> + * guarantees all stores of 'current' are visible before
>> >> >> + * ->sched_remote_wakeup gets used, so it can be in this word.
>> >> >> + */
>> >> >
>> >> > Isn't the control dep between that ttwu() p->on_rq read and
>> >> > p->sched_remote_wakeup write "sufficient"?
>> >>
>> >> smp_acquire__after_ctrl_dep() that is, since we need
>> >> ->on_rq load => 'current' bits load + store
>> >
>> > I don't think we need that extra barrier; after all, there will be a
>> > complete schedule() between waking the task and it actually becoming
>> > current.
>>
>> Apologies for the messy train of thought; what I was trying to say is that
>> we have already the following, which AIUI is sufficient:
>>
>> * p->XXX = X; ttwu()
>> * schedule() if (p->on_rq && ..) // false
>> * smp_mb__after_spinlock(); smp_acquire__after_ctrl_dep();
>> * deactivate_task() ttwu_queue_wakelist()
>> * p->on_rq = 0; p->sched_remote_wakeup = Y;
>>
>
> Ah, you meant the existing smp_acquire__after_ctrl_dep(). Yeah, that's
> not required here either ;-)
>
> The reason I had the ->on_cpu thing in there is because it shows we
> violate the regular ->on_cpu handoff rules, not for the acquire.
>
Gotcha
> The only ordering that matters on the RHS of that thing is the ->on_rq
> load to p->sched_remote_wakeup store ctrl dep. That, combined with the
> LHS, guarantees there is a strict order on the stores.
>
> Makes sense?
Yep, thanks!
On Tue, Nov 17, 2020 at 10:29AM +0100, Peter Zijlstra wrote:
[...]
> > Now the million dollar question is why KCSAN hasn't run into this. Hrmph.
>
> kernel/sched/Makefile:KCSAN_SANITIZE := n
>
> might have something to do with that, I suppose.
For the record, I tried to reproduce this data race. I found a
read/write race on this bitfield, but not yet that write/write race
(perhaps I wasn't running the right workload).
| read to 0xffff8d4e2ce39aac of 1 bytes by task 5269 on cpu 3:
| __sched_setscheduler+0x4a9/0x1070 kernel/sched/core.c:5297
| sched_setattr kernel/sched/core.c:5512 [inline]
| ...
|
| write to 0xffff8d4e2ce39aac of 1 bytes by task 5268 on cpu 1:
| __schedule+0x296/0xab0 kernel/sched/core.c:4462 prev->sched_contributes_to_load =
| schedule+0xd1/0x130 kernel/sched/core.c:4601
| ...
|
| Full report: https://paste.debian.net/hidden/07a50732/
Getting to the above race also required some effort as 1) I kept hitting
other unrelated data races in the scheduler and had to silence those
first to be able to make progress, and 2) only enable KCSAN for
scheduler code to just ignore all other data races. Then I let syzkaller
run for a few minutes.
Also note our default KCSAN config is suboptimal. For serious debugging,
I'd recommend the same config that rcutorture uses with the --kcsan
flag, specifically:
CONFIG_KCSAN_REPORT_VALUE_CHANGE_ONLY=n,
CONFIG_KCSAN_ASSUME_PLAIN_WRITES_ATOMIC=n
to get the full picture.
However, as a first step, it'd be nice to eventually remove the
KCSAN_SANITIZE := n from kernel/sched/Makefile when things are less
noisy (so that syzbot and default builds can start finding more serious
issues, too).
Thanks,
-- Marco
The following commit has been merged into the sched/urgent branch of tip:
Commit-ID: ec618b84f6e15281cc3660664d34cd0dd2f2579e
Gitweb: https://git.kernel.org/tip/ec618b84f6e15281cc3660664d34cd0dd2f2579e
Author: Peter Zijlstra <[email protected]>
AuthorDate: Thu, 24 Sep 2020 13:50:42 +02:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Tue, 17 Nov 2020 13:15:28 +01:00
sched: Fix rq->nr_iowait ordering
schedule() ttwu()
deactivate_task(); if (p->on_rq && ...) // false
atomic_dec(&task_rq(p)->nr_iowait);
if (prev->in_iowait)
atomic_inc(&rq->nr_iowait);
Allows nr_iowait to be decremented before it gets incremented,
resulting in more dodgy IO-wait numbers than usual.
Note that because we can now do ttwu_queue_wakelist() before
p->on_cpu==0, we lose the natural ordering and have to further delay
the decrement.
Fixes: c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu")
Reported-by: Tejun Heo <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
kernel/sched/core.c | 15 ++++++++++-----
1 file changed, 10 insertions(+), 5 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d2003a7..9f0ebfb 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2501,7 +2501,12 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
#ifdef CONFIG_SMP
if (wake_flags & WF_MIGRATED)
en_flags |= ENQUEUE_MIGRATED;
+ else
#endif
+ if (p->in_iowait) {
+ delayacct_blkio_end(p);
+ atomic_dec(&task_rq(p)->nr_iowait);
+ }
activate_task(rq, p, en_flags);
ttwu_do_wakeup(rq, p, wake_flags, rf);
@@ -2888,11 +2893,6 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
if (READ_ONCE(p->on_rq) && ttwu_runnable(p, wake_flags))
goto unlock;
- if (p->in_iowait) {
- delayacct_blkio_end(p);
- atomic_dec(&task_rq(p)->nr_iowait);
- }
-
#ifdef CONFIG_SMP
/*
* Ensure we load p->on_cpu _after_ p->on_rq, otherwise it would be
@@ -2963,6 +2963,11 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);
if (task_cpu(p) != cpu) {
+ if (p->in_iowait) {
+ delayacct_blkio_end(p);
+ atomic_dec(&task_rq(p)->nr_iowait);
+ }
+
wake_flags |= WF_MIGRATED;
psi_ttwu_dequeue(p);
set_task_cpu(p, cpu);
The following commit has been merged into the sched/urgent branch of tip:
Commit-ID: f97bb5272d9e95d400d6c8643ebb146b3e3e7842
Gitweb: https://git.kernel.org/tip/f97bb5272d9e95d400d6c8643ebb146b3e3e7842
Author: Peter Zijlstra <[email protected]>
AuthorDate: Tue, 17 Nov 2020 09:08:41 +01:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Tue, 17 Nov 2020 13:15:27 +01:00
sched: Fix data-race in wakeup
Mel reported that on some ARM64 platforms loadavg goes bananas and
Will tracked it down to the following race:
CPU0 CPU1
schedule()
prev->sched_contributes_to_load = X;
deactivate_task(prev);
try_to_wake_up()
if (p->on_rq &&) // false
if (smp_load_acquire(&p->on_cpu) && // true
ttwu_queue_wakelist())
p->sched_remote_wakeup = Y;
smp_store_release(prev->on_cpu, 0);
where both p->sched_contributes_to_load and p->sched_remote_wakeup are
in the same word, and thus the stores X and Y race (and can clobber
one another's data).
Whereas prior to commit c6e7bd7afaeb ("sched/core: Optimize ttwu()
spinning on p->on_cpu") the p->on_cpu handoff serialized access to
p->sched_remote_wakeup (just as it still does with
p->sched_contributes_to_load) that commit broke that by calling
ttwu_queue_wakelist() with p->on_cpu != 0.
However, due to
p->XXX = X ttwu()
schedule() if (p->on_rq && ...) // false
smp_mb__after_spinlock() if (smp_load_acquire(&p->on_cpu) &&
deactivate_task() ttwu_queue_wakelist())
p->on_rq = 0; p->sched_remote_wakeup = Y;
We can be sure any 'current' store is complete and 'current' is
guaranteed asleep. Therefore we can move p->sched_remote_wakeup into
the current flags word.
Note: while the observed failure was loadavg accounting gone wrong due
to ttwu() cobbering p->sched_contributes_to_load, the reverse problem
is also possible where schedule() clobbers p->sched_remote_wakeup,
this could result in enqueue_entity() wrecking ->vruntime and causing
scheduling artifacts.
Fixes: c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu")
Reported-by: Mel Gorman <[email protected]>
Debugged-by: Will Deacon <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
include/linux/sched.h | 16 +++++++++++++++-
1 file changed, 15 insertions(+), 1 deletion(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index d383cf0..0e91b45 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -769,7 +769,6 @@ struct task_struct {
unsigned sched_reset_on_fork:1;
unsigned sched_contributes_to_load:1;
unsigned sched_migrated:1;
- unsigned sched_remote_wakeup:1;
#ifdef CONFIG_PSI
unsigned sched_psi_wake_requeue:1;
#endif
@@ -779,6 +778,21 @@ struct task_struct {
/* Unserialized, strictly 'current' */
+ /*
+ * This field must not be in the scheduler word above due to wakelist
+ * queueing no longer being serialized by p->on_cpu. However:
+ *
+ * p->XXX = X; ttwu()
+ * schedule() if (p->on_rq && ..) // false
+ * smp_mb__after_spinlock(); if (smp_load_acquire(&p->on_cpu) && //true
+ * deactivate_task() ttwu_queue_wakelist())
+ * p->on_rq = 0; p->sched_remote_wakeup = Y;
+ *
+ * guarantees all stores of 'current' are visible before
+ * ->sched_remote_wakeup gets used, so it can be in this word.
+ */
+ unsigned sched_remote_wakeup:1;
+
/* Bit to tell LSMs we're in execve(): */
unsigned in_execve:1;
unsigned in_iowait:1;