2020-07-02 17:49:42

by Dave Jones

[permalink] [raw]
Subject: weird loadavg on idle machine post 5.7

When I upgraded my firewall to 5.7-rc2 I noticed that on a mostly
idle machine (that usually sees loadavg hover in the 0.xx range)
that it was consistently above 1.00 even when there was nothing running.
All that perf showed was the kernel was spending time in the idle loop
(and running perf).

For the first hour or so after boot, everything seems fine, but over
time loadavg creeps up, and once it's established a new baseline, it
never seems to ever drop below that again.

One morning I woke up to find loadavg at '7.xx', after almost as many
hours of uptime, which makes me wonder if perhaps this is triggered
by something in cron. I have a bunch of scripts that fire off
every hour that involve thousands of shortlived runs of iptables/ipset,
but running them manually didn't seem to automatically trigger the bug.

Given it took a few hours of runtime to confirm good/bad, bisecting this
took the last two weeks. I did it four different times, the first
producing bogus results from over-eager 'good', but the last two runs
both implicated this commit:

commit c6e7bd7afaeb3af55ffac122828035f1c01d1d7b (refs/bisect/bad)
Author: Peter Zijlstra <[email protected]>
Date: Sun May 24 21:29:55 2020 +0100

sched/core: Optimize ttwu() spinning on p->on_cpu

Both Rik and Mel reported seeing ttwu() spend significant time on:

smp_cond_load_acquire(&p->on_cpu, !VAL);

Attempt to avoid this by queueing the wakeup on the CPU that owns the
p->on_cpu value. This will then allow the ttwu() to complete without
further waiting.

Since we run schedule() with interrupts disabled, the IPI is
guaranteed to happen after p->on_cpu is cleared, this is what makes it
safe to queue early.

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Cc: Jirka Hladky <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: [email protected]
Cc: Hillf Danton <[email protected]>
Cc: Rik van Riel <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

Unfortunatly it doesn't revert cleanly on top of rc3 so I haven't
confirmed 100% that it's the cause yet, but the two separate bisects
seem promising.

I don't see any obvious correlation between what's changing there and
the symtoms (other than "scheduler magic") but maybe those closer to
this have ideas what could be going awry ?

Dave


2020-07-02 19:47:27

by Dave Jones

[permalink] [raw]
Subject: Re: weird loadavg on idle machine post 5.7

On Thu, Jul 02, 2020 at 01:15:48PM -0400, Dave Jones wrote:
> When I upgraded my firewall to 5.7-rc2 I noticed that on a mostly
> idle machine (that usually sees loadavg hover in the 0.xx range)
> that it was consistently above 1.00 even when there was nothing running.
> All that perf showed was the kernel was spending time in the idle loop
> (and running perf).

Unfortunate typo there, I meant 5.8-rc2, and just confirmed the bug persists in
5.8-rc3.

Dave

2020-07-02 21:18:06

by Paul Gortmaker

[permalink] [raw]
Subject: Re: weird loadavg on idle machine post 5.7

[weird loadavg on idle machine post 5.7] On 02/07/2020 (Thu 13:15) Dave Jones wrote:

> When I upgraded my firewall to 5.7-rc2 I noticed that on a mostly
> idle machine (that usually sees loadavg hover in the 0.xx range)
> that it was consistently above 1.00 even when there was nothing running.
> All that perf showed was the kernel was spending time in the idle loop
> (and running perf).
>
> For the first hour or so after boot, everything seems fine, but over
> time loadavg creeps up, and once it's established a new baseline, it
> never seems to ever drop below that again.
>
> One morning I woke up to find loadavg at '7.xx', after almost as many
> hours of uptime, which makes me wonder if perhaps this is triggered
> by something in cron. I have a bunch of scripts that fire off
> every hour that involve thousands of shortlived runs of iptables/ipset,
> but running them manually didn't seem to automatically trigger the bug.
>
> Given it took a few hours of runtime to confirm good/bad, bisecting this
> took the last two weeks. I did it four different times, the first

I've seen pretty much the same thing - I was helping paulmck test
rcu-dev for something hopefully unrelated, when I 1st saw it, and
assumed it came in with the sched-core merge and was using one under
that as "good" to attempt bisect.

> producing bogus results from over-eager 'good', but the last two runs

Yeah - it sucks. I was using Paul's TREE03 rcu-torture for loading and
even after a two hour test I'd still get "false good" results. Only
after 7h was I quite confident that good was really good.

> both implicated this commit:
>
> commit c6e7bd7afaeb3af55ffac122828035f1c01d1d7b (refs/bisect/bad)
> Author: Peter Zijlstra <[email protected]>
> Date: Sun May 24 21:29:55 2020 +0100
>
> sched/core: Optimize ttwu() spinning on p->on_cpu

I was down to 10 commits roughly above and below this guy before hearing
you were working the same problem.

I just got this guy to reveal a false load after a 2h test as well.
I want to let the one underneath soak overnight just to also confirm it
is "good" - so that is pending.

What I can add, is that it is like we are "leaking" an instance into
calc_load_tasks -- which isn't anything new -- see when tglx fixed it
before in d60585c5766. Unfortunate we don't have some low overhead leak
checks on that... ?

Anyway, if I "fix" the leak, then everything seems back to normal:

(gdb) p calc_load_tasks
$2 = {counter = 1}
(gdb) set variable calc_load_tasks = { 0 }
(gdb) p calc_load_tasks
$4 = {counter = 0}
(gdb) continue
Continuing.

[ ... watching decay on resumed target ....]

10:13:14 up 9:54, 4 users, load average: 0.92, 0.98, 1.15
10:13:54 up 9:55, 4 users, load average: 0.47, 0.86, 1.10
10:15:17 up 9:56, 4 users, load average: 0.12, 0.65, 1.00
10:19:20 up 10:00, 4 users, load average: 0.00, 0.28, 0.76
10:26:07 up 10:07, 4 users, load average: 0.00, 0.06, 0.48
10:32:48 up 10:14, 4 users, load average: 0.00, 0.00, 0.29

Obviously that isn't a fix, but it shows it is an accounting thing.
I've also used gdb to snoop all the cfs->avg fields and they look as
expected for a completely idle machine. Nothing hiding in avg_rt or
avg_dl either.

>
> Both Rik and Mel reported seeing ttwu() spend significant time on:
>
> smp_cond_load_acquire(&p->on_cpu, !VAL);
>
> Attempt to avoid this by queueing the wakeup on the CPU that owns the
> p->on_cpu value. This will then allow the ttwu() to complete without
> further waiting.
>
> Since we run schedule() with interrupts disabled, the IPI is
> guaranteed to happen after p->on_cpu is cleared, this is what makes it
> safe to queue early.
>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> Signed-off-by: Mel Gorman <[email protected]>
> Signed-off-by: Ingo Molnar <[email protected]>
> Cc: Jirka Hladky <[email protected]>
> Cc: Vincent Guittot <[email protected]>
> Cc: [email protected]
> Cc: Hillf Danton <[email protected]>
> Cc: Rik van Riel <[email protected]>
> Link: https://lore.kernel.org/r/[email protected]
>
> Unfortunatly it doesn't revert cleanly on top of rc3 so I haven't
> confirmed 100% that it's the cause yet, but the two separate bisects
> seem promising.

I've not tried the revert (yet) - but Kyle saw me boring people on
#kernel with the details of bisecting this and gave me the heads-up you
were looking at it too (thanks Kyle!). So I figured I'd better add
what I'd seen so far.

I'm testing with what is largely a defconfig, plus KVM_INTEL (needed for
paulmck TREE03 rcu-torture), plus I enabled KGDB and DEBUG_INFO after a
while so I could poke and prod - but was reproducing it before that.

For completeness, the test was:

tools/testing/selftests/rcutorture/bin/kvm.sh --cpus 24 --duration 120 \
--configs TREE03 --trust-make

...on a 24 core 2013 vintage xeon-v2 COTS box. As above, the 120m
seemed to give between 60-75% confidence on not getting a false good.

Anyway - so that is all I know so far...

Paul.
--

>
> I don't see any obvious correlation between what's changing there and
> the symtoms (other than "scheduler magic") but maybe those closer to
> this have ideas what could be going awry ?
>
> Dave

2020-07-02 21:39:24

by Mel Gorman

[permalink] [raw]
Subject: Re: weird loadavg on idle machine post 5.7

On Thu, Jul 02, 2020 at 01:15:48PM -0400, Dave Jones wrote:
> When I upgraded my firewall to 5.7-rc2 I noticed that on a mostly
> idle machine (that usually sees loadavg hover in the 0.xx range)
> that it was consistently above 1.00 even when there was nothing running.
> All that perf showed was the kernel was spending time in the idle loop
> (and running perf).
>
> For the first hour or so after boot, everything seems fine, but over
> time loadavg creeps up, and once it's established a new baseline, it
> never seems to ever drop below that again.
>
> One morning I woke up to find loadavg at '7.xx', after almost as many
> hours of uptime, which makes me wonder if perhaps this is triggered
> by something in cron. I have a bunch of scripts that fire off
> every hour that involve thousands of shortlived runs of iptables/ipset,
> but running them manually didn't seem to automatically trigger the bug.
>
> Given it took a few hours of runtime to confirm good/bad, bisecting this
> took the last two weeks. I did it four different times, the first
> producing bogus results from over-eager 'good', but the last two runs
> both implicated this commit:
>
> commit c6e7bd7afaeb3af55ffac122828035f1c01d1d7b (refs/bisect/bad)
> Author: Peter Zijlstra <[email protected]>
> Date: Sun May 24 21:29:55 2020 +0100
>
> sched/core: Optimize ttwu() spinning on p->on_cpu
>
> Both Rik and Mel reported seeing ttwu() spend significant time on:
>
> smp_cond_load_acquire(&p->on_cpu, !VAL);
>
> Attempt to avoid this by queueing the wakeup on the CPU that owns the
> p->on_cpu value. This will then allow the ttwu() to complete without
> further waiting.
>
> Since we run schedule() with interrupts disabled, the IPI is
> guaranteed to happen after p->on_cpu is cleared, this is what makes it
> safe to queue early.
>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> Signed-off-by: Mel Gorman <[email protected]>
> Signed-off-by: Ingo Molnar <[email protected]>

Peter, I'm not supremely confident about this but could it be because
"p->sched_contributes_to_load = !!task_contributes_to_load(p)" potentially
happens while a task is still being dequeued? In the final stages of a
task switch we have

prev_state = prev->state;
vtime_task_switch(prev);
perf_event_task_sched_in(prev, current);
finish_task(prev);

finish_task is when p->on_cpu is cleared after the state is updated.
With the patch, we potentially update sched_contributes_to_load while
p->state is transient so if the check below is true and ttwu_queue_wakelist
is used then sched_contributes_to_load was based on a transient value
and potentially wrong.

if (smp_load_acquire(&p->on_cpu) &&
ttwu_queue_wakelist(p, task_cpu(p), wake_flags | WF_ON_CPU))
goto unlock;

sched_contributes_to_load determines the value of rq->uninterruptible
which is used in the load value so it's a partial fit. The race would not
happen very often on a normal desktop so it would explain why it takes
a long time for the value to get screwed up and appears to fit.

I'm thinking that the !!task_contributes_to_load(p) should still happen
after smp_cond_load_acquire() when on_cpu is stable and the pi_lock is
held to stabilised p->state against a parallel wakeup or updating the
task rq. I do not see any hazards with respect to smp_rmb and the value
of p->state in this particular path but I've confused myself enough in
the various scheduler and wakeup paths that I don't want to bet money on
it late in the evening

It builds, not booted, it's for discussion but maybe Dave is feeling brave!

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ca5db40392d4..52c73598b18a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2592,9 +2592,6 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
}

#ifdef CONFIG_SMP
- p->sched_contributes_to_load = !!task_contributes_to_load(p);
- p->state = TASK_WAKING;
-
/*
* Ensure we load p->on_cpu _after_ p->on_rq, otherwise it would be
* possible to, falsely, observe p->on_cpu == 0.
@@ -2650,6 +2647,13 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
*/
smp_cond_load_acquire(&p->on_cpu, !VAL);

+ /*
+ * p is off the cpu and pi_lock is held to p->state is stable
+ * for calculating whether it contributes to load.
+ */
+ p->sched_contributes_to_load = !!task_contributes_to_load(p);
+ p->state = TASK_WAKING;
+
cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);
if (task_cpu(p) != cpu) {
wake_flags |= WF_MIGRATED;

--
Mel Gorman
SUSE Labs

2020-07-02 23:12:06

by Michal Kubecek

[permalink] [raw]
Subject: Re: weird loadavg on idle machine post 5.7

On Thu, Jul 02, 2020 at 10:36:27PM +0100, Mel Gorman wrote:
>
> It builds, not booted, it's for discussion but maybe Dave is feeling brave!
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index ca5db40392d4..52c73598b18a 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2592,9 +2592,6 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
> }
>
> #ifdef CONFIG_SMP
> - p->sched_contributes_to_load = !!task_contributes_to_load(p);
> - p->state = TASK_WAKING;
> -
> /*
> * Ensure we load p->on_cpu _after_ p->on_rq, otherwise it would be
> * possible to, falsely, observe p->on_cpu == 0.
> @@ -2650,6 +2647,13 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
> */
> smp_cond_load_acquire(&p->on_cpu, !VAL);
>
> + /*
> + * p is off the cpu and pi_lock is held to p->state is stable
> + * for calculating whether it contributes to load.
> + */
> + p->sched_contributes_to_load = !!task_contributes_to_load(p);
> + p->state = TASK_WAKING;
> +
> cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);
> if (task_cpu(p) != cpu) {
> wake_flags |= WF_MIGRATED;
>
> --

I felt brave but something is probably wrong: I tried to boot three
times and all three attempts crashed during boot, all three in scheduler
related functions:

set_next_entity()
check_preempt_wakeup()
finish_task_switch()

I left my USB-to-serial adapter in the office but I'll try again
tomorrow on a test machine with serial console.

Michal

2020-07-02 23:27:31

by Dave Jones

[permalink] [raw]
Subject: Re: weird loadavg on idle machine post 5.7

On Thu, Jul 02, 2020 at 10:36:27PM +0100, Mel Gorman wrote:

> I'm thinking that the !!task_contributes_to_load(p) should still happen
> after smp_cond_load_acquire() when on_cpu is stable and the pi_lock is
> held to stabilised p->state against a parallel wakeup or updating the
> task rq. I do not see any hazards with respect to smp_rmb and the value
> of p->state in this particular path but I've confused myself enough in
> the various scheduler and wakeup paths that I don't want to bet money on
> it late in the evening
>
> It builds, not booted, it's for discussion but maybe Dave is feeling brave!

stalls, and then panics during boot :(


[ 16.933212] igb 0000:02:00.0 eth1: igb: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
[ 69.572840] watchdog: BUG: soft lockup - CPU#3 stuck for 44s! [kworker/u8:0:7]
[ 69.572849] CPU: 3 PID: 7 Comm: kworker/u8:0 Kdump: loaded Not tainted 5.8.0-rc3-firewall+ #2
[ 69.572852] Hardware name: To be filled by O.E.M. To be filled by O.E.M./Q3XXG4-P, BIOS 5.6.5 06/30/2018
[ 69.572861] Workqueue: 0x0 (events_power_efficient)
[ 69.572877] RIP: 0010:finish_task_switch+0x71/0x1a0
[ 69.572884] Code: 00 00 4d 8b 7c 24 10 65 4c 8b 34 25 c0 6c 01 00 0f 1f 44 00 00 0f 1f 44 00 00 41 c7 44 24 2c 00 00 00 00 c6 03 00 fb 4d 85 ed <74> 0b f0 41 ff 4d 4c 0f 84 d9 00 00 00 49 83 c7 80 74 7a 48 89 d8
[ 69.572887] RSP: 0018:ffffb36700067e40 EFLAGS: 00000246
[ 69.572893] RAX: ffff94654eab0000 RBX: ffff9465575a8b40 RCX: 0000000000000000
[ 69.572895] RDX: 0000000000000000 RSI: ffff9465565c0000 RDI: ffff94654eab0000
[ 69.572898] RBP: ffffb36700067e68 R08: 0000000000000001 R09: 00000000000283c0
[ 69.572901] R10: 0000000000000000 R11: 0000000000000000 R12: ffff94654eab0000
[ 69.572904] R13: 0000000000000000 R14: ffff9465565c0000 R15: 0000000000000001
[ 69.572909] FS: 0000000000000000(0000) GS:ffff946557580000(0000) knlGS:0000000000000000
[ 69.572912] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 69.572917] CR2: 00007f29b26abc30 CR3: 000000020812d001 CR4: 00000000001606e0
[ 69.572919] Call Trace:
[ 69.572937] __schedule+0x28d/0x570
[ 69.572946] ? _cond_resched+0x15/0x30
[ 69.572954] schedule+0x38/0xa0
[ 69.572962] worker_thread+0xaa/0x3c0
[ 69.572968] ? process_one_work+0x3c0/0x3c0
[ 69.572972] kthread+0x116/0x130
[ 69.572977] ? __kthread_create_on_node+0x180/0x180
[ 69.572982] ret_from_fork+0x22/0x30
[ 69.572988] Kernel panic - not syncing: softlockup: hung tasks
[ 69.572993] CPU: 3 PID: 7 Comm: kworker/u8:0 Kdump: loaded Tainted: G L 5.8.0-rc3-firewall+ #2
[ 69.572995] Hardware name: To be filled by O.E.M. To be filled by O.E.M./Q3XXG4-P, BIOS 5.6.5 06/30/2018
[ 69.572998] Workqueue: 0x0 (events_power_efficient)
[ 69.573001] Call Trace:
[ 69.573004] <IRQ>
[ 69.573010] dump_stack+0x57/0x70
[ 69.573016] panic+0xfb/0x2cb
[ 69.573024] watchdog_timer_fn.cold.12+0x7d/0x96
[ 69.573030] ? softlockup_fn+0x30/0x30
[ 69.573035] __hrtimer_run_queues+0x100/0x280
[ 69.573041] hrtimer_interrupt+0xf4/0x210
[ 69.573049] __sysvec_apic_timer_interrupt+0x5d/0xf0
[ 69.573055] asm_call_on_stack+0x12/0x20
[ 69.573058] </IRQ>
[ 69.573064] sysvec_apic_timer_interrupt+0x6d/0x80
[ 69.573069] asm_sysvec_apic_timer_interrupt+0xf/0x20
[ 69.573078] RIP: 0010:finish_task_switch+0x71/0x1a0
[ 69.573082] Code: 00 00 4d 8b 7c 24 10 65 4c 8b 34 25 c0 6c 01 00 0f 1f 44 00 00 0f 1f 44 00 00 41 c7 44 24 2c 00 00 00 00 c6 03 00 fb 4d 85 ed <74> 0b f0 41 ff 4d 4c 0f 84 d9 00 00 00 49 83 c7 80 74 7a 48 89 d8
[ 69.573085] RSP: 0018:ffffb36700067e40 EFLAGS: 00000246
[ 69.573088] RAX: ffff94654eab0000 RBX: ffff9465575a8b40 RCX: 0000000000000000
[ 69.573090] RDX: 0000000000000000 RSI: ffff9465565c0000 RDI: ffff94654eab0000
[ 69.573092] RBP: ffffb36700067e68 R08: 0000000000000001 R09: 00000000000283c0
[ 69.573094] R10: 0000000000000000 R11: 0000000000000000 R12: ffff94654eab0000
[ 69.573096] R13: 0000000000000000 R14: ffff9465565c0000 R15: 0000000000000001
[ 69.573106] __schedule+0x28d/0x570
[ 69.573113] ? _cond_resched+0x15/0x30
[ 69.573119] schedule+0x38/0xa0
[ 69.573125] worker_thread+0xaa/0x3c0
[ 69.573130] ? process_one_work+0x3c0/0x3c0
[ 69.573134] kthread+0x116/0x130
[ 69.573149] ? __kthread_create_on_node+0x180/0x180
[ 69.792344] ret_from_fork+0x22/0x30


2020-07-03 09:05:17

by Peter Zijlstra

[permalink] [raw]
Subject: Re: weird loadavg on idle machine post 5.7

On Thu, Jul 02, 2020 at 10:36:27PM +0100, Mel Gorman wrote:

> > commit c6e7bd7afaeb3af55ffac122828035f1c01d1d7b (refs/bisect/bad)
> > Author: Peter Zijlstra <[email protected]>

> Peter, I'm not supremely confident about this but could it be because
> "p->sched_contributes_to_load = !!task_contributes_to_load(p)" potentially
> happens while a task is still being dequeued? In the final stages of a
> task switch we have
>
> prev_state = prev->state;
> vtime_task_switch(prev);
> perf_event_task_sched_in(prev, current);
> finish_task(prev);
>
> finish_task is when p->on_cpu is cleared after the state is updated.
> With the patch, we potentially update sched_contributes_to_load while
> p->state is transient so if the check below is true and ttwu_queue_wakelist
> is used then sched_contributes_to_load was based on a transient value
> and potentially wrong.

I'm not seeing it. Once a task hits schedule(), p->state doesn't change,
except through wakeup.

And while dequeue depends on p->state, it doesn't change it.

At this point in ttwu() we know p->on_rq == 0, which implies dequeue has
started, which means we've (at least) stopped executing the task -- we
started or finished schedule().

Let me stare at this more...

2020-07-03 10:41:19

by Peter Zijlstra

[permalink] [raw]
Subject: Re: weird loadavg on idle machine post 5.7

On Fri, Jul 03, 2020 at 11:02:26AM +0200, Peter Zijlstra wrote:
> On Thu, Jul 02, 2020 at 10:36:27PM +0100, Mel Gorman wrote:
>
> > > commit c6e7bd7afaeb3af55ffac122828035f1c01d1d7b (refs/bisect/bad)
> > > Author: Peter Zijlstra <[email protected]>
>
> > Peter, I'm not supremely confident about this but could it be because
> > "p->sched_contributes_to_load = !!task_contributes_to_load(p)" potentially
> > happens while a task is still being dequeued? In the final stages of a
> > task switch we have
> >
> > prev_state = prev->state;
> > vtime_task_switch(prev);
> > perf_event_task_sched_in(prev, current);
> > finish_task(prev);
> >
> > finish_task is when p->on_cpu is cleared after the state is updated.
> > With the patch, we potentially update sched_contributes_to_load while
> > p->state is transient so if the check below is true and ttwu_queue_wakelist
> > is used then sched_contributes_to_load was based on a transient value
> > and potentially wrong.
>
> I'm not seeing it. Once a task hits schedule(), p->state doesn't change,
> except through wakeup.
>
> And while dequeue depends on p->state, it doesn't change it.
>
> At this point in ttwu() we know p->on_rq == 0, which implies dequeue has
> started, which means we've (at least) stopped executing the task -- we
> started or finished schedule().
>
> Let me stare at this more...

So ARM/Power/etc.. can speculate the load such that the
task_contributes_to_load() value is from before ->on_rq.

The compiler might similar re-order things -- although I've not found it
doing so with the few builds I looked at.

So I think at the very least we should do something like this. But i've
no idea how to reproduce this problem.

Mel's patch placed it too far down, as the WF_ON_CPU path also relies on
this, and by not resetting p->sched_contributes_to_load it would skew
accounting even worse.

---
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fcd56f04b706..cba8a56d0f7f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2799,9 +2799,6 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
}

#ifdef CONFIG_SMP
- p->sched_contributes_to_load = !!task_contributes_to_load(p);
- p->state = TASK_WAKING;
-
/*
* Ensure we load p->on_cpu _after_ p->on_rq, otherwise it would be
* possible to, falsely, observe p->on_cpu == 0.
@@ -2823,6 +2820,9 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
*/
smp_rmb();

+ p->sched_contributes_to_load = !!task_contributes_to_load(p);
+ p->state = TASK_WAKING;
+
/*
* If the owning (remote) CPU is still in the middle of schedule() with
* this task as prev, considering queueing p on the remote CPUs wake_list

2020-07-03 13:29:15

by Paul Gortmaker

[permalink] [raw]
Subject: Re: weird loadavg on idle machine post 5.7

[Re: weird loadavg on idle machine post 5.7] On 02/07/2020 (Thu 17:15) Paul Gortmaker wrote:

> [weird loadavg on idle machine post 5.7] On 02/07/2020 (Thu 13:15) Dave Jones wrote:

[...]

> > both implicated this commit:
> >
> > commit c6e7bd7afaeb3af55ffac122828035f1c01d1d7b (refs/bisect/bad)
> > Author: Peter Zijlstra <[email protected]>
> > Date: Sun May 24 21:29:55 2020 +0100
> >
> > sched/core: Optimize ttwu() spinning on p->on_cpu
>
> I was down to 10 commits roughly above and below this guy before hearing
> you were working the same problem.
>
> I just got this guy to reveal a false load after a 2h test as well.
> I want to let the one underneath soak overnight just to also confirm it
> is "good" - so that is pending.

As per above, I ran a 12h test overnight on 505b8af5891 and it seems
fine. Every other "bad" bisect point failed in 7h or less. So my
testing seems to give the same result as Dave.

Paul.
--

root@t5610:/home/paul/git/linux-head#
[1]- Done nohup
tools/testing/selftests/rcutorture/bin/kvm.sh --cpus 24 --duration 720 --configs TREE03 --trust-make > /tmp/kvm.sh.out 2>&1
root@t5610:/home/paul/git/linux-head# uptime
09:10:56 up 13:12, 2 users, load average: 0.00, 0.00, 0.14
root@t5610:/home/paul/git/linux-head# cat /proc/version
Linux version 5.7.0-rc6-00029-gd505b8af5891 (paul@t5610) (gcc version 9.3.0 (Ubuntu 9.3.0-10ubuntu2), GNU ld (GNU Binutils for Ubuntu) 2.34) #2 SMP Thu Jul 2 18:55:40 EDT 2020

2020-07-03 20:55:28

by Dave Jones

[permalink] [raw]
Subject: Re: weird loadavg on idle machine post 5.7

On Fri, Jul 03, 2020 at 12:40:33PM +0200, Peter Zijlstra wrote:

> So ARM/Power/etc.. can speculate the load such that the
> task_contributes_to_load() value is from before ->on_rq.
>
> The compiler might similar re-order things -- although I've not found it
> doing so with the few builds I looked at.
>
> So I think at the very least we should do something like this. But i've
> no idea how to reproduce this problem.
>
> Mel's patch placed it too far down, as the WF_ON_CPU path also relies on
> this, and by not resetting p->sched_contributes_to_load it would skew
> accounting even worse.

looked promising the first few hours, but as soon as it hit four hours
of uptime, loadavg spiked and is now pinned to at least 1.00

Dave

2020-07-06 15:01:34

by Peter Zijlstra

[permalink] [raw]
Subject: Re: weird loadavg on idle machine post 5.7

On Fri, Jul 03, 2020 at 04:51:53PM -0400, Dave Jones wrote:
> On Fri, Jul 03, 2020 at 12:40:33PM +0200, Peter Zijlstra wrote:
>
> > So ARM/Power/etc.. can speculate the load such that the
> > task_contributes_to_load() value is from before ->on_rq.
> >
> > The compiler might similar re-order things -- although I've not found it
> > doing so with the few builds I looked at.
> >
> > So I think at the very least we should do something like this. But i've
> > no idea how to reproduce this problem.
> >
> > Mel's patch placed it too far down, as the WF_ON_CPU path also relies on
> > this, and by not resetting p->sched_contributes_to_load it would skew
> > accounting even worse.
>
> looked promising the first few hours, but as soon as it hit four hours
> of uptime, loadavg spiked and is now pinned to at least 1.00

OK, lots of cursing later, I now have the below...

The TL;DR is that while schedule() doesn't change p->state once it
starts, it does read it quite a bit, and ttwu() will actually change it
to TASK_WAKING. So if ttwu() changes it to WAKING before schedule()
reads it to do loadavg accounting, things go sideways.

The below is extra complicated by the fact that I've had to scrounge up
a bunch of load-store ordering without actually adding barriers. It adds
yet another control dependency to ttwu(), so take that C standard :-)

I've booted it, and build a few kernels with it and checked loadavg
drops to 0 after each build, so from that pov all is well, but since
I'm not confident I can reproduce the issue, I can't tell this actually
fixes anything, except maybe phantoms of my imagination.

---
include/linux/sched.h | 4 ---
kernel/sched/core.c | 67 +++++++++++++++++++++++++++++++++++++--------------
2 files changed, 49 insertions(+), 22 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 9bd073a10224..e26c8bbeda00 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -114,10 +114,6 @@ struct task_group;

#define task_is_stopped_or_traced(task) ((task->state & (__TASK_STOPPED | __TASK_TRACED)) != 0)

-#define task_contributes_to_load(task) ((task->state & TASK_UNINTERRUPTIBLE) != 0 && \
- (task->flags & PF_FROZEN) == 0 && \
- (task->state & TASK_NOLOAD) == 0)
-
#ifdef CONFIG_DEBUG_ATOMIC_SLEEP

/*
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1d3d2d67f398..f245444b4b15 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1313,9 +1313,6 @@ static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags)

void activate_task(struct rq *rq, struct task_struct *p, int flags)
{
- if (task_contributes_to_load(p))
- rq->nr_uninterruptible--;
-
enqueue_task(rq, p, flags);

p->on_rq = TASK_ON_RQ_QUEUED;
@@ -1325,9 +1322,6 @@ void deactivate_task(struct rq *rq, struct task_struct *p, int flags)
{
p->on_rq = (flags & DEQUEUE_SLEEP) ? 0 : TASK_ON_RQ_MIGRATING;

- if (task_contributes_to_load(p))
- rq->nr_uninterruptible++;
-
dequeue_task(rq, p, flags);
}

@@ -2228,10 +2222,10 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,

lockdep_assert_held(&rq->lock);

-#ifdef CONFIG_SMP
if (p->sched_contributes_to_load)
rq->nr_uninterruptible--;

+#ifdef CONFIG_SMP
if (wake_flags & WF_MIGRATED)
en_flags |= ENQUEUE_MIGRATED;
#endif
@@ -2575,7 +2569,7 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
* A similar smb_rmb() lives in try_invoke_on_locked_down_task().
*/
smp_rmb();
- if (p->on_rq && ttwu_remote(p, wake_flags))
+ if (READ_ONCE(p->on_rq) && ttwu_remote(p, wake_flags))
goto unlock;

if (p->in_iowait) {
@@ -2584,9 +2578,6 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
}

#ifdef CONFIG_SMP
- p->sched_contributes_to_load = !!task_contributes_to_load(p);
- p->state = TASK_WAKING;
-
/*
* Ensure we load p->on_cpu _after_ p->on_rq, otherwise it would be
* possible to, falsely, observe p->on_cpu == 0.
@@ -2605,8 +2596,20 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
*
* Pairs with the LOCK+smp_mb__after_spinlock() on rq->lock in
* __schedule(). See the comment for smp_mb__after_spinlock().
+ *
+ * Form a control-dep-acquire with p->on_rq == 0 above, to ensure
+ * schedule()'s deactivate_task() has 'happened' and p will no longer
+ * care about it's own p->state. See the comment in __schedule().
*/
- smp_rmb();
+ smp_acquire__after_ctrl_dep();
+
+ /*
+ * We're doing the wakeup (@success == 1), they did a dequeue (p->on_rq
+ * == 0), which means we need to do an enqueue, change p->state to
+ * TASK_WAKING such that we can unlock p->pi_lock before doing the
+ * enqueue, such as ttwu_queue_wakelist().
+ */
+ p->state = TASK_WAKING;

/*
* If the owning (remote) CPU is still in the middle of schedule() with
@@ -4088,6 +4091,7 @@ static void __sched notrace __schedule(bool preempt)
{
struct task_struct *prev, *next;
unsigned long *switch_count;
+ unsigned long prev_state;
struct rq_flags rf;
struct rq *rq;
int cpu;
@@ -4104,12 +4108,19 @@ static void __sched notrace __schedule(bool preempt)
local_irq_disable();
rcu_note_context_switch(preempt);

+ prev_state = prev->state;
+
/*
- * Make sure that signal_pending_state()->signal_pending() below
- * can't be reordered with __set_current_state(TASK_INTERRUPTIBLE)
- * done by the caller to avoid the race with signal_wake_up().
+ * __set_current_state(@state)
+ * schedule() signal_wake_up()
+ * prev_state = p->state set_tsk_thread_flag(p, TIF_SIGPENDING)
+ * wake_up_state()
+ * LOCK rq->lock LOCK p->pi_state
+ * smp_mb__after_spinlock() smp_mb__after_spinlock()
+ * if (signal_pending_state() if (p->state & @state)
+ *
*
- * The membarrier system call requires a full memory barrier
+ * Also, the membarrier system call requires a full memory barrier
* after coming from user-space, before storing to rq->curr.
*/
rq_lock(rq, &rf);
@@ -4120,10 +4131,30 @@ static void __sched notrace __schedule(bool preempt)
update_rq_clock(rq);

switch_count = &prev->nivcsw;
- if (!preempt && prev->state) {
- if (signal_pending_state(prev->state, prev)) {
+ /*
+ * We must re-load p->state in case ttwu_runnable() changed it
+ * before we acquired rq->lock.
+ */
+ if (!preempt && prev_state && prev_state == prev->state) {
+ if (signal_pending_state(prev_state, prev)) {
prev->state = TASK_RUNNING;
} else {
+ prev->sched_contributes_to_load =
+ (prev_state & (TASK_UNINTERRUPTIBLE | TASK_NOLOAD)) == TASK_UNINTERRUPTIBLE &&
+ (prev->flags & PF_FROZEN) == 0;
+
+ if (prev->sched_contributes_to_load)
+ rq->nr_uninterruptible++;
+
+ /*
+ * __schedule() ttwu()
+ * prev_state = prev->state; if (READ_ONCE(p->on_rq) && ...)
+ * LOCK rq->lock goto out;
+ * smp_mb__after_spinlock(); smp_acquire__after_ctrl_dep();
+ * p->on_rq = 0; p->state = TASK_WAKING;
+ *
+ * After this, schedule() must not care about p->state any more.
+ */
deactivate_task(rq, prev, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK);

if (prev->in_iowait) {

2020-07-06 21:23:42

by Dave Jones

[permalink] [raw]
Subject: Re: weird loadavg on idle machine post 5.7

On Mon, Jul 06, 2020 at 04:59:52PM +0200, Peter Zijlstra wrote:
> On Fri, Jul 03, 2020 at 04:51:53PM -0400, Dave Jones wrote:
> > On Fri, Jul 03, 2020 at 12:40:33PM +0200, Peter Zijlstra wrote:
> >
> > looked promising the first few hours, but as soon as it hit four hours
> > of uptime, loadavg spiked and is now pinned to at least 1.00
>
> OK, lots of cursing later, I now have the below...
>
> The TL;DR is that while schedule() doesn't change p->state once it
> starts, it does read it quite a bit, and ttwu() will actually change it
> to TASK_WAKING. So if ttwu() changes it to WAKING before schedule()
> reads it to do loadavg accounting, things go sideways.
>
> The below is extra complicated by the fact that I've had to scrounge up
> a bunch of load-store ordering without actually adding barriers. It adds
> yet another control dependency to ttwu(), so take that C standard :-)

Man this stuff is subtle. I could've read this a hundred times and not
even come close to approaching this.

Basically me reading scheduler code:
http://www.quickmeme.com/img/96/9642ed212bbced00885592b39880ec55218e922245e0637cf94db2e41857d558.jpg

> I've booted it, and build a few kernels with it and checked loadavg
> drops to 0 after each build, so from that pov all is well, but since
> I'm not confident I can reproduce the issue, I can't tell this actually
> fixes anything, except maybe phantoms of my imagination.

Five hours in, looking good so far. I think you nailed it.

Dave

2020-07-06 23:57:19

by Valentin Schneider

[permalink] [raw]
Subject: Re: weird loadavg on idle machine post 5.7


On 06/07/20 15:59, Peter Zijlstra wrote:
> OK, lots of cursing later, I now have the below...
>
> The TL;DR is that while schedule() doesn't change p->state once it
> starts, it does read it quite a bit, and ttwu() will actually change it
> to TASK_WAKING. So if ttwu() changes it to WAKING before schedule()
> reads it to do loadavg accounting, things go sideways.
>
> The below is extra complicated by the fact that I've had to scrounge up
> a bunch of load-store ordering without actually adding barriers. It adds
> yet another control dependency to ttwu(), so take that C standard :-)
>
> I've booted it, and build a few kernels with it and checked loadavg
> drops to 0 after each build, so from that pov all is well, but since
> I'm not confident I can reproduce the issue, I can't tell this actually
> fixes anything, except maybe phantoms of my imagination.
>

As you said on IRC, the one apparent race would lead into "negative"
rq->nr_uninterruptible accounting, i.e. we'd skip some increments so would
end up with more decrements. As for the described issue, I think we were
both expecting "positive" accounting, i.e. more increments than decrements,
leading into an artificially inflated loadavg.

In any case, this should get rid of any existing race. I'll need some more
time (and more aperol spritz) to go through it all though.

> @@ -2605,8 +2596,20 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
> *
> * Pairs with the LOCK+smp_mb__after_spinlock() on rq->lock in
> * __schedule(). See the comment for smp_mb__after_spinlock().
> + *
> + * Form a control-dep-acquire with p->on_rq == 0 above, to ensure
> + * schedule()'s deactivate_task() has 'happened' and p will no longer
> + * care about it's own p->state. See the comment in __schedule().
> */
> - smp_rmb();
> + smp_acquire__after_ctrl_dep();

Apologies for asking again, but I'm foolishly hopeful I'll someday be able
to grok those things without half a dozen tabs open with documentation and
Paul McKenney papers.

Do I get it right that the 'acquire' part hints this is equivalent to
issuing a load-acquire on whatever was needed to figure out whether or not
the take the branch (in this case, p->on_rq, amongst other things); IOW
ensures any memory access appearing later in program order has to happen
after the load?

That at least explains to me the load->{load,store} wording in
smp_acquire__after_ctrl_dep().

> +
> + /*
> + * We're doing the wakeup (@success == 1), they did a dequeue (p->on_rq
> + * == 0), which means we need to do an enqueue, change p->state to
> + * TASK_WAKING such that we can unlock p->pi_lock before doing the
> + * enqueue, such as ttwu_queue_wakelist().
> + */
> + p->state = TASK_WAKING;
>
> /*
> * If the owning (remote) CPU is still in the middle of schedule() with

2020-07-07 07:50:10

by Peter Zijlstra

[permalink] [raw]
Subject: Re: weird loadavg on idle machine post 5.7

On Mon, Jul 06, 2020 at 05:20:57PM -0400, Dave Jones wrote:
> On Mon, Jul 06, 2020 at 04:59:52PM +0200, Peter Zijlstra wrote:
> > On Fri, Jul 03, 2020 at 04:51:53PM -0400, Dave Jones wrote:
> > > On Fri, Jul 03, 2020 at 12:40:33PM +0200, Peter Zijlstra wrote:
> > >
> > > looked promising the first few hours, but as soon as it hit four hours
> > > of uptime, loadavg spiked and is now pinned to at least 1.00
> >
> > OK, lots of cursing later, I now have the below...
> >
> > The TL;DR is that while schedule() doesn't change p->state once it
> > starts, it does read it quite a bit, and ttwu() will actually change it
> > to TASK_WAKING. So if ttwu() changes it to WAKING before schedule()
> > reads it to do loadavg accounting, things go sideways.
> >
> > The below is extra complicated by the fact that I've had to scrounge up
> > a bunch of load-store ordering without actually adding barriers. It adds
> > yet another control dependency to ttwu(), so take that C standard :-)
>
> Man this stuff is subtle. I could've read this a hundred times and not
> even come close to approaching this.
>
> Basically me reading scheduler code:
> http://www.quickmeme.com/img/96/9642ed212bbced00885592b39880ec55218e922245e0637cf94db2e41857d558.jpg

Heh, that one made me nearly spill my tea, much funnies :-)

But yes, Dave Chinner also complained about this for the previous fix.
I've written this:

https://lore.kernel.org/lkml/[email protected]/

to help with that. But clearly I'll need to update that patch again
after this little adventure.

> > I've booted it, and build a few kernels with it and checked loadavg
> > drops to 0 after each build, so from that pov all is well, but since
> > I'm not confident I can reproduce the issue, I can't tell this actually
> > fixes anything, except maybe phantoms of my imagination.
>
> Five hours in, looking good so far. I think you nailed it.

\o/ hooray! Thanks for testing Dave!

2020-07-07 08:21:06

by Peter Zijlstra

[permalink] [raw]
Subject: Re: weird loadavg on idle machine post 5.7

On Tue, Jul 07, 2020 at 12:56:04AM +0100, Valentin Schneider wrote:

> > @@ -2605,8 +2596,20 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
> > *
> > * Pairs with the LOCK+smp_mb__after_spinlock() on rq->lock in
> > * __schedule(). See the comment for smp_mb__after_spinlock().
> > + *
> > + * Form a control-dep-acquire with p->on_rq == 0 above, to ensure
> > + * schedule()'s deactivate_task() has 'happened' and p will no longer
> > + * care about it's own p->state. See the comment in __schedule().
> > */
> > - smp_rmb();
> > + smp_acquire__after_ctrl_dep();
>
> Apologies for asking again, but I'm foolishly hopeful I'll someday be able
> to grok those things without half a dozen tabs open with documentation and
> Paul McKenney papers.
>
> Do I get it right that the 'acquire' part hints this is equivalent to
> issuing a load-acquire on whatever was needed to figure out whether or not
> the take the branch (in this case, p->on_rq, amongst other things); IOW
> ensures any memory access appearing later in program order has to happen
> after the load?
>
> That at least explains to me the load->{load,store} wording in
> smp_acquire__after_ctrl_dep().

Yes.

So the thing is that hardware MUST NOT speculate stores, or rather, if
it does, it must take extreme measures to ensure they do not become
visible in any way shape or form, since speculative stores lead to
instant OOTA problems.

Therefore we can say that branches order stores and if the branch
condition depends on a load, we get a load->store order. IOW the load
must complete before we can resolve the branch, which in turn enables
the store to become visible/happen.

If we then add an smp_rmb() to the branch to order load->load, we end up
with a load->{load,store} ordering, which is equivalent to a
load-acquire.

The reason to do it like that, is that load-aquire would otherwise
require an smp_mb(), since for many platforms that's the only barrier
that has load->store ordering.

The down-side of doing it like this, as Paul will be quick to point out,
is that the C standard doesn't recognise control dependencies and thus
the compiler would be in its right to 'optimize' our conditional away.

We're relying on the compilers not having done this in the past and
there being sufficient compiler people interested in compiling Linux to
avoid this from happening.


Anyway, this patch is basically:

LOAD p->state LOAD-ACQUIRE p->on_rq == 0
MB
STORE p->on_rq, 0 STORE p->state, TASK_WAKING

which ensures the TASK_WAKING store happens after the p->state load.
Just a wee bit complicated due to not actually adding any barriers while
adding additional ordering.

Anyway, let me now endeavour to write a coherent Changelog for this mess
:-(

2020-07-07 09:23:10

by Qais Yousef

[permalink] [raw]
Subject: Re: weird loadavg on idle machine post 5.7

On 07/06/20 16:59, Peter Zijlstra wrote:

[...]

> @@ -4104,12 +4108,19 @@ static void __sched notrace __schedule(bool preempt)
> local_irq_disable();
> rcu_note_context_switch(preempt);
>
> + prev_state = prev->state;
> +
> /*
> - * Make sure that signal_pending_state()->signal_pending() below
> - * can't be reordered with __set_current_state(TASK_INTERRUPTIBLE)
> - * done by the caller to avoid the race with signal_wake_up().
> + * __set_current_state(@state)
> + * schedule() signal_wake_up()
> + * prev_state = p->state set_tsk_thread_flag(p, TIF_SIGPENDING)
> + * wake_up_state()
> + * LOCK rq->lock LOCK p->pi_state
> + * smp_mb__after_spinlock() smp_mb__after_spinlock()
> + * if (signal_pending_state() if (p->state & @state)
> + *
> *
> - * The membarrier system call requires a full memory barrier
> + * Also, the membarrier system call requires a full memory barrier
> * after coming from user-space, before storing to rq->curr.
> */
> rq_lock(rq, &rf);
> @@ -4120,10 +4131,30 @@ static void __sched notrace __schedule(bool preempt)
> update_rq_clock(rq);
>
> switch_count = &prev->nivcsw;
> - if (!preempt && prev->state) {
> - if (signal_pending_state(prev->state, prev)) {
> + /*
> + * We must re-load p->state in case ttwu_runnable() changed it
> + * before we acquired rq->lock.
> + */
> + if (!preempt && prev_state && prev_state == prev->state) {

I think the compiler won't optimize `prev_state == prev->state` out because of
the smp_mb__after_spinlock() which implies a compiler barrier. Still not sure
if it's worth making prev->state accesses a READ_ONCE()?

Thanks

--
Qais Yousef

2020-07-07 09:49:25

by Peter Zijlstra

[permalink] [raw]
Subject: Re: weird loadavg on idle machine post 5.7

On Tue, Jul 07, 2020 at 10:20:05AM +0100, Qais Yousef wrote:
> On 07/06/20 16:59, Peter Zijlstra wrote:

> > + if (!preempt && prev_state && prev_state == prev->state) {
>
> I think the compiler won't optimize `prev_state == prev->state` out because of
> the smp_mb__after_spinlock() which implies a compiler barrier. Still not sure
> if it's worth making prev->state accesses a READ_ONCE()?

task_struct::state is one of the very rare (and ancient) variables
that's declared volatile.

We should probably clean that up some day, but so far I've not attempted
to do such a thing.

2020-07-07 10:24:14

by Valentin Schneider

[permalink] [raw]
Subject: Re: weird loadavg on idle machine post 5.7


On 07/07/20 09:17, Peter Zijlstra wrote:
> On Tue, Jul 07, 2020 at 12:56:04AM +0100, Valentin Schneider wrote:
>
>> > @@ -2605,8 +2596,20 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
>> > *
>> > * Pairs with the LOCK+smp_mb__after_spinlock() on rq->lock in
>> > * __schedule(). See the comment for smp_mb__after_spinlock().
>> > + *
>> > + * Form a control-dep-acquire with p->on_rq == 0 above, to ensure
>> > + * schedule()'s deactivate_task() has 'happened' and p will no longer
>> > + * care about it's own p->state. See the comment in __schedule().
>> > */
>> > - smp_rmb();
>> > + smp_acquire__after_ctrl_dep();
>>
>> Apologies for asking again, but I'm foolishly hopeful I'll someday be able
>> to grok those things without half a dozen tabs open with documentation and
>> Paul McKenney papers.
>>
>> Do I get it right that the 'acquire' part hints this is equivalent to
>> issuing a load-acquire on whatever was needed to figure out whether or not
>> the take the branch (in this case, p->on_rq, amongst other things); IOW
>> ensures any memory access appearing later in program order has to happen
>> after the load?
>>
>> That at least explains to me the load->{load,store} wording in
>> smp_acquire__after_ctrl_dep().
>
> Yes.
>
> So the thing is that hardware MUST NOT speculate stores, or rather, if
> it does, it must take extreme measures to ensure they do not become
> visible in any way shape or form, since speculative stores lead to
> instant OOTA problems.
>
> Therefore we can say that branches order stores and if the branch
> condition depends on a load, we get a load->store order. IOW the load
> must complete before we can resolve the branch, which in turn enables
> the store to become visible/happen.
>

Right, I think that point is made clear in memory-barriers.txt.

> If we then add an smp_rmb() to the branch to order load->load, we end up
> with a load->{load,store} ordering, which is equivalent to a
> load-acquire.
>
> The reason to do it like that, is that load-aquire would otherwise
> require an smp_mb(), since for many platforms that's the only barrier
> that has load->store ordering.
>
> The down-side of doing it like this, as Paul will be quick to point out,
> is that the C standard doesn't recognise control dependencies and thus
> the compiler would be in its right to 'optimize' our conditional away.
>

Yikes!

> We're relying on the compilers not having done this in the past and
> there being sufficient compiler people interested in compiling Linux to
> avoid this from happening.
>
>
> Anyway, this patch is basically:
>
> LOAD p->state LOAD-ACQUIRE p->on_rq == 0
> MB
> STORE p->on_rq, 0 STORE p->state, TASK_WAKING
>
> which ensures the TASK_WAKING store happens after the p->state load.
> Just a wee bit complicated due to not actually adding any barriers while
> adding additional ordering.
>

Your newer changelog also helped in that regards.

Thanks a ton for the writeup!

> Anyway, let me now endeavour to write a coherent Changelog for this mess
> :-(

2020-07-07 10:33:26

by Peter Zijlstra

[permalink] [raw]
Subject: Re: weird loadavg on idle machine post 5.7

On Tue, Jul 07, 2020 at 10:17:19AM +0200, Peter Zijlstra wrote:
> Anyway, let me now endeavour to write a coherent Changelog for this mess

I'll go stick this in sched/urgent and update that other documentation
patch (again)..

---
Subject: sched: Fix loadavg accounting race
From: Peter Zijlstra <[email protected]>
Date: Fri, 3 Jul 2020 12:40:33 +0200

The recent commit:

c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu")

moved these lines in ttwu():

p->sched_contributes_to_load = !!task_contributes_to_load(p);
p->state = TASK_WAKING;

up before:

smp_cond_load_acquire(&p->on_cpu, !VAL);

into the 'p->on_rq == 0' block, with the thinking that once we hit
schedule() the current task cannot change it's ->state anymore. And
while this is true, it is both incorrect and flawed.

It is incorrect in that we need at least an ACQUIRE on 'p->on_rq == 0'
to avoid weak hardware from re-ordering things for us. This can fairly
easily be achieved by relying on the control-dependency already in
place.

The second problem, which makes the flaw in the original argument, is
that while schedule() will not change prev->state, it will read it a
number of times (arguably too many times since it's marked volatile).
The previous condition 'p->on_cpu == 0' was sufficient because that
indicates schedule() has completed, and will no longer read
prev->state. So now the trick is to make this same true for the (much)
earlier 'prev->on_rq == 0' case.

Furthermore, in order to make the ordering stick, the 'prev->on_rq = 0'
assignment needs to he a RELEASE, but adding additional ordering to
schedule() is an unwelcome proposition at the best of times, doubly so
for mere accounting.

Luckily we can push the prev->state load up before rq->lock, with the
only caveat that we then have to re-read the state after. However, we
know that if it changed, we no longer have to worry about the blocking
path. This gives us the required ordering, if we block, we did the
prev->state load before an (effective) smp_mb() and the p->on_rq store
needs not change.

With this we end up with the effective ordering:

LOAD p->state LOAD-ACQUIRE p->on_rq == 0
MB
STORE p->on_rq, 0 STORE p->state, TASK_WAKING

which ensures the TASK_WAKING store happens after the prev->state
load, and all is well again.

Fixes: c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu")
Reported-by: Dave Jones <[email protected]>
Reported-by: Paul Gortmaker <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Tested-by: Dave Jones <[email protected]>
Tested-by: Paul Gortmaker <[email protected]>
---
include/linux/sched.h | 4 --
kernel/sched/core.c | 67 ++++++++++++++++++++++++++++++++++++++------------
2 files changed, 51 insertions(+), 20 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -114,10 +114,6 @@ struct task_group;

#define task_is_stopped_or_traced(task) ((task->state & (__TASK_STOPPED | __TASK_TRACED)) != 0)

-#define task_contributes_to_load(task) ((task->state & TASK_UNINTERRUPTIBLE) != 0 && \
- (task->flags & PF_FROZEN) == 0 && \
- (task->state & TASK_NOLOAD) == 0)
-
#ifdef CONFIG_DEBUG_ATOMIC_SLEEP

/*
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1313,9 +1313,6 @@ static inline void dequeue_task(struct r

void activate_task(struct rq *rq, struct task_struct *p, int flags)
{
- if (task_contributes_to_load(p))
- rq->nr_uninterruptible--;
-
enqueue_task(rq, p, flags);

p->on_rq = TASK_ON_RQ_QUEUED;
@@ -1325,9 +1322,6 @@ void deactivate_task(struct rq *rq, stru
{
p->on_rq = (flags & DEQUEUE_SLEEP) ? 0 : TASK_ON_RQ_MIGRATING;

- if (task_contributes_to_load(p))
- rq->nr_uninterruptible++;
-
dequeue_task(rq, p, flags);
}

@@ -2228,10 +2222,10 @@ ttwu_do_activate(struct rq *rq, struct t

lockdep_assert_held(&rq->lock);

-#ifdef CONFIG_SMP
if (p->sched_contributes_to_load)
rq->nr_uninterruptible--;

+#ifdef CONFIG_SMP
if (wake_flags & WF_MIGRATED)
en_flags |= ENQUEUE_MIGRATED;
#endif
@@ -2575,7 +2569,7 @@ try_to_wake_up(struct task_struct *p, un
* A similar smb_rmb() lives in try_invoke_on_locked_down_task().
*/
smp_rmb();
- if (p->on_rq && ttwu_remote(p, wake_flags))
+ if (READ_ONCE(p->on_rq) && ttwu_remote(p, wake_flags))
goto unlock;

if (p->in_iowait) {
@@ -2584,9 +2578,6 @@ try_to_wake_up(struct task_struct *p, un
}

#ifdef CONFIG_SMP
- p->sched_contributes_to_load = !!task_contributes_to_load(p);
- p->state = TASK_WAKING;
-
/*
* Ensure we load p->on_cpu _after_ p->on_rq, otherwise it would be
* possible to, falsely, observe p->on_cpu == 0.
@@ -2605,8 +2596,20 @@ try_to_wake_up(struct task_struct *p, un
*
* Pairs with the LOCK+smp_mb__after_spinlock() on rq->lock in
* __schedule(). See the comment for smp_mb__after_spinlock().
+ *
+ * Form a control-dep-acquire with p->on_rq == 0 above, to ensure
+ * schedule()'s deactivate_task() has 'happened' and p will no longer
+ * care about it's own p->state. See the comment in __schedule().
*/
- smp_rmb();
+ smp_acquire__after_ctrl_dep();
+
+ /*
+ * We're doing the wakeup (@success == 1), they did a dequeue (p->on_rq
+ * == 0), which means we need to do an enqueue, change p->state to
+ * TASK_WAKING such that we can unlock p->pi_lock before doing the
+ * enqueue, such as ttwu_queue_wakelist().
+ */
+ p->state = TASK_WAKING;

/*
* If the owning (remote) CPU is still in the middle of schedule() with
@@ -4088,6 +4091,7 @@ static void __sched notrace __schedule(b
{
struct task_struct *prev, *next;
unsigned long *switch_count;
+ unsigned long prev_state;
struct rq_flags rf;
struct rq *rq;
int cpu;
@@ -4104,12 +4108,22 @@ static void __sched notrace __schedule(b
local_irq_disable();
rcu_note_context_switch(preempt);

+ /* See deactivate_task() below. */
+ prev_state = prev->state;
+
/*
* Make sure that signal_pending_state()->signal_pending() below
* can't be reordered with __set_current_state(TASK_INTERRUPTIBLE)
- * done by the caller to avoid the race with signal_wake_up().
+ * done by the caller to avoid the race with signal_wake_up():
+ *
+ * __set_current_state(@state) signal_wake_up()
+ * schedule() set_tsk_thread_flag(p, TIF_SIGPENDING)
+ * wake_up_state(p, state)
+ * LOCK rq->lock LOCK p->pi_state
+ * smp_mb__after_spinlock() smp_mb__after_spinlock()
+ * if (signal_pending_state()) if (p->state & @state)
*
- * The membarrier system call requires a full memory barrier
+ * Also, the membarrier system call requires a full memory barrier
* after coming from user-space, before storing to rq->curr.
*/
rq_lock(rq, &rf);
@@ -4120,10 +4134,31 @@ static void __sched notrace __schedule(b
update_rq_clock(rq);

switch_count = &prev->nivcsw;
- if (!preempt && prev->state) {
- if (signal_pending_state(prev->state, prev)) {
+ /*
+ * We must re-load prev->state in case ttwu_remote() changed it
+ * before we acquired rq->lock.
+ */
+ if (!preempt && prev_state && prev_state == prev->state) {
+ if (signal_pending_state(prev_state, prev)) {
prev->state = TASK_RUNNING;
} else {
+ prev->sched_contributes_to_load =
+ (prev_state & TASK_UNINTERRUPTIBLE) &&
+ !(prev_state & TASK_NOLOAD) &&
+ !(prev->flags & PF_FROZEN);
+
+ if (prev->sched_contributes_to_load)
+ rq->nr_uninterruptible++;
+
+ /*
+ * __schedule() ttwu()
+ * prev_state = prev->state; if (READ_ONCE(p->on_rq) && ...)
+ * LOCK rq->lock goto out;
+ * smp_mb__after_spinlock(); smp_acquire__after_ctrl_dep();
+ * p->on_rq = 0; p->state = TASK_WAKING;
+ *
+ * After this, schedule() must not care about p->state any more.
+ */
deactivate_task(rq, prev, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK);

if (prev->in_iowait) {

Subject: [tip: sched/urgent] sched: Fix loadavg accounting race

The following commit has been merged into the sched/urgent branch of tip:

Commit-ID: dbfb089d360b1cc623c51a2c7cf9b99eff78e0e7
Gitweb: https://git.kernel.org/tip/dbfb089d360b1cc623c51a2c7cf9b99eff78e0e7
Author: Peter Zijlstra <[email protected]>
AuthorDate: Fri, 03 Jul 2020 12:40:33 +02:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Wed, 08 Jul 2020 11:38:49 +02:00

sched: Fix loadavg accounting race

The recent commit:

c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu")

moved these lines in ttwu():

p->sched_contributes_to_load = !!task_contributes_to_load(p);
p->state = TASK_WAKING;

up before:

smp_cond_load_acquire(&p->on_cpu, !VAL);

into the 'p->on_rq == 0' block, with the thinking that once we hit
schedule() the current task cannot change it's ->state anymore. And
while this is true, it is both incorrect and flawed.

It is incorrect in that we need at least an ACQUIRE on 'p->on_rq == 0'
to avoid weak hardware from re-ordering things for us. This can fairly
easily be achieved by relying on the control-dependency already in
place.

The second problem, which makes the flaw in the original argument, is
that while schedule() will not change prev->state, it will read it a
number of times (arguably too many times since it's marked volatile).
The previous condition 'p->on_cpu == 0' was sufficient because that
indicates schedule() has completed, and will no longer read
prev->state. So now the trick is to make this same true for the (much)
earlier 'prev->on_rq == 0' case.

Furthermore, in order to make the ordering stick, the 'prev->on_rq = 0'
assignment needs to he a RELEASE, but adding additional ordering to
schedule() is an unwelcome proposition at the best of times, doubly so
for mere accounting.

Luckily we can push the prev->state load up before rq->lock, with the
only caveat that we then have to re-read the state after. However, we
know that if it changed, we no longer have to worry about the blocking
path. This gives us the required ordering, if we block, we did the
prev->state load before an (effective) smp_mb() and the p->on_rq store
needs not change.

With this we end up with the effective ordering:

LOAD p->state LOAD-ACQUIRE p->on_rq == 0
MB
STORE p->on_rq, 0 STORE p->state, TASK_WAKING

which ensures the TASK_WAKING store happens after the prev->state
load, and all is well again.

Fixes: c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu")
Reported-by: Dave Jones <[email protected]>
Reported-by: Paul Gortmaker <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Tested-by: Dave Jones <[email protected]>
Tested-by: Paul Gortmaker <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
include/linux/sched.h | 4 +---
kernel/sched/core.c | 67 +++++++++++++++++++++++++++++++-----------
2 files changed, 51 insertions(+), 20 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 692e327..6833729 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -114,10 +114,6 @@ struct task_group;

#define task_is_stopped_or_traced(task) ((task->state & (__TASK_STOPPED | __TASK_TRACED)) != 0)

-#define task_contributes_to_load(task) ((task->state & TASK_UNINTERRUPTIBLE) != 0 && \
- (task->flags & PF_FROZEN) == 0 && \
- (task->state & TASK_NOLOAD) == 0)
-
#ifdef CONFIG_DEBUG_ATOMIC_SLEEP

/*
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ca5db40..950ac45 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1311,9 +1311,6 @@ static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags)

void activate_task(struct rq *rq, struct task_struct *p, int flags)
{
- if (task_contributes_to_load(p))
- rq->nr_uninterruptible--;
-
enqueue_task(rq, p, flags);

p->on_rq = TASK_ON_RQ_QUEUED;
@@ -1323,9 +1320,6 @@ void deactivate_task(struct rq *rq, struct task_struct *p, int flags)
{
p->on_rq = (flags & DEQUEUE_SLEEP) ? 0 : TASK_ON_RQ_MIGRATING;

- if (task_contributes_to_load(p))
- rq->nr_uninterruptible++;
-
dequeue_task(rq, p, flags);
}

@@ -2236,10 +2230,10 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,

lockdep_assert_held(&rq->lock);

-#ifdef CONFIG_SMP
if (p->sched_contributes_to_load)
rq->nr_uninterruptible--;

+#ifdef CONFIG_SMP
if (wake_flags & WF_MIGRATED)
en_flags |= ENQUEUE_MIGRATED;
#endif
@@ -2583,7 +2577,7 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
* A similar smb_rmb() lives in try_invoke_on_locked_down_task().
*/
smp_rmb();
- if (p->on_rq && ttwu_remote(p, wake_flags))
+ if (READ_ONCE(p->on_rq) && ttwu_remote(p, wake_flags))
goto unlock;

if (p->in_iowait) {
@@ -2592,9 +2586,6 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
}

#ifdef CONFIG_SMP
- p->sched_contributes_to_load = !!task_contributes_to_load(p);
- p->state = TASK_WAKING;
-
/*
* Ensure we load p->on_cpu _after_ p->on_rq, otherwise it would be
* possible to, falsely, observe p->on_cpu == 0.
@@ -2613,8 +2604,20 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
*
* Pairs with the LOCK+smp_mb__after_spinlock() on rq->lock in
* __schedule(). See the comment for smp_mb__after_spinlock().
+ *
+ * Form a control-dep-acquire with p->on_rq == 0 above, to ensure
+ * schedule()'s deactivate_task() has 'happened' and p will no longer
+ * care about it's own p->state. See the comment in __schedule().
*/
- smp_rmb();
+ smp_acquire__after_ctrl_dep();
+
+ /*
+ * We're doing the wakeup (@success == 1), they did a dequeue (p->on_rq
+ * == 0), which means we need to do an enqueue, change p->state to
+ * TASK_WAKING such that we can unlock p->pi_lock before doing the
+ * enqueue, such as ttwu_queue_wakelist().
+ */
+ p->state = TASK_WAKING;

/*
* If the owning (remote) CPU is still in the middle of schedule() with
@@ -4097,6 +4100,7 @@ static void __sched notrace __schedule(bool preempt)
{
struct task_struct *prev, *next;
unsigned long *switch_count;
+ unsigned long prev_state;
struct rq_flags rf;
struct rq *rq;
int cpu;
@@ -4113,12 +4117,22 @@ static void __sched notrace __schedule(bool preempt)
local_irq_disable();
rcu_note_context_switch(preempt);

+ /* See deactivate_task() below. */
+ prev_state = prev->state;
+
/*
* Make sure that signal_pending_state()->signal_pending() below
* can't be reordered with __set_current_state(TASK_INTERRUPTIBLE)
- * done by the caller to avoid the race with signal_wake_up().
+ * done by the caller to avoid the race with signal_wake_up():
+ *
+ * __set_current_state(@state) signal_wake_up()
+ * schedule() set_tsk_thread_flag(p, TIF_SIGPENDING)
+ * wake_up_state(p, state)
+ * LOCK rq->lock LOCK p->pi_state
+ * smp_mb__after_spinlock() smp_mb__after_spinlock()
+ * if (signal_pending_state()) if (p->state & @state)
*
- * The membarrier system call requires a full memory barrier
+ * Also, the membarrier system call requires a full memory barrier
* after coming from user-space, before storing to rq->curr.
*/
rq_lock(rq, &rf);
@@ -4129,10 +4143,31 @@ static void __sched notrace __schedule(bool preempt)
update_rq_clock(rq);

switch_count = &prev->nivcsw;
- if (!preempt && prev->state) {
- if (signal_pending_state(prev->state, prev)) {
+ /*
+ * We must re-load prev->state in case ttwu_remote() changed it
+ * before we acquired rq->lock.
+ */
+ if (!preempt && prev_state && prev_state == prev->state) {
+ if (signal_pending_state(prev_state, prev)) {
prev->state = TASK_RUNNING;
} else {
+ prev->sched_contributes_to_load =
+ (prev_state & TASK_UNINTERRUPTIBLE) &&
+ !(prev_state & TASK_NOLOAD) &&
+ !(prev->flags & PF_FROZEN);
+
+ if (prev->sched_contributes_to_load)
+ rq->nr_uninterruptible++;
+
+ /*
+ * __schedule() ttwu()
+ * prev_state = prev->state; if (READ_ONCE(p->on_rq) && ...)
+ * LOCK rq->lock goto out;
+ * smp_mb__after_spinlock(); smp_acquire__after_ctrl_dep();
+ * p->on_rq = 0; p->state = TASK_WAKING;
+ *
+ * After this, schedule() must not care about p->state any more.
+ */
deactivate_task(rq, prev, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK);

if (prev->in_iowait) {