2017-09-06 05:14:02

by Andy Lutomirski

[permalink] [raw]
Subject: Abysmal scheduler performance in Linus' tree?

I'm running e7d0c41ecc2e372a81741a30894f556afec24315 from Linus' tree
today, and I'm seeing abysmal scheduler performance. Running make -j4
ends up with all the tasks on CPU 3 most of the time (on my
4-logical-thread laptop). taskset -c 0 whatever puts whatever on CPU
0, but plain while true; do true; done puts the infinite loop on CPU 3
right along with the make -j4 tasks.

This is on Fedora 26, and I don't think I'm doing anything weird.
systemd has enabled the cpu controller, but it doesn't seem to have
configured anything or created any non-root cgroups.

Just a heads up. I haven't tried to diagnose it at all.


2017-09-06 08:25:25

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Abysmal scheduler performance in Linus' tree?

On Tue, Sep 05, 2017 at 10:13:39PM -0700, Andy Lutomirski wrote:
> I'm running e7d0c41ecc2e372a81741a30894f556afec24315 from Linus' tree
> today, and I'm seeing abysmal scheduler performance. Running make -j4
> ends up with all the tasks on CPU 3 most of the time (on my
> 4-logical-thread laptop). taskset -c 0 whatever puts whatever on CPU
> 0, but plain while true; do true; done puts the infinite loop on CPU 3
> right along with the make -j4 tasks.
>
> This is on Fedora 26, and I don't think I'm doing anything weird.
> systemd has enabled the cpu controller, but it doesn't seem to have
> configured anything or created any non-root cgroups.
>
> Just a heads up. I haven't tried to diagnose it at all.

"make O=defconfig-build -j80" results in:

%Cpu0 : 90.7 us, 9.3 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu1 : 88.7 us, 11.3 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu2 : 93.5 us, 6.5 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu3 : 86.8 us, 13.2 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu4 : 89.7 us, 10.3 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu5 : 96.3 us, 3.7 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu6 : 95.3 us, 4.7 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu7 : 94.4 us, 5.6 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu8 : 91.7 us, 8.3 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu9 : 94.3 us, 5.7 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu10 : 90.7 us, 9.3 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu11 : 96.2 us, 3.8 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu12 : 91.5 us, 8.5 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu13 : 90.6 us, 9.4 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu14 : 97.2 us, 2.8 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu15 : 89.7 us, 10.3 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu16 : 90.6 us, 9.4 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu17 : 93.4 us, 6.6 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu18 : 90.6 us, 9.4 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu19 : 92.5 us, 7.5 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu20 : 94.4 us, 5.6 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu21 : 90.7 us, 9.3 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu22 : 92.5 us, 7.5 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu23 : 90.7 us, 9.3 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu24 : 91.6 us, 8.4 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu25 : 93.5 us, 6.5 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu26 : 93.4 us, 5.7 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.9 si, 0.0 st
%Cpu27 : 92.5 us, 7.5 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu28 : 92.5 us, 7.5 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu29 : 88.8 us, 11.2 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu30 : 90.6 us, 9.4 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu31 : 93.5 us, 6.5 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu32 : 93.5 us, 6.5 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu33 : 93.4 us, 6.6 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu34 : 90.7 us, 9.3 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu35 : 93.5 us, 6.5 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu36 : 90.7 us, 9.3 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu37 : 97.2 us, 2.8 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu38 : 92.5 us, 7.5 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu39 : 92.6 us, 7.4 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st

Do you have a .config somewhere? Are you running with the systemd? Is it
creating cpu cgroups?

Any specifics on your setup?

2017-09-06 08:59:23

by Andy Lutomirski

[permalink] [raw]
Subject: Re: Abysmal scheduler performance in Linus' tree?



> On Sep 6, 2017, at 1:25 AM, Peter Zijlstra <[email protected]> wrote:
>
>> On Tue, Sep 05, 2017 at 10:13:39PM -0700, Andy Lutomirski wrote:
>> I'm running e7d0c41ecc2e372a81741a30894f556afec24315 from Linus' tree
>> today, and I'm seeing abysmal scheduler performance. Running make -j4
>> ends up with all the tasks on CPU 3 most of the time (on my
>> 4-logical-thread laptop). taskset -c 0 whatever puts whatever on CPU
>> 0, but plain while true; do true; done puts the infinite loop on CPU 3
>> right along with the make -j4 tasks.
>>
>> This is on Fedora 26, and I don't think I'm doing anything weird.
>> systemd has enabled the cpu controller, but it doesn't seem to have
>> configured anything or created any non-root cgroups.
>>
>> Just a heads up. I haven't tried to diagnose it at all.
>
> "make O=defconfig-build -j80" results in:
>
> %Cpu0 : 90.7 us, 9.3 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
> %Cpu1 : 88.7 us, 11.3 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
> %Cpu2 : 93.5 us, 6.5 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
> %Cpu3 : 86.8 us, 13.2 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
> %Cpu4 : 89.7 us, 10.3 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
> %Cpu5 : 96.3 us, 3.7 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
> %Cpu6 : 95.3 us, 4.7 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
> %Cpu7 : 94.4 us, 5.6 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
> %Cpu8 : 91.7 us, 8.3 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
> %Cpu9 : 94.3 us, 5.7 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
> %Cpu10 : 90.7 us, 9.3 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
> %Cpu11 : 96.2 us, 3.8 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
> %Cpu12 : 91.5 us, 8.5 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
> %Cpu13 : 90.6 us, 9.4 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
> %Cpu14 : 97.2 us, 2.8 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
> %Cpu15 : 89.7 us, 10.3 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
> %Cpu16 : 90.6 us, 9.4 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
> %Cpu17 : 93.4 us, 6.6 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
> %Cpu18 : 90.6 us, 9.4 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
> %Cpu19 : 92.5 us, 7.5 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
> %Cpu20 : 94.4 us, 5.6 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
> %Cpu21 : 90.7 us, 9.3 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
> %Cpu22 : 92.5 us, 7.5 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
> %Cpu23 : 90.7 us, 9.3 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
> %Cpu24 : 91.6 us, 8.4 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
> %Cpu25 : 93.5 us, 6.5 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
> %Cpu26 : 93.4 us, 5.7 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.9 si, 0.0 st
> %Cpu27 : 92.5 us, 7.5 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
> %Cpu28 : 92.5 us, 7.5 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
> %Cpu29 : 88.8 us, 11.2 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
> %Cpu30 : 90.6 us, 9.4 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
> %Cpu31 : 93.5 us, 6.5 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
> %Cpu32 : 93.5 us, 6.5 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
> %Cpu33 : 93.4 us, 6.6 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
> %Cpu34 : 90.7 us, 9.3 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
> %Cpu35 : 93.5 us, 6.5 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
> %Cpu36 : 90.7 us, 9.3 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
> %Cpu37 : 97.2 us, 2.8 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
> %Cpu38 : 92.5 us, 7.5 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
> %Cpu39 : 92.6 us, 7.4 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
>
> Do you have a .config somewhere?

I'll attach tomorrow. I'll also test in a VM.

> Are you running with the systemd? Is it
> creating cpu cgroups?

Yes systemd, no cgroups.

>
> Any specifics on your setup?

On further fiddling, I only see this after a suspend and resume cycle.

2017-09-06 09:03:41

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Abysmal scheduler performance in Linus' tree?

On Wed, Sep 06, 2017 at 01:59:14AM -0700, Andy Lutomirski wrote:
> > Any specifics on your setup?
>
> On further fiddling, I only see this after a suspend and resume cycle.

Ah, ok. That's not something I otherwise test. Lets see if I can force
this brick of mine through a suspend-resume cycle :-)

2017-09-06 09:15:49

by Chris Wilson

[permalink] [raw]
Subject: Re: Abysmal scheduler performance in Linus' tree?

Quoting Andy Lutomirski (2017-09-06 06:13:39)
> I'm running e7d0c41ecc2e372a81741a30894f556afec24315 from Linus' tree
> today, and I'm seeing abysmal scheduler performance. Running make -j4
> ends up with all the tasks on CPU 3 most of the time (on my
> 4-logical-thread laptop). taskset -c 0 whatever puts whatever on CPU
> 0, but plain while true; do true; done puts the infinite loop on CPU 3
> right along with the make -j4 tasks.
>
> This is on Fedora 26, and I don't think I'm doing anything weird.
> systemd has enabled the cpu controller, but it doesn't seem to have
> configured anything or created any non-root cgroups.
>
> Just a heads up. I haven't tried to diagnose it at all.

There is an issue on !llc machines arising from numa_wake_wide() where
processes are not spread widely across the low-power cores:
https://patchwork.kernel.org/patch/9875581/

The patch we are using to fix the regression is
https://cgit.freedesktop.org/drm-intel/commit/?h=topic/core-for-CI&id=6c362d9657293d700a8299acbeb2f1e24378f488
-Chris

2017-09-06 09:24:33

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Abysmal scheduler performance in Linus' tree?

On Wed, Sep 06, 2017 at 10:15:33AM +0100, Chris Wilson wrote:
> Quoting Andy Lutomirski (2017-09-06 06:13:39)
> > I'm running e7d0c41ecc2e372a81741a30894f556afec24315 from Linus' tree
> > today, and I'm seeing abysmal scheduler performance. Running make -j4
> > ends up with all the tasks on CPU 3 most of the time (on my
> > 4-logical-thread laptop). taskset -c 0 whatever puts whatever on CPU
> > 0, but plain while true; do true; done puts the infinite loop on CPU 3
> > right along with the make -j4 tasks.
> >
> > This is on Fedora 26, and I don't think I'm doing anything weird.
> > systemd has enabled the cpu controller, but it doesn't seem to have
> > configured anything or created any non-root cgroups.
> >
> > Just a heads up. I haven't tried to diagnose it at all.
>
> There is an issue on !llc machines arising from numa_wake_wide() where
> processes are not spread widely across the low-power cores:
> https://patchwork.kernel.org/patch/9875581/
>
> The patch we are using to fix the regression is
> https://cgit.freedesktop.org/drm-intel/commit/?h=topic/core-for-CI&id=6c362d9657293d700a8299acbeb2f1e24378f488


Urgh. that's the broken one, note how it has:

+static void get_llc_stats(struct llc_stats *stats, int cpu)
+{
+ struct sched_domain_shared *sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
+
+ if (!sds) {
+ memset(&stats, 0, sizeof(*stats));
^

0-day reported that that exploded with gcc-4.8 but worked with newer
compilers. But either way its fairly broken code.

The one that we committed and hit Linus' tree already (if my git-foo is
any good today) is:


commit 90001d67be2fa2acbe3510d1f64fa6533efa30ef
Author: Peter Zijlstra <[email protected]>
Date: Mon Jul 31 17:50:05 2017 +0200

sched/fair: Fix wake_affine() for !NUMA_BALANCING

In commit:

3fed382b46ba ("sched/numa: Implement NUMA node level wake_affine()")

Rik changed wake_affine to consider NUMA information when balancing
between LLC domains.

There are a number of problems here which this patch tries to address:

- LLC < NODE; in this case we'd use the wrong information to balance
- !NUMA_BALANCING: in this case, the new code doesn't do any
balancing at all
- re-computes the NUMA data for every wakeup, this can mean iterating
up to 64 CPUs for every wakeup.
- default affine wakeups inside a cache

We address these by saving the load/capacity values for each
sched_domain during regular load-balance and using these values in
wake_affine_llc(). The obvious down-side to using cached values is
that they can be too old and poorly reflect reality.

But this way we can use LLC wide information and thus not rely on
assuming LLC matches NODE. We also don't rely on NUMA_BALANCING nor do
we have to aggegate two nodes (or even cache domains) worth of CPUs
for each wakeup.

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Josef Bacik <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Fixes: 3fed382b46ba ("sched/numa: Implement NUMA node level wake_affine()")
[ Minor readability improvements. ]
Signed-off-by: Ingo Molnar <[email protected]>

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 7d065abc7a47..d7b6dab956ec 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -71,6 +71,14 @@ struct sched_domain_shared {
atomic_t ref;
atomic_t nr_busy_cpus;
int has_idle_cores;
+
+ /*
+ * Some variables from the most recent sd_lb_stats for this domain,
+ * used by wake_affine().
+ */
+ unsigned long nr_running;
+ unsigned long load;
+ unsigned long capacity;
};

struct sched_domain {
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a7f1c3b797f8..8d5868771cb3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2658,59 +2658,6 @@ void task_tick_numa(struct rq *rq, struct task_struct *curr)
}
}

-/*
- * Can a task be moved from prev_cpu to this_cpu without causing a load
- * imbalance that would trigger the load balancer?
- */
-static inline bool numa_wake_affine(struct sched_domain *sd,
- struct task_struct *p, int this_cpu,
- int prev_cpu, int sync)
-{
- struct numa_stats prev_load, this_load;
- s64 this_eff_load, prev_eff_load;
-
- update_numa_stats(&prev_load, cpu_to_node(prev_cpu));
- update_numa_stats(&this_load, cpu_to_node(this_cpu));
-
- /*
- * If sync wakeup then subtract the (maximum possible)
- * effect of the currently running task from the load
- * of the current CPU:
- */
- if (sync) {
- unsigned long current_load = task_h_load(current);
-
- if (this_load.load > current_load)
- this_load.load -= current_load;
- else
- this_load.load = 0;
- }
-
- /*
- * In low-load situations, where this_cpu's node is idle due to the
- * sync cause above having dropped this_load.load to 0, move the task.
- * Moving to an idle socket will not create a bad imbalance.
- *
- * Otherwise check if the nodes are near enough in load to allow this
- * task to be woken on this_cpu's node.
- */
- if (this_load.load > 0) {
- unsigned long task_load = task_h_load(p);
-
- this_eff_load = 100;
- this_eff_load *= prev_load.compute_capacity;
-
- prev_eff_load = 100 + (sd->imbalance_pct - 100) / 2;
- prev_eff_load *= this_load.compute_capacity;
-
- this_eff_load *= this_load.load + task_load;
- prev_eff_load *= prev_load.load - task_load;
-
- return this_eff_load <= prev_eff_load;
- }
-
- return true;
-}
#else
static void task_tick_numa(struct rq *rq, struct task_struct *curr)
{
@@ -2724,14 +2671,6 @@ static inline void account_numa_dequeue(struct rq *rq, struct task_struct *p)
{
}

-#ifdef CONFIG_SMP
-static inline bool numa_wake_affine(struct sched_domain *sd,
- struct task_struct *p, int this_cpu,
- int prev_cpu, int sync)
-{
- return true;
-}
-#endif /* !SMP */
#endif /* CONFIG_NUMA_BALANCING */

static void
@@ -5428,20 +5367,115 @@ static int wake_wide(struct task_struct *p)
return 1;
}

+struct llc_stats {
+ unsigned long nr_running;
+ unsigned long load;
+ unsigned long capacity;
+ int has_capacity;
+};
+
+static bool get_llc_stats(struct llc_stats *stats, int cpu)
+{
+ struct sched_domain_shared *sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
+
+ if (!sds)
+ return false;
+
+ stats->nr_running = READ_ONCE(sds->nr_running);
+ stats->load = READ_ONCE(sds->load);
+ stats->capacity = READ_ONCE(sds->capacity);
+ stats->has_capacity = stats->nr_running < per_cpu(sd_llc_size, cpu);
+
+ return true;
+}
+
+/*
+ * Can a task be moved from prev_cpu to this_cpu without causing a load
+ * imbalance that would trigger the load balancer?
+ *
+ * Since we're running on 'stale' values, we might in fact create an imbalance
+ * but recomputing these values is expensive, as that'd mean iteration 2 cache
+ * domains worth of CPUs.
+ */
+static bool
+wake_affine_llc(struct sched_domain *sd, struct task_struct *p,
+ int this_cpu, int prev_cpu, int sync)
+{
+ struct llc_stats prev_stats, this_stats;
+ s64 this_eff_load, prev_eff_load;
+ unsigned long task_load;
+
+ if (!get_llc_stats(&prev_stats, prev_cpu) ||
+ !get_llc_stats(&this_stats, this_cpu))
+ return false;
+
+ /*
+ * If sync wakeup then subtract the (maximum possible)
+ * effect of the currently running task from the load
+ * of the current LLC.
+ */
+ if (sync) {
+ unsigned long current_load = task_h_load(current);
+
+ /* in this case load hits 0 and this LLC is considered 'idle' */
+ if (current_load > this_stats.load)
+ return true;
+
+ this_stats.load -= current_load;
+ }
+
+ /*
+ * The has_capacity stuff is not SMT aware, but by trying to balance
+ * the nr_running on both ends we try and fill the domain at equal
+ * rates, thereby first consuming cores before siblings.
+ */
+
+ /* if the old cache has capacity, stay there */
+ if (prev_stats.has_capacity && prev_stats.nr_running < this_stats.nr_running+1)
+ return false;
+
+ /* if this cache has capacity, come here */
+ if (this_stats.has_capacity && this_stats.nr_running < prev_stats.nr_running+1)
+ return true;
+
+ /*
+ * Check to see if we can move the load without causing too much
+ * imbalance.
+ */
+ task_load = task_h_load(p);
+
+ this_eff_load = 100;
+ this_eff_load *= prev_stats.capacity;
+
+ prev_eff_load = 100 + (sd->imbalance_pct - 100) / 2;
+ prev_eff_load *= this_stats.capacity;
+
+ this_eff_load *= this_stats.load + task_load;
+ prev_eff_load *= prev_stats.load - task_load;
+
+ return this_eff_load <= prev_eff_load;
+}
+
static int wake_affine(struct sched_domain *sd, struct task_struct *p,
int prev_cpu, int sync)
{
int this_cpu = smp_processor_id();
- bool affine = false;
+ bool affine;

/*
- * Common case: CPUs are in the same socket, and select_idle_sibling()
- * will do its thing regardless of what we return:
+ * Default to no affine wakeups; wake_affine() should not effect a task
+ * placement the load-balancer feels inclined to undo. The conservative
+ * option is therefore to not move tasks when they wake up.
*/
- if (cpus_share_cache(prev_cpu, this_cpu))
- affine = true;
- else
- affine = numa_wake_affine(sd, p, this_cpu, prev_cpu, sync);
+ affine = false;
+
+ /*
+ * If the wakeup is across cache domains, try to evaluate if movement
+ * makes sense, otherwise rely on select_idle_siblings() to do
+ * placement inside the cache domain.
+ */
+ if (!cpus_share_cache(prev_cpu, this_cpu))
+ affine = wake_affine_llc(sd, p, this_cpu, prev_cpu, sync);

schedstat_inc(p->se.statistics.nr_wakeups_affine_attempts);
if (affine) {
@@ -7121,6 +7155,7 @@ struct sg_lb_stats {
struct sd_lb_stats {
struct sched_group *busiest; /* Busiest group in this sd */
struct sched_group *local; /* Local group in this sd */
+ unsigned long total_running;
unsigned long total_load; /* Total load of all groups in sd */
unsigned long total_capacity; /* Total capacity of all groups in sd */
unsigned long avg_load; /* Average load across all groups in sd */
@@ -7140,6 +7175,7 @@ static inline void init_sd_lb_stats(struct sd_lb_stats *sds)
*sds = (struct sd_lb_stats){
.busiest = NULL,
.local = NULL,
+ .total_running = 0UL,
.total_load = 0UL,
.total_capacity = 0UL,
.busiest_stat = {
@@ -7575,6 +7611,7 @@ static inline enum fbq_type fbq_classify_rq(struct rq *rq)
*/
static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sds)
{
+ struct sched_domain_shared *shared = env->sd->shared;
struct sched_domain *child = env->sd->child;
struct sched_group *sg = env->sd->groups;
struct sg_lb_stats *local = &sds->local_stat;
@@ -7631,6 +7668,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd

next_group:
/* Now, start updating sd_lb_stats */
+ sds->total_running += sgs->sum_nr_running;
sds->total_load += sgs->group_load;
sds->total_capacity += sgs->group_capacity;

@@ -7646,6 +7684,21 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
env->dst_rq->rd->overload = overload;
}

+ if (!shared)
+ return;
+
+ /*
+ * Since these are sums over groups they can contain some CPUs
+ * multiple times for the NUMA domains.
+ *
+ * Currently only wake_affine_llc() and find_busiest_group()
+ * uses these numbers, only the last is affected by this problem.
+ *
+ * XXX fix that.
+ */
+ WRITE_ONCE(shared->nr_running, sds->total_running);
+ WRITE_ONCE(shared->load, sds->total_load);
+ WRITE_ONCE(shared->capacity, sds->total_capacity);
}

/**
@@ -7875,6 +7928,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
if (!sds.busiest || busiest->sum_nr_running == 0)
goto out_balanced;

+ /* XXX broken for overlapping NUMA groups */
sds.avg_load = (SCHED_CAPACITY_SCALE * sds.total_load)
/ sds.total_capacity;


2017-09-06 10:44:27

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Abysmal scheduler performance in Linus' tree?

On Wed, Sep 06, 2017 at 11:18:46AM +0100, Chris Wilson wrote:

> > +static void get_llc_stats(struct llc_stats *stats, int cpu)
> > +{
> > + struct sched_domain_shared *sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
> > +
> > + if (!sds) {
> > + memset(&stats, 0, sizeof(*stats));
>
> Yes, I even sent you a mail about it ;)

Bah, too much email, sorry :-(

> > + /*
> > + * The has_capacity stuff is not SMT aware, but by trying to balance
> > + * the nr_running on both ends we try and fill the domain at equal
> > + * rates, thereby first consuming cores before siblings.
> > + */
> > +
> > + /* if the old cache has capacity, stay there */
> > + if (prev_stats.has_capacity && prev_stats.nr_running < this_stats.nr_running+1)
> > + return false;
> > +
> > + /* if this cache has capacity, come here */
> > + if (this_stats.has_capacity && this_stats.nr_running < prev_stats.nr_running+1)
> > + return true;
>
> This is still not working as intended, it should be
>
> if (this_stats.has_capacity && this_stats.nr_running+1 < prev_stats.nr_running)
> return true;
>
> to fix the regression.

Argh, you're quite right. Let me do a patch for that.

2017-09-06 10:51:36

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Abysmal scheduler performance in Linus' tree?

On Wed, Sep 06, 2017 at 12:44:20PM +0200, Peter Zijlstra wrote:
> > > + /* if this cache has capacity, come here */
> > > + if (this_stats.has_capacity && this_stats.nr_running < prev_stats.nr_running+1)
> > > + return true;
> >
> > This is still not working as intended, it should be
> >
> > if (this_stats.has_capacity && this_stats.nr_running+1 < prev_stats.nr_running)
> > return true;
> >
> > to fix the regression.
>
> Argh, you're quite right. Let me do a patch for that.

---
Subject: sched/fair: Fix wake_affine_llc() balance rules
From: Peter Zijlstra <[email protected]>
Date: Wed Sep 6 12:45:45 CEST 2017

Chris reported that the SMT balance rules got the +1 on the wrong
side, resulting in a bias towards the current LLC; which the
load-balancer would then try and undo.

Reported-by: Chris Wilson <[email protected]>
Fixes: 90001d67be2f ("sched/fair: Fix wake_affine() for !NUMA_BALANCING")
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
---
kernel/sched/fair.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5435,7 +5435,7 @@ wake_affine_llc(struct sched_domain *sd,
return false;

/* if this cache has capacity, come here */
- if (this_stats.has_capacity && this_stats.nr_running < prev_stats.nr_running+1)
+ if (this_stats.has_capacity && this_stats.nr_running+1 < prev_stats.nr_running)
return true;

/*

2017-09-06 16:14:29

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Abysmal scheduler performance in Linus' tree?

On Wed, Sep 06, 2017 at 11:03:33AM +0200, Peter Zijlstra wrote:
> On Wed, Sep 06, 2017 at 01:59:14AM -0700, Andy Lutomirski wrote:
> > > Any specifics on your setup?
> >
> > On further fiddling, I only see this after a suspend and resume cycle.
>
> Ah, ok. That's not something I otherwise test. Lets see if I can force
> this brick of mine through a suspend-resume cycle :-)

I'd be suspicious of these here commits:

77d1dfda0e79 ("sched/topology, cpuset: Avoid spurious/wrong domain rebuilds")
09e0dd8e0f2e ("sched/topology: Avoid pointless rebuild")
bbdacdfed2f5 ("sched/debug: Optimize sched_domain sysctl generation")

I tested them with regular hotplug, but suspend resume always is a tad
funny.

2017-09-07 06:16:04

by Mike Galbraith

[permalink] [raw]
Subject: Re: Abysmal scheduler performance in Linus' tree?

On Wed, 2017-09-06 at 18:14 +0200, Peter Zijlstra wrote:
> On Wed, Sep 06, 2017 at 11:03:33AM +0200, Peter Zijlstra wrote:
> > On Wed, Sep 06, 2017 at 01:59:14AM -0700, Andy Lutomirski wrote:
> > > > Any specifics on your setup?
> > >
> > > On further fiddling, I only see this after a suspend and resume cycle.
> >
> > Ah, ok. That's not something I otherwise test. Lets see if I can force
> > this brick of mine through a suspend-resume cycle :-)
>
> I'd be suspicious of these here commits:
>
> 77d1dfda0e79 ("sched/topology, cpuset: Avoid spurious/wrong domain rebuilds")

Seems to be that one.  After suspend/resume, cpu7 (i4790+SMT) is online
with NULL domain.  offline/online it, it grows proper domains.

-Mike

(on vacation, about to go explore that unfamiliar 'outside' place)  

2017-09-07 07:31:42

by Ingo Molnar

[permalink] [raw]
Subject: Re: Abysmal scheduler performance in Linus' tree?


* Mike Galbraith <[email protected]> wrote:

> On Wed, 2017-09-06 at 18:14 +0200, Peter Zijlstra wrote:
> > On Wed, Sep 06, 2017 at 11:03:33AM +0200, Peter Zijlstra wrote:
> > > On Wed, Sep 06, 2017 at 01:59:14AM -0700, Andy Lutomirski wrote:
> > > > > Any specifics on your setup?
> > > >
> > > > On further fiddling, I only see this after a suspend and resume cycle.
> > >
> > > Ah, ok. That's not something I otherwise test. Lets see if I can force
> > > this brick of mine through a suspend-resume cycle :-)
> >
> > I'd be suspicious of these here commits:
> >
> > 77d1dfda0e79 ("sched/topology, cpuset: Avoid spurious/wrong domain rebuilds")
>
> Seems to be that one. ?After suspend/resume, cpu7 (i4790+SMT) is online
> with NULL domain. ?offline/online it, it grows proper domains.

Thanks, that's really useful, should narrow things down!

> (on vacation, about to go explore that unfamiliar 'outside' place) ?

Be careful, it's a dangerous world for kernel hackers! ;-)

Thanks,

Ingo

Subject: [tip:sched/urgent] sched/fair: Fix wake_affine_llc() balancing rules

Commit-ID: a731ebe6f17bff9e7ca12ef227f9da4d5bdf8425
Gitweb: http://git.kernel.org/tip/a731ebe6f17bff9e7ca12ef227f9da4d5bdf8425
Author: Peter Zijlstra <[email protected]>
AuthorDate: Wed, 6 Sep 2017 12:51:31 +0200
Committer: Ingo Molnar <[email protected]>
CommitDate: Thu, 7 Sep 2017 09:29:31 +0200

sched/fair: Fix wake_affine_llc() balancing rules

Chris Wilson reported that the SMT balance rules got the +1 on the
wrong side, resulting in a bias towards the current LLC; which the
load-balancer would then try and undo.

Reported-by: Chris Wilson <[email protected]>
Tested-by: Chris Wilson <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Fixes: 90001d67be2f ("sched/fair: Fix wake_affine() for !NUMA_BALANCING")
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/fair.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8d58687..9dd2ce1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5435,7 +5435,7 @@ wake_affine_llc(struct sched_domain *sd, struct task_struct *p,
return false;

/* if this cache has capacity, come here */
- if (this_stats.has_capacity && this_stats.nr_running < prev_stats.nr_running+1)
+ if (this_stats.has_capacity && this_stats.nr_running+1 < prev_stats.nr_running)
return true;

/*

2017-09-07 09:13:51

by Peter Zijlstra

[permalink] [raw]
Subject: [PATCH] sched/cpuset/pm: Fix cpuset vs suspend-resume

On Thu, Sep 07, 2017 at 08:15:29AM +0200, Mike Galbraith wrote:

> > I'd be suspicious of these here commits:
> >
> > 77d1dfda0e79 ("sched/topology, cpuset: Avoid spurious/wrong domain rebuilds")
>
> Seems to be that one. ?After suspend/resume, cpu7 (i4790+SMT) is online
> with NULL domain. ?offline/online it, it grows proper domains.

Shiny.. and *urgh*.

So if I offline cpus 4-39 (otherwise too much output) and then do:

# echo processors > /sys/power/pm_test
# echo disk > /sys/power/state

I get:

[ 568.315315] Disabling non-boot CPUs ...
[ 568.329059] CPU0 attaching NULL sched-domain.
[ 568.333942] CPU1 attaching NULL sched-domain.
[ 568.338805] CPU2 attaching NULL sched-domain.
[ 568.343675] CPU3 attaching NULL sched-domain.
[ 568.348645] CPU0 attaching sched-domain(s):
[ 568.353322] domain-0: span=0,2-3 level=MC
[ 568.357924] groups: 0:{ span=0 }, 2:{ span=2 }, 3:{ span=3 }
[ 568.364454] CPU2 attaching sched-domain(s):
[ 568.369125] domain-0: span=0,2-3 level=MC
[ 568.373699] groups: 2:{ span=2 }, 3:{ span=3 }, 0:{ span=0 }
[ 568.380224] CPU3 attaching sched-domain(s):
[ 568.384896] domain-0: span=0,2-3 level=MC
[ 568.389471] groups: 3:{ span=3 }, 0:{ span=0 }, 2:{ span=2 }
[ 568.448127] smpboot: CPU 1 is now offline

[ 568.462083] CPU0 attaching NULL sched-domain.
[ 568.466972] CPU2 attaching NULL sched-domain.
[ 568.471850] CPU3 attaching NULL sched-domain.
[ 568.476855] CPU0 attaching sched-domain(s):
[ 568.481551] domain-0: span=0,3 level=MC
[ 568.485986] groups: 0:{ span=0 }, 3:{ span=3 }
[ 568.491178] CPU3 attaching sched-domain(s):
[ 568.495864] domain-0: span=0,3 level=MC
[ 568.500252] groups: 3:{ span=3 }, 0:{ span=0 }
[ 568.522159] smpboot: CPU 2 is now offline

[ 568.537075] CPU0 attaching NULL sched-domain.
[ 568.541969] CPU3 attaching NULL sched-domain.
[ 568.546947] CPU0 attaching NULL sched-domain.
[ 568.558865] smpboot: CPU 3 is now offline

[ 568.563955] PM: hibernation debug: Waiting for 5 seconds.



[ 573.570008] Enabling non-boot CPUs ...
[ 573.594298] CPU0 attaching NULL sched-domain.
[ 573.599216] CPU0 attaching sched-domain(s):
[ 573.603891] domain-0: span=0-1 level=MC
[ 573.608281] groups: 0:{ span=0 }, 1:{ span=1 }
[ 573.613455] CPU1 attaching sched-domain(s):
[ 573.618127] domain-0: span=0-1 level=MC
[ 573.622509] groups: 1:{ span=1 }, 0:{ span=0 }
[ 573.627668] span: 0-1 (max cpu_capacity = 1024)
[ 573.632799] CPU1 is up

[ 573.650551] CPU0 attaching NULL sched-domain.
[ 573.655421] CPU1 attaching NULL sched-domain.
[ 573.660340] CPU0 attaching sched-domain(s):
[ 573.665012] domain-0: span=0-2 level=MC
[ 573.669402] groups: 0:{ span=0 }, 1:{ span=1 }, 2:{ span=2 }
[ 573.675930] CPU1 attaching sched-domain(s):
[ 573.680601] domain-0: span=0-2 level=MC
[ 573.684982] groups: 1:{ span=1 }, 2:{ span=2 }, 0:{ span=0 }
[ 573.691496] CPU2 attaching sched-domain(s):
[ 573.696166] domain-0: span=0-2 level=MC
[ 573.700546] groups: 2:{ span=2 }, 0:{ span=0 }, 1:{ span=1 }
[ 573.707064] span: 0-2 (max cpu_capacity = 1024)
[ 573.712198] CPU2 is up

[ 573.730022] CPU3 is up

[ 575.913686] Restarting tasks ... done.

Its failing to rebuild domains for CPU3

And I can see how that happens, note how cpuset_cpu_active() doesn't
build an identity domain for the last CPU but relies on
cpuset_update_active_cpus() to create the proper full cpuset
configuration.

_However_, since cpuset_cpu_inactive() did _NOT_ call
cpuset_update_active_cpus(), top_cpuset.effective_cpus wasn't updated,
and now cpuset_hotplug_workfn() will find the active mask matching and
not in fact update anything.

So this was already broken if you'd had an actual cpuset configuration,
it would never restore the proper cpuset domains and retain the identity
domain setup created during resume.

What's worse, we don't seem to wait for completion of the
cpuset_hotplug_workfn() before unfreezing userspace.

---
Subject: sched/cpuset/pm: Fix cpuset vs suspend-resume

Cpusets vs suspend-resume is _completely_ broken. And it got noticed
because it now resulted in non-cpuset usage breaking too.

On suspend cpuset_cpu_inactive() doesn't call into
cpuset_update_active_cpus() because it doesn't want to move tasks about,
there is no need, all tasks are frozen and won't run again until after
we've resumed everything.

But this means that when we finally do call into
cpuset_update_active_cpus() after resuming the last frozen cpu in
cpuset_cpu_active(), the top_cpuset will not have any difference with
the cpu_active_mask and this it will not in fact do _anything_.

So the cpuset configuration will not be restored. This was largely
hidden because we would unconditionally create identity domains and
mobile users would not in fact use cpusets much. And servers what do use
cpusets tend to not suspend-resume much.

An addition problem is that we'd not in fact wait for the cpuset work to
finish before resuming the tasks, allowing spurious migrations outside
of the specified domains.

Fix the rebuild by introducing cpuset_force_rebuild() and fix the
ordering with cpuset_wait_for_hotplug().

Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Reported-by: Andy Lutomirski <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
---
include/linux/cpuset.h | 6 ++++++
kernel/cgroup/cpuset.c | 16 +++++++++++++++-
kernel/power/process.c | 5 ++++-
kernel/sched/core.c | 7 +++----
4 files changed, 28 insertions(+), 6 deletions(-)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index e74655d941b7..a1e6a33a4b03 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -51,7 +51,9 @@ static inline void cpuset_dec(void)

extern int cpuset_init(void);
extern void cpuset_init_smp(void);
+extern void cpuset_force_rebuild(void);
extern void cpuset_update_active_cpus(void);
+extern void cpuset_wait_for_hotplug(void);
extern void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mask);
extern void cpuset_cpus_allowed_fallback(struct task_struct *p);
extern nodemask_t cpuset_mems_allowed(struct task_struct *p);
@@ -164,11 +166,15 @@ static inline bool cpusets_enabled(void) { return false; }
static inline int cpuset_init(void) { return 0; }
static inline void cpuset_init_smp(void) {}

+static inline void cpuset_force_rebuild(void) { }
+
static inline void cpuset_update_active_cpus(void)
{
partition_sched_domains(1, NULL, NULL);
}

+static inline void cpuset_wait_for_hotplug(void) { }
+
static inline void cpuset_cpus_allowed(struct task_struct *p,
struct cpumask *mask)
{
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 2f4039bafebb..0513ee39698b 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -2267,6 +2267,13 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs)
mutex_unlock(&cpuset_mutex);
}

+static bool force_rebuild;
+
+void cpuset_force_rebuild(void)
+{
+ force_rebuild = true;
+}
+
/**
* cpuset_hotplug_workfn - handle CPU/memory hotunplug for a cpuset
*
@@ -2341,8 +2348,10 @@ static void cpuset_hotplug_workfn(struct work_struct *work)
}

/* rebuild sched domains if cpus_allowed has changed */
- if (cpus_updated)
+ if (cpus_updated || force_rebuild) {
+ force_rebuild = false;
rebuild_sched_domains();
+ }
}

void cpuset_update_active_cpus(void)
@@ -2355,6 +2364,11 @@ void cpuset_update_active_cpus(void)
schedule_work(&cpuset_hotplug_work);
}

+void cpuset_wait_for_hotplug(void)
+{
+ flush_work(&cpuset_hotplug_work);
+}
+
/*
* Keep top_cpuset.mems_allowed tracking node_states[N_MEMORY].
* Call this routine anytime after node_states[N_MEMORY] changes.
diff --git a/kernel/power/process.c b/kernel/power/process.c
index 78672d324a6e..50f25cb370c6 100644
--- a/kernel/power/process.c
+++ b/kernel/power/process.c
@@ -20,8 +20,9 @@
#include <linux/workqueue.h>
#include <linux/kmod.h>
#include <trace/events/power.h>
+#include <linux/cpuset.h>

-/*
+/*
* Timeout for stopping processes
*/
unsigned int __read_mostly freeze_timeout_msecs = 20 * MSEC_PER_SEC;
@@ -202,6 +203,8 @@ void thaw_processes(void)
__usermodehelper_set_disable_depth(UMH_FREEZING);
thaw_workqueues();

+ cpuset_wait_for_hotplug();
+
read_lock(&tasklist_lock);
for_each_process_thread(g, p) {
/* No other threads should have PF_SUSPEND_TASK set */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 703f5831738e..7ba7084ffc0a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5558,16 +5558,15 @@ static void cpuset_cpu_active(void)
* operation in the resume sequence, just build a single sched
* domain, ignoring cpusets.
*/
- num_cpus_frozen--;
- if (likely(num_cpus_frozen)) {
- partition_sched_domains(1, NULL, NULL);
+ partition_sched_domains(1, NULL, NULL);
+ if (--num_cpus_frozen)
return;
- }
/*
* This is the last CPU online operation. So fall through and
* restore the original sched domains by considering the
* cpuset configurations.
*/
+ cpuset_force_rebuild();
}
cpuset_update_active_cpus();
}

2017-09-07 09:26:25

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched/cpuset/pm: Fix cpuset vs suspend-resume

On Thu, Sep 07, 2017 at 11:13:38AM +0200, Peter Zijlstra wrote:
> Subject: sched/cpuset/pm: Fix cpuset vs suspend-resume
>
> Cpusets vs suspend-resume is _completely_ broken. And it got noticed
> because it now resulted in non-cpuset usage breaking too.
>
> On suspend cpuset_cpu_inactive() doesn't call into
> cpuset_update_active_cpus() because it doesn't want to move tasks about,
> there is no need, all tasks are frozen and won't run again until after
> we've resumed everything.
>
> But this means that when we finally do call into
> cpuset_update_active_cpus() after resuming the last frozen cpu in
> cpuset_cpu_active(), the top_cpuset will not have any difference with
> the cpu_active_mask and this it will not in fact do _anything_.
>
> So the cpuset configuration will not be restored. This was largely
> hidden because we would unconditionally create identity domains and
> mobile users would not in fact use cpusets much. And servers what do use
> cpusets tend to not suspend-resume much.
>
> An addition problem is that we'd not in fact wait for the cpuset work to
> finish before resuming the tasks, allowing spurious migrations outside
> of the specified domains.
>
> Fix the rebuild by introducing cpuset_force_rebuild() and fix the
> ordering with cpuset_wait_for_hotplug().
>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Reported-by: Andy Lutomirski <[email protected]>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>

TJ, I _think_ it was commit:

deb7aa308ea2 ("cpuset: reorganize CPU / memory hotplug handling")

That wrecked things, but there's been so much changes in this area it is
really hard to tell. Note how before that commit it would
unconditionally rebuild the domains, and you 'optimized' that ;-)

That commit also introduced the work to do the async rebuild and failed
to do that flush on resume.

In any case, I think we should put a fixes tag on this commit such that
it gets picked up into stable kernels. Not sure anybody will try and
backport it into 4 year old kernels, but who knows.

Subject: [tip:sched/urgent] sched/cpuset/pm: Fix cpuset vs. suspend-resume bugs

Commit-ID: 50e76632339d4655859523a39249dd95ee5e93e7
Gitweb: http://git.kernel.org/tip/50e76632339d4655859523a39249dd95ee5e93e7
Author: Peter Zijlstra <[email protected]>
AuthorDate: Thu, 7 Sep 2017 11:13:38 +0200
Committer: Ingo Molnar <[email protected]>
CommitDate: Thu, 7 Sep 2017 11:45:21 +0200

sched/cpuset/pm: Fix cpuset vs. suspend-resume bugs

Cpusets vs. suspend-resume is _completely_ broken. And it got noticed
because it now resulted in non-cpuset usage breaking too.

On suspend cpuset_cpu_inactive() doesn't call into
cpuset_update_active_cpus() because it doesn't want to move tasks about,
there is no need, all tasks are frozen and won't run again until after
we've resumed everything.

But this means that when we finally do call into
cpuset_update_active_cpus() after resuming the last frozen cpu in
cpuset_cpu_active(), the top_cpuset will not have any difference with
the cpu_active_mask and this it will not in fact do _anything_.

So the cpuset configuration will not be restored. This was largely
hidden because we would unconditionally create identity domains and
mobile users would not in fact use cpusets much. And servers what do use
cpusets tend to not suspend-resume much.

An addition problem is that we'd not in fact wait for the cpuset work to
finish before resuming the tasks, allowing spurious migrations outside
of the specified domains.

Fix the rebuild by introducing cpuset_force_rebuild() and fix the
ordering with cpuset_wait_for_hotplug().

Reported-by: Andy Lutomirski <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Fixes: deb7aa308ea2 ("cpuset: reorganize CPU / memory hotplug handling")
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/cpuset.h | 6 ++++++
kernel/cgroup/cpuset.c | 16 +++++++++++++++-
kernel/power/process.c | 5 ++++-
kernel/sched/core.c | 7 +++----
4 files changed, 28 insertions(+), 6 deletions(-)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index e74655d..a1e6a33 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -51,7 +51,9 @@ static inline void cpuset_dec(void)

extern int cpuset_init(void);
extern void cpuset_init_smp(void);
+extern void cpuset_force_rebuild(void);
extern void cpuset_update_active_cpus(void);
+extern void cpuset_wait_for_hotplug(void);
extern void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mask);
extern void cpuset_cpus_allowed_fallback(struct task_struct *p);
extern nodemask_t cpuset_mems_allowed(struct task_struct *p);
@@ -164,11 +166,15 @@ static inline bool cpusets_enabled(void) { return false; }
static inline int cpuset_init(void) { return 0; }
static inline void cpuset_init_smp(void) {}

+static inline void cpuset_force_rebuild(void) { }
+
static inline void cpuset_update_active_cpus(void)
{
partition_sched_domains(1, NULL, NULL);
}

+static inline void cpuset_wait_for_hotplug(void) { }
+
static inline void cpuset_cpus_allowed(struct task_struct *p,
struct cpumask *mask)
{
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 2f4039b..0513ee3 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -2267,6 +2267,13 @@ retry:
mutex_unlock(&cpuset_mutex);
}

+static bool force_rebuild;
+
+void cpuset_force_rebuild(void)
+{
+ force_rebuild = true;
+}
+
/**
* cpuset_hotplug_workfn - handle CPU/memory hotunplug for a cpuset
*
@@ -2341,8 +2348,10 @@ static void cpuset_hotplug_workfn(struct work_struct *work)
}

/* rebuild sched domains if cpus_allowed has changed */
- if (cpus_updated)
+ if (cpus_updated || force_rebuild) {
+ force_rebuild = false;
rebuild_sched_domains();
+ }
}

void cpuset_update_active_cpus(void)
@@ -2355,6 +2364,11 @@ void cpuset_update_active_cpus(void)
schedule_work(&cpuset_hotplug_work);
}

+void cpuset_wait_for_hotplug(void)
+{
+ flush_work(&cpuset_hotplug_work);
+}
+
/*
* Keep top_cpuset.mems_allowed tracking node_states[N_MEMORY].
* Call this routine anytime after node_states[N_MEMORY] changes.
diff --git a/kernel/power/process.c b/kernel/power/process.c
index 78672d3..50f25cb 100644
--- a/kernel/power/process.c
+++ b/kernel/power/process.c
@@ -20,8 +20,9 @@
#include <linux/workqueue.h>
#include <linux/kmod.h>
#include <trace/events/power.h>
+#include <linux/cpuset.h>

-/*
+/*
* Timeout for stopping processes
*/
unsigned int __read_mostly freeze_timeout_msecs = 20 * MSEC_PER_SEC;
@@ -202,6 +203,8 @@ void thaw_processes(void)
__usermodehelper_set_disable_depth(UMH_FREEZING);
thaw_workqueues();

+ cpuset_wait_for_hotplug();
+
read_lock(&tasklist_lock);
for_each_process_thread(g, p) {
/* No other threads should have PF_SUSPEND_TASK set */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6d2c7ff..136a76d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5556,16 +5556,15 @@ static void cpuset_cpu_active(void)
* operation in the resume sequence, just build a single sched
* domain, ignoring cpusets.
*/
- num_cpus_frozen--;
- if (likely(num_cpus_frozen)) {
- partition_sched_domains(1, NULL, NULL);
+ partition_sched_domains(1, NULL, NULL);
+ if (--num_cpus_frozen)
return;
- }
/*
* This is the last CPU online operation. So fall through and
* restore the original sched domains by considering the
* cpuset configurations.
*/
+ cpuset_force_rebuild();
}
cpuset_update_active_cpus();
}

2017-09-07 11:03:15

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH] sched/cpuset/pm: Fix cpuset vs suspend-resume

On Thursday, September 7, 2017 11:26:16 AM CEST Peter Zijlstra wrote:
> On Thu, Sep 07, 2017 at 11:13:38AM +0200, Peter Zijlstra wrote:
> > Subject: sched/cpuset/pm: Fix cpuset vs suspend-resume
> >
> > Cpusets vs suspend-resume is _completely_ broken. And it got noticed
> > because it now resulted in non-cpuset usage breaking too.
> >
> > On suspend cpuset_cpu_inactive() doesn't call into
> > cpuset_update_active_cpus() because it doesn't want to move tasks about,
> > there is no need, all tasks are frozen and won't run again until after
> > we've resumed everything.
> >
> > But this means that when we finally do call into
> > cpuset_update_active_cpus() after resuming the last frozen cpu in
> > cpuset_cpu_active(), the top_cpuset will not have any difference with
> > the cpu_active_mask and this it will not in fact do _anything_.
> >
> > So the cpuset configuration will not be restored. This was largely
> > hidden because we would unconditionally create identity domains and
> > mobile users would not in fact use cpusets much. And servers what do use
> > cpusets tend to not suspend-resume much.
> >
> > An addition problem is that we'd not in fact wait for the cpuset work to
> > finish before resuming the tasks, allowing spurious migrations outside
> > of the specified domains.
> >
> > Fix the rebuild by introducing cpuset_force_rebuild() and fix the
> > ordering with cpuset_wait_for_hotplug().
> >
> > Cc: [email protected]
> > Cc: [email protected]
> > Cc: [email protected]
> > Reported-by: Andy Lutomirski <[email protected]>
> > Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
>
> TJ, I _think_ it was commit:
>
> deb7aa308ea2 ("cpuset: reorganize CPU / memory hotplug handling")
>
> That wrecked things, but there's been so much changes in this area it is
> really hard to tell. Note how before that commit it would
> unconditionally rebuild the domains, and you 'optimized' that ;-)
>
> That commit also introduced the work to do the async rebuild and failed
> to do that flush on resume.
>
> In any case, I think we should put a fixes tag on this commit such that
> it gets picked up into stable kernels. Not sure anybody will try and
> backport it into 4 year old kernels, but who knows.
>

Many thanks for fixing this!


2017-09-07 20:38:15

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH] sched/cpuset/pm: Fix cpuset vs suspend-resume

Hello,

On Thu, Sep 07, 2017 at 11:26:16AM +0200, Peter Zijlstra wrote:
> TJ, I _think_ it was commit:
>
> deb7aa308ea2 ("cpuset: reorganize CPU / memory hotplug handling")

Heh, that's a while ago.

> That wrecked things, but there's been so much changes in this area it is
> really hard to tell. Note how before that commit it would
> unconditionally rebuild the domains, and you 'optimized' that ;-)
>
> That commit also introduced the work to do the async rebuild and failed
> to do that flush on resume.
>
> In any case, I think we should put a fixes tag on this commit such that
> it gets picked up into stable kernels. Not sure anybody will try and
> backport it into 4 year old kernels, but who knows.

I see & sounds good to me. I see that the patch already got applied
but FWIW,

Acked-by: Tejun Heo <[email protected]>

Thanks!

--
tejun