2020-10-29 18:24:21

by Daniel Jordan

[permalink] [raw]
Subject: [PATCH] cpuset: fix race between hotplug work and later CPU offline

One of our machines keeled over trying to rebuild the scheduler domains.
Mainline produces the same splat:

BUG: unable to handle page fault for address: 0000607f820054db
CPU: 2 PID: 149 Comm: kworker/1:1 Not tainted 5.10.0-rc1-master+ #6
Workqueue: events cpuset_hotplug_workfn
RIP: build_sched_domains
Call Trace:
partition_sched_domains_locked
rebuild_sched_domains_locked
cpuset_hotplug_workfn

It happens with cgroup2 and exclusive cpusets only. This reproducer
triggers it on an 8-cpu vm and works most effectively with no
preexisting child cgroups:

cd $UNIFIED_ROOT
mkdir cg1
echo 4-7 > cg1/cpuset.cpus
echo root > cg1/cpuset.cpus.partition

# with smt/control reading 'on',
echo off > /sys/devices/system/cpu/smt/control

RIP maps to

sd->shared = *per_cpu_ptr(sdd->sds, sd_id);

from sd_init(). sd_id is calculated earlier in the same function:

cpumask_and(sched_domain_span(sd), cpu_map, tl->mask(cpu));
sd_id = cpumask_first(sched_domain_span(sd));

tl->mask(cpu), which reads cpu_sibling_map on x86, returns an empty mask
and so cpumask_first() returns >= nr_cpu_ids, which leads to the bogus
value from per_cpu_ptr() above.

The problem is a race between cpuset_hotplug_workfn() and a later
offline of CPU N. cpuset_hotplug_workfn() updates the effective masks
when N is still online, the offline clears N from cpu_sibling_map, and
then the worker uses the stale effective masks that still have N to
generate the scheduling domains, leading the worker to read
N's empty cpu_sibling_map in sd_init().

rebuild_sched_domains_locked() prevented the race during the cgroup2
cpuset series up until the Fixes commit changed its check. Make the
check more robust so that it can detect an offline CPU in any exclusive
cpuset's effective mask, not just the top one.

Fixes: 0ccea8feb980 ("cpuset: Make generate_sched_domains() work with partition")
Signed-off-by: Daniel Jordan <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Li Zefan <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Prateek Sood <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---

I think the right thing to do long-term is make the hotplug work
synchronous, fixing the lockdep splats of past attempts, and then take
these checks out of rebuild_sched_domains_locked, but this fixes the
immediate issue and is small enough for stable. Open to suggestions.

Prateek, are you planning on picking up your patches again?

kernel/cgroup/cpuset.c | 20 +++++++++++++++++---
1 file changed, 17 insertions(+), 3 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 57b5b5d0a5fd..ac3124010b2a 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -983,8 +983,10 @@ partition_and_rebuild_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
*/
static void rebuild_sched_domains_locked(void)
{
+ struct cgroup_subsys_state *pos_css;
struct sched_domain_attr *attr;
cpumask_var_t *doms;
+ struct cpuset *cs;
int ndoms;

lockdep_assert_cpus_held();
@@ -999,9 +1001,21 @@ static void rebuild_sched_domains_locked(void)
!cpumask_equal(top_cpuset.effective_cpus, cpu_active_mask))
return;

- if (top_cpuset.nr_subparts_cpus &&
- !cpumask_subset(top_cpuset.effective_cpus, cpu_active_mask))
- return;
+ if (top_cpuset.nr_subparts_cpus) {
+ rcu_read_lock();
+ cpuset_for_each_descendant_pre(cs, pos_css, &top_cpuset) {
+ if (!is_partition_root(cs)) {
+ pos_css = css_rightmost_descendant(pos_css);
+ continue;
+ }
+ if (!cpumask_subset(cs->effective_cpus,
+ cpu_active_mask)) {
+ rcu_read_unlock();
+ return;
+ }
+ }
+ rcu_read_unlock();
+ }

/* Generate domain masks and attrs */
ndoms = generate_sched_domains(&doms, &attr);

base-commit: 23859ae44402f4d935b9ee548135dd1e65e2cbf4
--
2.29.0


2020-11-10 16:47:34

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] cpuset: fix race between hotplug work and later CPU offline

On Thu, Oct 29, 2020 at 02:18:45PM -0400, Daniel Jordan wrote:
> One of our machines keeled over trying to rebuild the scheduler domains.
> Mainline produces the same splat:
>
> BUG: unable to handle page fault for address: 0000607f820054db
> CPU: 2 PID: 149 Comm: kworker/1:1 Not tainted 5.10.0-rc1-master+ #6
> Workqueue: events cpuset_hotplug_workfn
> RIP: build_sched_domains
> Call Trace:
> partition_sched_domains_locked
> rebuild_sched_domains_locked
> cpuset_hotplug_workfn
>
> It happens with cgroup2 and exclusive cpusets only. This reproducer
> triggers it on an 8-cpu vm and works most effectively with no
> preexisting child cgroups:
>
> cd $UNIFIED_ROOT
> mkdir cg1
> echo 4-7 > cg1/cpuset.cpus
> echo root > cg1/cpuset.cpus.partition
>
> # with smt/control reading 'on',
> echo off > /sys/devices/system/cpu/smt/control
>
> RIP maps to
>
> sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
>
> from sd_init(). sd_id is calculated earlier in the same function:
>
> cpumask_and(sched_domain_span(sd), cpu_map, tl->mask(cpu));
> sd_id = cpumask_first(sched_domain_span(sd));
>
> tl->mask(cpu), which reads cpu_sibling_map on x86, returns an empty mask
> and so cpumask_first() returns >= nr_cpu_ids, which leads to the bogus
> value from per_cpu_ptr() above.
>
> The problem is a race between cpuset_hotplug_workfn() and a later
> offline of CPU N. cpuset_hotplug_workfn() updates the effective masks
> when N is still online, the offline clears N from cpu_sibling_map, and
> then the worker uses the stale effective masks that still have N to
> generate the scheduling domains, leading the worker to read
> N's empty cpu_sibling_map in sd_init().
>
> rebuild_sched_domains_locked() prevented the race during the cgroup2
> cpuset series up until the Fixes commit changed its check. Make the
> check more robust so that it can detect an offline CPU in any exclusive
> cpuset's effective mask, not just the top one.

*groan*, what a mess...

> Fixes: 0ccea8feb980 ("cpuset: Make generate_sched_domains() work with partition")
> Signed-off-by: Daniel Jordan <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Li Zefan <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Prateek Sood <[email protected]>
> Cc: Tejun Heo <[email protected]>
> Cc: Waiman Long <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> ---
>
> I think the right thing to do long-term is make the hotplug work
> synchronous, fixing the lockdep splats of past attempts, and then take
> these checks out of rebuild_sched_domains_locked, but this fixes the
> immediate issue and is small enough for stable. Open to suggestions.
>
> Prateek, are you planning on picking up your patches again?

Yeah, that might help, but those deadlocks were nasty iirc :/

> kernel/cgroup/cpuset.c | 20 +++++++++++++++++---
> 1 file changed, 17 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 57b5b5d0a5fd..ac3124010b2a 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -983,8 +983,10 @@ partition_and_rebuild_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
> */
> static void rebuild_sched_domains_locked(void)
> {
> + struct cgroup_subsys_state *pos_css;
> struct sched_domain_attr *attr;
> cpumask_var_t *doms;
> + struct cpuset *cs;
> int ndoms;
>
> lockdep_assert_cpus_held();
> @@ -999,9 +1001,21 @@ static void rebuild_sched_domains_locked(void)
> !cpumask_equal(top_cpuset.effective_cpus, cpu_active_mask))
> return;

So you argued above that effective_cpus was stale, I suppose the above
one works because its an equality test instead of a subset? Does that
wants a comment?

> - if (top_cpuset.nr_subparts_cpus &&
> - !cpumask_subset(top_cpuset.effective_cpus, cpu_active_mask))
> - return;
> + if (top_cpuset.nr_subparts_cpus) {
> + rcu_read_lock();
> + cpuset_for_each_descendant_pre(cs, pos_css, &top_cpuset) {
> + if (!is_partition_root(cs)) {
> + pos_css = css_rightmost_descendant(pos_css);
> + continue;
> + }
> + if (!cpumask_subset(cs->effective_cpus,
> + cpu_active_mask)) {
> + rcu_read_unlock();
> + return;
> + }
> + }
> + rcu_read_unlock();
> + }
>
> /* Generate domain masks and attrs */
> ndoms = generate_sched_domains(&doms, &attr);

2020-11-10 20:37:17

by Daniel Jordan

[permalink] [raw]
Subject: Re: [PATCH] cpuset: fix race between hotplug work and later CPU offline

Peter Zijlstra <[email protected]> writes:
> On Thu, Oct 29, 2020 at 02:18:45PM -0400, Daniel Jordan wrote:
>> rebuild_sched_domains_locked() prevented the race during the cgroup2
>> cpuset series up until the Fixes commit changed its check. Make the
>> check more robust so that it can detect an offline CPU in any exclusive
>> cpuset's effective mask, not just the top one.
>
> *groan*, what a mess...

Ah, the joys of cpu hotplug!

>> I think the right thing to do long-term is make the hotplug work
>> synchronous, fixing the lockdep splats of past attempts, and then take
>> these checks out of rebuild_sched_domains_locked, but this fixes the
>> immediate issue and is small enough for stable. Open to suggestions.
>>
>> Prateek, are you planning on picking up your patches again?
>
> Yeah, that might help, but those deadlocks were nasty iirc :/

It might end up being too invasive to be worth it, but I'm being
optimistic for now.

>> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
>> index 57b5b5d0a5fd..ac3124010b2a 100644
>> --- a/kernel/cgroup/cpuset.c
>> +++ b/kernel/cgroup/cpuset.c
>> @@ -983,8 +983,10 @@ partition_and_rebuild_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
>> */
>> static void rebuild_sched_domains_locked(void)
>> {
>> + struct cgroup_subsys_state *pos_css;
>> struct sched_domain_attr *attr;
>> cpumask_var_t *doms;
>> + struct cpuset *cs;
>> int ndoms;
>>
>> lockdep_assert_cpus_held();
>> @@ -999,9 +1001,21 @@ static void rebuild_sched_domains_locked(void)
>> !cpumask_equal(top_cpuset.effective_cpus, cpu_active_mask))
>> return;
>
> So you argued above that effective_cpus was stale, I suppose the above
> one works because its an equality test instead of a subset?

Yep, fortunately enough.

> Does that wants a comment?

Ok, I'll change the comments to this absent other ideas.

/*
* If we have raced with CPU hotplug, return early to avoid
* passing doms with offlined cpu to partition_sched_domains().
* Anyways, cpuset_hotplug_workfn() will rebuild sched domains.
*
* With no CPUs in any subpartitions, top_cpuset's effective CPUs
* should be the same as the active CPUs, so checking only top_cpuset
* is enough to detect racing CPU offlines.
*/
if (!top_cpuset.nr_subparts_cpus &&
!cpumask_equal(top_cpuset.effective_cpus, cpu_active_mask))
return;

/*
* With subpartition CPUs, however, the effective CPUs of a partition
* root should be only a subset of the active CPUs. Since a CPU in any
* partition root could be offlined, all must be checked.
*/
if (top_cpuset.nr_subparts_cpus) {
rcu_read_lock();
...


Thanks for looking.