2021-02-04 01:07:40

by Alexey Klimov

[permalink] [raw]
Subject: [PATCH] cpu/hotplug: wait for cpuset_hotplug_work to finish on cpu onlining

When a CPU offlined and onlined via device_offline() and device_online()
the userspace gets uevent notification. If, after receiving "online" uevent,
userspace executes sched_setaffinity() on some task trying to move it
to a recently onlined CPU, then it often fails with -EINVAL. Userspace needs
to wait around 5..30 ms before sched_setaffinity() will succeed for the recently
onlined CPU after receiving uevent.

If in_mask argument for sched_setaffinity() has only recently onlined CPU,
it often fails with such flow:

sched_setaffinity()
cpuset_cpus_allowed()
guarantee_online_cpus() <-- cs->effective_cpus mask does not
contain recently onlined cpu
cpumask_and() <-- final new_mask is empty
__set_cpus_allowed_ptr()
cpumask_any_and_distribute() <-- returns dest_cpu equal to nr_cpu_ids
returns -EINVAL

Cpusets used in guarantee_online_cpus() are updated using workqueue from
cpuset_update_active_cpus() which in its turn is called from cpu hotplug callback
sched_cpu_activate() hence it may not be observable by sched_setaffinity() if
it is called immediately after uevent.
Out of line uevent can be avoided if we will ensure that cpuset_hotplug_work
has run to completion using cpuset_wait_for_hotplug() after onlining the
cpu in cpu_up() and in cpuhp_smt_enable().

Co-analyzed-by: Joshua Baker <[email protected]>
Signed-off-by: Alexey Klimov <[email protected]>
---

Previous RFC patch and discussion is here:
https://lore.kernel.org/lkml/[email protected]/

The commit a49e4629b5ed "cpuset: Make cpuset hotplug synchronous"
would also get rid of the early uevent but it was reverted (deadlocks).

The nature of this bug is also described here (with different consequences):
https://lore.kernel.org/lkml/[email protected]/

Reproducer: https://gitlab.com/0xeafffffe/xlam

Currently with such changes the reproducer code continues to work without issues.
The idea is to avoid the situation when userspace receives the event about
onlined CPU which is not ready to take tasks for a while after uevent.


kernel/cpu.c | 47 +++++++++++++++++++++++++++++++++++++++++------
1 file changed, 41 insertions(+), 6 deletions(-)

diff --git a/kernel/cpu.c b/kernel/cpu.c
index 4e11e91010e1..ea728e75a74d 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -15,6 +15,7 @@
#include <linux/sched/smt.h>
#include <linux/unistd.h>
#include <linux/cpu.h>
+#include <linux/cpuset.h>
#include <linux/oom.h>
#include <linux/rcupdate.h>
#include <linux/export.h>
@@ -1281,6 +1282,11 @@ static int cpu_up(unsigned int cpu, enum cpuhp_state target)
err = _cpu_up(cpu, 0, target);
out:
cpu_maps_update_done();
+
+ /* To avoid out of line uevent */
+ if (!err)
+ cpuset_wait_for_hotplug();
+
return err;
}

@@ -2062,8 +2068,6 @@ static void cpuhp_offline_cpu_device(unsigned int cpu)
struct device *dev = get_cpu_device(cpu);

dev->offline = true;
- /* Tell user space about the state change */
- kobject_uevent(&dev->kobj, KOBJ_OFFLINE);
}

static void cpuhp_online_cpu_device(unsigned int cpu)
@@ -2071,14 +2075,18 @@ static void cpuhp_online_cpu_device(unsigned int cpu)
struct device *dev = get_cpu_device(cpu);

dev->offline = false;
- /* Tell user space about the state change */
- kobject_uevent(&dev->kobj, KOBJ_ONLINE);
}

int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval)
{
- int cpu, ret = 0;
+ struct device *dev;
+ cpumask_var_t mask;
+ int cpu, ret;
+
+ if (!zalloc_cpumask_var(&mask, GFP_KERNEL))
+ return -ENOMEM;

+ ret = 0;
cpu_maps_update_begin();
for_each_online_cpu(cpu) {
if (topology_is_primary_thread(cpu))
@@ -2100,17 +2108,32 @@ int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval)
* serialized against the regular offline usage.
*/
cpuhp_offline_cpu_device(cpu);
+ cpumask_set_cpu(cpu, mask);
}
if (!ret)
cpu_smt_control = ctrlval;
cpu_maps_update_done();
+
+ /* Tell user space about the state changes */
+ for_each_cpu(cpu, mask) {
+ dev = get_cpu_device(cpu);
+ kobject_uevent(&dev->kobj, KOBJ_OFFLINE);
+ }
+
+ free_cpumask_var(mask);
return ret;
}

int cpuhp_smt_enable(void)
{
- int cpu, ret = 0;
+ struct device *dev;
+ cpumask_var_t mask;
+ int cpu, ret;

+ if (!zalloc_cpumask_var(&mask, GFP_KERNEL))
+ return -ENOMEM;
+
+ ret = 0;
cpu_maps_update_begin();
cpu_smt_control = CPU_SMT_ENABLED;
for_each_present_cpu(cpu) {
@@ -2122,8 +2145,20 @@ int cpuhp_smt_enable(void)
break;
/* See comment in cpuhp_smt_disable() */
cpuhp_online_cpu_device(cpu);
+ cpumask_set_cpu(cpu, mask);
}
cpu_maps_update_done();
+
+ /* To avoid out of line uevents */
+ cpuset_wait_for_hotplug();
+
+ /* Tell user space about the state changes */
+ for_each_cpu(cpu, mask) {
+ dev = get_cpu_device(cpu);
+ kobject_uevent(&dev->kobj, KOBJ_ONLINE);
+ }
+
+ free_cpumask_var(mask);
return ret;
}
#endif
--
2.30.0


2021-02-04 13:47:28

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] cpu/hotplug: wait for cpuset_hotplug_work to finish on cpu onlining

On Thu, Feb 04, 2021 at 12:50:34PM +0000, Alexey Klimov wrote:
> On Thu, Feb 4, 2021 at 9:46 AM Peter Zijlstra <[email protected]> wrote:
> >
> > On Thu, Feb 04, 2021 at 01:01:57AM +0000, Alexey Klimov wrote:
> > > @@ -1281,6 +1282,11 @@ static int cpu_up(unsigned int cpu, enum cpuhp_state target)
> > > err = _cpu_up(cpu, 0, target);
> > > out:
> > > cpu_maps_update_done();
> > > +
> > > + /* To avoid out of line uevent */
> > > + if (!err)
> > > + cpuset_wait_for_hotplug();
> > > +
> > > return err;
> > > }
> > >
> >
> > > @@ -2071,14 +2075,18 @@ static void cpuhp_online_cpu_device(unsigned int cpu)
> > > struct device *dev = get_cpu_device(cpu);
> > >
> > > dev->offline = false;
> > > - /* Tell user space about the state change */
> > > - kobject_uevent(&dev->kobj, KOBJ_ONLINE);
> > > }
> > >
> >
> > One concequence of this is that you'll now get a bunch of notifications
> > across things like suspend/hybernate.
>
> The patch doesn't change the number of kobject_uevent()s. The
> userspace will get the same number of uevents as before the patch (at
> least if I can rely on my eyes).

bringup_hibernate_cpu() didn't used to generate an event, it does now.
Same for bringup_nonboot_cpus().

Also, looking again, you don't seem to be reinstating the OFFLINE event
you took out.


2021-02-04 23:40:06

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] cpu/hotplug: wait for cpuset_hotplug_work to finish on cpu onlining

On Thu, Feb 04, 2021 at 01:01:57AM +0000, Alexey Klimov wrote:
> @@ -1281,6 +1282,11 @@ static int cpu_up(unsigned int cpu, enum cpuhp_state target)
> err = _cpu_up(cpu, 0, target);
> out:
> cpu_maps_update_done();
> +
> + /* To avoid out of line uevent */
> + if (!err)
> + cpuset_wait_for_hotplug();
> +
> return err;
> }
>

> @@ -2071,14 +2075,18 @@ static void cpuhp_online_cpu_device(unsigned int cpu)
> struct device *dev = get_cpu_device(cpu);
>
> dev->offline = false;
> - /* Tell user space about the state change */
> - kobject_uevent(&dev->kobj, KOBJ_ONLINE);
> }
>

One concequence of this is that you'll now get a bunch of notifications
across things like suspend/hybernate.

2021-02-05 00:10:31

by Alexey Klimov

[permalink] [raw]
Subject: Re: [PATCH] cpu/hotplug: wait for cpuset_hotplug_work to finish on cpu onlining

On Thu, Feb 4, 2021 at 9:46 AM Peter Zijlstra <[email protected]> wrote:
>
> On Thu, Feb 04, 2021 at 01:01:57AM +0000, Alexey Klimov wrote:
> > @@ -1281,6 +1282,11 @@ static int cpu_up(unsigned int cpu, enum cpuhp_state target)
> > err = _cpu_up(cpu, 0, target);
> > out:
> > cpu_maps_update_done();
> > +
> > + /* To avoid out of line uevent */
> > + if (!err)
> > + cpuset_wait_for_hotplug();
> > +
> > return err;
> > }
> >
>
> > @@ -2071,14 +2075,18 @@ static void cpuhp_online_cpu_device(unsigned int cpu)
> > struct device *dev = get_cpu_device(cpu);
> >
> > dev->offline = false;
> > - /* Tell user space about the state change */
> > - kobject_uevent(&dev->kobj, KOBJ_ONLINE);
> > }
> >
>
> One concequence of this is that you'll now get a bunch of notifications
> across things like suspend/hybernate.

The patch doesn't change the number of kobject_uevent()s. The
userspace will get the same number of uevents as before the patch (at
least if I can rely on my eyes).
Or is there a concern that now the uevents are sent in a row
sequentially which might abuse userspace uevents handling machinery?

Best regards,
Alexey

2021-02-05 01:58:29

by Daniel Jordan

[permalink] [raw]
Subject: Re: [PATCH] cpu/hotplug: wait for cpuset_hotplug_work to finish on cpu onlining

Peter Zijlstra <[email protected]> writes:

> On Thu, Feb 04, 2021 at 12:50:34PM +0000, Alexey Klimov wrote:
>> On Thu, Feb 4, 2021 at 9:46 AM Peter Zijlstra <[email protected]> wrote:
>> >
>> > On Thu, Feb 04, 2021 at 01:01:57AM +0000, Alexey Klimov wrote:
>> > > @@ -1281,6 +1282,11 @@ static int cpu_up(unsigned int cpu, enum cpuhp_state target)
>> > > err = _cpu_up(cpu, 0, target);
>> > > out:
>> > > cpu_maps_update_done();
>> > > +
>> > > + /* To avoid out of line uevent */
>> > > + if (!err)
>> > > + cpuset_wait_for_hotplug();
>> > > +
>> > > return err;
>> > > }
>> > >
>> >
>> > > @@ -2071,14 +2075,18 @@ static void cpuhp_online_cpu_device(unsigned int cpu)
>> > > struct device *dev = get_cpu_device(cpu);
>> > >
>> > > dev->offline = false;
>> > > - /* Tell user space about the state change */
>> > > - kobject_uevent(&dev->kobj, KOBJ_ONLINE);
>> > > }
>> > >
>> >
>> > One concequence of this is that you'll now get a bunch of notifications
>> > across things like suspend/hybernate.
>>
>> The patch doesn't change the number of kobject_uevent()s. The
>> userspace will get the same number of uevents as before the patch (at
>> least if I can rely on my eyes).
>
> bringup_hibernate_cpu() didn't used to generate an event, it does now.
> Same for bringup_nonboot_cpus().

Both of those call cpu_up(), which only gets a cpuset_wait_for_hotplug()
in this patch. No new events generated from that, right, it's just a
wrapper for a flush_work()?

> Also, looking again, you don't seem to be reinstating the OFFLINE event
> you took out.

It seems to be reinstated in cpuhp_smt_disable()?

2021-02-05 02:04:25

by Daniel Jordan

[permalink] [raw]
Subject: Re: [PATCH] cpu/hotplug: wait for cpuset_hotplug_work to finish on cpu onlining

Alexey Klimov <[email protected]> writes:

> When a CPU offlined and onlined via device_offline() and device_online()
> the userspace gets uevent notification. If, after receiving "online" uevent,
> userspace executes sched_setaffinity() on some task trying to move it
> to a recently onlined CPU, then it often fails with -EINVAL. Userspace needs
> to wait around 5..30 ms before sched_setaffinity() will succeed for the recently
> onlined CPU after receiving uevent.
>
> If in_mask argument for sched_setaffinity() has only recently onlined CPU,
> it often fails with such flow:
>
> sched_setaffinity()
> cpuset_cpus_allowed()
> guarantee_online_cpus() <-- cs->effective_cpus mask does not
> contain recently onlined cpu
> cpumask_and() <-- final new_mask is empty
> __set_cpus_allowed_ptr()
> cpumask_any_and_distribute() <-- returns dest_cpu equal to nr_cpu_ids
> returns -EINVAL
>
> Cpusets used in guarantee_online_cpus() are updated using workqueue from
> cpuset_update_active_cpus() which in its turn is called from cpu hotplug callback
> sched_cpu_activate() hence it may not be observable by sched_setaffinity() if
> it is called immediately after uevent.
> Out of line uevent can be avoided if we will ensure that cpuset_hotplug_work
> has run to completion using cpuset_wait_for_hotplug() after onlining the
> cpu in cpu_up() and in cpuhp_smt_enable().

Nice writeup. I just have some nits, patch looks ok otherwise.

> @@ -1281,6 +1282,11 @@ static int cpu_up(unsigned int cpu, enum cpuhp_state target)
> err = _cpu_up(cpu, 0, target);
> out:
> cpu_maps_update_done();
> +
> + /* To avoid out of line uevent */

Not sure this will make sense out of context. Maybe,

/*
* Wait for cpuset updates to cpumasks to finish. Later on this path
* may generate uevents whose consumers rely on the updates.
*/

> @@ -2062,8 +2068,6 @@ static void cpuhp_offline_cpu_device(unsigned int cpu)
> struct device *dev = get_cpu_device(cpu);
>
> dev->offline = true;
> }
>
> @@ -2071,14 +2075,18 @@ static void cpuhp_online_cpu_device(unsigned int cpu)
> struct device *dev = get_cpu_device(cpu);
>
> dev->offline = false;
> }

You could get rid of these functions and just put the few remaining bits
in the callers. They each have only one.

2021-02-05 11:39:58

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH] cpu/hotplug: wait for cpuset_hotplug_work to finish on cpu onlining

On 02/04/21 10:46, Peter Zijlstra wrote:
> On Thu, Feb 04, 2021 at 01:01:57AM +0000, Alexey Klimov wrote:
> > @@ -1281,6 +1282,11 @@ static int cpu_up(unsigned int cpu, enum cpuhp_state target)
> > err = _cpu_up(cpu, 0, target);
> > out:
> > cpu_maps_update_done();
> > +
> > + /* To avoid out of line uevent */
> > + if (!err)
> > + cpuset_wait_for_hotplug();
> > +
> > return err;
> > }
> >
>
> > @@ -2071,14 +2075,18 @@ static void cpuhp_online_cpu_device(unsigned int cpu)
> > struct device *dev = get_cpu_device(cpu);
> >
> > dev->offline = false;
> > - /* Tell user space about the state change */
> > - kobject_uevent(&dev->kobj, KOBJ_ONLINE);
> > }
> >
>
> One concequence of this is that you'll now get a bunch of notifications
> across things like suspend/hybernate.

And the resume latency will incur 5-30ms * nr_cpu_ids.

Since you just care about device_online(), isn't cpu_device_up() a better place
for the wait? This function is special helper for device_online(), leaving
suspend/resume and kexec paths free from having to do this unnecessary wait.

Thanks

--
Qais Yousef

2021-02-11 13:57:24

by Alexey Klimov

[permalink] [raw]
Subject: Re: [PATCH] cpu/hotplug: wait for cpuset_hotplug_work to finish on cpu onlining

On Fri, Feb 5, 2021 at 11:22 AM Qais Yousef <[email protected]> wrote:
>
> On 02/04/21 10:46, Peter Zijlstra wrote:
> > On Thu, Feb 04, 2021 at 01:01:57AM +0000, Alexey Klimov wrote:
> > > @@ -1281,6 +1282,11 @@ static int cpu_up(unsigned int cpu, enum cpuhp_state target)
> > > err = _cpu_up(cpu, 0, target);
> > > out:
> > > cpu_maps_update_done();
> > > +
> > > + /* To avoid out of line uevent */
> > > + if (!err)
> > > + cpuset_wait_for_hotplug();
> > > +
> > > return err;
> > > }
> > >
> >
> > > @@ -2071,14 +2075,18 @@ static void cpuhp_online_cpu_device(unsigned int cpu)
> > > struct device *dev = get_cpu_device(cpu);
> > >
> > > dev->offline = false;
> > > - /* Tell user space about the state change */
> > > - kobject_uevent(&dev->kobj, KOBJ_ONLINE);
> > > }
> > >
> >
> > One concequence of this is that you'll now get a bunch of notifications
> > across things like suspend/hybernate.
>
> And the resume latency will incur 5-30ms * nr_cpu_ids.
>
> Since you just care about device_online(), isn't cpu_device_up() a better place
> for the wait? This function is special helper for device_online(), leaving
> suspend/resume and kexec paths free from having to do this unnecessary wait.

Yup, the same idea here once Peter mentioned bringup_nonboot_cpus()
and bringup_hibernate_cpu().

Best regards,
Alexey

2021-02-11 14:27:26

by Alexey Klimov

[permalink] [raw]
Subject: Re: [PATCH] cpu/hotplug: wait for cpuset_hotplug_work to finish on cpu onlining

On Fri, Feb 5, 2021 at 12:41 AM Daniel Jordan
<[email protected]> wrote:
>
> Peter Zijlstra <[email protected]> writes:

[...]

> >> > One concequence of this is that you'll now get a bunch of notifications
> >> > across things like suspend/hybernate.
> >>
> >> The patch doesn't change the number of kobject_uevent()s. The
> >> userspace will get the same number of uevents as before the patch (at
> >> least if I can rely on my eyes).
> >
> > bringup_hibernate_cpu() didn't used to generate an event, it does now.
> > Same for bringup_nonboot_cpus().
>
> Both of those call cpu_up(), which only gets a cpuset_wait_for_hotplug()
> in this patch. No new events generated from that, right, it's just a
> wrapper for a flush_work()?
>
> > Also, looking again, you don't seem to be reinstating the OFFLINE event
> > you took out.
>
> It seems to be reinstated in cpuhp_smt_disable()?

Peter, what Daniel said.
cpuset_wait_for_hotplug() doesn't generate an event.

The offline event was moved below in the same function:

+
+ /* Tell user space about the state changes */
+ for_each_cpu(cpu, mask) {
+ dev = get_cpu_device(cpu);
+ kobject_uevent(&dev->kobj, KOBJ_OFFLINE);
+ }
+
+ free_cpumask_var(mask);

Daniel,
thanks for your comments. I'll update the patch and resend.

Best regards,
Alexey