cpu hotplug may be disabled via cpu_hotplug_enable/cpu_hotplug_disable.
When disabled, cpu_down and cpu_up will fail with -EBUSY. Users of the
cpu_up/cpu_down should handle this situation as this is mostly temporal
disablement and exception should be made for EBUSY, assuming that EBUSY
always stands for this situation and is worth repeating execution. This
kind of handling isn't implemented within kernel space users of cpu_up,
cpu_down, while for user space at least implementation within rt-tools
doesn't deal with it.
Instead of creating exceptions for users, this problem should be dealt
at source. One could claim that EBUSY justifies such exception, however
when possible we should create as simple as possible interfaces which
do not fail in case of resource being held.
Heavier user of cpu_hotplug_enable/disable is pci_device_probe yielding
errors on bringing cpu cores up/down if pci devices are getting probed.
Situation in which pci devices are getting probed and having cpus going
down/up is valid situation.
Problem was observed on x86 board by having partitioning of the system
to RT/NRT cpu sets failing (of which part is to bring cpus down/up via
sysfs) if pci devices would be getting probed at the same time. This is
confusing for userspace as dependency to pci devices is not clear.
Fix this behavior by waiting for cpu hotplug to be ready. Return -EBUSY
only after hotplugging was not enabled for about 10 seconds.
Fixes: 1ddd45f8d76f ("PCI: Use cpu_hotplug_disable() instead of get_online_cpus()")
Signed-off-by: Matija Glavinic Pecotic <[email protected]>
Reviewed-by: Alexander Sverdlin <[email protected]>
---
kernel/cpu.c | 50 ++++++++++++++++++++++++++++++++++++++++----------
1 file changed, 40 insertions(+), 10 deletions(-)
diff --git a/kernel/cpu.c b/kernel/cpu.c
index 9c706af..6ff3c1a 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -31,6 +31,7 @@
#include <linux/relay.h>
#include <linux/slab.h>
#include <linux/percpu-rwsem.h>
+#include <linux/wait.h>
#include <trace/events/power.h>
#define CREATE_TRACE_POINTS
@@ -278,11 +279,22 @@ void cpu_maps_update_done(void)
}
/*
- * If set, cpu_up and cpu_down will return -EBUSY and do nothing.
+ * If set, cpu_up and cpu_down will retry for cpu_hotplug_retries and
+ * eventually return -EBUSY if unsuccessful.
* Should always be manipulated under cpu_add_remove_lock
*/
static int cpu_hotplug_disabled;
+/*
+ * waitqueue for waiting on cpu_hotplug_disabled
+ */
+static DECLARE_WAIT_QUEUE_HEAD(wait_cpu_hp_enabled);
+
+/*
+ * Retries for cpu_hotplug to be enabled by cpu_up/cpu_down.
+ */
+static int cpu_hotplug_retries = 10;
+
#ifdef CONFIG_HOTPLUG_CPU
DEFINE_STATIC_PERCPU_RWSEM(cpu_hotplug_lock);
@@ -341,7 +353,7 @@ static void lockdep_release_cpus_lock(void)
/*
* Wait for currently running CPU hotplug operations to complete (if any) and
- * disable future CPU hotplug (from sysfs). The 'cpu_add_remove_lock' protects
+ * briefly disable CPU hotplug (from sysfs). The 'cpu_add_remove_lock' protects
* the 'cpu_hotplug_disabled' flag. The same lock is also acquired by the
* hotplug path before performing hotplug operations. So acquiring that lock
* guarantees mutual exclusion from any currently running hotplug operations.
@@ -366,6 +378,7 @@ void cpu_hotplug_enable(void)
cpu_maps_update_begin();
__cpu_hotplug_enable();
cpu_maps_update_done();
+ wake_up(&wait_cpu_hp_enabled);
}
EXPORT_SYMBOL_GPL(cpu_hotplug_enable);
@@ -1043,11 +1056,21 @@ static int cpu_down_maps_locked(unsigned int cpu, enum cpuhp_state target)
static int do_cpu_down(unsigned int cpu, enum cpuhp_state target)
{
- int err;
+ int err = -EBUSY, retries = cpu_hotplug_retries;
- cpu_maps_update_begin();
- err = cpu_down_maps_locked(cpu, target);
- cpu_maps_update_done();
+ while (retries--) {
+ wait_event_timeout(wait_cpu_hp_enabled,
+ !cpu_hotplug_disabled,
+ HZ);
+ cpu_maps_update_begin();
+ if (cpu_hotplug_disabled) {
+ cpu_maps_update_done();
+ continue;
+ }
+ err = _cpu_down(cpu, 0, target);
+ cpu_maps_update_done();
+ break;
+ }
return err;
}
@@ -1171,7 +1194,7 @@ static int _cpu_up(unsigned int cpu, int tasks_frozen, enum cpuhp_state target)
static int do_cpu_up(unsigned int cpu, enum cpuhp_state target)
{
- int err = 0;
+ int err = 0, retries = cpu_hotplug_retries;
if (!cpu_possible(cpu)) {
pr_err("can't online cpu %d because it is not configured as may-hotadd at boot time\n",
@@ -1186,9 +1209,16 @@ static int do_cpu_up(unsigned int cpu, enum cpuhp_state target)
if (err)
return err;
- cpu_maps_update_begin();
-
- if (cpu_hotplug_disabled) {
+ while (--retries) {
+ wait_event_timeout(wait_cpu_hp_enabled,
+ !cpu_hotplug_disabled,
+ HZ);
+ cpu_maps_update_begin();
+ if (!cpu_hotplug_disabled)
+ break;
+ cpu_maps_update_done();
+ }
+ if (!retries) {
err = -EBUSY;
goto out;
}
--
2.10.2
Matija,
Matija Glavinic Pecotic <[email protected]> writes:
> Heavier user of cpu_hotplug_enable/disable is pci_device_probe yielding
> errors on bringing cpu cores up/down if pci devices are getting probed.
> Situation in which pci devices are getting probed and having cpus going
> down/up is valid situation.
So what?. User space has to handle -EBUSY properly and it was possible
even before that PCI commit that the online/offline operation request
returned -EBUSY.
> Problem was observed on x86 board by having partitioning of the system
> to RT/NRT cpu sets failing (of which part is to bring cpus down/up via
> sysfs) if pci devices would be getting probed at the same time. This is
I have no idea why you need to offline/online CPUs to partition a
system. There are surely more sensible ways to do that, but that's not
part of this discussion.
> confusing for userspace as dependency to pci devices is not clear.
What's confusing about a EBUSY return code? It's pretty universaly used
in situations where a facility is temporarily busy. If it's not
sufficiently documented, why EBUSY can be returned and what that means,
then this needs to be improved.
> Fix this behavior by waiting for cpu hotplug to be ready. Return -EBUSY
> only after hotplugging was not enabled for about 10 seconds.
There is nothing to fix here, really. Fix the user space handling which
fails to handle EBUSY. Just because this commit made it more likely that
EBUSY is returned does not justify this horrid hack.
Thanks,
tglx
Hello Thomas,
On 02/03/2020 07:08 PM, Thomas Gleixner wrote:
> So what?. User space has to handle -EBUSY properly and it was possible
> even before that PCI commit that the online/offline operation request
> returned -EBUSY.
> What's confusing about a EBUSY return code? It's pretty universaly used
> in situations where a facility is temporarily busy. If it's not
> sufficiently documented, why EBUSY can be returned and what that means,
> then this needs to be improved.
It is true this was happening before your work in the pci subsystem, I
should've referenced original commit which made cpu_up/down returning
EBUSY, I agree there is nothing to fix in your patch.
EBUSY existing and being commonly used doesnt justify it in every
situation. We do not have problem only in userspace, but kernel as well,
no user of cpu_up/down takes into account of possible temporal
unavailability. Going into extreme, we could start returning EBUSY
whenever we have resource/facility taken which would made every
interface candidate for returning it. As I see it, EBUSY has its place
in nonblocking APIs. Others should try (hard) not to return it. Handling
it is further topic of its own. How large the timeout to quit? Let's say
that we know that for cpu, it is 10 seconds which I proposed. Passing
responsibility to select tmo to the users will spread out that policy to
each subsystem of its own, yielding to situations where it will for
someone work, for others not, depending on the tmo chosen.
These kind of waits I do not prefer, but I wasnt able to think of
anything better to try to improve this situation. I still believe it
should be improved, and once/if cpu hotplug will be able to remove
cpu_hotplug_enable/disable, remove it.
> I have no idea why you need to offline/online CPUs to partition a
> system. There are surely more sensible ways to do that, but that's not
> part of this discussion.
I'd be happy to make it part.
We are using partrt from
https://github.com/OpenEneaLinux/rt-tools/tree/master/partrt,
cpu_up/down is part of it, AFAIK, it is there to force timer migration
and doesnt have any other (known to me) usage. In the meantime since we
started with core isolation, we changed how we treat isolated cores. We
are now starting with isolcpus=cpu-list nohz_full=cpu-list
rcu_nocbs=cpu-list, and we are atm at Linux 4.19. Earlier we had
different setup where we wanted to use cores in the startup, partition
later, however that showed to be problematic and not in line with how
things are going in the area.
Do you think we do not need toggle them under these conditions?
Thanks,
Matija
Matija,
Matija Glavinic Pecotic <[email protected]> writes:
> On 02/03/2020 07:08 PM, Thomas Gleixner wrote:
> EBUSY existing and being commonly used doesnt justify it in every
> situation. We do not have problem only in userspace, but kernel as well,
> no user of cpu_up/down takes into account of possible temporal
There is no point in caring about this, simply because if you look at
the 5 callsites on x86:
- Three are debug stuff which handle the error return and do not care
about whether its EBUSY or not
- One is the boot time cpu onlining which ignores any error code on
purpose because there are enough reasons why this can fail and we
want at least get up to init.
- The last one is the sysfs interface which you are tripping over.
> unavailability. Going into extreme, we could start returning EBUSY
> whenever we have resource/facility taken which would made every
> interface candidate for returning it. As I see it, EBUSY has its place
> in nonblocking APIs. Others should try (hard) not to return it. Handling
> it is further topic of its own. How large the timeout to quit? Let's say
> that we know that for cpu, it is 10 seconds which I proposed. Passing
> responsibility to select tmo to the users will spread out that policy to
> each subsystem of its own, yielding to situations where it will for
> someone work, for others not, depending on the tmo chosen.
We don't know whats the proper timeout for everyone. You picked one and
it fits your expectation, but its bound to break other peoples
expectations.
That's the exact reason why these kind of heuristics are bad and
horrible and should be avoided in general.
>> I have no idea why you need to offline/online CPUs to partition a
>> system. There are surely more sensible ways to do that, but that's not
>> part of this discussion.
>
> I'd be happy to make it part.
>
> We are using partrt from
> https://github.com/OpenEneaLinux/rt-tools/tree/master/partrt,
> cpu_up/down is part of it, AFAIK, it is there to force timer migration
> and doesnt have any other (known to me) usage. In the meantime since we
> started with core isolation, we changed how we treat isolated cores. We
> are now starting with isolcpus=cpu-list nohz_full=cpu-list
> rcu_nocbs=cpu-list, and we are atm at Linux 4.19. Earlier we had
> different setup where we wanted to use cores in the startup, partition
> later, however that showed to be problematic and not in line with how
> things are going in the area.
>
> Do you think we do not need toggle them under these conditions?
If you have that isolation thingies on the kernel command line there is
no point in doing the cpu up/down dance. It's not providing you anything
useful except steering interrupts away which you can do on the kernel
command line as well with 'irqaffinity=...'.
Thanks,
tglx
Hello Thomas,
On 02/04/2020 01:30 PM, Thomas Gleixner wrote:
> If you have that isolation thingies on the kernel command line there is
> no point in doing the cpu up/down dance. It's not providing you anything
> useful except steering interrupts away which you can do on the kernel
> command line as well with 'irqaffinity=...'.
Thanks for the info.
Regards,
Matija