From: Vincent Donnefort <[email protected]>
This patch-set intends mainly to fix HP rollback, which is currently broken,
due to an inconsistent "state" usage and an issue with CPUHP_AP_ONLINE_IDLE.
It also improves the "fail" interface, which can now be reset and will reject
CPUHP_BRINGUP_CPU, a step that can't always fail.
Vincent Donnefort (4):
cpu/hotplug: Allowing to reset fail injection
cpu/hotplug: CPUHP_BRINGUP_CPU exception in fail injection
cpu/hotplug: Add cpuhp_invoke_callback_range()
cpu/hotplug: Fix CPU down rollback
kernel/cpu.c | 173 +++++++++++++++++++++++++++++++++++++----------------------
1 file changed, 110 insertions(+), 63 deletions(-)
--
2.7.4
From: Vincent Donnefort <[email protected]>
After the AP brought itself down to CPUHP_TEARDOWN_CPU, the BP will finish
the job. The steps left are as followed:
+--------------------+
| CPUHP_TEARDOWN_CPU | -> If fails state is CPUHP_TEARDOWN_CPU
+--------------------+
| ATOMIC STATES | -> Cannot Fail
+--------------------+
| CPUHP_BRINGUP_CPU | -> Cannot fail
+--------------------+
| ... |
| ... | -> Can fail and rollback
| CPUHP_OFFLINE |
+--------------------+
There are, in case of failure two possibilities:
1. Failure in CPUHP_TEARDOWN_CPU, then st->state will still be
CPUHP_TEARDOWN_CPU
2. Failure between CPUHP_OFFLINE and CPUHP_BRINGUP_CPU.
For 2. The rollback will always run in the CPUHP_BRINGUP_CPU bringup
callback (bringup_cpu()) which kicks the AP back on. The latter will then
end-up setting st->state to CPUHP_AP_ONLINE_IDLE.
Checking for CPUHP_AP_ONLINE_IDLE allows then to rollback in all cases.
Signed-off-by: Vincent Donnefort <[email protected]>
---
kernel/cpu.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/kernel/cpu.c b/kernel/cpu.c
index aebec3a..8b3f84e 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -1078,7 +1078,14 @@ static int __ref _cpu_down(unsigned int cpu, int tasks_frozen,
* to do the further cleanups.
*/
ret = cpuhp_down_callbacks(cpu, st, target);
- if (ret && st->state == CPUHP_TEARDOWN_CPU && st->state < prev_state) {
+ if (ret && (st->state == CPUHP_AP_ONLINE_IDLE ||
+ st->state == CPUHP_TEARDOWN_CPU) &&
+ st->state < prev_state) {
+ /*
+ * After an error the state can be either:
+ * - CPUHP_TEARDOWN_CPU if this step failed.
+ * - CPUHP_AP_ONLINE_IDLE if any other failed.
+ */
cpuhp_reset_state(st, prev_state);
__cpuhp_kick_ap(st);
}
--
2.7.4
On Mon, Jan 11, 2021 at 05:10:47PM +0000, [email protected] wrote:
> From: Vincent Donnefort <[email protected]>
>
> After the AP brought itself down to CPUHP_TEARDOWN_CPU, the BP will finish
> the job. The steps left are as followed:
>
> +--------------------+
> | CPUHP_TEARDOWN_CPU | -> If fails state is CPUHP_TEARDOWN_CPU
> +--------------------+
> | ATOMIC STATES | -> Cannot Fail
> +--------------------+
> | CPUHP_BRINGUP_CPU | -> Cannot fail
> +--------------------+
> | ... |
> | ... | -> Can fail and rollback
These are the PREPARE/DEAD states, right? It would be _really_ daft for
a DEAD notifier to fail. But yeah, I suppose that if it does, it will
indeed end up in CPUHP_AP_ONLINE_IDLE.
Do we want to WARN when a DEAD notifier fails?
On Thu, Jan 21, 2021 at 03:57:03PM +0100, Peter Zijlstra wrote:
> On Mon, Jan 11, 2021 at 05:10:47PM +0000, [email protected] wrote:
> > From: Vincent Donnefort <[email protected]>
> >
> > After the AP brought itself down to CPUHP_TEARDOWN_CPU, the BP will finish
> > the job. The steps left are as followed:
> >
> > +--------------------+
> > | CPUHP_TEARDOWN_CPU | -> If fails state is CPUHP_TEARDOWN_CPU
> > +--------------------+
> > | ATOMIC STATES | -> Cannot Fail
> > +--------------------+
> > | CPUHP_BRINGUP_CPU | -> Cannot fail
> > +--------------------+
> > | ... |
> > | ... | -> Can fail and rollback
>
> These are the PREPARE/DEAD states, right? It would be _really_ daft for
> a DEAD notifier to fail. But yeah, I suppose that if it does, it will
> indeed end up in CPUHP_AP_ONLINE_IDLE.
>
> Do we want to WARN when a DEAD notifier fails?
>
>
Indeed, I couldn't find a dead callback which can return an error. So I
suppose we could go for another strategy here, that would be to not allow
failure for the dead states (i.e states < BRINGUP_CPU when hotunplug). In
that case, the fail interface should probably disallow selecting those
states and a WARN here would be good.
--
Vincent