The DYING/STARTING callbacks are not expected to fail. However, as reported
by Derek, drivers such as tboot are still free to return errors within
those sections. In that case, there's nothing the hotplug machinery can do,
so let's just proceed and log the failures.
Fixes: 453e41085183 (cpu/hotplug: Add cpuhp_invoke_callback_range())
Reported-by: Derek Dolney <[email protected]>
Signed-off-by: Vincent Donnefort <[email protected]>
---
v1 -> v2:
- Commit message rewording.
- More details in the warnings.
- Some variable renaming
diff --git a/kernel/cpu.c b/kernel/cpu.c
index bbad5e375d3b..c3617683459e 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -663,21 +663,51 @@ static bool cpuhp_next_state(bool bringup,
return true;
}
-static int cpuhp_invoke_callback_range(bool bringup,
- unsigned int cpu,
- struct cpuhp_cpu_state *st,
- enum cpuhp_state target)
+static int _cpuhp_invoke_callback_range(bool bringup,
+ unsigned int cpu,
+ struct cpuhp_cpu_state *st,
+ enum cpuhp_state target,
+ bool nofail)
{
enum cpuhp_state state;
- int err = 0;
+ int ret = 0;
while (cpuhp_next_state(bringup, &state, st, target)) {
+ int err;
+
err = cpuhp_invoke_callback(cpu, state, bringup, NULL, NULL);
- if (err)
+ if (!err)
+ continue;
+
+ if (nofail) {
+ pr_warn("CPU %u %s state %s (%d) failed (%d)\n",
+ cpu, bringup ? "UP" : "DOWN",
+ cpuhp_get_step(st->state)->name,
+ st->state, err);
+ ret = -1;
+ } else {
+ ret = err;
break;
+ }
}
- return err;
+ return ret;
+}
+
+static inline int cpuhp_invoke_callback_range(bool bringup,
+ unsigned int cpu,
+ struct cpuhp_cpu_state *st,
+ enum cpuhp_state target)
+{
+ return _cpuhp_invoke_callback_range(bringup, cpu, st, target, false);
+}
+
+static inline void cpuhp_invoke_callback_range_nofail(bool bringup,
+ unsigned int cpu,
+ struct cpuhp_cpu_state *st,
+ enum cpuhp_state target)
+{
+ WARN_ON_ONCE(_cpuhp_invoke_callback_range(bringup, cpu, st, target, true));
}
static inline bool can_rollback_cpu(struct cpuhp_cpu_state *st)
@@ -999,7 +1029,6 @@ static int take_cpu_down(void *_param)
struct cpuhp_cpu_state *st = this_cpu_ptr(&cpuhp_state);
enum cpuhp_state target = max((int)st->target, CPUHP_AP_OFFLINE);
int err, cpu = smp_processor_id();
- int ret;
/* Ensure this CPU doesn't handle any more interrupts. */
err = __cpu_disable();
@@ -1012,13 +1041,11 @@ static int take_cpu_down(void *_param)
*/
WARN_ON(st->state != (CPUHP_TEARDOWN_CPU - 1));
- /* Invoke the former CPU_DYING callbacks */
- ret = cpuhp_invoke_callback_range(false, cpu, st, target);
-
/*
+ * Invoke the former CPU_DYING callbacks
* DYING must not fail!
*/
- WARN_ON_ONCE(ret);
+ cpuhp_invoke_callback_range_nofail(false, cpu, st, target);
/* Give up timekeeping duties */
tick_handover_do_timer();
@@ -1296,16 +1323,14 @@ void notify_cpu_starting(unsigned int cpu)
{
struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu);
enum cpuhp_state target = min((int)st->target, CPUHP_AP_ONLINE);
- int ret;
rcu_cpu_starting(cpu); /* Enables RCU usage on this CPU. */
cpumask_set_cpu(cpu, &cpus_booted_once_mask);
- ret = cpuhp_invoke_callback_range(true, cpu, st, target);
/*
* STARTING must not fail!
*/
- WARN_ON_ONCE(ret);
+ cpuhp_invoke_callback_range_nofail(true, cpu, st, target);
}
/*
--
2.36.1.124.g0e6072fb45-goog
On Mon, May 23, 2022 at 05:05:36PM +0100, Vincent Donnefort wrote:
> The DYING/STARTING callbacks are not expected to fail. However, as reported
> by Derek, drivers such as tboot are still free to return errors within
> those sections. In that case, there's nothing the hotplug machinery can do,
> so let's just proceed and log the failures.
>
I'm confused. Why isn't this a driver bug?
On Thu, May 26, 2022 at 09:24:28AM +0100, Vincent Donnefort wrote:
> On Wed, May 25, 2022 at 06:52:48PM +0200, Peter Zijlstra wrote:
> > On Mon, May 23, 2022 at 05:05:36PM +0100, Vincent Donnefort wrote:
> > > The DYING/STARTING callbacks are not expected to fail. However, as reported
> > > by Derek, drivers such as tboot are still free to return errors within
> > > those sections. In that case, there's nothing the hotplug machinery can do,
> > > so let's just proceed and log the failures.
> > >
> >
> > I'm confused. Why isn't this a driver bug?
>
> It is a entirely a driver bug which has been reported already. but 453e41085183
> (cpu/hotplug: Add cpuhp_invoke_callback_range()) changed the behaviour so I
> thought it would be worth to revert to the original one which is to not break
> the entire up/down for a single driver error.
Ah I see. Fair enough I suppose.
I tested this patch on the 5.12 commit that broke suspend and also on
the latest git 5.18 branch and this is good, suspend and resume are
working again.
Derek
On 5/23/22 12:05 PM, Vincent Donnefort wrote:
> The DYING/STARTING callbacks are not expected to fail. However, as reported
> by Derek, drivers such as tboot are still free to return errors within
> those sections. In that case, there's nothing the hotplug machinery can do,
> so let's just proceed and log the failures.
>
> Fixes: 453e41085183 (cpu/hotplug: Add cpuhp_invoke_callback_range())
> Reported-by: Derek Dolney <[email protected]>
> Signed-off-by: Vincent Donnefort <[email protected]>
>
> ---
>
> v1 -> v2:
> - Commit message rewording.
> - More details in the warnings.
> - Some variable renaming
>
> diff --git a/kernel/cpu.c b/kernel/cpu.c
> index bbad5e375d3b..c3617683459e 100644
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -663,21 +663,51 @@ static bool cpuhp_next_state(bool bringup,
> return true;
> }
>
> -static int cpuhp_invoke_callback_range(bool bringup,
> - unsigned int cpu,
> - struct cpuhp_cpu_state *st,
> - enum cpuhp_state target)
> +static int _cpuhp_invoke_callback_range(bool bringup,
> + unsigned int cpu,
> + struct cpuhp_cpu_state *st,
> + enum cpuhp_state target,
> + bool nofail)
> {
> enum cpuhp_state state;
> - int err = 0;
> + int ret = 0;
>
> while (cpuhp_next_state(bringup, &state, st, target)) {
> + int err;
> +
> err = cpuhp_invoke_callback(cpu, state, bringup, NULL, NULL);
> - if (err)
> + if (!err)
> + continue;
> +
> + if (nofail) {
> + pr_warn("CPU %u %s state %s (%d) failed (%d)\n",
> + cpu, bringup ? "UP" : "DOWN",
> + cpuhp_get_step(st->state)->name,
> + st->state, err);
> + ret = -1;
> + } else {
> + ret = err;
> break;
> + }
> }
>
> - return err;
> + return ret;
> +}
> +
> +static inline int cpuhp_invoke_callback_range(bool bringup,
> + unsigned int cpu,
> + struct cpuhp_cpu_state *st,
> + enum cpuhp_state target)
> +{
> + return _cpuhp_invoke_callback_range(bringup, cpu, st, target, false);
> +}
> +
> +static inline void cpuhp_invoke_callback_range_nofail(bool bringup,
> + unsigned int cpu,
> + struct cpuhp_cpu_state *st,
> + enum cpuhp_state target)
> +{
> + WARN_ON_ONCE(_cpuhp_invoke_callback_range(bringup, cpu, st, target, true));
> }
>
> static inline bool can_rollback_cpu(struct cpuhp_cpu_state *st)
> @@ -999,7 +1029,6 @@ static int take_cpu_down(void *_param)
> struct cpuhp_cpu_state *st = this_cpu_ptr(&cpuhp_state);
> enum cpuhp_state target = max((int)st->target, CPUHP_AP_OFFLINE);
> int err, cpu = smp_processor_id();
> - int ret;
>
> /* Ensure this CPU doesn't handle any more interrupts. */
> err = __cpu_disable();
> @@ -1012,13 +1041,11 @@ static int take_cpu_down(void *_param)
> */
> WARN_ON(st->state != (CPUHP_TEARDOWN_CPU - 1));
>
> - /* Invoke the former CPU_DYING callbacks */
> - ret = cpuhp_invoke_callback_range(false, cpu, st, target);
> -
> /*
> + * Invoke the former CPU_DYING callbacks
> * DYING must not fail!
> */
> - WARN_ON_ONCE(ret);
> + cpuhp_invoke_callback_range_nofail(false, cpu, st, target);
>
> /* Give up timekeeping duties */
> tick_handover_do_timer();
> @@ -1296,16 +1323,14 @@ void notify_cpu_starting(unsigned int cpu)
> {
> struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu);
> enum cpuhp_state target = min((int)st->target, CPUHP_AP_ONLINE);
> - int ret;
>
> rcu_cpu_starting(cpu); /* Enables RCU usage on this CPU. */
> cpumask_set_cpu(cpu, &cpus_booted_once_mask);
> - ret = cpuhp_invoke_callback_range(true, cpu, st, target);
>
> /*
> * STARTING must not fail!
> */
> - WARN_ON_ONCE(ret);
> + cpuhp_invoke_callback_range_nofail(true, cpu, st, target);
> }
>
> /*
>
On Wed, May 25, 2022 at 06:52:48PM +0200, Peter Zijlstra wrote:
> On Mon, May 23, 2022 at 05:05:36PM +0100, Vincent Donnefort wrote:
> > The DYING/STARTING callbacks are not expected to fail. However, as reported
> > by Derek, drivers such as tboot are still free to return errors within
> > those sections. In that case, there's nothing the hotplug machinery can do,
> > so let's just proceed and log the failures.
> >
>
> I'm confused. Why isn't this a driver bug?
It is a entirely a driver bug which has been reported already. but 453e41085183
(cpu/hotplug: Add cpuhp_invoke_callback_range()) changed the behaviour so I
thought it would be worth to revert to the original one which is to not break
the entire up/down for a single driver error.
Tested by: Derek Dolney <[email protected]>
Suspend and resume with tboot are working again with this patch. Tested
on 5.12 broken commit and latest git 5.18.
On 5/23/22 12:05 PM, Vincent Donnefort wrote:
> The DYING/STARTING callbacks are not expected to fail. However, as reported
> by Derek, drivers such as tboot are still free to return errors within
> those sections. In that case, there's nothing the hotplug machinery can do,
> so let's just proceed and log the failures.
>
> Fixes: 453e41085183 (cpu/hotplug: Add cpuhp_invoke_callback_range())
> Reported-by: Derek Dolney <[email protected]>
> Signed-off-by: Vincent Donnefort <[email protected]>
>
> ---
>
> v1 -> v2:
> - Commit message rewording.
> - More details in the warnings.
> - Some variable renaming
>
> diff --git a/kernel/cpu.c b/kernel/cpu.c
> index bbad5e375d3b..c3617683459e 100644
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -663,21 +663,51 @@ static bool cpuhp_next_state(bool bringup,
> return true;
> }
>
> -static int cpuhp_invoke_callback_range(bool bringup,
> - unsigned int cpu,
> - struct cpuhp_cpu_state *st,
> - enum cpuhp_state target)
> +static int _cpuhp_invoke_callback_range(bool bringup,
> + unsigned int cpu,
> + struct cpuhp_cpu_state *st,
> + enum cpuhp_state target,
> + bool nofail)
> {
> enum cpuhp_state state;
> - int err = 0;
> + int ret = 0;
>
> while (cpuhp_next_state(bringup, &state, st, target)) {
> + int err;
> +
> err = cpuhp_invoke_callback(cpu, state, bringup, NULL, NULL);
> - if (err)
> + if (!err)
> + continue;
> +
> + if (nofail) {
> + pr_warn("CPU %u %s state %s (%d) failed (%d)\n",
> + cpu, bringup ? "UP" : "DOWN",
> + cpuhp_get_step(st->state)->name,
> + st->state, err);
> + ret = -1;
> + } else {
> + ret = err;
> break;
> + }
> }
>
> - return err;
> + return ret;
> +}
> +
> +static inline int cpuhp_invoke_callback_range(bool bringup,
> + unsigned int cpu,
> + struct cpuhp_cpu_state *st,
> + enum cpuhp_state target)
> +{
> + return _cpuhp_invoke_callback_range(bringup, cpu, st, target, false);
> +}
> +
> +static inline void cpuhp_invoke_callback_range_nofail(bool bringup,
> + unsigned int cpu,
> + struct cpuhp_cpu_state *st,
> + enum cpuhp_state target)
> +{
> + WARN_ON_ONCE(_cpuhp_invoke_callback_range(bringup, cpu, st, target, true));
> }
>
> static inline bool can_rollback_cpu(struct cpuhp_cpu_state *st)
> @@ -999,7 +1029,6 @@ static int take_cpu_down(void *_param)
> struct cpuhp_cpu_state *st = this_cpu_ptr(&cpuhp_state);
> enum cpuhp_state target = max((int)st->target, CPUHP_AP_OFFLINE);
> int err, cpu = smp_processor_id();
> - int ret;
>
> /* Ensure this CPU doesn't handle any more interrupts. */
> err = __cpu_disable();
> @@ -1012,13 +1041,11 @@ static int take_cpu_down(void *_param)
> */
> WARN_ON(st->state != (CPUHP_TEARDOWN_CPU - 1));
>
> - /* Invoke the former CPU_DYING callbacks */
> - ret = cpuhp_invoke_callback_range(false, cpu, st, target);
> -
> /*
> + * Invoke the former CPU_DYING callbacks
> * DYING must not fail!
> */
> - WARN_ON_ONCE(ret);
> + cpuhp_invoke_callback_range_nofail(false, cpu, st, target);
>
> /* Give up timekeeping duties */
> tick_handover_do_timer();
> @@ -1296,16 +1323,14 @@ void notify_cpu_starting(unsigned int cpu)
> {
> struct cpuhp_cpu_state *st = per_cpu_ptr(&cpuhp_state, cpu);
> enum cpuhp_state target = min((int)st->target, CPUHP_AP_ONLINE);
> - int ret;
>
> rcu_cpu_starting(cpu); /* Enables RCU usage on this CPU. */
> cpumask_set_cpu(cpu, &cpus_booted_once_mask);
> - ret = cpuhp_invoke_callback_range(true, cpu, st, target);
>
> /*
> * STARTING must not fail!
> */
> - WARN_ON_ONCE(ret);
> + cpuhp_invoke_callback_range_nofail(true, cpu, st, target);
> }
>
> /*
>
FYI there is also now a patch to fix the driver bug in testing by the
tboot devs at the moment, you could monitor the progress here:
https://sourceforge.net/p/tboot/mailman/message/37659164/
I tested this patch and it works for me.
Derek
On 5/26/22 6:15 AM, Peter Zijlstra wrote:
> On Thu, May 26, 2022 at 09:24:28AM +0100, Vincent Donnefort wrote:
>> On Wed, May 25, 2022 at 06:52:48PM +0200, Peter Zijlstra wrote:
>>> On Mon, May 23, 2022 at 05:05:36PM +0100, Vincent Donnefort wrote:
>>>> The DYING/STARTING callbacks are not expected to fail. However, as reported
>>>> by Derek, drivers such as tboot are still free to return errors within
>>>> those sections. In that case, there's nothing the hotplug machinery can do,
>>>> so let's just proceed and log the failures.
>>>>
>>>
>>> I'm confused. Why isn't this a driver bug?
>>
>> It is a entirely a driver bug which has been reported already. but 453e41085183
>> (cpu/hotplug: Add cpuhp_invoke_callback_range()) changed the behaviour so I
>> thought it would be worth to revert to the original one which is to not break
>> the entire up/down for a single driver error.
>
> Ah I see. Fair enough I suppose.
>
Vincent,
On Mon, May 23 2022 at 17:05, Vincent Donnefort wrote:
> +static int _cpuhp_invoke_callback_range(bool bringup,
> + unsigned int cpu,
> + struct cpuhp_cpu_state *st,
> + enum cpuhp_state target,
> + bool nofail)
> {
> enum cpuhp_state state;
> - int err = 0;
> + int ret = 0;
>
> while (cpuhp_next_state(bringup, &state, st, target)) {
> + int err;
> +
> err = cpuhp_invoke_callback(cpu, state, bringup, NULL, NULL);
> - if (err)
> + if (!err)
> + continue;
> +
> + if (nofail) {
> + pr_warn("CPU %u %s state %s (%d) failed (%d)\n",
> + cpu, bringup ? "UP" : "DOWN",
> + cpuhp_get_step(st->state)->name,
> + st->state, err);
> + ret = -1;
I have a hard time to map this to the changelog:
> those sections. In that case, there's nothing the hotplug machinery can do,
> so let's just proceed and log the failures.
That's still returning an error code at the end. Confused.
Thanks,
tglx
On Mon, Jun 13, 2022 at 02:36:18PM +0200, Thomas Gleixner wrote:
> Vincent,
>
> On Mon, May 23 2022 at 17:05, Vincent Donnefort wrote:
> > +static int _cpuhp_invoke_callback_range(bool bringup,
> > + unsigned int cpu,
> > + struct cpuhp_cpu_state *st,
> > + enum cpuhp_state target,
> > + bool nofail)
> > {
> > enum cpuhp_state state;
> > - int err = 0;
> > + int ret = 0;
> >
> > while (cpuhp_next_state(bringup, &state, st, target)) {
> > + int err;
> > +
> > err = cpuhp_invoke_callback(cpu, state, bringup, NULL, NULL);
> > - if (err)
> > + if (!err)
> > + continue;
> > +
> > + if (nofail) {
> > + pr_warn("CPU %u %s state %s (%d) failed (%d)\n",
> > + cpu, bringup ? "UP" : "DOWN",
> > + cpuhp_get_step(st->state)->name,
> > + st->state, err);
> > + ret = -1;
>
> I have a hard time to map this to the changelog:
>
> > those sections. In that case, there's nothing the hotplug machinery can do,
> > so let's just proceed and log the failures.
>
> That's still returning an error code at the end. Confused.
It is, but after returning from this function, only a warning will be raised
(cpuhp_invoke_callback_range_nofail()) instead of stopping the HP machinery
(cpuhp_invoke_callback_range()). How about this changelog?
The DYING/STARTING callbacks are not expected to fail. However, as reported by
Derek, drivers such as tboot are still free to return errors within those
sections, which halts the hot(un)plug and leaves the CPU in an unrecoverable
state.
No rollback being possible there, let's only log the failures and proceed
with the following steps. This restores the hotplug behaviour prior to
453e41085183 (cpu/hotplug: Add cpuhp_invoke_callback_range())
>
> Thanks,
>
> tglx
On 13.06.22 15:37, Vincent Donnefort wrote:
> On Mon, Jun 13, 2022 at 02:36:18PM +0200, Thomas Gleixner wrote:
>> Vincent,
>>
>> On Mon, May 23 2022 at 17:05, Vincent Donnefort wrote:
>>> +static int _cpuhp_invoke_callback_range(bool bringup,
>>> + unsigned int cpu,
>>> + struct cpuhp_cpu_state *st,
>>> + enum cpuhp_state target,
>>> + bool nofail)
>>> {
>>> enum cpuhp_state state;
>>> - int err = 0;
>>> + int ret = 0;
>>>
>>> while (cpuhp_next_state(bringup, &state, st, target)) {
>>> + int err;
>>> +
>>> err = cpuhp_invoke_callback(cpu, state, bringup, NULL, NULL);
>>> - if (err)
>>> + if (!err)
>>> + continue;
>>> +
>>> + if (nofail) {
>>> + pr_warn("CPU %u %s state %s (%d) failed (%d)\n",
>>> + cpu, bringup ? "UP" : "DOWN",
>>> + cpuhp_get_step(st->state)->name,
>>> + st->state, err);
>>> + ret = -1;
>>
>> I have a hard time to map this to the changelog:
>>
>>> those sections. In that case, there's nothing the hotplug machinery can do,
>>> so let's just proceed and log the failures.
>>
>> That's still returning an error code at the end. Confused.
>
> It is, but after returning from this function, only a warning will be raised
> (cpuhp_invoke_callback_range_nofail()) instead of stopping the HP machinery
> (cpuhp_invoke_callback_range()). How about this changelog?
>
> The DYING/STARTING callbacks are not expected to fail. However, as reported by
> Derek, drivers such as tboot are still free to return errors within those
> sections, which halts the hot(un)plug and leaves the CPU in an unrecoverable
> state.
>
> No rollback being possible there, let's only log the failures and proceed
> with the following steps. This restores the hotplug behaviour prior to
> 453e41085183 (cpu/hotplug: Add cpuhp_invoke_callback_range())
Vincent, what's up here? Did that patch make it further? It looks to me
like things stalled here, but maybe I'm missing something. I'm asking
because that fix was supposed to fix a regression I'm tracking.
BTW, if you respin this patch, could you please add proper 'Link:' tags
pointing to all reports about this issue? e.g. like this:
Link: https://bugzilla.kernel.org/show_bug.cgi?id=215867
These tags are important, as they allow others to look into the
backstory now and years from now. That is why they should be placed in
cases like this, as Documentation/process/submitting-patches.rst and
Documentation/process/5.Posting.rst explain in more detail.
Additionally, my regression tracking bot ‘regzbot’ relies on these tags
to automatically connect reports with patches that are posted or
committed to fix the reported issue.
Ciao, Thorsten