2018-09-04 06:35:56

by Neeraj Upadhyay

[permalink] [raw]
Subject: [PATCH] cpu/hotplug: Fix rollback during error-out in takedown_cpu()

If takedown_cpu() fails during _cpu_down(), st->state is reset,
by calling cpuhp_reset_state(). This results in an additional
increment of st->state, which results in CPUHP_AP_SMPBOOT_THREADS
state being skipped during rollback. Fix this by not calling
cpuhp_reset_state() and doing the state reset directly in
_cpu_down().

Fixes: 4dddfb5faa61 ("smp/hotplug: Rewrite AP state machine core")
Signed-off-by: Neeraj Upadhyay <[email protected]>
---
kernel/cpu.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/kernel/cpu.c b/kernel/cpu.c
index aa7fe85..9f49edb 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -970,7 +970,14 @@ static int __ref _cpu_down(unsigned int cpu, int tasks_frozen,
*/
ret = cpuhp_down_callbacks(cpu, st, target);
if (ret && st->state > CPUHP_TEARDOWN_CPU && st->state < prev_state) {
- cpuhp_reset_state(st, prev_state);
+ /*
+ * As st->last is not set, cpuhp_reset_state() increments
+ * st->state, which results in CPUHP_AP_SMPBOOT_THREADS being
+ * skipped during rollback. So, don't use it here.
+ */
+ st->rollback = true;
+ st->target = prev_state;
+ st->bringup = !st->bringup;
__cpuhp_kick_ap(st);
}

--
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
member of the Code Aurora Forum, hosted by The Linux Foundation



2018-09-04 12:15:15

by Mukesh Ojha

[permalink] [raw]
Subject: Re: [PATCH] cpu/hotplug: Fix rollback during error-out in takedown_cpu()



On 9/4/2018 12:03 PM, Neeraj Upadhyay wrote:
> If takedown_cpu() fails during _cpu_down(), st->state is reset,
> by calling cpuhp_reset_state(). This results in an additional
> increment of st->state, which results in CPUHP_AP_SMPBOOT_THREADS
> state being skipped during rollback. Fix this by not calling
> cpuhp_reset_state() and doing the state reset directly in
> _cpu_down().
>
> Fixes: 4dddfb5faa61 ("smp/hotplug: Rewrite AP state machine core")
> Signed-off-by: Neeraj Upadhyay <[email protected]>
> ---
> kernel/cpu.c | 9 ++++++++-
> 1 file changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/cpu.c b/kernel/cpu.c
> index aa7fe85..9f49edb 100644
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -970,7 +970,14 @@ static int __ref _cpu_down(unsigned int cpu, int tasks_frozen,
> */
> ret = cpuhp_down_callbacks(cpu, st, target);
> if (ret && st->state > CPUHP_TEARDOWN_CPU && st->state < prev_state) {
> - cpuhp_reset_state(st, prev_state);
> + /*
> + * As st->last is not set, cpuhp_reset_state() increments
> + * st->state, which results in CPUHP_AP_SMPBOOT_THREADS being
> + * skipped during rollback. So, don't use it here.
> + */
> + st->rollback = true;
> + st->target = prev_state;
> + st->bringup = !st->bringup;

Acked-by: Mukesh Ojha <[email protected]>

looks valid for this case.

> __cpuhp_kick_ap(st);
> }
>


2018-09-05 11:34:56

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH] cpu/hotplug: Fix rollback during error-out in takedown_cpu()

On Tue, 4 Sep 2018, Neeraj Upadhyay wrote:
> If takedown_cpu() fails during _cpu_down(), st->state is reset,
> by calling cpuhp_reset_state(). This results in an additional
> increment of st->state, which results in CPUHP_AP_SMPBOOT_THREADS
> state being skipped during rollback. Fix this by not calling
> cpuhp_reset_state() and doing the state reset directly in
> _cpu_down().
>
> Fixes: 4dddfb5faa61 ("smp/hotplug: Rewrite AP state machine core")
> Signed-off-by: Neeraj Upadhyay <[email protected]>
> ---
> kernel/cpu.c | 9 ++++++++-
> 1 file changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/cpu.c b/kernel/cpu.c
> index aa7fe85..9f49edb 100644
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -970,7 +970,14 @@ static int __ref _cpu_down(unsigned int cpu, int tasks_frozen,
> */
> ret = cpuhp_down_callbacks(cpu, st, target);
> if (ret && st->state > CPUHP_TEARDOWN_CPU && st->state < prev_state) {
> - cpuhp_reset_state(st, prev_state);
> + /*
> + * As st->last is not set, cpuhp_reset_state() increments
> + * st->state, which results in CPUHP_AP_SMPBOOT_THREADS being
> + * skipped during rollback. So, don't use it here.
> + */
> + st->rollback = true;
> + st->target = prev_state;
> + st->bringup = !st->bringup;

No, this is just papering over the actual problem.

The state inconsistency happens in take_cpu_down() when it returns with a
failure from __cpu_disable() because that returns with state = TEARDOWN_CPU
and st->state is then incremented in undo_cpu_down().

That's the real issue and we need to analyze the whole cpu_down rollback
logic first.

Thanks,

tglx
























2018-09-05 12:12:43

by Mukesh Ojha

[permalink] [raw]
Subject: Re: [PATCH] cpu/hotplug: Fix rollback during error-out in takedown_cpu()



On 9/5/2018 5:03 PM, Thomas Gleixner wrote:
> On Tue, 4 Sep 2018, Neeraj Upadhyay wrote:
>> If takedown_cpu() fails during _cpu_down(), st->state is reset,
>> by calling cpuhp_reset_state(). This results in an additional
>> increment of st->state, which results in CPUHP_AP_SMPBOOT_THREADS
>> state being skipped during rollback. Fix this by not calling
>> cpuhp_reset_state() and doing the state reset directly in
>> _cpu_down().
>>
>> Fixes: 4dddfb5faa61 ("smp/hotplug: Rewrite AP state machine core")
>> Signed-off-by: Neeraj Upadhyay <[email protected]>
>> ---
>> kernel/cpu.c | 9 ++++++++-
>> 1 file changed, 8 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/cpu.c b/kernel/cpu.c
>> index aa7fe85..9f49edb 100644
>> --- a/kernel/cpu.c
>> +++ b/kernel/cpu.c
>> @@ -970,7 +970,14 @@ static int __ref _cpu_down(unsigned int cpu, int tasks_frozen,
>> */
>> ret = cpuhp_down_callbacks(cpu, st, target);
>> if (ret && st->state > CPUHP_TEARDOWN_CPU && st->state < prev_state) {
>> - cpuhp_reset_state(st, prev_state);
>> + /*
>> + * As st->last is not set, cpuhp_reset_state() increments
>> + * st->state, which results in CPUHP_AP_SMPBOOT_THREADS being
>> + * skipped during rollback. So, don't use it here.
>> + */
>> + st->rollback = true;
>> + st->target = prev_state;
>> + st->bringup = !st->bringup;
> No, this is just papering over the actual problem.
>
> The state inconsistency happens in take_cpu_down() when it returns with a
> failure from __cpu_disable() because that returns with state = TEARDOWN_CPU
> and st->state is then incremented in undo_cpu_down().
>
> That's the real issue and we need to analyze the whole cpu_down rollback
> logic first.

Could this be done like below ?

diff --git a/kernel/cpu.c b/kernel/cpu.c
index aa7fe85..47bce90 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -802,17 +802,18 @@ static int take_cpu_down(void *_param)
        int err, cpu = smp_processor_id();
        int ret;

-       /* Ensure this CPU doesn't handle any more interrupts. */
-       err = __cpu_disable();
-       if (err < 0)
-               return err;
-
        /*
         * We get here while we are in CPUHP_TEARDOWN_CPU state and we
must not
         * do this step again.
         */
        WARN_ON(st->state != CPUHP_TEARDOWN_CPU);
        st->state--;
+
+       /* Ensure this CPU doesn't handle any more interrupts. */
+       err = __cpu_disable();
+       if (err < 0)
+               return err;
+
        /* Invoke the former CPU_DYING callbacks */

Thanks,
Mukesh
>
>
>
>
>
>


2018-09-05 12:25:44

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH] cpu/hotplug: Fix rollback during error-out in takedown_cpu()

On Wed, 5 Sep 2018, Thomas Gleixner wrote:
> On Tue, 4 Sep 2018, Neeraj Upadhyay wrote:
> > ret = cpuhp_down_callbacks(cpu, st, target);
> > if (ret && st->state > CPUHP_TEARDOWN_CPU && st->state < prev_state) {
> > - cpuhp_reset_state(st, prev_state);
> > + /*
> > + * As st->last is not set, cpuhp_reset_state() increments
> > + * st->state, which results in CPUHP_AP_SMPBOOT_THREADS being
> > + * skipped during rollback. So, don't use it here.
> > + */
> > + st->rollback = true;
> > + st->target = prev_state;
> > + st->bringup = !st->bringup;
>
> No, this is just papering over the actual problem.
>
> The state inconsistency happens in take_cpu_down() when it returns with a
> failure from __cpu_disable() because that returns with state = TEARDOWN_CPU
> and st->state is then incremented in undo_cpu_down().
>
> That's the real issue and we need to analyze the whole cpu_down rollback
> logic first.

And looking closer this is a general issue. Just that the TEARDOWN state
makes it simple to observe. It's universaly broken, when the first teardown
callback fails because, st->state is only decremented _AFTER_ the callback
returns success, but undo_cpu_down() increments unconditionally.

Patch below.

Thanks,

tglx
----
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -916,7 +916,8 @@ static int cpuhp_down_callbacks(unsigned
ret = cpuhp_invoke_callback(cpu, st->state, false, NULL, NULL);
if (ret) {
st->target = prev_state;
- undo_cpu_down(cpu, st);
+ if (st->state < prev_state)
+ undo_cpu_down(cpu, st);
break;
}
}
@@ -969,7 +970,7 @@ static int __ref _cpu_down(unsigned int
* to do the further cleanups.
*/
ret = cpuhp_down_callbacks(cpu, st, target);
- if (ret && st->state > CPUHP_TEARDOWN_CPU && st->state < prev_state) {
+ if (ret && st->state == CPUHP_TEARDOWN_CPU && st->state < prev_state) {
cpuhp_reset_state(st, prev_state);
__cpuhp_kick_ap(st);
}

2018-09-05 12:53:18

by Neeraj Upadhyay

[permalink] [raw]
Subject: Re: [PATCH] cpu/hotplug: Fix rollback during error-out in takedown_cpu()



On 09/05/2018 05:53 PM, Thomas Gleixner wrote:
> On Wed, 5 Sep 2018, Thomas Gleixner wrote:
>> On Tue, 4 Sep 2018, Neeraj Upadhyay wrote:
>>> ret = cpuhp_down_callbacks(cpu, st, target);
>>> if (ret && st->state > CPUHP_TEARDOWN_CPU && st->state < prev_state) {
>>> - cpuhp_reset_state(st, prev_state);
>>> + /*
>>> + * As st->last is not set, cpuhp_reset_state() increments
>>> + * st->state, which results in CPUHP_AP_SMPBOOT_THREADS being
>>> + * skipped during rollback. So, don't use it here.
>>> + */
>>> + st->rollback = true;
>>> + st->target = prev_state;
>>> + st->bringup = !st->bringup;
>> No, this is just papering over the actual problem.
>>
>> The state inconsistency happens in take_cpu_down() when it returns with a
>> failure from __cpu_disable() because that returns with state = TEARDOWN_CPU
>> and st->state is then incremented in undo_cpu_down().
>>
>> That's the real issue and we need to analyze the whole cpu_down rollback
>> logic first.
> And looking closer this is a general issue. Just that the TEARDOWN state
> makes it simple to observe. It's universaly broken, when the first teardown
> callback fails because, st->state is only decremented _AFTER_ the callback
> returns success, but undo_cpu_down() increments unconditionally.
>
> Patch below.
>
> Thanks,
>
> tglx
As per my understanding, there are 2 problems here; one is fixed with
your patch, and other is cpuhp_reset_state() is used during rollback
from non-AP to AP state, which seem to result in 2 increments of
st->state (one increment done by cpuhp_reset_state() and another by
cpu_thread_fun()) .
> ----
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -916,7 +916,8 @@ static int cpuhp_down_callbacks(unsigned
> ret = cpuhp_invoke_callback(cpu, st->state, false, NULL, NULL);
> if (ret) {
> st->target = prev_state;
> - undo_cpu_down(cpu, st);
> + if (st->state < prev_state)
> + undo_cpu_down(cpu, st);
> break;
> }
> }
> @@ -969,7 +970,7 @@ static int __ref _cpu_down(unsigned int
> * to do the further cleanups.
> */
> ret = cpuhp_down_callbacks(cpu, st, target);
> - if (ret && st->state > CPUHP_TEARDOWN_CPU && st->state < prev_state) {
> + if (ret && st->state == CPUHP_TEARDOWN_CPU && st->state < prev_state) {
> cpuhp_reset_state(st, prev_state);
> __cpuhp_kick_ap(st);
> }

--
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
member of the Code Aurora Forum, hosted by The Linux Foundation


2018-09-05 13:16:02

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH] cpu/hotplug: Fix rollback during error-out in takedown_cpu()

On Wed, 5 Sep 2018, Mukesh Ojha wrote:
> On 9/5/2018 5:03 PM, Thomas Gleixner wrote:
> > > + st->rollback = true;
> > > + st->target = prev_state;
> > > + st->bringup = !st->bringup;
> > No, this is just papering over the actual problem.
> >
> > The state inconsistency happens in take_cpu_down() when it returns with a
> > failure from __cpu_disable() because that returns with state = TEARDOWN_CPU
> > and st->state is then incremented in undo_cpu_down().
> >
> > That's the real issue and we need to analyze the whole cpu_down rollback
> > logic first.
>
> Could this be done like below ?

IOW, more papering over the real root cause.

> diff --git a/kernel/cpu.c b/kernel/cpu.c
> index aa7fe85..47bce90 100644
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -802,17 +802,18 @@ static int take_cpu_down(void *_param)
>         int err, cpu = smp_processor_id();
>         int ret;

^^^^^^^

Please fix your mailer or your editor. That patch is whitespace damaged.

Thanks,

tglx

2018-09-05 13:19:14

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH] cpu/hotplug: Fix rollback during error-out in takedown_cpu()

On Wed, 5 Sep 2018, Neeraj Upadhyay wrote:
> On 09/05/2018 05:53 PM, Thomas Gleixner wrote:
> >
> > And looking closer this is a general issue. Just that the TEARDOWN state
> > makes it simple to observe. It's universaly broken, when the first teardown
> > callback fails because, st->state is only decremented _AFTER_ the callback
> > returns success, but undo_cpu_down() increments unconditionally.
> >
>
> As per my understanding, there are 2 problems here; one is fixed with your
> patch, and other is cpuhp_reset_state() is used during rollback from non-AP to
> AP state, which seem to result in 2 increments of st->state (one increment
> done by cpuhp_reset_state() and another by cpu_thread_fun()) .

And how did your hack fix that up magically? I'll have a look later today.

Thanks,

tglx

2018-09-05 14:56:06

by Sudeep Holla

[permalink] [raw]
Subject: Re: [PATCH] cpu/hotplug: Fix rollback during error-out in takedown_cpu()

On Wed, Sep 05, 2018 at 02:23:46PM +0200, Thomas Gleixner wrote:
> On Wed, 5 Sep 2018, Thomas Gleixner wrote:
> > On Tue, 4 Sep 2018, Neeraj Upadhyay wrote:
> > > ret = cpuhp_down_callbacks(cpu, st, target);
> > > if (ret && st->state > CPUHP_TEARDOWN_CPU && st->state < prev_state) {
> > > - cpuhp_reset_state(st, prev_state);
> > > + /*
> > > + * As st->last is not set, cpuhp_reset_state() increments
> > > + * st->state, which results in CPUHP_AP_SMPBOOT_THREADS being
> > > + * skipped during rollback. So, don't use it here.
> > > + */
> > > + st->rollback = true;
> > > + st->target = prev_state;
> > > + st->bringup = !st->bringup;
> >
> > No, this is just papering over the actual problem.
> >
> > The state inconsistency happens in take_cpu_down() when it returns with a
> > failure from __cpu_disable() because that returns with state = TEARDOWN_CPU
> > and st->state is then incremented in undo_cpu_down().
> >
> > That's the real issue and we need to analyze the whole cpu_down rollback
> > logic first.
>
> And looking closer this is a general issue. Just that the TEARDOWN state
> makes it simple to observe. It's universaly broken, when the first teardown
> callback fails because, st->state is only decremented _AFTER_ the callback
> returns success, but undo_cpu_down() increments unconditionally.
>
> Patch below.

This patch fixes the issue reported @[1]. Lorenzo did some debugging and
I wanted to have a look at it at some point but this discussion drew my
attention and sounded very similar[2]. So I did a quick test with this
patch and it fixes the issue.

--
Regards,
Sudeep

[1] https://lore.kernel.org/lkml/CAMuHMdVg868LgL5xTg5Dp5rReKxoo+8fRy+ETJiMxGWZCp+hWw@mail.gmail.com/
[2] https://lore.kernel.org/lkml/20180823131505.GA31558@red-moon/

2018-09-05 15:21:15

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: [PATCH] cpu/hotplug: Fix rollback during error-out in takedown_cpu()

Hi Sudeep,

Thanks for the CC!

On Wed, Sep 5, 2018 at 4:54 PM Sudeep Holla <[email protected]> wrote:
> On Wed, Sep 05, 2018 at 02:23:46PM +0200, Thomas Gleixner wrote:
> > On Wed, 5 Sep 2018, Thomas Gleixner wrote:
> > > On Tue, 4 Sep 2018, Neeraj Upadhyay wrote:
> > > > ret = cpuhp_down_callbacks(cpu, st, target);
> > > > if (ret && st->state > CPUHP_TEARDOWN_CPU && st->state < prev_state) {
> > > > - cpuhp_reset_state(st, prev_state);
> > > > + /*
> > > > + * As st->last is not set, cpuhp_reset_state() increments
> > > > + * st->state, which results in CPUHP_AP_SMPBOOT_THREADS being
> > > > + * skipped during rollback. So, don't use it here.
> > > > + */
> > > > + st->rollback = true;
> > > > + st->target = prev_state;
> > > > + st->bringup = !st->bringup;
> > >
> > > No, this is just papering over the actual problem.
> > >
> > > The state inconsistency happens in take_cpu_down() when it returns with a
> > > failure from __cpu_disable() because that returns with state = TEARDOWN_CPU
> > > and st->state is then incremented in undo_cpu_down().
> > >
> > > That's the real issue and we need to analyze the whole cpu_down rollback
> > > logic first.
> >
> > And looking closer this is a general issue. Just that the TEARDOWN state
> > makes it simple to observe. It's universaly broken, when the first teardown
> > callback fails because, st->state is only decremented _AFTER_ the callback
> > returns success, but undo_cpu_down() increments unconditionally.
> >
> > Patch below.
>
> This patch fixes the issue reported @[1]. Lorenzo did some debugging and
> I wanted to have a look at it at some point but this discussion drew my
> attention and sounded very similar[2]. So I did a quick test with this
> patch and it fixes the issue.

> [1] https://lore.kernel.org/lkml/CAMuHMdVg868LgL5xTg5Dp5rReKxoo+8fRy+ETJiMxGWZCp+hWw@mail.gmail.com/
> [2] https://lore.kernel.org/lkml/20180823131505.GA31558@red-moon/

Thomas' patch fixes the issue for me:
Tested-by: Geert Uytterhoeven <[email protected]>


Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2018-09-06 02:57:54

by Neeraj Upadhyay

[permalink] [raw]
Subject: Re: [PATCH] cpu/hotplug: Fix rollback during error-out in takedown_cpu()



On 09/05/2018 06:47 PM, Thomas Gleixner wrote:
> On Wed, 5 Sep 2018, Neeraj Upadhyay wrote:
>> On 09/05/2018 05:53 PM, Thomas Gleixner wrote:
>>> And looking closer this is a general issue. Just that the TEARDOWN state
>>> makes it simple to observe. It's universaly broken, when the first teardown
>>> callback fails because, st->state is only decremented _AFTER_ the callback
>>> returns success, but undo_cpu_down() increments unconditionally.
>>>
>> As per my understanding, there are 2 problems here; one is fixed with your
>> patch, and other is cpuhp_reset_state() is used during rollback from non-AP to
>> AP state, which seem to result in 2 increments of st->state (one increment
>> done by cpuhp_reset_state() and another by cpu_thread_fun()) .
> And how did your hack fix that up magically? I'll have a look later today.
>
> Thanks,
>
> tglx

The hack fixes it by not calling cpuhp_reset_state() and doing rollback
state reset inline in  _cpu_down().




--
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
member of the Code Aurora Forum, hosted by The Linux Foundation


2018-09-06 08:23:57

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH] cpu/hotplug: Fix rollback during error-out in takedown_cpu()

On Thu, 6 Sep 2018, Neeraj Upadhyay wrote:
> On 09/05/2018 06:47 PM, Thomas Gleixner wrote:
> > On Wed, 5 Sep 2018, Neeraj Upadhyay wrote:
> > > On 09/05/2018 05:53 PM, Thomas Gleixner wrote:
> > > > And looking closer this is a general issue. Just that the TEARDOWN state
> > > > makes it simple to observe. It's universaly broken, when the first
> > > > teardown
> > > > callback fails because, st->state is only decremented _AFTER_ the
> > > > callback
> > > > returns success, but undo_cpu_down() increments unconditionally.
> > > >
> > > As per my understanding, there are 2 problems here; one is fixed with your
> > > patch, and other is cpuhp_reset_state() is used during rollback from
> > > non-AP to
> > > AP state, which seem to result in 2 increments of st->state (one increment
> > > done by cpuhp_reset_state() and another by cpu_thread_fun()) .
> > And how did your hack fix that up magically? I'll have a look later today.
> >
> > Thanks,
> >
> > tglx
>
> The hack fixes it by not calling cpuhp_reset_state() and doing rollback state
> reset inline in  _cpu_down().

And how is that any different from the proper patch I sent? It ends up in
the same state. So I have a hard time to understand your blurb above where
you claim that my patch just solves one of two problems?

Thanks,

tglx

2018-09-06 09:02:35

by Neeraj Upadhyay

[permalink] [raw]
Subject: Re: [PATCH] cpu/hotplug: Fix rollback during error-out in takedown_cpu()



On 09/06/2018 01:48 PM, Thomas Gleixner wrote:
> On Thu, 6 Sep 2018, Neeraj Upadhyay wrote:
>> On 09/05/2018 06:47 PM, Thomas Gleixner wrote:
>>> On Wed, 5 Sep 2018, Neeraj Upadhyay wrote:
>>>> On 09/05/2018 05:53 PM, Thomas Gleixner wrote:
>>>>> And looking closer this is a general issue. Just that the TEARDOWN state
>>>>> makes it simple to observe. It's universaly broken, when the first
>>>>> teardown
>>>>> callback fails because, st->state is only decremented _AFTER_ the
>>>>> callback
>>>>> returns success, but undo_cpu_down() increments unconditionally.
>>>>>
>>>> As per my understanding, there are 2 problems here; one is fixed with your
>>>> patch, and other is cpuhp_reset_state() is used during rollback from
>>>> non-AP to
>>>> AP state, which seem to result in 2 increments of st->state (one increment
>>>> done by cpuhp_reset_state() and another by cpu_thread_fun()) .
>>> And how did your hack fix that up magically? I'll have a look later today.
>>>
>>> Thanks,
>>>
>>> tglx
>>
>> The hack fixes it by not calling cpuhp_reset_state() and doing rollback state
>> reset inline in  _cpu_down().
>
> And how is that any different from the proper patch I sent? It ends up in
> the same state. So I have a hard time to understand your blurb above where
> you claim that my patch just solves one of two problems?
>
> Thanks,
>
> tglx
>

Yes, your patch solves the problem related to smpboot unparking being
skipped during rollback and with the hack we end up in same state. The
second thing, which I am referring to, is that there is one additional
state increment. I missed the part that, it could be required, so that
we reach CPUHP_AP_ONLINE_IDLE before calling __cpuhp_kick_ap(). So, it's
not a problem.

* cpuhp_reset_state() does one state increment and we reach
CPUHP_AP_ONLINE_IDLE.

if (ret && st->state > CPUHP_TEARDOWN_CPU && st->state < prev_state) {
cpuhp_reset_state(st, prev_state);
__cpuhp_kick_ap(st);
}

CPUHP_TEARDOWN_CPU,
CPUHP_AP_ONLINE_IDLE,
CPUHP_AP_SMPBOOT_THREADS,


* cpuhp_thread_fun() does one more increment before invoking state
callback (so, we skip CPUHP_AP_ONLINE_IDLE) and we reach
CPUHP_AP_SMPBOOT_THREADS:

static void cpuhp_thread_fun(unsigned int cpu)
<snip>
if (bringup) {
st->state++;
state = st->state;
<snip>


--
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
member of the Code Aurora Forum, hosted by The Linux Foundation

Subject: [tip:smp/urgent] cpu/hotplug: Prevent state corruption on error rollback

Commit-ID: 69fa6eb7d6a64801ea261025cce9723d9442d773
Gitweb: https://git.kernel.org/tip/69fa6eb7d6a64801ea261025cce9723d9442d773
Author: Thomas Gleixner <[email protected]>
AuthorDate: Thu, 6 Sep 2018 15:21:38 +0200
Committer: Thomas Gleixner <[email protected]>
CommitDate: Thu, 6 Sep 2018 15:21:38 +0200

cpu/hotplug: Prevent state corruption on error rollback

When a teardown callback fails, the CPU hotplug code brings the CPU back to
the previous state. The previous state becomes the new target state. The
rollback happens in undo_cpu_down() which increments the state
unconditionally even if the state is already the same as the target.

As a consequence the next CPU hotplug operation will start at the wrong
state. This is easily to observe when __cpu_disable() fails.

Prevent the unconditional undo by checking the state vs. target before
incrementing state and fix up the consequently wrong conditional in the
unplug code which handles the failure of the final CPU take down on the
control CPU side.

Fixes: 4dddfb5faa61 ("smp/hotplug: Rewrite AP state machine core")
Reported-by: Neeraj Upadhyay <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Tested-by: Geert Uytterhoeven <[email protected]>
Tested-by: Sudeep Holla <[email protected]>
Tested-by: Neeraj Upadhyay <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: https://lkml.kernel.org/r/[email protected]

----
---
kernel/cpu.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/cpu.c b/kernel/cpu.c
index eb4041f78073..0097acec1c71 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -916,7 +916,8 @@ static int cpuhp_down_callbacks(unsigned int cpu, struct cpuhp_cpu_state *st,
ret = cpuhp_invoke_callback(cpu, st->state, false, NULL, NULL);
if (ret) {
st->target = prev_state;
- undo_cpu_down(cpu, st);
+ if (st->state < prev_state)
+ undo_cpu_down(cpu, st);
break;
}
}
@@ -969,7 +970,7 @@ static int __ref _cpu_down(unsigned int cpu, int tasks_frozen,
* to do the further cleanups.
*/
ret = cpuhp_down_callbacks(cpu, st, target);
- if (ret && st->state > CPUHP_TEARDOWN_CPU && st->state < prev_state) {
+ if (ret && st->state == CPUHP_TEARDOWN_CPU && st->state < prev_state) {
cpuhp_reset_state(st, prev_state);
__cpuhp_kick_ap(st);
}