2023-10-27 18:37:52

by David Woodhouse

[permalink] [raw]
Subject: [PATCH] acpi_idle: use raw_safe_halt() from acpi_idle_play_dead()

From: David Woodhouse <[email protected]>

Xen HVM guests were observed taking triple-faults when attempting to
online a previously offlined vCPU.

Investigation showed that the fault was coming from a failing call
to lockdep_assert_irqs_disabled(), in load_current_idt() which was
too early in the CPU bringup to actually catch the exception and
report the failure cleanly.

This was a false positive, caused by acpi_idle_play_dead() setting
the per-cpu hardirqs_enabled flag by calling safe_halt(). Switch it
to use raw_safe_halt() instead, which doesn't do so.

Signed-off-by: David Woodhouse <[email protected]>
---
We might {also,instead} explicitly set the hardirqs_enabled flag to
zero when bringing up an AP?

drivers/acpi/processor_idle.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c
index 3a34a8c425fe..55437f5e0c3a 100644
--- a/drivers/acpi/processor_idle.c
+++ b/drivers/acpi/processor_idle.c
@@ -592,7 +592,7 @@ static int acpi_idle_play_dead(struct cpuidle_device *dev, int index)
while (1) {

if (cx->entry_method == ACPI_CSTATE_HALT)
- safe_halt();
+ raw_safe_halt();
else if (cx->entry_method == ACPI_CSTATE_SYSTEMIO) {
io_idle(cx->address);
} else
--
2.41.0



Attachments:
smime.p7s (5.83 kB)

2023-10-27 19:18:14

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] acpi_idle: use raw_safe_halt() from acpi_idle_play_dead()

On Fri, Oct 27, 2023 at 07:36:51PM +0100, David Woodhouse wrote:
> From: David Woodhouse <[email protected]>
>
> Xen HVM guests were observed taking triple-faults when attempting to
> online a previously offlined vCPU.
>
> Investigation showed that the fault was coming from a failing call
> to lockdep_assert_irqs_disabled(), in load_current_idt() which was
> too early in the CPU bringup to actually catch the exception and
> report the failure cleanly.
>
> This was a false positive, caused by acpi_idle_play_dead() setting
> the per-cpu hardirqs_enabled flag by calling safe_halt(). Switch it
> to use raw_safe_halt() instead, which doesn't do so.
>
> Signed-off-by: David Woodhouse <[email protected]>
> ---
> We might {also,instead} explicitly set the hardirqs_enabled flag to
> zero when bringing up an AP?

So I fixed up the idle paths the other day (see all that __cpuidle
stuff) but I've not yet gone through the whole hotplug thing :/

This seems right, at this point everything, including RCU is very much
gone, any instrumentation is undesired.

Acked-by: Peter Zijlstra (Intel) <[email protected]>

>
> drivers/acpi/processor_idle.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c
> index 3a34a8c425fe..55437f5e0c3a 100644
> --- a/drivers/acpi/processor_idle.c
> +++ b/drivers/acpi/processor_idle.c
> @@ -592,7 +592,7 @@ static int acpi_idle_play_dead(struct cpuidle_device *dev, int index)
> while (1) {
>
> if (cx->entry_method == ACPI_CSTATE_HALT)
> - safe_halt();
> + raw_safe_halt();
> else if (cx->entry_method == ACPI_CSTATE_SYSTEMIO) {
> io_idle(cx->address);
> } else
> --
> 2.41.0
>
>


2023-11-20 12:20:48

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH] acpi_idle: use raw_safe_halt() from acpi_idle_play_dead()

On Fri, 2023-10-27 at 21:14 +0200, Peter Zijlstra wrote:
> On Fri, Oct 27, 2023 at 07:36:51PM +0100, David Woodhouse wrote:
> > From: David Woodhouse <[email protected]>
> >
> > Xen HVM guests were observed taking triple-faults when attempting to
> > online a previously offlined vCPU.
> >
> > Investigation showed that the fault was coming from a failing call
> > to lockdep_assert_irqs_disabled(), in load_current_idt() which was
> > too early in the CPU bringup to actually catch the exception and
> > report the failure cleanly.
> >
> > This was a false positive, caused by acpi_idle_play_dead() setting
> > the per-cpu hardirqs_enabled flag by calling safe_halt(). Switch it
> > to use raw_safe_halt() instead, which doesn't do so.
> >
> > Signed-off-by: David Woodhouse <[email protected]>
> > ---
> > We might {also,instead} explicitly set the hardirqs_enabled flag to
> > zero when bringing up an AP?
>
> So I fixed up the idle paths the other day (see all that __cpuidle
> stuff) but I've not yet gone through the whole hotplug thing :/
>
> This seems right, at this point everything, including RCU is very much
> gone, any instrumentation is undesired.
>
> Acked-by: Peter Zijlstra (Intel) <[email protected]>

Ping? Who's taking this?

Needs a Cc:[email protected] now too, to fix 6.6.x.


Attachments:
smime.p7s (5.83 kB)

2023-11-20 12:58:18

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH] acpi_idle: use raw_safe_halt() from acpi_idle_play_dead()

On Mon, Nov 20, 2023 at 1:20 PM David Woodhouse <[email protected]> wrote:
>
> On Fri, 2023-10-27 at 21:14 +0200, Peter Zijlstra wrote:
> > On Fri, Oct 27, 2023 at 07:36:51PM +0100, David Woodhouse wrote:
> > > From: David Woodhouse <[email protected]>
> > >
> > > Xen HVM guests were observed taking triple-faults when attempting to
> > > online a previously offlined vCPU.
> > >
> > > Investigation showed that the fault was coming from a failing call
> > > to lockdep_assert_irqs_disabled(), in load_current_idt() which was
> > > too early in the CPU bringup to actually catch the exception and
> > > report the failure cleanly.
> > >
> > > This was a false positive, caused by acpi_idle_play_dead() setting
> > > the per-cpu hardirqs_enabled flag by calling safe_halt(). Switch it
> > > to use raw_safe_halt() instead, which doesn't do so.
> > >
> > > Signed-off-by: David Woodhouse <[email protected]>
> > > ---
> > > We might {also,instead} explicitly set the hardirqs_enabled flag to
> > > zero when bringing up an AP?
> >
> > So I fixed up the idle paths the other day (see all that __cpuidle
> > stuff) but I've not yet gone through the whole hotplug thing :/
> >
> > This seems right, at this point everything, including RCU is very much
> > gone, any instrumentation is undesired.
> >
> > Acked-by: Peter Zijlstra (Intel) <[email protected]>
>
> Ping? Who's taking this?

I'm going to apply it.

> Needs a Cc:[email protected] now too, to fix 6.6.x.

Sure.