2019-05-28 22:15:19

by Jiri Kosina

[permalink] [raw]
Subject: [PATCH] x86/power: Fix 'nosmt' vs. hibernation triple fault during resume

From: Jiri Kosina <[email protected]>

As explained in

0cc3cd21657b ("cpu/hotplug: Boot HT siblings at least once")

we always, no matter what, have to bring up x86 HT siblings during boot at
least once in order to avoid first MCE bringing the system to its knees.

That means that whenever 'nosmt' is supplied on the kernel command-line,
all the HT siblings are as a result sitting in mwait or cpudile after
going through the online-offline cycle at least once.

This causes a serious issue though when a kernel, which saw 'nosmt' on its
commandline, is going to perform resume from hibernation: if the resume
from the hibernated image is successful, cr3 is flipped in order to point
to the address space of the kernel that is being resumed, which in turn
means that all the HT siblings are all of a sudden mwaiting on address
which is no longer valid.

That results in triple fault shortly after cr3 is switched, and machine
reboots.

Fix this by always waking up all the SMT siblings before initiating the
'restore from hibernation' process; this guarantees that all the HT
siblings will be properly carried over to the resumed kernel waiting in
resume_play_dead(), and acted upon accordingly afterwards, based on the
target kernel configuration.

Cc: [email protected] # v4.19+
Debugged-by: Thomas Gleixner <[email protected]>
Fixes: 0cc3cd21657b ("cpu/hotplug: Boot HT siblings at least once")
Signed-off-by: Jiri Kosina <[email protected]>
---
arch/x86/power/cpu.c | 11 +++++++++++
include/linux/cpu.h | 2 ++
kernel/cpu.c | 2 +-
3 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/arch/x86/power/cpu.c b/arch/x86/power/cpu.c
index a7d966964c6f..bde8ce1f6c6c 100644
--- a/arch/x86/power/cpu.c
+++ b/arch/x86/power/cpu.c
@@ -299,9 +299,20 @@ int hibernate_resume_nonboot_cpu_disable(void)
* address in its instruction pointer may not be possible to resolve
* any more at that point (the page tables used by it previously may
* have been overwritten by hibernate image data).
+ *
+ * First, make sure that we wake up all the potentially disabled SMT
+ * threads which have been initially brought up and then put into
+ * mwait/cpuidle sleep.
+ * Those will be put to proper (not interfering with hibernation
+ * resume) sleep afterwards, and the resumed kernel will decide itself
+ * what to do with them.
*/
smp_ops.play_dead = resume_play_dead;
+ ret = cpuhp_smt_enable();
+ if (ret)
+ goto out;
ret = disable_nonboot_cpus();
+out:
smp_ops.play_dead = play_dead;
return ret;
}
diff --git a/include/linux/cpu.h b/include/linux/cpu.h
index 3813fe45effd..b5523552a607 100644
--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -201,10 +201,12 @@ enum cpuhp_smt_control {
extern enum cpuhp_smt_control cpu_smt_control;
extern void cpu_smt_disable(bool force);
extern void cpu_smt_check_topology(void);
+extern int cpuhp_smt_enable(void);
#else
# define cpu_smt_control (CPU_SMT_NOT_IMPLEMENTED)
static inline void cpu_smt_disable(bool force) { }
static inline void cpu_smt_check_topology(void) { }
+static inline int cpuhp_smt_enable(void) { return 0; }
#endif

/*
diff --git a/kernel/cpu.c b/kernel/cpu.c
index f2ef10460698..3ff5ce0e4132 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -2093,7 +2093,7 @@ static int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval)
return ret;
}

-static int cpuhp_smt_enable(void)
+int cpuhp_smt_enable(void)
{
int cpu, ret = 0;


--
Jiri Kosina
SUSE Labs


2019-05-29 08:08:58

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH] x86/power: Fix 'nosmt' vs. hibernation triple fault during resume

On Tue, May 28, 2019 at 11:31 PM Jiri Kosina <[email protected]> wrote:
>
> From: Jiri Kosina <[email protected]>
>
> As explained in
>
> 0cc3cd21657b ("cpu/hotplug: Boot HT siblings at least once")
>
> we always, no matter what, have to bring up x86 HT siblings during boot at
> least once in order to avoid first MCE bringing the system to its knees.
>
> That means that whenever 'nosmt' is supplied on the kernel command-line,
> all the HT siblings are as a result sitting in mwait or cpudile after
> going through the online-offline cycle at least once.
>
> This causes a serious issue though when a kernel, which saw 'nosmt' on its
> commandline, is going to perform resume from hibernation: if the resume
> from the hibernated image is successful, cr3 is flipped in order to point
> to the address space of the kernel that is being resumed, which in turn
> means that all the HT siblings are all of a sudden mwaiting on address
> which is no longer valid.
>
> That results in triple fault shortly after cr3 is switched, and machine
> reboots.
>
> Fix this by always waking up all the SMT siblings before initiating the
> 'restore from hibernation' process; this guarantees that all the HT
> siblings will be properly carried over to the resumed kernel waiting in
> resume_play_dead(), and acted upon accordingly afterwards, based on the
> target kernel configuration.
>
> Cc: [email protected] # v4.19+
> Debugged-by: Thomas Gleixner <[email protected]>
> Fixes: 0cc3cd21657b ("cpu/hotplug: Boot HT siblings at least once")
> Signed-off-by: Jiri Kosina <[email protected]>

I can take this or, in case it is better to route it through x86:

Acked-by: Rafael J. Wysocki <[email protected]>

> ---
> arch/x86/power/cpu.c | 11 +++++++++++
> include/linux/cpu.h | 2 ++
> kernel/cpu.c | 2 +-
> 3 files changed, 14 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/power/cpu.c b/arch/x86/power/cpu.c
> index a7d966964c6f..bde8ce1f6c6c 100644
> --- a/arch/x86/power/cpu.c
> +++ b/arch/x86/power/cpu.c
> @@ -299,9 +299,20 @@ int hibernate_resume_nonboot_cpu_disable(void)
> * address in its instruction pointer may not be possible to resolve
> * any more at that point (the page tables used by it previously may
> * have been overwritten by hibernate image data).
> + *
> + * First, make sure that we wake up all the potentially disabled SMT
> + * threads which have been initially brought up and then put into
> + * mwait/cpuidle sleep.
> + * Those will be put to proper (not interfering with hibernation
> + * resume) sleep afterwards, and the resumed kernel will decide itself
> + * what to do with them.
> */
> smp_ops.play_dead = resume_play_dead;
> + ret = cpuhp_smt_enable();
> + if (ret)
> + goto out;
> ret = disable_nonboot_cpus();
> +out:
> smp_ops.play_dead = play_dead;
> return ret;
> }
> diff --git a/include/linux/cpu.h b/include/linux/cpu.h
> index 3813fe45effd..b5523552a607 100644
> --- a/include/linux/cpu.h
> +++ b/include/linux/cpu.h
> @@ -201,10 +201,12 @@ enum cpuhp_smt_control {
> extern enum cpuhp_smt_control cpu_smt_control;
> extern void cpu_smt_disable(bool force);
> extern void cpu_smt_check_topology(void);
> +extern int cpuhp_smt_enable(void);
> #else
> # define cpu_smt_control (CPU_SMT_NOT_IMPLEMENTED)
> static inline void cpu_smt_disable(bool force) { }
> static inline void cpu_smt_check_topology(void) { }
> +static inline int cpuhp_smt_enable(void) { return 0; }
> #endif
>
> /*
> diff --git a/kernel/cpu.c b/kernel/cpu.c
> index f2ef10460698..3ff5ce0e4132 100644
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -2093,7 +2093,7 @@ static int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval)
> return ret;
> }
>
> -static int cpuhp_smt_enable(void)
> +int cpuhp_smt_enable(void)
> {
> int cpu, ret = 0;
>
>
> --
> Jiri Kosina
> SUSE Labs

2019-05-29 09:05:16

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] x86/power: Fix 'nosmt' vs. hibernation triple fault during resume

On Tue, May 28, 2019 at 11:31:45PM +0200, Jiri Kosina wrote:

> diff --git a/arch/x86/power/cpu.c b/arch/x86/power/cpu.c
> index a7d966964c6f..bde8ce1f6c6c 100644
> --- a/arch/x86/power/cpu.c
> +++ b/arch/x86/power/cpu.c
> @@ -299,9 +299,20 @@ int hibernate_resume_nonboot_cpu_disable(void)
> * address in its instruction pointer may not be possible to resolve
> * any more at that point (the page tables used by it previously may
> * have been overwritten by hibernate image data).
> + *
> + * First, make sure that we wake up all the potentially disabled SMT
> + * threads which have been initially brought up and then put into
> + * mwait/cpuidle sleep.
> + * Those will be put to proper (not interfering with hibernation
> + * resume) sleep afterwards, and the resumed kernel will decide itself
> + * what to do with them.
> */
> smp_ops.play_dead = resume_play_dead;

Oooh, teh yuck!, but this explains my confusion from the other thread.

> + ret = cpuhp_smt_enable();
> + if (ret)
> + goto out;
> ret = disable_nonboot_cpus();
> +out:
> smp_ops.play_dead = play_dead;
> return ret;
> }

I think you can avoid the goto like:

ret = cpuhp_smt_enable();
if (ret)
return ret;

smp_ops.play_dead = resume_play_dead;
ret = disable_nonboot_cpus();
smp_ops.play_dead = play_dead;
return ret;

We don't need the play dead change to online CPUs.

2019-05-29 10:33:31

by Jiri Kosina

[permalink] [raw]
Subject: [PATCH v2] x86/power: Fix 'nosmt' vs. hibernation triple fault during resume

From: Jiri Kosina <[email protected]>

As explained in

0cc3cd21657b ("cpu/hotplug: Boot HT siblings at least once")

we always, no matter what, have to bring up x86 HT siblings during boot at
least once in order to avoid first MCE bringing the system to its knees.

That means that whenever 'nosmt' is supplied on the kernel command-line,
all the HT siblings are as a result sitting in mwait or cpudile after
going through the online-offline cycle at least once.

This causes a serious issue though when a kernel, which saw 'nosmt' on its
commandline, is going to perform resume from hibernation: if the resume
from the hibernated image is successful, cr3 is flipped in order to point
to the address space of the kernel that is being resumed, which in turn
means that all the HT siblings are all of a sudden mwaiting on address
which is no longer valid.

That results in triple fault shortly after cr3 is switched, and machine
reboots.

Fix this by always waking up all the SMT siblings before initiating the
'restore from hibernation' process; this guarantees that all the HT
siblings will be properly carried over to the resumed kernel waiting in
resume_play_dead(), and acted upon accordingly afterwards, based on the
target kernel configuration.

Cc: [email protected] # v4.19+
Debugged-by: Thomas Gleixner <[email protected]>
Fixes: 0cc3cd21657b ("cpu/hotplug: Boot HT siblings at least once")
Acked-by: Rafael J. Wysocki <[email protected]>
Signed-off-by: Jiri Kosina <[email protected]>
---

v1 -> v2:
restructure error handling as suggested by peterz
add Rafael's ack

arch/x86/power/cpu.c | 10 ++++++++++
include/linux/cpu.h | 2 ++
kernel/cpu.c | 2 +-
3 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/arch/x86/power/cpu.c b/arch/x86/power/cpu.c
index a7d966964c6f..513ce09e9950 100644
--- a/arch/x86/power/cpu.c
+++ b/arch/x86/power/cpu.c
@@ -299,7 +299,17 @@ int hibernate_resume_nonboot_cpu_disable(void)
* address in its instruction pointer may not be possible to resolve
* any more at that point (the page tables used by it previously may
* have been overwritten by hibernate image data).
+ *
+ * First, make sure that we wake up all the potentially disabled SMT
+ * threads which have been initially brought up and then put into
+ * mwait/cpuidle sleep.
+ * Those will be put to proper (not interfering with hibernation
+ * resume) sleep afterwards, and the resumed kernel will decide itself
+ * what to do with them.
*/
+ ret = cpuhp_smt_enable();
+ if (ret)
+ return ret;
smp_ops.play_dead = resume_play_dead;
ret = disable_nonboot_cpus();
smp_ops.play_dead = play_dead;
diff --git a/include/linux/cpu.h b/include/linux/cpu.h
index 3813fe45effd..b5523552a607 100644
--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -201,10 +201,12 @@ enum cpuhp_smt_control {
extern enum cpuhp_smt_control cpu_smt_control;
extern void cpu_smt_disable(bool force);
extern void cpu_smt_check_topology(void);
+extern int cpuhp_smt_enable(void);
#else
# define cpu_smt_control (CPU_SMT_NOT_IMPLEMENTED)
static inline void cpu_smt_disable(bool force) { }
static inline void cpu_smt_check_topology(void) { }
+static inline int cpuhp_smt_enable(void) { return 0; }
#endif

/*
diff --git a/kernel/cpu.c b/kernel/cpu.c
index f2ef10460698..3ff5ce0e4132 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -2093,7 +2093,7 @@ static int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval)
return ret;
}

-static int cpuhp_smt_enable(void)
+int cpuhp_smt_enable(void)
{
int cpu, ret = 0;


--
Jiri Kosina
SUSE Labs

2019-05-29 12:05:04

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2] x86/power: Fix 'nosmt' vs. hibernation triple fault during resume

On Wed, May 29, 2019 at 12:32:02PM +0200, Jiri Kosina wrote:
> arch/x86/power/cpu.c | 10 ++++++++++
> include/linux/cpu.h | 2 ++
> kernel/cpu.c | 2 +-
> 3 files changed, 13 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/power/cpu.c b/arch/x86/power/cpu.c
> index a7d966964c6f..513ce09e9950 100644
> --- a/arch/x86/power/cpu.c
> +++ b/arch/x86/power/cpu.c
> @@ -299,7 +299,17 @@ int hibernate_resume_nonboot_cpu_disable(void)
> * address in its instruction pointer may not be possible to resolve
> * any more at that point (the page tables used by it previously may
> * have been overwritten by hibernate image data).
> + *
> + * First, make sure that we wake up all the potentially disabled SMT
> + * threads which have been initially brought up and then put into
> + * mwait/cpuidle sleep.
> + * Those will be put to proper (not interfering with hibernation
> + * resume) sleep afterwards, and the resumed kernel will decide itself
> + * what to do with them.
> */
> + ret = cpuhp_smt_enable();
> + if (ret)
> + return ret;
> smp_ops.play_dead = resume_play_dead;
> ret = disable_nonboot_cpus();
> smp_ops.play_dead = play_dead;

Acked-by: Peter Zijlstra (Intel) <[email protected]>

2019-05-29 16:12:25

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: [PATCH v2] x86/power: Fix 'nosmt' vs. hibernation triple fault during resume

On Wed, May 29, 2019 at 12:32:02PM +0200, Jiri Kosina wrote:
> From: Jiri Kosina <[email protected]>
>
> As explained in
>
> 0cc3cd21657b ("cpu/hotplug: Boot HT siblings at least once")
>
> we always, no matter what, have to bring up x86 HT siblings during boot at
> least once in order to avoid first MCE bringing the system to its knees.
>
> That means that whenever 'nosmt' is supplied on the kernel command-line,
> all the HT siblings are as a result sitting in mwait or cpudile after
> going through the online-offline cycle at least once.
>
> This causes a serious issue though when a kernel, which saw 'nosmt' on its
> commandline, is going to perform resume from hibernation: if the resume
> from the hibernated image is successful, cr3 is flipped in order to point
> to the address space of the kernel that is being resumed, which in turn
> means that all the HT siblings are all of a sudden mwaiting on address
> which is no longer valid.
>
> That results in triple fault shortly after cr3 is switched, and machine
> reboots.
>
> Fix this by always waking up all the SMT siblings before initiating the
> 'restore from hibernation' process; this guarantees that all the HT
> siblings will be properly carried over to the resumed kernel waiting in
> resume_play_dead(), and acted upon accordingly afterwards, based on the
> target kernel configuration.

hibernation_restore() is called by user space at runtime, via ioctl or
sysfs. So I think this still doesn't fix the case where you've disabled
CPUs at runtime via sysfs, and then resumed from hibernation. Or are we
declaring that this is not a supported scenario?

Would it be possible for mwait_play_dead() to instead just monitor a
fixmap address which doesn't change for kaslr?

Is there are reason why maxcpus= doesn't do the CR4.MCE booted_once
dance?

--
Josh

2019-05-29 16:29:25

by Jiri Kosina

[permalink] [raw]
Subject: Re: [PATCH v2] x86/power: Fix 'nosmt' vs. hibernation triple fault during resume

On Wed, 29 May 2019, Josh Poimboeuf wrote:

> hibernation_restore() is called by user space at runtime, via ioctl or
> sysfs. So I think this still doesn't fix the case where you've disabled
> CPUs at runtime via sysfs, and then resumed from hibernation. Or are we
> declaring that this is not a supported scenario?

Yeah I personally find that scenario awkward :) Anyway, cpuhp_smt_enable()
is going to online even those potentially "manually" offlined CPUs, isn't
it?

Are you perhaps suggesting to call enable_nonboot_cpus() instead of
cpuhp_smt_enable() here to make it more explicit?

> Is there are reason why maxcpus= doesn't do the CR4.MCE booted_once
> dance?

I am not sure whether it's really needed. My understanding is that the MCE
issue happens only after primary sibling has been brought up; if that
never happened, MCE wouldn't be broadcasted to that core at all in the
first place.

But this needs to be confirmed by Intel.

--
Jiri Kosina
SUSE Labs

2019-05-29 17:02:53

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2] x86/power: Fix 'nosmt' vs. hibernation triple fault during resume

On Wed, May 29, 2019 at 06:26:59PM +0200, Jiri Kosina wrote:
> On Wed, 29 May 2019, Josh Poimboeuf wrote:

> > Is there are reason why maxcpus= doesn't do the CR4.MCE booted_once
> > dance?
>
> I am not sure whether it's really needed. My understanding is that the MCE
> issue happens only after primary sibling has been brought up; if that
> never happened, MCE wouldn't be broadcasted to that core at all in the
> first place.
>
> But this needs to be confirmed by Intel.

(I'm not confirming anything, as I've no clue), but that code stems from
long before we found out about that brilliant MCE stuff (which was
fairly recent).

2019-05-29 17:17:10

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v2] x86/power: Fix 'nosmt' vs. hibernation triple fault during resume

On Wed, 29 May 2019, Peter Zijlstra wrote:
> On Wed, May 29, 2019 at 06:26:59PM +0200, Jiri Kosina wrote:
> > On Wed, 29 May 2019, Josh Poimboeuf wrote:
>
> > > Is there are reason why maxcpus= doesn't do the CR4.MCE booted_once
> > > dance?
> >
> > I am not sure whether it's really needed. My understanding is that the MCE
> > issue happens only after primary sibling has been brought up; if that
> > never happened, MCE wouldn't be broadcasted to that core at all in the
> > first place.
> >
> > But this needs to be confirmed by Intel.
>
> (I'm not confirming anything, as I've no clue), but that code stems from
> long before we found out about that brilliant MCE stuff (which was
> fairly recent).

Actually we knew about the brilliant MCE wreckage for a long time and
maxcpus was always considered to be a debug/testing bandaid and not to be
used for anything serious used in production.

Of course 'nosmt' changed that because that is aimed at production
scenarios so we were forced to deal with that 'feature'.

We could do the same thing with 'maxcpus' now that we have all the
mechanisms there at our fingertips already, but I'd rather not do it.

Thanks,

tglx

2019-05-29 17:19:11

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: [PATCH v2] x86/power: Fix 'nosmt' vs. hibernation triple fault during resume

On Wed, May 29, 2019 at 06:26:59PM +0200, Jiri Kosina wrote:
> On Wed, 29 May 2019, Josh Poimboeuf wrote:
>
> > hibernation_restore() is called by user space at runtime, via ioctl or
> > sysfs. So I think this still doesn't fix the case where you've disabled
> > CPUs at runtime via sysfs, and then resumed from hibernation. Or are we
> > declaring that this is not a supported scenario?
>
> Yeah I personally find that scenario awkward :) Anyway, cpuhp_smt_enable()
> is going to online even those potentially "manually" offlined CPUs, isn't
> it?
>
> Are you perhaps suggesting to call enable_nonboot_cpus() instead of
> cpuhp_smt_enable() here to make it more explicit?

Maybe, but I guess that wouldn't work as-is because it relies on
the frozen_cpus mask.

But maybe this is just a scenario we don't care about anyway?

I still have the question about whether we could make mwait_play_dead()
monitor a fixed address. If we could get that to work, that seems more
robust to me.

Another question. With your patch, if booted with nosmt, is SMT still
disabled after you resume from hibernation? I don't see how SMT would
get disabled again.

> > Is there are reason why maxcpus= doesn't do the CR4.MCE booted_once
> > dance?
>
> I am not sure whether it's really needed. My understanding is that the MCE
> issue happens only after primary sibling has been brought up; if that
> never happened, MCE wouldn't be broadcasted to that core at all in the
> first place.
>
> But this needs to be confirmed by Intel.

Right, but can't maxcpus= create scenarios where only the primary
sibling has been brought up?

Anyway, Thomas indicated on IRC that maxcpus= may be deprecated and
should probably be documented as such. So maybe it's another scenario
we don't care about.

--
Josh

2019-05-29 17:31:27

by Jiri Kosina

[permalink] [raw]
Subject: Re: [PATCH v2] x86/power: Fix 'nosmt' vs. hibernation triple fault during resume

On Wed, 29 May 2019, Josh Poimboeuf wrote:

> I still have the question about whether we could make mwait_play_dead()
> monitor a fixed address. If we could get that to work, that seems more
> robust to me.

Hmm, does it really?

That'd mean the resumer and resumee must have the same fixmap. How are you
going to guarantee that? Currently the resuming kernel doesn't really have
to be the same as the one that is being resumed.

> Another question. With your patch, if booted with nosmt, is SMT still
> disabled after you resume from hibernation?

Yup, it is.

> I don't see how SMT would get disabled again.

The target kernel only onlines the CPUs which were online at the time of
hibernation (and are therefore in frozen_cpus mask).

--
Jiri Kosina
SUSE Labs

2019-05-29 18:05:20

by Jiri Kosina

[permalink] [raw]
Subject: Re: [PATCH v2] x86/power: Fix 'nosmt' vs. hibernation triple fault during resume

On Wed, 29 May 2019, Jiri Kosina wrote:

> The target kernel only onlines the CPUs which were online at the time of
> hibernation (and are therefore in frozen_cpus mask).

Hm, there is a catch though. After resume, the SMT siblings are now in hlt
instead of mwait.

Which means that the resumed kernel has to do one more online/offline
cycle for them, to push them to mwait again.

Bah.

I'll send v3 shortly, so please don't apply v2 just yet.

--
Jiri Kosina
SUSE Labs

2019-05-29 20:29:42

by Jiri Kosina

[permalink] [raw]
Subject: [PATCH v3] x86/power: Fix 'nosmt' vs. hibernation triple fault during resume

From: Jiri Kosina <[email protected]>

As explained in

0cc3cd21657b ("cpu/hotplug: Boot HT siblings at least once")

we always, no matter what, have to bring up x86 HT siblings during boot at
least once in order to avoid first MCE bringing the system to its knees.

That means that whenever 'nosmt' is supplied on the kernel command-line,
all the HT siblings are as a result sitting in mwait or cpudile after
going through the online-offline cycle at least once.

This causes a serious issue though when a kernel, which saw 'nosmt' on its
commandline, is going to perform resume from hibernation: if the resume
from the hibernated image is successful, cr3 is flipped in order to point
to the address space of the kernel that is being resumed, which in turn
means that all the HT siblings are all of a sudden mwaiting on address
which is no longer valid.

That results in triple fault shortly after cr3 is switched, and machine
reboots.

Fix this by always waking up all the SMT siblings before initiating the
'restore from hibernation' process; this guarantees that all the HT
siblings will be properly carried over to the resumed kernel waiting in
resume_play_dead(), and acted upon accordingly afterwards, based on the
target kernel configuration.
Symmetricaly, the resumed kernel has to push the SMT siblings to mwait
again in case it has SMT disabled; this means it has to online all
the siblings when resuming (so that they come out of hlt) and offline
them again to let them reach mwait.

Cc: [email protected] # v4.19+
Debugged-by: Thomas Gleixner <[email protected]>
Fixes: 0cc3cd21657b ("cpu/hotplug: Boot HT siblings at least once")
Signed-off-by: Jiri Kosina <[email protected]>
---

v1 -> v2:
- restructure error handling as suggested by peterz
- add Rafael's ack

v2 -> v3:
- added extra online/offline dance for nosmt case during
resume, as we want the siblings to be in mwait, not hlt
- dropped peterz's and Rafael's acks for now due to the above

arch/x86/power/cpu.c | 10 ++++++++++
arch/x86/power/hibernate.c | 33 +++++++++++++++++++++++++++++++++
include/linux/cpu.h | 4 ++++
kernel/cpu.c | 4 ++--
kernel/power/hibernate.c | 8 ++++++++
5 files changed, 57 insertions(+), 2 deletions(-)

diff --git a/arch/x86/power/cpu.c b/arch/x86/power/cpu.c
index a7d966964c6f..513ce09e9950 100644
--- a/arch/x86/power/cpu.c
+++ b/arch/x86/power/cpu.c
@@ -299,7 +299,17 @@ int hibernate_resume_nonboot_cpu_disable(void)
* address in its instruction pointer may not be possible to resolve
* any more at that point (the page tables used by it previously may
* have been overwritten by hibernate image data).
+ *
+ * First, make sure that we wake up all the potentially disabled SMT
+ * threads which have been initially brought up and then put into
+ * mwait/cpuidle sleep.
+ * Those will be put to proper (not interfering with hibernation
+ * resume) sleep afterwards, and the resumed kernel will decide itself
+ * what to do with them.
*/
+ ret = cpuhp_smt_enable();
+ if (ret)
+ return ret;
smp_ops.play_dead = resume_play_dead;
ret = disable_nonboot_cpus();
smp_ops.play_dead = play_dead;
diff --git a/arch/x86/power/hibernate.c b/arch/x86/power/hibernate.c
index 4845b8c7be7f..fc413717a45f 100644
--- a/arch/x86/power/hibernate.c
+++ b/arch/x86/power/hibernate.c
@@ -11,6 +11,7 @@
#include <linux/suspend.h>
#include <linux/scatterlist.h>
#include <linux/kdebug.h>
+#include <linux/cpu.h>

#include <crypto/hash.h>

@@ -245,3 +246,35 @@ int relocate_restore_code(void)
__flush_tlb_all();
return 0;
}
+
+int arch_resume_nosmt(void)
+{
+ int ret = 0;
+ /*
+ * We reached this while coming out of hibernation. This means
+ * that SMT siblings are sleeping in hlt, as mwait is not safe
+ * against control transition during resume (see comment in
+ * hibernate_resume_nonboot_cpu_disable()).
+ *
+ * If the resumed kernel has SMT disabled, we have to take all the
+ * SMT siblings out of hlt, and offline them again so that they
+ * end up in mwait proper.
+ *
+ * Called with hotplug disabled.
+ */
+ cpu_hotplug_enable();
+ if (cpu_smt_control == CPU_SMT_DISABLED ||
+ cpu_smt_control == CPU_SMT_FORCE_DISABLED) {
+ enum cpuhp_smt_control old = cpu_smt_control;
+
+ ret = cpuhp_smt_enable();
+ if (ret)
+ goto out;
+ ret = cpuhp_smt_disable(old);
+ if (ret)
+ goto out;
+ }
+out:
+ cpu_hotplug_disable();
+ return ret;
+}
diff --git a/include/linux/cpu.h b/include/linux/cpu.h
index 3813fe45effd..fcb1386bb0d4 100644
--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -201,10 +201,14 @@ enum cpuhp_smt_control {
extern enum cpuhp_smt_control cpu_smt_control;
extern void cpu_smt_disable(bool force);
extern void cpu_smt_check_topology(void);
+extern int cpuhp_smt_enable(void);
+extern int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval);
#else
# define cpu_smt_control (CPU_SMT_NOT_IMPLEMENTED)
static inline void cpu_smt_disable(bool force) { }
static inline void cpu_smt_check_topology(void) { }
+static inline int cpuhp_smt_enable(void) { return 0; }
+static inline int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval) { return 0; }
#endif

/*
diff --git a/kernel/cpu.c b/kernel/cpu.c
index f2ef10460698..077fde6fb953 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -2061,7 +2061,7 @@ static void cpuhp_online_cpu_device(unsigned int cpu)
kobject_uevent(&dev->kobj, KOBJ_ONLINE);
}

-static int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval)
+int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval)
{
int cpu, ret = 0;

@@ -2093,7 +2093,7 @@ static int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval)
return ret;
}

-static int cpuhp_smt_enable(void)
+int cpuhp_smt_enable(void)
{
int cpu, ret = 0;

diff --git a/kernel/power/hibernate.c b/kernel/power/hibernate.c
index c8c272df7154..40dadbee7aa5 100644
--- a/kernel/power/hibernate.c
+++ b/kernel/power/hibernate.c
@@ -257,6 +257,10 @@ void swsusp_show_speed(ktime_t start, ktime_t stop,
(kps % 1000) / 10);
}

+__weak int arch_resume_nosmt(void)
+{
+}
+
/**
* create_image - Create a hibernation image.
* @platform_mode: Whether or not to use the platform driver.
@@ -324,6 +328,10 @@ static int create_image(int platform_mode)
Enable_cpus:
suspend_enable_secondary_cpus();

+ /* Allow architectures to do nosmt-specific post-resume dances */
+ if (!in_suspend)
+ error = arch_resume_nosmt();
+
Platform_finish:
platform_finish(platform_mode);

--
Jiri Kosina
SUSE Labs

2019-05-29 21:27:20

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH v3] x86/power: Fix 'nosmt' vs. hibernation triple fault during resume

On Wed 2019-05-29 22:26:48, Jiri Kosina wrote:
> From: Jiri Kosina <[email protected]>
>
> As explained in
>
> 0cc3cd21657b ("cpu/hotplug: Boot HT siblings at least once")
>
> we always, no matter what, have to bring up x86 HT siblings during boot at
> least once in order to avoid first MCE bringing the system to its
> knees.


> Cc: [email protected] # v4.19+
> Debugged-by: Thomas Gleixner <[email protected]>
> Fixes: 0cc3cd21657b ("cpu/hotplug: Boot HT siblings at least once")
> Signed-off-by: Jiri Kosina <[email protected]>

Acked-by: Pavel Machek <[email protected]>

But I'm less sure if this is -stable material. Is reverting
0cc3cd21657be04cb0559fe8063f2130493f92cf in -stable an option?
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


Attachments:
(No filename) (878.00 B)
signature.asc (188.00 B)
Digital signature
Download all attachments

2019-05-29 21:29:08

by Jiri Kosina

[permalink] [raw]
Subject: Re: [PATCH v3] x86/power: Fix 'nosmt' vs. hibernation triple fault during resume

On Wed, 29 May 2019, Pavel Machek wrote:

> > As explained in
> >
> > 0cc3cd21657b ("cpu/hotplug: Boot HT siblings at least once")
> >
> > we always, no matter what, have to bring up x86 HT siblings during boot at
> > least once in order to avoid first MCE bringing the system to its
> > knees.
>
>
> > Cc: [email protected] # v4.19+
> > Debugged-by: Thomas Gleixner <[email protected]>
> > Fixes: 0cc3cd21657b ("cpu/hotplug: Boot HT siblings at least once")
> > Signed-off-by: Jiri Kosina <[email protected]>
>
> Acked-by: Pavel Machek <[email protected]>
>
> But I'm less sure if this is -stable material. Is reverting
> 0cc3cd21657be04cb0559fe8063f2130493f92cf in -stable an option?

Well, without that commit, first MCE to come kills nosmt system.

--
Jiri Kosina
SUSE Labs

2019-05-29 21:54:57

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH v3] x86/power: Fix 'nosmt' vs. hibernation triple fault during resume

On Wed 2019-05-29 23:27:34, Jiri Kosina wrote:
> On Wed, 29 May 2019, Pavel Machek wrote:
>
> > > As explained in
> > >
> > > 0cc3cd21657b ("cpu/hotplug: Boot HT siblings at least once")
> > >
> > > we always, no matter what, have to bring up x86 HT siblings during boot at
> > > least once in order to avoid first MCE bringing the system to its
> > > knees.
> >
> >
> > > Cc: [email protected] # v4.19+
> > > Debugged-by: Thomas Gleixner <[email protected]>
> > > Fixes: 0cc3cd21657b ("cpu/hotplug: Boot HT siblings at least once")
> > > Signed-off-by: Jiri Kosina <[email protected]>
> >
> > Acked-by: Pavel Machek <[email protected]>
> >
> > But I'm less sure if this is -stable material. Is reverting
> > 0cc3cd21657be04cb0559fe8063f2130493f92cf in -stable an option?
>
> Well, without that commit, first MCE to come kills nosmt system.

...which is same situation 4.9-stable and 4.14-stable lives with for a
long long time, right?

But you are right, reverting 0cc3cd21657b does not look like good
option :-(.


Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


Attachments:
(No filename) (1.18 kB)
signature.asc (188.00 B)
Digital signature
Download all attachments

2019-05-29 22:12:02

by Jiri Kosina

[permalink] [raw]
Subject: [PATCH v4] x86/power: Fix 'nosmt' vs. hibernation triple fault during resume

From: Jiri Kosina <[email protected]>

As explained in

0cc3cd21657b ("cpu/hotplug: Boot HT siblings at least once")

we always, no matter what, have to bring up x86 HT siblings during boot at
least once in order to avoid first MCE bringing the system to its knees.

That means that whenever 'nosmt' is supplied on the kernel command-line,
all the HT siblings are as a result sitting in mwait or cpudile after
going through the online-offline cycle at least once.

This causes a serious issue though when a kernel, which saw 'nosmt' on its
commandline, is going to perform resume from hibernation: if the resume
from the hibernated image is successful, cr3 is flipped in order to point
to the address space of the kernel that is being resumed, which in turn
means that all the HT siblings are all of a sudden mwaiting on address
which is no longer valid.

That results in triple fault shortly after cr3 is switched, and machine
reboots.

Fix this by always waking up all the SMT siblings before initiating the
'restore from hibernation' process; this guarantees that all the HT
siblings will be properly carried over to the resumed kernel waiting in
resume_play_dead(), and acted upon accordingly afterwards, based on the
target kernel configuration.
Symmetricaly, the resumed kernel has to push the SMT siblings to mwait
again in case it has SMT disabled; this means it has to online all
the siblings when resuming (so that they come out of hlt) and offline
them again to let them reach mwait.

Cc: [email protected] # v4.19+
Debugged-by: Thomas Gleixner <[email protected]>
Fixes: 0cc3cd21657b ("cpu/hotplug: Boot HT siblings at least once")
Signed-off-by: Jiri Kosina <[email protected]>
---

v1 -> v2:
- restructure error handling as suggested by peterz
- add Rafael's ack

v2 -> v3:
- added extra online/offline dance for nosmt case during
resume, as we want the siblings to be in mwait, not hlt
- dropped peterz's and Rafael's acks for now due to the above

v3 -> v4:
- fix undefined return value from arch_resume_nosmt() in case
it's not overriden by arch

arch/x86/power/cpu.c | 10 ++++++++++
arch/x86/power/hibernate.c | 33 +++++++++++++++++++++++++++++++++
include/linux/cpu.h | 4 ++++
kernel/cpu.c | 4 ++--
kernel/power/hibernate.c | 9 +++++++++
5 files changed, 58 insertions(+), 2 deletions(-)

diff --git a/arch/x86/power/cpu.c b/arch/x86/power/cpu.c
index a7d966964c6f..513ce09e9950 100644
--- a/arch/x86/power/cpu.c
+++ b/arch/x86/power/cpu.c
@@ -299,7 +299,17 @@ int hibernate_resume_nonboot_cpu_disable(void)
* address in its instruction pointer may not be possible to resolve
* any more at that point (the page tables used by it previously may
* have been overwritten by hibernate image data).
+ *
+ * First, make sure that we wake up all the potentially disabled SMT
+ * threads which have been initially brought up and then put into
+ * mwait/cpuidle sleep.
+ * Those will be put to proper (not interfering with hibernation
+ * resume) sleep afterwards, and the resumed kernel will decide itself
+ * what to do with them.
*/
+ ret = cpuhp_smt_enable();
+ if (ret)
+ return ret;
smp_ops.play_dead = resume_play_dead;
ret = disable_nonboot_cpus();
smp_ops.play_dead = play_dead;
diff --git a/arch/x86/power/hibernate.c b/arch/x86/power/hibernate.c
index 4845b8c7be7f..fc413717a45f 100644
--- a/arch/x86/power/hibernate.c
+++ b/arch/x86/power/hibernate.c
@@ -11,6 +11,7 @@
#include <linux/suspend.h>
#include <linux/scatterlist.h>
#include <linux/kdebug.h>
+#include <linux/cpu.h>

#include <crypto/hash.h>

@@ -245,3 +246,35 @@ int relocate_restore_code(void)
__flush_tlb_all();
return 0;
}
+
+int arch_resume_nosmt(void)
+{
+ int ret = 0;
+ /*
+ * We reached this while coming out of hibernation. This means
+ * that SMT siblings are sleeping in hlt, as mwait is not safe
+ * against control transition during resume (see comment in
+ * hibernate_resume_nonboot_cpu_disable()).
+ *
+ * If the resumed kernel has SMT disabled, we have to take all the
+ * SMT siblings out of hlt, and offline them again so that they
+ * end up in mwait proper.
+ *
+ * Called with hotplug disabled.
+ */
+ cpu_hotplug_enable();
+ if (cpu_smt_control == CPU_SMT_DISABLED ||
+ cpu_smt_control == CPU_SMT_FORCE_DISABLED) {
+ enum cpuhp_smt_control old = cpu_smt_control;
+
+ ret = cpuhp_smt_enable();
+ if (ret)
+ goto out;
+ ret = cpuhp_smt_disable(old);
+ if (ret)
+ goto out;
+ }
+out:
+ cpu_hotplug_disable();
+ return ret;
+}
diff --git a/include/linux/cpu.h b/include/linux/cpu.h
index 3813fe45effd..fcb1386bb0d4 100644
--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -201,10 +201,14 @@ enum cpuhp_smt_control {
extern enum cpuhp_smt_control cpu_smt_control;
extern void cpu_smt_disable(bool force);
extern void cpu_smt_check_topology(void);
+extern int cpuhp_smt_enable(void);
+extern int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval);
#else
# define cpu_smt_control (CPU_SMT_NOT_IMPLEMENTED)
static inline void cpu_smt_disable(bool force) { }
static inline void cpu_smt_check_topology(void) { }
+static inline int cpuhp_smt_enable(void) { return 0; }
+static inline int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval) { return 0; }
#endif

/*
diff --git a/kernel/cpu.c b/kernel/cpu.c
index f2ef10460698..077fde6fb953 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -2061,7 +2061,7 @@ static void cpuhp_online_cpu_device(unsigned int cpu)
kobject_uevent(&dev->kobj, KOBJ_ONLINE);
}

-static int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval)
+int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval)
{
int cpu, ret = 0;

@@ -2093,7 +2093,7 @@ static int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval)
return ret;
}

-static int cpuhp_smt_enable(void)
+int cpuhp_smt_enable(void)
{
int cpu, ret = 0;

diff --git a/kernel/power/hibernate.c b/kernel/power/hibernate.c
index c8c272df7154..b65635753e8e 100644
--- a/kernel/power/hibernate.c
+++ b/kernel/power/hibernate.c
@@ -257,6 +257,11 @@ void swsusp_show_speed(ktime_t start, ktime_t stop,
(kps % 1000) / 10);
}

+__weak int arch_resume_nosmt(void)
+{
+ return 0;
+}
+
/**
* create_image - Create a hibernation image.
* @platform_mode: Whether or not to use the platform driver.
@@ -324,6 +329,10 @@ static int create_image(int platform_mode)
Enable_cpus:
suspend_enable_secondary_cpus();

+ /* Allow architectures to do nosmt-specific post-resume dances */
+ if (!in_suspend)
+ error = arch_resume_nosmt();
+
Platform_finish:
platform_finish(platform_mode);


--
Jiri Kosina
SUSE Labs

2019-05-30 08:48:43

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH v4] x86/power: Fix 'nosmt' vs. hibernation triple fault during resume

On Thu, May 30, 2019 at 12:09 AM Jiri Kosina <[email protected]> wrote:
>
> From: Jiri Kosina <[email protected]>
>
> As explained in
>
> 0cc3cd21657b ("cpu/hotplug: Boot HT siblings at least once")
>
> we always, no matter what, have to bring up x86 HT siblings during boot at
> least once in order to avoid first MCE bringing the system to its knees.
>
> That means that whenever 'nosmt' is supplied on the kernel command-line,
> all the HT siblings are as a result sitting in mwait or cpudile after
> going through the online-offline cycle at least once.
>
> This causes a serious issue though when a kernel, which saw 'nosmt' on its
> commandline, is going to perform resume from hibernation: if the resume
> from the hibernated image is successful, cr3 is flipped in order to point
> to the address space of the kernel that is being resumed, which in turn
> means that all the HT siblings are all of a sudden mwaiting on address
> which is no longer valid.
>
> That results in triple fault shortly after cr3 is switched, and machine
> reboots.
>
> Fix this by always waking up all the SMT siblings before initiating the
> 'restore from hibernation' process; this guarantees that all the HT
> siblings will be properly carried over to the resumed kernel waiting in
> resume_play_dead(), and acted upon accordingly afterwards, based on the
> target kernel configuration.
> Symmetricaly, the resumed kernel has to push the SMT siblings to mwait
> again in case it has SMT disabled; this means it has to online all
> the siblings when resuming (so that they come out of hlt) and offline
> them again to let them reach mwait.
>
> Cc: [email protected] # v4.19+
> Debugged-by: Thomas Gleixner <[email protected]>
> Fixes: 0cc3cd21657b ("cpu/hotplug: Boot HT siblings at least once")
> Signed-off-by: Jiri Kosina <[email protected]>

LGTM

And I would prefer this one to go in through the PM tree due to the
hibernate core changes,
so can I get an ACK from the x86 arch side here, please?

> ---
>
> v1 -> v2:
> - restructure error handling as suggested by peterz
> - add Rafael's ack
>
> v2 -> v3:
> - added extra online/offline dance for nosmt case during
> resume, as we want the siblings to be in mwait, not hlt
> - dropped peterz's and Rafael's acks for now due to the above
>
> v3 -> v4:
> - fix undefined return value from arch_resume_nosmt() in case
> it's not overriden by arch
>
> arch/x86/power/cpu.c | 10 ++++++++++
> arch/x86/power/hibernate.c | 33 +++++++++++++++++++++++++++++++++
> include/linux/cpu.h | 4 ++++
> kernel/cpu.c | 4 ++--
> kernel/power/hibernate.c | 9 +++++++++
> 5 files changed, 58 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/power/cpu.c b/arch/x86/power/cpu.c
> index a7d966964c6f..513ce09e9950 100644
> --- a/arch/x86/power/cpu.c
> +++ b/arch/x86/power/cpu.c
> @@ -299,7 +299,17 @@ int hibernate_resume_nonboot_cpu_disable(void)
> * address in its instruction pointer may not be possible to resolve
> * any more at that point (the page tables used by it previously may
> * have been overwritten by hibernate image data).
> + *
> + * First, make sure that we wake up all the potentially disabled SMT
> + * threads which have been initially brought up and then put into
> + * mwait/cpuidle sleep.
> + * Those will be put to proper (not interfering with hibernation
> + * resume) sleep afterwards, and the resumed kernel will decide itself
> + * what to do with them.
> */
> + ret = cpuhp_smt_enable();
> + if (ret)
> + return ret;
> smp_ops.play_dead = resume_play_dead;
> ret = disable_nonboot_cpus();
> smp_ops.play_dead = play_dead;
> diff --git a/arch/x86/power/hibernate.c b/arch/x86/power/hibernate.c
> index 4845b8c7be7f..fc413717a45f 100644
> --- a/arch/x86/power/hibernate.c
> +++ b/arch/x86/power/hibernate.c
> @@ -11,6 +11,7 @@
> #include <linux/suspend.h>
> #include <linux/scatterlist.h>
> #include <linux/kdebug.h>
> +#include <linux/cpu.h>
>
> #include <crypto/hash.h>
>
> @@ -245,3 +246,35 @@ int relocate_restore_code(void)
> __flush_tlb_all();
> return 0;
> }
> +
> +int arch_resume_nosmt(void)
> +{
> + int ret = 0;
> + /*
> + * We reached this while coming out of hibernation. This means
> + * that SMT siblings are sleeping in hlt, as mwait is not safe
> + * against control transition during resume (see comment in
> + * hibernate_resume_nonboot_cpu_disable()).
> + *
> + * If the resumed kernel has SMT disabled, we have to take all the
> + * SMT siblings out of hlt, and offline them again so that they
> + * end up in mwait proper.
> + *
> + * Called with hotplug disabled.
> + */
> + cpu_hotplug_enable();
> + if (cpu_smt_control == CPU_SMT_DISABLED ||
> + cpu_smt_control == CPU_SMT_FORCE_DISABLED) {
> + enum cpuhp_smt_control old = cpu_smt_control;
> +
> + ret = cpuhp_smt_enable();
> + if (ret)
> + goto out;
> + ret = cpuhp_smt_disable(old);
> + if (ret)
> + goto out;
> + }
> +out:
> + cpu_hotplug_disable();
> + return ret;
> +}
> diff --git a/include/linux/cpu.h b/include/linux/cpu.h
> index 3813fe45effd..fcb1386bb0d4 100644
> --- a/include/linux/cpu.h
> +++ b/include/linux/cpu.h
> @@ -201,10 +201,14 @@ enum cpuhp_smt_control {
> extern enum cpuhp_smt_control cpu_smt_control;
> extern void cpu_smt_disable(bool force);
> extern void cpu_smt_check_topology(void);
> +extern int cpuhp_smt_enable(void);
> +extern int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval);
> #else
> # define cpu_smt_control (CPU_SMT_NOT_IMPLEMENTED)
> static inline void cpu_smt_disable(bool force) { }
> static inline void cpu_smt_check_topology(void) { }
> +static inline int cpuhp_smt_enable(void) { return 0; }
> +static inline int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval) { return 0; }
> #endif
>
> /*
> diff --git a/kernel/cpu.c b/kernel/cpu.c
> index f2ef10460698..077fde6fb953 100644
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -2061,7 +2061,7 @@ static void cpuhp_online_cpu_device(unsigned int cpu)
> kobject_uevent(&dev->kobj, KOBJ_ONLINE);
> }
>
> -static int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval)
> +int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval)
> {
> int cpu, ret = 0;
>
> @@ -2093,7 +2093,7 @@ static int cpuhp_smt_disable(enum cpuhp_smt_control ctrlval)
> return ret;
> }
>
> -static int cpuhp_smt_enable(void)
> +int cpuhp_smt_enable(void)
> {
> int cpu, ret = 0;
>
> diff --git a/kernel/power/hibernate.c b/kernel/power/hibernate.c
> index c8c272df7154..b65635753e8e 100644
> --- a/kernel/power/hibernate.c
> +++ b/kernel/power/hibernate.c
> @@ -257,6 +257,11 @@ void swsusp_show_speed(ktime_t start, ktime_t stop,
> (kps % 1000) / 10);
> }
>
> +__weak int arch_resume_nosmt(void)
> +{
> + return 0;
> +}
> +
> /**
> * create_image - Create a hibernation image.
> * @platform_mode: Whether or not to use the platform driver.
> @@ -324,6 +329,10 @@ static int create_image(int platform_mode)
> Enable_cpus:
> suspend_enable_secondary_cpus();
>
> + /* Allow architectures to do nosmt-specific post-resume dances */
> + if (!in_suspend)
> + error = arch_resume_nosmt();
> +
> Platform_finish:
> platform_finish(platform_mode);
>
>
> --
> Jiri Kosina
> SUSE Labs
>

2019-05-30 10:50:20

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH v4] x86/power: Fix 'nosmt' vs. hibernation triple fault during resume

On Thu 2019-05-30 00:09:39, Jiri Kosina wrote:
> From: Jiri Kosina <[email protected]>
>
> As explained in
>
> 0cc3cd21657b ("cpu/hotplug: Boot HT siblings at least once")
>
> we always, no matter what, have to bring up x86 HT siblings during boot at
> least once in order to avoid first MCE bringing the system to its knees.
>
> That means that whenever 'nosmt' is supplied on the kernel command-line,
> all the HT siblings are as a result sitting in mwait or cpudile after
> going through the online-offline cycle at least once.
>
> This causes a serious issue though when a kernel, which saw 'nosmt' on its
> commandline, is going to perform resume from hibernation: if the resume
> from the hibernated image is successful, cr3 is flipped in order to point
> to the address space of the kernel that is being resumed, which in turn
> means that all the HT siblings are all of a sudden mwaiting on address
> which is no longer valid.
>
> That results in triple fault shortly after cr3 is switched, and machine
> reboots.
>
> Fix this by always waking up all the SMT siblings before initiating the
> 'restore from hibernation' process; this guarantees that all the HT
> siblings will be properly carried over to the resumed kernel waiting in
> resume_play_dead(), and acted upon accordingly afterwards, based on the
> target kernel configuration.
> Symmetricaly, the resumed kernel has to push the SMT siblings to mwait
> again in case it has SMT disabled; this means it has to online all
> the siblings when resuming (so that they come out of hlt) and offline
> them again to let them reach mwait.
>
> Cc: [email protected] # v4.19+
> Debugged-by: Thomas Gleixner <[email protected]>
> Fixes: 0cc3cd21657b ("cpu/hotplug: Boot HT siblings at least once")
> Signed-off-by: Jiri Kosina <[email protected]>

Acked-by: Pavel Machek <[email protected]>

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


Attachments:
(No filename) (1.99 kB)
signature.asc (188.00 B)
Digital signature
Download all attachments

2019-05-30 22:04:54

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v4] x86/power: Fix 'nosmt' vs. hibernation triple fault during resume

On Thu, 30 May 2019, Rafael J. Wysocki wrote:
> >
> > Cc: [email protected] # v4.19+
> > Debugged-by: Thomas Gleixner <[email protected]>
> > Fixes: 0cc3cd21657b ("cpu/hotplug: Boot HT siblings at least once")
> > Signed-off-by: Jiri Kosina <[email protected]>
>
> LGTM
>
> And I would prefer this one to go in through the PM tree due to the
> hibernate core changes,

Ok.

> so can I get an ACK from the x86 arch side here, please?

No. Is the following good enough?

Reviewed-by: Thomas Gleixner <[email protected]>

Thanks,

tglx

2019-05-30 22:07:52

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH v4] x86/power: Fix 'nosmt' vs. hibernation triple fault during resume

On Thu, May 30, 2019 at 11:27 PM Thomas Gleixner <[email protected]> wrote:
>
> On Thu, 30 May 2019, Rafael J. Wysocki wrote:
> > >
> > > Cc: [email protected] # v4.19+
> > > Debugged-by: Thomas Gleixner <[email protected]>
> > > Fixes: 0cc3cd21657b ("cpu/hotplug: Boot HT siblings at least once")
> > > Signed-off-by: Jiri Kosina <[email protected]>
> >
> > LGTM
> >
> > And I would prefer this one to go in through the PM tree due to the
> > hibernate core changes,
>
> Ok.
>
> > so can I get an ACK from the x86 arch side here, please?
>
> No. Is the following good enough?
>
> Reviewed-by: Thomas Gleixner <[email protected]>

Yes, it is, thanks!

2019-05-30 23:39:33

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: [PATCH v4] x86/power: Fix 'nosmt' vs. hibernation triple fault during resume

On Thu, May 30, 2019 at 11:38:51PM +0200, Rafael J. Wysocki wrote:
> On Thu, May 30, 2019 at 11:27 PM Thomas Gleixner <[email protected]> wrote:
> >
> > On Thu, 30 May 2019, Rafael J. Wysocki wrote:
> > > >
> > > > Cc: [email protected] # v4.19+
> > > > Debugged-by: Thomas Gleixner <[email protected]>
> > > > Fixes: 0cc3cd21657b ("cpu/hotplug: Boot HT siblings at least once")
> > > > Signed-off-by: Jiri Kosina <[email protected]>
> > >
> > > LGTM
> > >
> > > And I would prefer this one to go in through the PM tree due to the
> > > hibernate core changes,
> >
> > Ok.
> >
> > > so can I get an ACK from the x86 arch side here, please?
> >
> > No. Is the following good enough?
> >
> > Reviewed-by: Thomas Gleixner <[email protected]>
>
> Yes, it is, thanks!

I still think changing monitor/mwait to use a fixmap address would be a
much cleaner way to fix this. I can try to work up a patch tomorrow.

--
Josh

2019-05-30 23:43:28

by Jiri Kosina

[permalink] [raw]
Subject: Re: [PATCH v4] x86/power: Fix 'nosmt' vs. hibernation triple fault during resume

On Thu, 30 May 2019, Josh Poimboeuf wrote:

> > > Reviewed-by: Thomas Gleixner <[email protected]>
> >
> > Yes, it is, thanks!
>
> I still think changing monitor/mwait to use a fixmap address would be a
> much cleaner way to fix this. I can try to work up a patch tomorrow.

I disagree with that from the backwards compatibility point of view.

I personally am quite frequently using differnet combinations of
resumer/resumee kernels, and I've never been biten by it so far. I'd guess
I am not the only one.
Fixmap sort of breaks that invariant.

Thanks,

--
Jiri Kosina
SUSE Labs

2019-05-31 05:17:22

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: [PATCH v4] x86/power: Fix 'nosmt' vs. hibernation triple fault during resume

On Fri, May 31, 2019 at 01:42:02AM +0200, Jiri Kosina wrote:
> On Thu, 30 May 2019, Josh Poimboeuf wrote:
>
> > > > Reviewed-by: Thomas Gleixner <[email protected]>
> > >
> > > Yes, it is, thanks!
> >
> > I still think changing monitor/mwait to use a fixmap address would be a
> > much cleaner way to fix this. I can try to work up a patch tomorrow.
>
> I disagree with that from the backwards compatibility point of view.
>
> I personally am quite frequently using differnet combinations of
> resumer/resumee kernels, and I've never been biten by it so far. I'd guess
> I am not the only one.
> Fixmap sort of breaks that invariant.

Right now there is no backwards compatibility because nosmt resume is
already broken.

For "future" backwards compatibility we could just define a hard-coded
reserved fixmap page address, adjacent to the vsyscall reserved address.

Something like this (not yet tested)? Maybe we could also remove the
resume_play_dead() hack?

diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h
index 9da8cccdf3fb..1c328624162c 100644
--- a/arch/x86/include/asm/fixmap.h
+++ b/arch/x86/include/asm/fixmap.h
@@ -80,6 +80,7 @@ enum fixed_addresses {
#ifdef CONFIG_X86_VSYSCALL_EMULATION
VSYSCALL_PAGE = (FIXADDR_TOP - VSYSCALL_ADDR) >> PAGE_SHIFT,
#endif
+ FIX_MWAIT = (FIXADDR_TOP - VSYSCALL_ADDR - 1) >> PAGE_SHIFT,
#endif
FIX_DBGP_BASE,
FIX_EARLYCON_MEM_BASE,
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 73e69aaaa117..9804fbe25d03 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -108,6 +108,8 @@ int __read_mostly __max_smt_threads = 1;
/* Flag to indicate if a complete sched domain rebuild is required */
bool x86_topology_update;

+static char __mwait_page[PAGE_SIZE];
+
int arch_update_cpu_topology(void)
{
int retval = x86_topology_update;
@@ -1319,6 +1321,8 @@ void __init native_smp_prepare_cpus(unsigned int max_cpus)
smp_quirk_init_udelay();

speculative_store_bypass_ht_init();
+
+ set_fixmap(FIX_MWAIT, __pa_symbol(&__mwait_page));
}

void arch_enable_nonboot_cpus_begin(void)
@@ -1631,11 +1635,12 @@ static inline void mwait_play_dead(void)
}

/*
- * This should be a memory location in a cache line which is
- * unlikely to be touched by other processors. The actual
- * content is immaterial as it is not actually modified in any way.
+ * This memory location is never actually written to. It's mapped at a
+ * reserved fixmap address to ensure the monitored address remains
+ * valid across a hibernation resume operation. Otherwise a triple
+ * fault can occur.
*/
- mwait_ptr = &current_thread_info()->flags;
+ mwait_ptr = (void *)fix_to_virt(FIX_MWAIT);

wbinvd();

2019-05-31 08:28:02

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH v4] x86/power: Fix 'nosmt' vs. hibernation triple fault during resume

On Friday, May 31, 2019 7:14:56 AM CEST Josh Poimboeuf wrote:
> On Fri, May 31, 2019 at 01:42:02AM +0200, Jiri Kosina wrote:
> > On Thu, 30 May 2019, Josh Poimboeuf wrote:
> >
> > > > > Reviewed-by: Thomas Gleixner <[email protected]>
> > > >
> > > > Yes, it is, thanks!
> > >
> > > I still think changing monitor/mwait to use a fixmap address would be a
> > > much cleaner way to fix this. I can try to work up a patch tomorrow.
> >
> > I disagree with that from the backwards compatibility point of view.
> >
> > I personally am quite frequently using differnet combinations of
> > resumer/resumee kernels, and I've never been biten by it so far. I'd guess
> > I am not the only one.
> > Fixmap sort of breaks that invariant.
>
> Right now there is no backwards compatibility because nosmt resume is
> already broken.
>
> For "future" backwards compatibility we could just define a hard-coded
> reserved fixmap page address, adjacent to the vsyscall reserved address.
>
> Something like this (not yet tested)? Maybe we could also remove the
> resume_play_dead() hack?

Yes, we can IMO, but in a separate patch, please.

> diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h
> index 9da8cccdf3fb..1c328624162c 100644
> --- a/arch/x86/include/asm/fixmap.h
> +++ b/arch/x86/include/asm/fixmap.h
> @@ -80,6 +80,7 @@ enum fixed_addresses {
> #ifdef CONFIG_X86_VSYSCALL_EMULATION
> VSYSCALL_PAGE = (FIXADDR_TOP - VSYSCALL_ADDR) >> PAGE_SHIFT,
> #endif
> + FIX_MWAIT = (FIXADDR_TOP - VSYSCALL_ADDR - 1) >> PAGE_SHIFT,
> #endif
> FIX_DBGP_BASE,
> FIX_EARLYCON_MEM_BASE,
> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> index 73e69aaaa117..9804fbe25d03 100644
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -108,6 +108,8 @@ int __read_mostly __max_smt_threads = 1;
> /* Flag to indicate if a complete sched domain rebuild is required */
> bool x86_topology_update;
>
> +static char __mwait_page[PAGE_SIZE];
> +
> int arch_update_cpu_topology(void)
> {
> int retval = x86_topology_update;
> @@ -1319,6 +1321,8 @@ void __init native_smp_prepare_cpus(unsigned int max_cpus)
> smp_quirk_init_udelay();
>
> speculative_store_bypass_ht_init();
> +
> + set_fixmap(FIX_MWAIT, __pa_symbol(&__mwait_page));
> }
>
> void arch_enable_nonboot_cpus_begin(void)
> @@ -1631,11 +1635,12 @@ static inline void mwait_play_dead(void)
> }
>
> /*
> - * This should be a memory location in a cache line which is
> - * unlikely to be touched by other processors. The actual
> - * content is immaterial as it is not actually modified in any way.
> + * This memory location is never actually written to. It's mapped at a
> + * reserved fixmap address to ensure the monitored address remains
> + * valid across a hibernation resume operation. Otherwise a triple
> + * fault can occur.
> */
> - mwait_ptr = &current_thread_info()->flags;
> + mwait_ptr = (void *)fix_to_virt(FIX_MWAIT);
>
> wbinvd();
>
>

Jiri, any chance to test this?



2019-05-31 08:48:59

by Jiri Kosina

[permalink] [raw]
Subject: Re: [PATCH v4] x86/power: Fix 'nosmt' vs. hibernation triple fault during resume

On Fri, 31 May 2019, Josh Poimboeuf wrote:

> > I disagree with that from the backwards compatibility point of view.
> >
> > I personally am quite frequently using differnet combinations of
> > resumer/resumee kernels, and I've never been biten by it so far. I'd guess
> > I am not the only one.
> > Fixmap sort of breaks that invariant.
>
> Right now there is no backwards compatibility because nosmt resume is
> already broken.

Yeah, well, but that's "only" for nosmt kernels at least.

> For "future" backwards compatibility we could just define a hard-coded
> reserved fixmap page address, adjacent to the vsyscall reserved address.
>
> Something like this (not yet tested)? Maybe we could also remove the
> resume_play_dead() hack?

Does it also solve cpuidle case? I have no overview what all the cpuidle
drivers might be potentially doing in their ->enter_dead() callbacks.
Rafael?

Thanks,

--
Jiri Kosina
SUSE Labs

2019-05-31 08:59:08

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH v4] x86/power: Fix 'nosmt' vs. hibernation triple fault during resume

On Friday, May 31, 2019 10:47:21 AM CEST Jiri Kosina wrote:
> On Fri, 31 May 2019, Josh Poimboeuf wrote:
>
> > > I disagree with that from the backwards compatibility point of view.
> > >
> > > I personally am quite frequently using differnet combinations of
> > > resumer/resumee kernels, and I've never been biten by it so far. I'd guess
> > > I am not the only one.
> > > Fixmap sort of breaks that invariant.
> >
> > Right now there is no backwards compatibility because nosmt resume is
> > already broken.
>
> Yeah, well, but that's "only" for nosmt kernels at least.
>
> > For "future" backwards compatibility we could just define a hard-coded
> > reserved fixmap page address, adjacent to the vsyscall reserved address.
> >
> > Something like this (not yet tested)? Maybe we could also remove the
> > resume_play_dead() hack?
>
> Does it also solve cpuidle case? I have no overview what all the cpuidle
> drivers might be potentially doing in their ->enter_dead() callbacks.
> Rafael?

There are just two of them, ACPI cpuidle and intel_idle, and they both should
be covered.

In any case, I think that this is the way to go here even though it may be somewhat
problematic to start with.

Cheers,
Rafael



2019-05-31 12:14:20

by Jiri Kosina

[permalink] [raw]
Subject: Re: [PATCH v4] x86/power: Fix 'nosmt' vs. hibernation triple fault during resume

On Fri, 31 May 2019, Josh Poimboeuf wrote:

> Something like this (not yet tested)? Maybe we could also remove the
> resume_play_dead() hack?

I tried to test this, but the resumed kernel doesn't seem to be healthy
for reason I don't understand yet.
Symptoms I've seen so far -- 'dazed and confused NMI', spontaneous reboot,
userspace segfault.

> diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h
> index 9da8cccdf3fb..1c328624162c 100644
> --- a/arch/x86/include/asm/fixmap.h
> +++ b/arch/x86/include/asm/fixmap.h
> @@ -80,6 +80,7 @@ enum fixed_addresses {
> #ifdef CONFIG_X86_VSYSCALL_EMULATION
> VSYSCALL_PAGE = (FIXADDR_TOP - VSYSCALL_ADDR) >> PAGE_SHIFT,
> #endif
> + FIX_MWAIT = (FIXADDR_TOP - VSYSCALL_ADDR - 1) >> PAGE_SHIFT,

Two things to this:

- you don't seem to fix x86_32
- shouldn't it rather be FIXADDR_TOP - VSYSCALL_ADDR + 1 instead?

Thanks,

--
Jiri Kosina
SUSE Labs

2019-05-31 12:20:21

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH v4] x86/power: Fix 'nosmt' vs. hibernation triple fault during resume

On Fri 2019-05-31 01:42:02, Jiri Kosina wrote:
> On Thu, 30 May 2019, Josh Poimboeuf wrote:
>
> > > > Reviewed-by: Thomas Gleixner <[email protected]>
> > >
> > > Yes, it is, thanks!
> >
> > I still think changing monitor/mwait to use a fixmap address would be a
> > much cleaner way to fix this. I can try to work up a patch tomorrow.
>
> I disagree with that from the backwards compatibility point of view.
>
> I personally am quite frequently using differnet combinations of
> resumer/resumee kernels, and I've never been biten by it so far. I'd guess
> I am not the only one.
> Fixmap sort of breaks that invariant.

If we get less tricky code, it may be worth it...
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


Attachments:
(No filename) (857.00 B)
signature.asc (188.00 B)
Digital signature
Download all attachments

2019-05-31 14:28:16

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v4] x86/power: Fix 'nosmt' vs. hibernation triple fault during resume

On Fri, May 31, 2019 at 1:57 AM Rafael J. Wysocki <[email protected]> wrote:
>
> On Friday, May 31, 2019 10:47:21 AM CEST Jiri Kosina wrote:
> > On Fri, 31 May 2019, Josh Poimboeuf wrote:
> >
> > > > I disagree with that from the backwards compatibility point of view.
> > > >
> > > > I personally am quite frequently using differnet combinations of
> > > > resumer/resumee kernels, and I've never been biten by it so far. I'd guess
> > > > I am not the only one.
> > > > Fixmap sort of breaks that invariant.
> > >
> > > Right now there is no backwards compatibility because nosmt resume is
> > > already broken.
> >
> > Yeah, well, but that's "only" for nosmt kernels at least.
> >
> > > For "future" backwards compatibility we could just define a hard-coded
> > > reserved fixmap page address, adjacent to the vsyscall reserved address.
> > >
> > > Something like this (not yet tested)? Maybe we could also remove the
> > > resume_play_dead() hack?
> >
> > Does it also solve cpuidle case? I have no overview what all the cpuidle
> > drivers might be potentially doing in their ->enter_dead() callbacks.
> > Rafael?
>
> There are just two of them, ACPI cpuidle and intel_idle, and they both should
> be covered.
>
> In any case, I think that this is the way to go here even though it may be somewhat
> problematic to start with.
>

Given that there seems to be a genuine compatibility issue right now,
can we design an actual sane way to hand off control of all CPUs
rather than adding duct tape to an extremely fragile mechanism? I can
think of at least two sensible solutions:

1. Have a self-contained "play dead for kexec/resume" function that
touches only few well-defined physical pages: a set of page tables and
a page of code. Load CR3 to point to those page tables, fill in the
code with some form of infinite loop, and run it. Or just turn off
paging entirely and run the infinite loop. Have the kernel doing the
resuming inform the kernel being resumed of which pages these are, and
have the kernel being resumed take over all CPUs before reusing the
pages.

2. Put the CPU all the way to sleep by sending it an INIT IPI.

Version 2 seems very simple and robust. Is there a reason we can't do
it? We obviously don't want to do it for normal offline because it
might be a high-power state, but a cpu in the wait-for-SIPI state is
not going to exit that state all by itself.

The patch to implement #2 should be short and sweet as long as we are
careful to only put genuine APs to sleep like this. The only downside
I can see is that an new kernel resuming and old kernel that was
booted with nosmt is going to waste power, but I don't think that's a
showstopper.

2019-05-31 14:33:40

by Jiri Kosina

[permalink] [raw]
Subject: Re: [PATCH v4] x86/power: Fix 'nosmt' vs. hibernation triple fault during resume

On Fri, 31 May 2019, Andy Lutomirski wrote:

> 2. Put the CPU all the way to sleep by sending it an INIT IPI.
>
> Version 2 seems very simple and robust. Is there a reason we can't do
> it? We obviously don't want to do it for normal offline because it
> might be a high-power state, but a cpu in the wait-for-SIPI state is
> not going to exit that state all by itself.
>
> The patch to implement #2 should be short and sweet as long as we are
> careful to only put genuine APs to sleep like this. The only downside
> I can see is that an new kernel resuming and old kernel that was
> booted with nosmt is going to waste power, but I don't think that's a
> showstopper.

Well, if *that* is not an issue, than the original 3-liner that just
forces them to 'hlt' [1] would be good enough as well.

[1] https://lore.kernel.org/lkml/[email protected]/

Thanks,

--
Jiri Kosina
SUSE Labs

2019-05-31 14:35:46

by Jiri Kosina

[permalink] [raw]
Subject: Re: [PATCH v4] x86/power: Fix 'nosmt' vs. hibernation triple fault during resume

On Fri, 31 May 2019, Jiri Kosina wrote:

> On Fri, 31 May 2019, Andy Lutomirski wrote:
>
> > 2. Put the CPU all the way to sleep by sending it an INIT IPI.
> >
> > Version 2 seems very simple and robust. Is there a reason we can't do
> > it? We obviously don't want to do it for normal offline because it
> > might be a high-power state, but a cpu in the wait-for-SIPI state is
> > not going to exit that state all by itself.
> >
> > The patch to implement #2 should be short and sweet as long as we are
> > careful to only put genuine APs to sleep like this. The only downside
> > I can see is that an new kernel resuming and old kernel that was
> > booted with nosmt is going to waste power, but I don't think that's a
> > showstopper.
>
> Well, if *that* is not an issue, than the original 3-liner that just
> forces them to 'hlt' [1] would be good enough as well.

Actually no, scratch that, I misunderstood your proposal, sorry.

--
Jiri Kosina
SUSE Labs

2019-05-31 14:49:56

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v4] x86/power: Fix 'nosmt' vs. hibernation triple fault during resume



> On May 31, 2019, at 7:31 AM, Jiri Kosina <[email protected]> wrote:
>
>> On Fri, 31 May 2019, Andy Lutomirski wrote:
>>
>> 2. Put the CPU all the way to sleep by sending it an INIT IPI.
>>
>> Version 2 seems very simple and robust. Is there a reason we can't do
>> it? We obviously don't want to do it for normal offline because it
>> might be a high-power state, but a cpu in the wait-for-SIPI state is
>> not going to exit that state all by itself.
>>
>> The patch to implement #2 should be short and sweet as long as we are
>> careful to only put genuine APs to sleep like this. The only downside
>> I can see is that an new kernel resuming and old kernel that was
>> booted with nosmt is going to waste power, but I don't think that's a
>> showstopper.
>
> Well, if *that* is not an issue, than the original 3-liner that just
> forces them to 'hlt' [1] would be good enough as well.
>
>

Seems okay to me as long as we’re confident we won’t get a spurious interrupt.

In general, I don’t think we’re ever suppose to rely on mwait *staying* asleep. As I understand it, mwait can wake up whenever it wants, and the only real guarantee we have is that the CPU makes some effort to stay asleep until an interrupt is received or the monitor address is poked.

As a trivial example, if we are in a VM and we get scheduled out at any point between MONITOR and the eventual intentional wakeup, we’re toast. Same if we get an SMI due to bad luck or due to a thermal event happening shortly after pushing the power button to resume from hibernate.

For that matter, what actually happens if we get an SMI while halted? Does RSM go directly to sleep or does it re-fetch the HLT?

It seems to me that we should just avoid the scenario where we have IP pointed to a bogus address and we just cross our fingers and hope the CPU doesn’t do anything.

I think that, as a short term fix, we should use HLT and, as a long term fix, we should either keep the CPU state fully valid or we should hard-offline the CPU using documented mechanisms, e.g. the WAIT-for-SIPI state.

2019-05-31 14:54:02

by Jiri Kosina

[permalink] [raw]
Subject: Re: [PATCH v4] x86/power: Fix 'nosmt' vs. hibernation triple fault during resume

On Fri, 31 May 2019, Josh Poimboeuf wrote:

> > I personally am quite frequently using differnet combinations of
> > resumer/resumee kernels, and I've never been biten by it so far. I'd guess
> > I am not the only one.
> > Fixmap sort of breaks that invariant.
>
> Right now there is no backwards compatibility because nosmt resume is
> already broken.
>
> For "future" backwards compatibility we could just define a hard-coded
> reserved fixmap page address, adjacent to the vsyscall reserved address.
>
> Something like this (not yet tested)? Maybe we could also remove the
> resume_play_dead() hack?

Looking into SDM:

=====
A store to the address range armed by the MONITOR instruction, an
interrupt, an NMI or SMI, a debug exception, a machine check exception,
the BINIT# signal, the INIT# signal, or the RESET# signal will exit the
implementation-dependent-optimized state.
=====

And mwait doesn't have the 'auto-restart on SMM exit' like hlt does. So I
guess that's why I am seeing the triple faults even with your (fixed, see
below) patch as well.

So I don't think we can safely use this aproach.

>
> diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h
> index 9da8cccdf3fb..1c328624162c 100644
> --- a/arch/x86/include/asm/fixmap.h
> +++ b/arch/x86/include/asm/fixmap.h
> @@ -80,6 +80,7 @@ enum fixed_addresses {
> #ifdef CONFIG_X86_VSYSCALL_EMULATION
> VSYSCALL_PAGE = (FIXADDR_TOP - VSYSCALL_ADDR) >> PAGE_SHIFT,
> #endif
> + FIX_MWAIT = (FIXADDR_TOP - VSYSCALL_ADDR - 1) >> PAGE_SHIFT,
> #endif
> FIX_DBGP_BASE,
> FIX_EARLYCON_MEM_BASE,
> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> index 73e69aaaa117..9804fbe25d03 100644
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -108,6 +108,8 @@ int __read_mostly __max_smt_threads = 1;
> /* Flag to indicate if a complete sched domain rebuild is required */
> bool x86_topology_update;
>
> +static char __mwait_page[PAGE_SIZE];

This needs to be __align(PAGE_SIZE) in order for the fixmap to work
properly.

--
Jiri Kosina
SUSE Labs

2019-05-31 14:55:50

by Jiri Kosina

[permalink] [raw]
Subject: Re: [PATCH v4] x86/power: Fix 'nosmt' vs. hibernation triple fault during resume

On Fri, 31 May 2019, Andy Lutomirski wrote:

> For that matter, what actually happens if we get an SMI while halted?
> Does RSM go directly to sleep or does it re-fetch the HLT?

Our mails just crossed, I replied to Josh's mwait() proposal patch a
minute ago.

HLT is guaranteed to be re-entered if SMM interrupted it, while MWAIT is
not.

So as a short-term fix for 5.2, I still believe in v4 of my patch that
does the mwait->hlt->mwait transition across hibernate/resume, and for 5.3
I can look into forcing it to wait-for-SIPI proper.

--
Jiri Kosina
SUSE Labs

2019-05-31 15:28:26

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: [PATCH v4] x86/power: Fix 'nosmt' vs. hibernation triple fault during resume

On Fri, May 31, 2019 at 04:54:20PM +0200, Jiri Kosina wrote:
> On Fri, 31 May 2019, Andy Lutomirski wrote:
>
> > For that matter, what actually happens if we get an SMI while halted?
> > Does RSM go directly to sleep or does it re-fetch the HLT?
>
> Our mails just crossed, I replied to Josh's mwait() proposal patch a
> minute ago.

Good catch. I agree that mwait seems unsafe across resume and my patch
is bogus.

Andy, in the short term it sounds like you're proposing to make
native_play_dead() use hlt_play_dead() unconditionally. Right?

That would simplify things and also would fix Jiri's bug I think. The
only question I'd have is if we have data on the power savings
difference between hlt and mwait. mwait seems to wake up on a lot of
different conditions which might negate its deeper sleep state.

Andy, for your long term idea to use INIT IPI, I wonder if that would
work with SMT siblings? Specifically I wonder about the Intel issue
that requires siblings to have CR4.MCE set.

--
Josh

2019-05-31 15:42:51

by Jiri Kosina

[permalink] [raw]
Subject: Re: [PATCH v4] x86/power: Fix 'nosmt' vs. hibernation triple fault during resume

On Fri, 31 May 2019, Josh Poimboeuf wrote:

> The only question I'd have is if we have data on the power savings
> difference between hlt and mwait. mwait seems to wake up on a lot of
> different conditions which might negate its deeper sleep state.

hlt wakes up on basically the same set of events, but has the
auto-restarting semantics on some of them (especially SMM). So the wakeup
frequency itself shouldn't really contribute to power consumption
difference; it's the C-state that mwait allows CPU to enter.

Thanks,

--
Jiri Kosina
SUSE Labs

2019-05-31 16:21:33

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: [PATCH v4] x86/power: Fix 'nosmt' vs. hibernation triple fault during resume

On Fri, May 31, 2019 at 05:41:18PM +0200, Jiri Kosina wrote:
> On Fri, 31 May 2019, Josh Poimboeuf wrote:
>
> > The only question I'd have is if we have data on the power savings
> > difference between hlt and mwait. mwait seems to wake up on a lot of
> > different conditions which might negate its deeper sleep state.
>
> hlt wakes up on basically the same set of events, but has the
> auto-restarting semantics on some of them (especially SMM). So the wakeup
> frequency itself shouldn't really contribute to power consumption
> difference; it's the C-state that mwait allows CPU to enter.

Ok. I reluctantly surrender :-) For your v4:

Reviewed-by: Josh Poimboeuf <[email protected]>

It works as a short term fix, but it's fragile, and it does feel like
we're just adding more duct tape, as Andy said.

--
Josh

2019-05-31 16:26:00

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v4] x86/power: Fix 'nosmt' vs. hibernation triple fault during resume

On Fri, May 31, 2019 at 7:54 AM Jiri Kosina <[email protected]> wrote:
>
> On Fri, 31 May 2019, Andy Lutomirski wrote:
>
> > For that matter, what actually happens if we get an SMI while halted?
> > Does RSM go directly to sleep or does it re-fetch the HLT?
>
> Our mails just crossed, I replied to Josh's mwait() proposal patch a
> minute ago.
>
> HLT is guaranteed to be re-entered if SMM interrupted it, while MWAIT is
> not.
>
> So as a short-term fix for 5.2, I still believe in v4 of my patch that
> does the mwait->hlt->mwait transition across hibernate/resume, and for 5.3
> I can look into forcing it to wait-for-SIPI proper.
>

How sure are you? From http://www.rcollins.org/ddj/Mar97/Mar97.html I see:

In general, the AutoHALT field directs the microprocessor whether or
not to restart the HLT instruction upon exit of SMM. This is
accomplished by decrementing EIP and executing whatever instruction
resides at that position. AutoHALT restart behavior is consistent,
regardless of whether or not EIP-1 contains a HLT instruction. If the
SMM handler set Auto HALT[bit0]=1 when the interrupted instruction was
not a HLT instruction(AutoHALT[bit0]= 0 upon entrance), they would run
the risk of resuming execution at an undesired location. The RSM
microcode doesn’t know the length of the interrupted instruction.
Therefore when AutoHALT[bit0]=1 upon exit, the RSM microcode blindly
decrements the EIP register by 1 and resumes execution. This explains
why Intel warns that unpredictable behavior may result from setting
this field to restart a HLT instruction when the microprocessor wasn’t
in a HALT state upon entrance. Listing One presents an algorithm that
describes the AutoHALT Restart feature.

The AMD APM says:

If the return from SMM takes the processor back to the halt state, the
HLT instruction is not re-
executed. However, the halt special bus-cycle is driven on the
processor bus after the RSM instruction
executes.

The Intel SDM Vol 3 34.10 says:

If the HLT instruction is restarted, the processor will generate a
memory access to fetch the HLT instruction (if it is
not in the internal cache), and execute a HLT bus transaction. This
behavior results in multiple HLT bus transactions
for the same HLT instruction.

And the SDM is very clear that one should *not* do RSM with auto-halt
set if the instruction that was interrupted wasn't HLT.

It sounds to me like Intel does not architecturally guarantee that
it's okay to do HLT from one CPU and then remove the HLT instruction
out from under it on another CPU.

2019-05-31 16:52:46

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v4] x86/power: Fix 'nosmt' vs. hibernation triple fault during resume

On Fri, May 31, 2019 at 9:19 AM Josh Poimboeuf <[email protected]> wrote:
>
> On Fri, May 31, 2019 at 05:41:18PM +0200, Jiri Kosina wrote:
> > On Fri, 31 May 2019, Josh Poimboeuf wrote:
> >
> > > The only question I'd have is if we have data on the power savings
> > > difference between hlt and mwait. mwait seems to wake up on a lot of
> > > different conditions which might negate its deeper sleep state.
> >
> > hlt wakes up on basically the same set of events, but has the
> > auto-restarting semantics on some of them (especially SMM). So the wakeup
> > frequency itself shouldn't really contribute to power consumption
> > difference; it's the C-state that mwait allows CPU to enter.
>
> Ok. I reluctantly surrender :-) For your v4:
>
> Reviewed-by: Josh Poimboeuf <[email protected]>
>
> It works as a short term fix, but it's fragile, and it does feel like
> we're just adding more duct tape, as Andy said.
>

Just to clarify what I was thinking, it seems like soft-offlining a
CPU and resuming a kernel have fundamentally different requirements.
To soft-offline a CPU, we want to get power consumption as low as
possible and make sure that MCE won't kill the system. It's okay for
the CPU to occasionally execute some code. For resume, what we're
really doing is trying to hand control of all CPUs from kernel A to
kernel B. There are two basic ways to hand off control of a given
CPU: we can jump (with JMP, RET, horrible self-modifying code, etc)
from one kernel to the other, or we can attempt to make a given CPU
stop executing code from either kernel at all and then forcibly wrench
control of it in kernel B. Either approach seems okay, but the latter
approach depends on getting the CPU to reliably stop executing code.
We don't care about power consumption for resume, and I'm not even
convinced that we need to be able to survive an MCE that happens while
we're resuming, although surviving MCE would be nice.

So if we don't want to depend on nasty system details at all, we could
have the first kernel explicitly wake up all CPUs and hand them all
off to the new kernel, more or less the same way that we hand over
control of the BSP right now. Or we can look for a way to tell all
the APs to stop executing kernel code, and the only architectural way
I know of to do that is to sent an INIT IPI (and then presumably
deassert INIT -- the SDM is a bit vague).

Or we could allocate a page, stick a GDT, a TSS, and a 1: hlt; jmp 1b
in it, turn off paging, and run that code. And then somehow convince
the kernel we load not to touch that page until it finishes waking up
all CPUs. This seems conceptually simple and very robust, but I'm not
sure it fits in with the way hibernation works right now at all.

2019-05-31 18:12:57

by Josh Poimboeuf

[permalink] [raw]
Subject: Re: [PATCH v4] x86/power: Fix 'nosmt' vs. hibernation triple fault during resume

On Fri, May 31, 2019 at 09:51:09AM -0700, Andy Lutomirski wrote:
> Just to clarify what I was thinking, it seems like soft-offlining a
> CPU and resuming a kernel have fundamentally different requirements.
> To soft-offline a CPU, we want to get power consumption as low as
> possible and make sure that MCE won't kill the system. It's okay for
> the CPU to occasionally execute some code. For resume, what we're
> really doing is trying to hand control of all CPUs from kernel A to
> kernel B. There are two basic ways to hand off control of a given
> CPU: we can jump (with JMP, RET, horrible self-modifying code, etc)
> from one kernel to the other, or we can attempt to make a given CPU
> stop executing code from either kernel at all and then forcibly wrench
> control of it in kernel B. Either approach seems okay, but the latter
> approach depends on getting the CPU to reliably stop executing code.
> We don't care about power consumption for resume, and I'm not even
> convinced that we need to be able to survive an MCE that happens while
> we're resuming, although surviving MCE would be nice.

I'd thought you were proposing a global improvement: we get rid of
mwait_play_dead() everywhere, i.e. all the time, not just for the resume
path.

Instead it sounds like you were proposing a local improvement to the
resume path, to continue doing what
hibernate_resume_nonboot_cpu_disable() is already doing, but use an INIT
IPI instead of HLT to make sure the CPU is completely dead.

That may be a theoretical improvement but we'd still need to do the
whole "wake and play dead" dance which Jiri's patch is doing for offline
CPUs. So Jiri's patch looks ok to me.

--
Josh

2019-05-31 21:07:42

by Jiri Kosina

[permalink] [raw]
Subject: Re: [PATCH v4] x86/power: Fix 'nosmt' vs. hibernation triple fault during resume

On Fri, 31 May 2019, Andy Lutomirski wrote:

> The Intel SDM Vol 3 34.10 says:
>
> If the HLT instruction is restarted, the processor will generate a
> memory access to fetch the HLT instruction (if it is
> not in the internal cache), and execute a HLT bus transaction. This
> behavior results in multiple HLT bus transactions
> for the same HLT instruction.

Which basically means that both hibernation and kexec have been broken in
this respect for gazillions of years, and seems like noone noticed. Makes
one wonder what the reason for that might be.

Either SDM is not precise and the refetch actually never happens for real
(or is always in these cases satisfied from I$ perhaps?), or ... ?

So my patch basically puts things back where they have been for ages
(while mwait is obviously much worse, as that gets woken up by the write
to the monitored address, which inevitably does happen during resume), but
seems like SDM is suggesting that we've been in a grey zone wrt RSM at
least for all those ages.

So perhaps we really should ditch resume_play_dead() altogether
eventually, and replace it with sending INIT IPI around instead (and then
waking the CPUs properly via INIT INIT START). I'd still like to do that
for 5.3 though, as that'd be slightly bigger surgery, and conservatively
put things basically back to state they have been up to now for 5.2.

Thanks,

--
Jiri Kosina
SUSE Labs

2019-05-31 21:23:51

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v4] x86/power: Fix 'nosmt' vs. hibernation triple fault during resume


> On May 31, 2019, at 2:05 PM, Jiri Kosina <[email protected]> wrote:
>
>> On Fri, 31 May 2019, Andy Lutomirski wrote:
>>
>> The Intel SDM Vol 3 34.10 says:
>>
>> If the HLT instruction is restarted, the processor will generate a
>> memory access to fetch the HLT instruction (if it is
>> not in the internal cache), and execute a HLT bus transaction. This
>> behavior results in multiple HLT bus transactions
>> for the same HLT instruction.
>
> Which basically means that both hibernation and kexec have been broken in
> this respect for gazillions of years, and seems like noone noticed. Makes
> one wonder what the reason for that might be.
>
> Either SDM is not precise and the refetch actually never happens for real
> (or is always in these cases satisfied from I$ perhaps?), or ... ?
>
> So my patch basically puts things back where they have been for ages
> (while mwait is obviously much worse, as that gets woken up by the write
> to the monitored address, which inevitably does happen during resume), but
> seems like SDM is suggesting that we've been in a grey zone wrt RSM at
> least for all those ages.
>
> So perhaps we really should ditch resume_play_dead() altogether
> eventually, and replace it with sending INIT IPI around instead (and then
> waking the CPUs properly via INIT INIT START). I'd still like to do that
> for 5.3 though, as that'd be slightly bigger surgery, and conservatively
> put things basically back to state they have been up to now for 5.2.
>


Seems reasonable to me. I would guess that it mostly works because SMI isn’t all that common and the window where it matters is short. Or maybe the SDM is misleading.

2019-06-03 11:53:02

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH v4] x86/power: Fix 'nosmt' vs. hibernation triple fault during resume

On Friday, May 31, 2019 6:19:52 PM CEST Josh Poimboeuf wrote:
> On Fri, May 31, 2019 at 05:41:18PM +0200, Jiri Kosina wrote:
> > On Fri, 31 May 2019, Josh Poimboeuf wrote:
> >
> > > The only question I'd have is if we have data on the power savings
> > > difference between hlt and mwait. mwait seems to wake up on a lot of
> > > different conditions which might negate its deeper sleep state.
> >
> > hlt wakes up on basically the same set of events, but has the
> > auto-restarting semantics on some of them (especially SMM). So the wakeup
> > frequency itself shouldn't really contribute to power consumption
> > difference; it's the C-state that mwait allows CPU to enter.
>
> Ok. I reluctantly surrender :-) For your v4:
>
> Reviewed-by: Josh Poimboeuf <[email protected]>
>
> It works as a short term fix, but it's fragile, and it does feel like
> we're just adding more duct tape, as Andy said.

OK, the v4 queued up then, thanks!



2019-06-03 14:26:12

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v4] x86/power: Fix 'nosmt' vs. hibernation triple fault during resume

On Fri, May 31, 2019 at 02:22:27PM -0700, Andy Lutomirski wrote:
>
> > On May 31, 2019, at 2:05 PM, Jiri Kosina <[email protected]> wrote:
> >
> >> On Fri, 31 May 2019, Andy Lutomirski wrote:
> >>
> >> The Intel SDM Vol 3 34.10 says:
> >>
> >> If the HLT instruction is restarted, the processor will generate a
> >> memory access to fetch the HLT instruction (if it is
> >> not in the internal cache), and execute a HLT bus transaction. This
> >> behavior results in multiple HLT bus transactions
> >> for the same HLT instruction.
> >
> > Which basically means that both hibernation and kexec have been broken in
> > this respect for gazillions of years, and seems like noone noticed. Makes
> > one wonder what the reason for that might be.
> >
> > Either SDM is not precise and the refetch actually never happens for real
> > (or is always in these cases satisfied from I$ perhaps?), or ... ?
> >
> > So my patch basically puts things back where they have been for ages
> > (while mwait is obviously much worse, as that gets woken up by the write
> > to the monitored address, which inevitably does happen during resume), but
> > seems like SDM is suggesting that we've been in a grey zone wrt RSM at
> > least for all those ages.
> >
> > So perhaps we really should ditch resume_play_dead() altogether
> > eventually, and replace it with sending INIT IPI around instead (and then
> > waking the CPUs properly via INIT INIT START). I'd still like to do that
> > for 5.3 though, as that'd be slightly bigger surgery, and conservatively
> > put things basically back to state they have been up to now for 5.2.
> >
>
>
> Seems reasonable to me. I would guess that it mostly works because SMI isn’t
> all that common and the window where it matters is short. Or maybe the SDM
> is misleading.

For P6 and later, i.e. all modern CPUs, Intel processors go straight to
halted state and don't fetch/decode the HLT instruction.

P5 actually did a fetch, but from what I can tell that behavior wasn't
carried forward to KNC, unlike other legacy interrupt crud from P5:

[1] https://lkml.kernel.org/r/[email protected]

2019-06-03 15:27:45

by Jiri Kosina

[permalink] [raw]
Subject: Re: [PATCH v4] x86/power: Fix 'nosmt' vs. hibernation triple fault during resume

On Mon, 3 Jun 2019, Sean Christopherson wrote:

> For P6 and later, i.e. all modern CPUs, Intel processors go straight to
> halted state and don't fetch/decode the HLT instruction.

That'd be a rather relieving fact actually. Do you happen to know if this
is stated in some Intel documentation and we've just overlooked it, or
whether it's rather an information that's being carried over from
generation to generation by whispering through grapevine?

Thanks,

--
Jiri Kosina
SUSE Labs

2019-06-03 17:48:20

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v4] x86/power: Fix 'nosmt' vs. hibernation triple fault during resume

On Mon, Jun 03, 2019 at 05:24:26PM +0200, Jiri Kosina wrote:
> On Mon, 3 Jun 2019, Sean Christopherson wrote:
>
> > For P6 and later, i.e. all modern CPUs, Intel processors go straight to
> > halted state and don't fetch/decode the HLT instruction.
>
> That'd be a rather relieving fact actually. Do you happen to know if this
> is stated in some Intel documentation and we've just overlooked it, or
> whether it's rather an information that's being carried over from
> generation to generation by whispering through grapevine?

I highly doubt it's officially stated anywhere. Intel's approach to this
type of micro-architecture specific behavior is (usually) to word the SDM
in such a way that both approaches are legal. E.g. a 1993 version of the
SDM says "Returns to interrupted HLT instruction", whereas in 1995, which
just so happens to coincide with the introduction of the P6 architecture,
the SDM started saying "Returns to HALT state" and added the blurb about
"will generate a memory access to fetch the HLT instruction (if it is not
in the internal cache)" so that the old behavior is still legal.

All that being said, the "straight to HALT" behavior is now the de facto
standard since lots of people will yell loudly if it changes.