Kexec enabling inside TDX guest requires addressing several issues:
- Avoid clearing CR4.MCE during kexec;
- Convert all shared memory back to private on kexec, so the next
kernel could start normally;
- Add support for offlining secondary CPUs, so the new kernel could
bring them up again.
The first two items are relatively straight-forward. The initial attempt
to implement them can be found here - [1].
The last item is tricky. TDX guests use ACPI MADT MPWK to bring up
secondary CPUs. The mechanism doesn't allow to put a CPU back offline if
it has woken up.
It is clearly missing functionality from the ACPI mechanism and it has
to be changed to allow offlining. The work in this direction has
started, but it takes time.
For now, disable kexec for TDX guests. It will fail kexec instead of
crashing on attempt.
[1] https://lore.kernel.org/all/[email protected]
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/include/asm/kexec.h | 3 +++
arch/x86/kernel/machine_kexec_64.c | 10 ++++++++++
include/linux/kexec.h | 7 +++++++
kernel/kexec.c | 4 ++++
kernel/kexec_file.c | 4 ++++
5 files changed, 28 insertions(+)
diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
index a3760ca796aa..89693684a7d1 100644
--- a/arch/x86/include/asm/kexec.h
+++ b/arch/x86/include/asm/kexec.h
@@ -189,6 +189,9 @@ extern void arch_kexec_pre_free_pages(void *vaddr, unsigned int pages);
void arch_kexec_protect_crashkres(void);
#define arch_kexec_protect_crashkres arch_kexec_protect_crashkres
+int arch_kexec_load(void);
+#define arch_kexec_load arch_kexec_load
+
void arch_kexec_unprotect_crashkres(void);
#define arch_kexec_unprotect_crashkres arch_kexec_unprotect_crashkres
diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
index 0611fd83858e..9a0ded12df70 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -600,3 +600,13 @@ void arch_kexec_pre_free_pages(void *vaddr, unsigned int pages)
*/
set_memory_encrypted((unsigned long)vaddr, pages);
}
+
+int arch_kexec_load(void)
+{
+ if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST)) {
+ pr_warn_once("Disable kexec: not yet supported in TDX guest\n");
+ return -EOPNOTSUPP;
+ }
+
+ return 0;
+}
diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index 6883c5922701..9fa88c191605 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -442,6 +442,13 @@ static inline void arch_kexec_protect_crashkres(void) { }
static inline void arch_kexec_unprotect_crashkres(void) { }
#endif
+#ifndef arch_kexec_load
+static inline int arch_kexec_load(void)
+{
+ return 0;
+}
+#endif
+
#ifndef page_to_boot_pfn
static inline unsigned long page_to_boot_pfn(struct page *page)
{
diff --git a/kernel/kexec.c b/kernel/kexec.c
index 92d301f98776..70e4923c135d 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -194,6 +194,10 @@ static inline int kexec_load_check(unsigned long nr_segments,
KEXEC_TYPE_CRASH : KEXEC_TYPE_DEFAULT;
int result;
+ result = arch_kexec_load();
+ if (result)
+ return result;
+
/* We only trust the superuser with rebooting the system. */
if (!kexec_load_permitted(image_type))
return -EPERM;
diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
index f1a0e4e3fb5c..d02b7eb0f6e6 100644
--- a/kernel/kexec_file.c
+++ b/kernel/kexec_file.c
@@ -331,6 +331,10 @@ SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, initrd_fd,
struct kimage **dest_image, *image;
int ret = 0, i;
+ ret = arch_kexec_load();
+ if (ret)
+ return ret;
+
/* We only trust the superuser with rebooting the system. */
if (!kexec_load_permitted(image_type))
return -EPERM;
--
2.39.2
On 3/25/23 09:01, Kirill A. Shutemov wrote:
> The last item is tricky. TDX guests use ACPI MADT MPWK to bring up
> secondary CPUs. The mechanism doesn't allow to put a CPU back offline if
> it has woken up.
...
> +int arch_kexec_load(void)
> +{
> + if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST)) {
> + pr_warn_once("Disable kexec: not yet supported in TDX guest\n");
> + return -EOPNOTSUPP;
> + }
> +
> + return 0;
> +}
So, let's put all this together:
1. TDX implementations use MADT for wakeup exclusively right now (but
are not necessarily _required_ to do so forever)
2. MADT doesn't support CPU offlining
3. kexec() requires offlining
Thus, current TDX implementations can't support TDX guests. This
*doesn't* say that TDX will always use the MADT for wakeups.
Yet, the check you have here is for TDX and *not* for the MADT.
That seems wrong.
Let's say SEV or arm64 comes along and uses the MADT for their guests.
They'll add another arch_kexec_load(), with a check for *their* feature.
This all seems like you should be disabling kexec() the moment the MADT
CPU wakeup is used instead of making it based on TDX.
On Sat, Mar 25, 2023 at 09:25:36AM -0700, Dave Hansen wrote:
> On 3/25/23 09:01, Kirill A. Shutemov wrote:
> > The last item is tricky. TDX guests use ACPI MADT MPWK to bring up
> > secondary CPUs. The mechanism doesn't allow to put a CPU back offline if
> > it has woken up.
> ...
> > +int arch_kexec_load(void)
> > +{
> > + if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST)) {
> > + pr_warn_once("Disable kexec: not yet supported in TDX guest\n");
> > + return -EOPNOTSUPP;
> > + }
> > +
> > + return 0;
> > +}
>
> So, let's put all this together:
>
> 1. TDX implementations use MADT for wakeup exclusively right now (but
> are not necessarily _required_ to do so forever)
> 2. MADT doesn't support CPU offlining
> 3. kexec() requires offlining
>
> Thus, current TDX implementations can't support TDX guests. This
> *doesn't* say that TDX will always use the MADT for wakeups.
>
> Yet, the check you have here is for TDX and *not* for the MADT.
As I described in the commit message there are more than MADT that is
required to get kexec in TDX guest.
> That seems wrong.
>
> Let's say SEV or arm64 comes along and uses the MADT for their guests.
> They'll add another arch_kexec_load(), with a check for *their* feature.
>
> This all seems like you should be disabling kexec() the moment the MADT
> CPU wakeup is used instead of making it based on TDX.
I guess we can go this path if you are fine with taking CR4.MCE and shared
memory reverting patches (they require some rework, but I can get them
into shape quickly). After that we can forbid kexec on machines with MADT
if nr_cpus > 1.
Sounds good?
--
Kiryl Shutsemau / Kirill A. Shutemov
On 3/25/23 12:25, Kirill A. Shutemov wrote:
> On Sat, Mar 25, 2023 at 09:25:36AM -0700, Dave Hansen wrote:
>> On 3/25/23 09:01, Kirill A. Shutemov wrote:
>>> The last item is tricky. TDX guests use ACPI MADT MPWK to bring up
>>> secondary CPUs. The mechanism doesn't allow to put a CPU back offline if
>>> it has woken up.
>> ...
>>> +int arch_kexec_load(void)
>>> +{
>>> + if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST)) {
>>> + pr_warn_once("Disable kexec: not yet supported in TDX guest\n");
>>> + return -EOPNOTSUPP;
>>> + }
>>> +
>>> + return 0;
>>> +}
>>
>> So, let's put all this together:
>>
>> 1. TDX implementations use MADT for wakeup exclusively right now (but
>> are not necessarily _required_ to do so forever)
>> 2. MADT doesn't support CPU offlining
>> 3. kexec() requires offlining
>>
>> Thus, current TDX implementations can't support TDX guests. This
>> *doesn't* say that TDX will always use the MADT for wakeups.
>>
>> Yet, the check you have here is for TDX and *not* for the MADT.
>
> As I described in the commit message there are more than MADT that is
> required to get kexec in TDX guest.
I kinda think we should do both.
Let's make sure that all systems that depend on MADT wakeups can't
kexec() until the ACPI folks work out what to do there.
Separately, let's either fix or *mark* the kexec()-incompatible pieces
that *ARE* specific to TDX.
>> That seems wrong.
>>
>> Let's say SEV or arm64 comes along and uses the MADT for their guests.
>> They'll add another arch_kexec_load(), with a check for *their* feature.
>>
>> This all seems like you should be disabling kexec() the moment the MADT
>> CPU wakeup is used instead of making it based on TDX.
>
> I guess we can go this path if you are fine with taking CR4.MCE and shared
> memory reverting patches (they require some rework, but I can get them
> into shape quickly). After that we can forbid kexec on machines with MADT
> if nr_cpus > 1.
This goes back to what I asked before: is anyone actually going to *use*
a single-processor system that wants to kexec()? If not, let's not
waste the time to introduce code that is just going to bitrot. Just
mark it broken and move on with life.
I'm also a _bit_ curious what the implications of the CR4.MCE
preservation are. IIRC, systems are quite a bit less stable when
CR4.MCE==0. So, maybe there are some benefits to leaving it set during
kexec() for everyone. But, that probably also involves having a #MC
handler in the mix and that's obviously trouble while we're switching
kernels and messing around in the guts of things like paging configuration.
The CR4.MCE change is simple enough, but it also _completely_ glossed
over the implications to non-TDX users.
I guess the overall message here is: please try to think of the big
picture a _little_ outside of the TDX world. This patch's approach is a
bit myopic.
On 03/26/23 at 10:01am, Dave Hansen wrote:
> On 3/25/23 12:25, Kirill A. Shutemov wrote:
> > On Sat, Mar 25, 2023 at 09:25:36AM -0700, Dave Hansen wrote:
> >> On 3/25/23 09:01, Kirill A. Shutemov wrote:
> >>> The last item is tricky. TDX guests use ACPI MADT MPWK to bring up
> >>> secondary CPUs. The mechanism doesn't allow to put a CPU back offline if
> >>> it has woken up.
> >> ...
> >>> +int arch_kexec_load(void)
> >>> +{
> >>> + if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST)) {
> >>> + pr_warn_once("Disable kexec: not yet supported in TDX guest\n");
> >>> + return -EOPNOTSUPP;
> >>> + }
> >>> +
> >>> + return 0;
> >>> +}
> >>
> >> So, let's put all this together:
> >>
> >> 1. TDX implementations use MADT for wakeup exclusively right now (but
> >> are not necessarily _required_ to do so forever)
> >> 2. MADT doesn't support CPU offlining
> >> 3. kexec() requires offlining
> >>
> >> Thus, current TDX implementations can't support TDX guests. This
> >> *doesn't* say that TDX will always use the MADT for wakeups.
> >>
> >> Yet, the check you have here is for TDX and *not* for the MADT.
> >
> > As I described in the commit message there are more than MADT that is
> > required to get kexec in TDX guest.
>
> I kinda think we should do both.
>
> Let's make sure that all systems that depend on MADT wakeups can't
> kexec() until the ACPI folks work out what to do there.
>
> Separately, let's either fix or *mark* the kexec()-incompatible pieces
> that *ARE* specific to TDX.
>
> >> That seems wrong.
> >>
> >> Let's say SEV or arm64 comes along and uses the MADT for their guests.
> >> They'll add another arch_kexec_load(), with a check for *their* feature.
> >>
> >> This all seems like you should be disabling kexec() the moment the MADT
> >> CPU wakeup is used instead of making it based on TDX.
> >
> > I guess we can go this path if you are fine with taking CR4.MCE and shared
> > memory reverting patches (they require some rework, but I can get them
> > into shape quickly). After that we can forbid kexec on machines with MADT
> > if nr_cpus > 1.
>
> This goes back to what I asked before: is anyone actually going to *use*
> a single-processor system that wants to kexec()? If not, let's not
> waste the time to introduce code that is just going to bitrot. Just
> mark it broken and move on with life.
Now we have two API for kexec: kexec_load and kexec_file_load. They can
be used to do kexec reboot, or crash dumping. For crash dumping, we
usually only use one cpu to do the vmcore dumping. At least on our
Fedora/centos-stream/RHEL, we do like this with kernel parameter
'nr_cpus=1' added by default. Unless people explicitly remove the
'nr_cpus=1' restriction or set nr_cpus= to other number to persue
multithread dumping in kdump kernel.
So I think Kirill's idea looks good. Means on TDX guest kexec reboot
will be forbid, while crash dumping will function normally.
Thanks
Baoquan
On Sun, Mar 26, 2023 at 10:01:23AM -0700, Dave Hansen wrote:
> > I guess we can go this path if you are fine with taking CR4.MCE and shared
> > memory reverting patches (they require some rework, but I can get them
> > into shape quickly). After that we can forbid kexec on machines with MADT
> > if nr_cpus > 1.
>
> This goes back to what I asked before: is anyone actually going to *use*
> a single-processor system that wants to kexec()? If not, let's not
> waste the time to introduce code that is just going to bitrot. Just
> mark it broken and move on with life.
>
> I'm also a _bit_ curious what the implications of the CR4.MCE
> preservation are. IIRC, systems are quite a bit less stable when
> CR4.MCE==0. So, maybe there are some benefits to leaving it set during
> kexec() for everyone.
Hm. I thought the opposite: keeping MCE set brings more risks.
Andrew had feedback on the patch:
Async events, including NMIs, cannot be taken between this point and the
target having set itself up into it's intended operating mode. During
this period you get all kinds of fun with type confusion in the IDT/TSS
and/or not having a safe stack to service the event.
I tend to agree with him, but maybe I miss bigger picture.
Based on that I adjusted the patch to only affect TDX guests:
From edbef5f1e6c31929ae1249c58b29c38f86e676c0 Mon Sep 17 00:00:00 2001
From: "Kirill A. Shutemov" <[email protected]>
Date: Fri, 10 Feb 2023 12:53:11 +0300
Subject: [PATCH] x86/kexec: Keep CR4.MCE during kexec for TDX guest
TDX guests are not allowed to clear CR4.MCE. Attempt to clear it leads
to #VE.
Use alternatives to keep the flag during kexec for TDX guests.
The change doesn't affect non-TDX environments.
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/kernel/relocate_kernel_64.S | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocate_kernel_64.S
index 4a73351f87f8..9e83a638a2b8 100644
--- a/arch/x86/kernel/relocate_kernel_64.S
+++ b/arch/x86/kernel/relocate_kernel_64.S
@@ -145,8 +145,11 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
* Set cr4 to a known state:
* - physical address extension enabled
* - 5-level paging, if it was enabled before
+ * - Machine check exception on TDX guest. Clearing MCE is not allowed
+ * in TDX guests.
*/
- movl $X86_CR4_PAE, %eax
+ ALTERNATIVE "movl "$X86_CR4_PAE", %eax", \
+ "movl "$(X86_CR4_PAE | X86_CR4_MCE)", %eax", X86_FEATURE_TDX_GUEST
testq $X86_CR4_LA57, %r13
jz 1f
orl $X86_CR4_LA57, %eax
--
Kiryl Shutsemau / Kirill A. Shutemov
On Mon, Mar 27, 2023 at 09:35:54AM +0800, Baoquan He wrote:
> On 03/26/23 at 10:01am, Dave Hansen wrote:
> > On 3/25/23 12:25, Kirill A. Shutemov wrote:
> > > On Sat, Mar 25, 2023 at 09:25:36AM -0700, Dave Hansen wrote:
> > >> On 3/25/23 09:01, Kirill A. Shutemov wrote:
> > >>> The last item is tricky. TDX guests use ACPI MADT MPWK to bring up
> > >>> secondary CPUs. The mechanism doesn't allow to put a CPU back offline if
> > >>> it has woken up.
> > >> ...
> > >>> +int arch_kexec_load(void)
> > >>> +{
> > >>> + if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST)) {
> > >>> + pr_warn_once("Disable kexec: not yet supported in TDX guest\n");
> > >>> + return -EOPNOTSUPP;
> > >>> + }
> > >>> +
> > >>> + return 0;
> > >>> +}
> > >>
> > >> So, let's put all this together:
> > >>
> > >> 1. TDX implementations use MADT for wakeup exclusively right now (but
> > >> are not necessarily _required_ to do so forever)
> > >> 2. MADT doesn't support CPU offlining
> > >> 3. kexec() requires offlining
> > >>
> > >> Thus, current TDX implementations can't support TDX guests. This
> > >> *doesn't* say that TDX will always use the MADT for wakeups.
> > >>
> > >> Yet, the check you have here is for TDX and *not* for the MADT.
> > >
> > > As I described in the commit message there are more than MADT that is
> > > required to get kexec in TDX guest.
> >
> > I kinda think we should do both.
> >
> > Let's make sure that all systems that depend on MADT wakeups can't
> > kexec() until the ACPI folks work out what to do there.
> >
> > Separately, let's either fix or *mark* the kexec()-incompatible pieces
> > that *ARE* specific to TDX.
> >
> > >> That seems wrong.
> > >>
> > >> Let's say SEV or arm64 comes along and uses the MADT for their guests.
> > >> They'll add another arch_kexec_load(), with a check for *their* feature.
> > >>
> > >> This all seems like you should be disabling kexec() the moment the MADT
> > >> CPU wakeup is used instead of making it based on TDX.
> > >
> > > I guess we can go this path if you are fine with taking CR4.MCE and shared
> > > memory reverting patches (they require some rework, but I can get them
> > > into shape quickly). After that we can forbid kexec on machines with MADT
> > > if nr_cpus > 1.
> >
> > This goes back to what I asked before: is anyone actually going to *use*
> > a single-processor system that wants to kexec()? If not, let's not
> > waste the time to introduce code that is just going to bitrot. Just
> > mark it broken and move on with life.
>
> Now we have two API for kexec: kexec_load and kexec_file_load. They can
> be used to do kexec reboot, or crash dumping. For crash dumping, we
> usually only use one cpu to do the vmcore dumping. At least on our
> Fedora/centos-stream/RHEL, we do like this with kernel parameter
> 'nr_cpus=1' added by default. Unless people explicitly remove the
> 'nr_cpus=1' restriction or set nr_cpus= to other number to persue
> multithread dumping in kdump kernel.
Hm. I'm not sure how to determine if the target kernel wants to use >1
CPU. Scanning cmdline looks fragile. And who said the target kernel is
Linux.
I guess we can park all CPUs, but CPU0 and target kernel will just fail to
bring them up which is non-fatal issue (at least for Linux).
I admit that all looks hackish.
--
Kiryl Shutsemau / Kirill A. Shutemov
On 03/27/23 at 02:09pm, Kirill A. Shutemov wrote:
> On Mon, Mar 27, 2023 at 09:35:54AM +0800, Baoquan He wrote:
> > On 03/26/23 at 10:01am, Dave Hansen wrote:
> > > On 3/25/23 12:25, Kirill A. Shutemov wrote:
> > > > On Sat, Mar 25, 2023 at 09:25:36AM -0700, Dave Hansen wrote:
> > > >> On 3/25/23 09:01, Kirill A. Shutemov wrote:
> > > >>> The last item is tricky. TDX guests use ACPI MADT MPWK to bring up
> > > >>> secondary CPUs. The mechanism doesn't allow to put a CPU back offline if
> > > >>> it has woken up.
> > > >> ...
> > > >>> +int arch_kexec_load(void)
> > > >>> +{
> > > >>> + if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST)) {
> > > >>> + pr_warn_once("Disable kexec: not yet supported in TDX guest\n");
> > > >>> + return -EOPNOTSUPP;
> > > >>> + }
> > > >>> +
> > > >>> + return 0;
> > > >>> +}
> > > >>
> > > >> So, let's put all this together:
> > > >>
> > > >> 1. TDX implementations use MADT for wakeup exclusively right now (but
> > > >> are not necessarily _required_ to do so forever)
> > > >> 2. MADT doesn't support CPU offlining
> > > >> 3. kexec() requires offlining
> > > >>
> > > >> Thus, current TDX implementations can't support TDX guests. This
> > > >> *doesn't* say that TDX will always use the MADT for wakeups.
> > > >>
> > > >> Yet, the check you have here is for TDX and *not* for the MADT.
> > > >
> > > > As I described in the commit message there are more than MADT that is
> > > > required to get kexec in TDX guest.
> > >
> > > I kinda think we should do both.
> > >
> > > Let's make sure that all systems that depend on MADT wakeups can't
> > > kexec() until the ACPI folks work out what to do there.
> > >
> > > Separately, let's either fix or *mark* the kexec()-incompatible pieces
> > > that *ARE* specific to TDX.
> > >
> > > >> That seems wrong.
> > > >>
> > > >> Let's say SEV or arm64 comes along and uses the MADT for their guests.
> > > >> They'll add another arch_kexec_load(), with a check for *their* feature.
> > > >>
> > > >> This all seems like you should be disabling kexec() the moment the MADT
> > > >> CPU wakeup is used instead of making it based on TDX.
> > > >
> > > > I guess we can go this path if you are fine with taking CR4.MCE and shared
> > > > memory reverting patches (they require some rework, but I can get them
> > > > into shape quickly). After that we can forbid kexec on machines with MADT
> > > > if nr_cpus > 1.
> > >
> > > This goes back to what I asked before: is anyone actually going to *use*
> > > a single-processor system that wants to kexec()? If not, let's not
> > > waste the time to introduce code that is just going to bitrot. Just
> > > mark it broken and move on with life.
> >
> > Now we have two API for kexec: kexec_load and kexec_file_load. They can
> > be used to do kexec reboot, or crash dumping. For crash dumping, we
> > usually only use one cpu to do the vmcore dumping. At least on our
> > Fedora/centos-stream/RHEL, we do like this with kernel parameter
> > 'nr_cpus=1' added by default. Unless people explicitly remove the
> > 'nr_cpus=1' restriction or set nr_cpus= to other number to persue
> > multithread dumping in kdump kernel.
>
> Hm. I'm not sure how to determine if the target kernel wants to use >1
> CPU. Scanning cmdline looks fragile. And who said the target kernel is
> Linux.
Ah, I forgot the checking and disabling is done in 1st kernel, it's truly
not convinent to get 2nd kernel's cmdline. Then disabling kexec on TDX
guest for the time being looks resonable to me.
>
> I guess we can park all CPUs, but CPU0 and target kernel will just fail to
> bring them up which is non-fatal issue (at least for Linux).
>
> I admit that all looks hackish.
>
> --
> Kiryl Shutsemau / Kirill A. Shutemov
>
On 3/25/23 09:01, Kirill A. Shutemov wrote:
> The last item is tricky. TDX guests use ACPI MADT MPWK to bring up
> secondary CPUs. The mechanism doesn't allow to put a CPU back offline if
> it has woken up.
I'm not sure I like this approach.
TDX uses the MADT exclusively.
MADT-based systems can't offline CPUs.
kexec() requires