2023-02-13 23:48:57

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH 0/2] Kexec enabling in TDX guest

The patch brings basic enabling of kexec in TDX guests.

By "basic enabling" I mean, kexec in the guests with a single CPU.
TDX guests use ACPI MADT MPWK to bring up secondary CPUs. The mechanism
doesn't allow to put a CPU back offline if it has woken up.

We are looking into this, but it might take time.

Kirill A. Shutemov (2):
x86/kexec: Preserve CR4.MCE during kexec
x86/tdx: Convert shared memory back to private on kexec

arch/x86/coco/tdx/Makefile | 1 +
arch/x86/coco/tdx/kexec.c | 82 ++++++++++++++++++++++++++++
arch/x86/include/asm/tdx.h | 4 ++
arch/x86/kernel/machine_kexec_64.c | 2 +
arch/x86/kernel/relocate_kernel_64.S | 6 +-
5 files changed, 94 insertions(+), 1 deletion(-)
create mode 100644 arch/x86/coco/tdx/kexec.c

--
2.39.1



2023-02-13 23:49:05

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH 2/2] x86/tdx: Convert shared memory back to private on kexec

TDX guests allocate shared buffers to perform I/O. It is done by
allocating pages normally from the buddy allocator and converting them
to shared with set_memory_decrypted().

The target kernel has no idea what memory is converted this way. It only
sees E820_TYPE_RAM.

Accessing shared memory via private mapping is fatal. It leads to
unrecoverable TD exit.

Walk direct mapping and convert all shared memory back to private.
It makes all RAM private again and target kernel may use it normally.

Skip the conversion on kexec of crashkernel. It uses own pool of memory
and will not accidentally allocate from the memory the first kernel made
shared.

For crash investigation, it might be useful to access data in the shared
buffers.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/coco/tdx/Makefile | 1 +
arch/x86/coco/tdx/kexec.c | 82 ++++++++++++++++++++++++++++++
arch/x86/include/asm/tdx.h | 4 ++
arch/x86/kernel/machine_kexec_64.c | 2 +
4 files changed, 89 insertions(+)
create mode 100644 arch/x86/coco/tdx/kexec.c

diff --git a/arch/x86/coco/tdx/Makefile b/arch/x86/coco/tdx/Makefile
index 46c55998557d..a5daa2a33531 100644
--- a/arch/x86/coco/tdx/Makefile
+++ b/arch/x86/coco/tdx/Makefile
@@ -1,3 +1,4 @@
# SPDX-License-Identifier: GPL-2.0

obj-y += tdx.o tdcall.o
+obj-$(CONFIG_KEXEC_CORE) += kexec.o
diff --git a/arch/x86/coco/tdx/kexec.c b/arch/x86/coco/tdx/kexec.c
new file mode 100644
index 000000000000..f1f31515f372
--- /dev/null
+++ b/arch/x86/coco/tdx/kexec.c
@@ -0,0 +1,82 @@
+#define pr_fmt(fmt) "tdx: " fmt
+
+#include <linux/pagewalk.h>
+#include <asm/tdx.h>
+#include <asm/x86_init.h>
+
+static inline bool pud_decrypted(pud_t pud)
+{
+ return cc_mkdec(pud_val(pud)) == pud_val(pud);
+}
+
+static inline bool pmd_decrypted(pmd_t pmd)
+{
+ return cc_mkdec(pmd_val(pmd)) == pmd_val(pmd);
+}
+
+static inline bool pte_decrypted(pte_t pte)
+{
+ return cc_mkdec(pte_val(pte)) == pte_val(pte);
+}
+
+static inline void unshare_range(unsigned long start, unsigned long end)
+{
+ int pages = (end - start) / PAGE_SIZE;
+
+ if (!x86_platform.guest.enc_status_change_finish(start, pages, true))
+ pr_err("Failed to unshare range %#lx-%#lx\n", start, end);
+}
+
+static int unshare_pud(pud_t *pud, unsigned long addr, unsigned long next,
+ struct mm_walk *walk)
+{
+ if (pud_decrypted(*pud))
+ unshare_range(addr, next);
+
+ return 0;
+}
+
+static int unshare_pmd(pmd_t *pmd, unsigned long addr, unsigned long next,
+ struct mm_walk *walk)
+{
+ if (pmd_decrypted(*pmd))
+ unshare_range(addr, next);
+
+ return 0;
+}
+
+static int unshare_pte(pte_t *pte, unsigned long addr, unsigned long next,
+ struct mm_walk *walk)
+{
+ if (pte_decrypted(*pte))
+ unshare_range(addr, next);
+
+ return 0;
+}
+
+static const struct mm_walk_ops unshare_ops = {
+ .pud_entry = unshare_pud,
+ .pmd_entry = unshare_pmd,
+ .pte_entry = unshare_pte,
+};
+
+void tdx_kexec_prepare(bool crash)
+{
+ /*
+ * Crash kernel may want to see data in the shared buffers.
+ * Do not revert them to private on kexec of crash kernel.
+ */
+ if (crash)
+ return;
+
+ /*
+ * Walk direct mapping and convert all shared memory back to private,
+ * so the target kernel will be able use it normally.
+ */
+ mmap_write_lock(&init_mm);
+ walk_page_range_novma(&init_mm,
+ PAGE_OFFSET,
+ PAGE_OFFSET + (max_pfn_mapped << PAGE_SHIFT),
+ &unshare_ops, init_mm.pgd, NULL);
+ mmap_write_unlock(&init_mm);
+}
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 28d889c9aa16..7cdbf10e9f7d 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -69,6 +69,8 @@ bool tdx_early_handle_ve(struct pt_regs *regs);

int tdx_mcall_get_report0(u8 *reportdata, u8 *tdreport);

+void tdx_kexec_prepare(bool crash);
+
#else

static inline void tdx_early_init(void) { };
@@ -76,6 +78,8 @@ static inline void tdx_safe_halt(void) { };

static inline bool tdx_early_handle_ve(struct pt_regs *regs) { return false; }

+static inline void tdx_kexec_prepare(bool crash) {}
+
#endif /* CONFIG_INTEL_TDX_GUEST */

#if defined(CONFIG_KVM_GUEST) && defined(CONFIG_INTEL_TDX_GUEST)
diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
index 0611fd83858e..adbb3e5347da 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -28,6 +28,7 @@
#include <asm/setup.h>
#include <asm/set_memory.h>
#include <asm/cpu.h>
+#include <asm/tdx.h>

#ifdef CONFIG_ACPI
/*
@@ -312,6 +313,7 @@ void machine_kexec(struct kimage *image)
local_irq_disable();
hw_breakpoint_disable();
cet_disable();
+ tdx_kexec_prepare(image->type == KEXEC_TYPE_CRASH);

if (image->preserve_context) {
#ifdef CONFIG_X86_IO_APIC
--
2.39.1


2023-02-13 23:49:18

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH 1/2] x86/kexec: Preserve CR4.MCE during kexec

TDX guests are not allowed to clear CR4.MCE. Attempt to clear it leads
to #VE.

Preserve the flag during kexec.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/kernel/relocate_kernel_64.S | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocate_kernel_64.S
index 4a73351f87f8..18f19dcc40e9 100644
--- a/arch/x86/kernel/relocate_kernel_64.S
+++ b/arch/x86/kernel/relocate_kernel_64.S
@@ -145,8 +145,12 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
* Set cr4 to a known state:
* - physical address extension enabled
* - 5-level paging, if it was enabled before
+ * - Preserve MCE, if it was set. Clearing MCE may fault in some
+ * environments.
*/
- movl $X86_CR4_PAE, %eax
+ movq %cr4, %rax
+ andl $X86_CR4_MCE, %eax
+ orl $X86_CR4_PAE, %eax
testq $X86_CR4_LA57, %r13
jz 1f
orl $X86_CR4_LA57, %eax
--
2.39.1


2023-02-15 01:53:53

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH 2/2] x86/tdx: Convert shared memory back to private on kexec


> +void tdx_kexec_prepare(bool crash)
> +{
> + /*
> + * Crash kernel may want to see data in the shared buffers.
> + * Do not revert them to private on kexec of crash kernel.
> + */
> + if (crash)
> + return;
> +
> + /*
> + * Walk direct mapping and convert all shared memory back to private,
> + * so the target kernel will be able use it normally.
> + */
> + mmap_write_lock(&init_mm);
> + walk_page_range_novma(&init_mm,
> + PAGE_OFFSET,
> + PAGE_OFFSET + (max_pfn_mapped << PAGE_SHIFT),
> + &unshare_ops, init_mm.pgd, NULL);
> + mmap_write_unlock(&init_mm);
> +}

Looks the page table walk is done unconditionally when !crash.

I think it's better to check whether this is TDX guest (either this function, or
below in machine_kexec()) and just return early if it's not a TDX guest?

[..]


> /*
> @@ -312,6 +313,7 @@ void machine_kexec(struct kimage *image)
> local_irq_disable();
> hw_breakpoint_disable();
> cet_disable();
> + tdx_kexec_prepare(image->type == KEXEC_TYPE_CRASH);
>
> if (image->preserve_context) {
> #ifdef CONFIG_X86_IO_APIC

2023-02-16 01:49:55

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH 1/2] x86/kexec: Preserve CR4.MCE during kexec

On Tue, 2023-02-14 at 02:48 +0300, Kirill A. Shutemov wrote:
> TDX guests are not allowed to clear CR4.MCE. Attempt to clear it
> leads
> to #VE.
>
> Preserve the flag during kexec.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>

I wonder whats going on with the pre-existing switching between eax and
rax in this code for the cr0 and cr4 manipulations. Do you know what
the reason is?

Also, for a simple non-tdx kexec regression test:
Tested-by: Rick Edgecombe <[email protected]>

2023-02-16 09:44:18

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH 1/2] x86/kexec: Preserve CR4.MCE during kexec

On Thu, Feb 16, 2023 at 01:49:39AM +0000, Edgecombe, Rick P wrote:
> On Tue, 2023-02-14 at 02:48 +0300, Kirill A. Shutemov wrote:
> > TDX guests are not allowed to clear CR4.MCE. Attempt to clear it
> > leads
> > to #VE.
> >
> > Preserve the flag during kexec.
> >
> > Signed-off-by: Kirill A. Shutemov <[email protected]>
>
> I wonder whats going on with the pre-existing switching between eax and
> rax in this code for the cr0 and cr4 manipulations. Do you know what
> the reason is?

32-bit ORs and ANDs save one byte per instruction. And there's no 32-bit
MOV to/from control registers in 64-bit mode.

>
> Also, for a simple non-tdx kexec regression test:
> Tested-by: Rick Edgecombe <[email protected]>

Thanks!

--
Kiryl Shutsemau / Kirill A. Shutemov

2023-02-16 17:34:00

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH 1/2] x86/kexec: Preserve CR4.MCE during kexec

On Thu, 2023-02-16 at 12:43 +0300, Kirill A. Shutemov wrote:
> On Thu, Feb 16, 2023 at 01:49:39AM +0000, Edgecombe, Rick P wrote:
> > On Tue, 2023-02-14 at 02:48 +0300, Kirill A. Shutemov wrote:
> > > TDX guests are not allowed to clear CR4.MCE. Attempt to clear it
> > > leads
> > > to #VE.
> > >
> > > Preserve the flag during kexec.
> > >
> > > Signed-off-by: Kirill A. Shutemov <
> > > [email protected]>
> >
> > I wonder whats going on with the pre-existing switching between eax
> > and
> > rax in this code for the cr0 and cr4 manipulations. Do you know
> > what
> > the reason is?
>
> 32-bit ORs and ANDs save one byte per instruction. And there's no 32-
> bit
> MOV to/from control registers in 64-bit mode.

Oh right, I think I recall now. There is a 64 bit AND in the CR0 piece
here too, which of course is outside of these changes.

But otherwise, it's not clear from the patch what the implications are
of leaving CR4.MCE set for the non-TDX environment. I see in head_64.S
it will clear it during boot if the kernel doesn't support machine
check. So it leaves a little window where CR4.MCE is set where it
wasn't before.

The piece in head_64.S talks about how an #MC will crash the system if
it happens before the machine check stuff is fully setup anyway, so it
doesn't hurt to leave it on. Is that the reasoning for this change as
well? If so it might help to add a little more about the reasoning in
the commit log.

2023-02-16 17:50:46

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 0/2] Kexec enabling in TDX guest

On 2/13/23 15:48, Kirill A. Shutemov wrote:
> The patch brings basic enabling of kexec in TDX guests.
>
> By "basic enabling" I mean, kexec in the guests with a single CPU.
> TDX guests use ACPI MADT MPWK to bring up secondary CPUs. The mechanism
> doesn't allow to put a CPU back offline if it has woken up.
>
> We are looking into this, but it might take time.

This is simple enough. But, nobody will _actually_ use this code as-is,
right? What's the point of applying it now?

2023-02-16 18:12:43

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH 0/2] Kexec enabling in TDX guest

On Thu, Feb 16, 2023 at 09:50:32AM -0800, Dave Hansen wrote:
> On 2/13/23 15:48, Kirill A. Shutemov wrote:
> > The patch brings basic enabling of kexec in TDX guests.
> >
> > By "basic enabling" I mean, kexec in the guests with a single CPU.
> > TDX guests use ACPI MADT MPWK to bring up secondary CPUs. The mechanism
> > doesn't allow to put a CPU back offline if it has woken up.
> >
> > We are looking into this, but it might take time.
>
> This is simple enough. But, nobody will _actually_ use this code as-is,
> right? What's the point of applying it now?

Why nobody? Single CPU VMs are not that uncommon.

--
Kiryl Shutsemau / Kirill A. Shutemov

2023-02-16 18:33:00

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 0/2] Kexec enabling in TDX guest

On 2/16/23 10:12, Kirill A. Shutemov wrote:
> On Thu, Feb 16, 2023 at 09:50:32AM -0800, Dave Hansen wrote:
>> On 2/13/23 15:48, Kirill A. Shutemov wrote:
>>> The patch brings basic enabling of kexec in TDX guests.
>>>
>>> By "basic enabling" I mean, kexec in the guests with a single CPU.
>>> TDX guests use ACPI MADT MPWK to bring up secondary CPUs. The mechanism
>>> doesn't allow to put a CPU back offline if it has woken up.
>>>
>>> We are looking into this, but it might take time.
>> This is simple enough. But, nobody will _actually_ use this code as-is,
>> right? What's the point of applying it now?
> Why nobody? Single CPU VMs are not that uncommon.

Here's one data point: the only "General Purpose" ones I see AWS
offering are Haswell era:

https://aws.amazon.com/ec2/instance-types/

That _might_ be because of concerns about SMT side-channel exposure on
anything newer.

So, we can argue about what "uncommon" means. But, a minority of folks
care about 1-cpu VMs. Also, a separate minority of folks care about
kexec(). I'm worried that the overlap between the two will be an
*OVERWHELMING* minority of folks. In other words, so few people will
use this code that it'll just bitrot.

I'm looking for compelling arguments why mainline should carry this.

2023-02-16 19:47:06

by Andrew Cooper

[permalink] [raw]
Subject: Re: [PATCH 1/2] x86/kexec: Preserve CR4.MCE during kexec

On 13/02/2023 11:48 pm, Kirill A. Shutemov wrote:
> TDX guests are not allowed to clear CR4.MCE. Attempt to clear it leads
> to #VE.
>
> Preserve the flag during kexec.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>

That's unfortunate for TDX, but in a non-TDX system you must never
maintain CR4.MCE into kexec.

Async events, including NMIs, cannot be taken between this point and the
target having set itself up into it's intended operating mode.  During
this period you get all kinds of fun with type confusion in the IDT/TSS
and/or not having a safe stack to service the event.

A clean shutdown from not having machine checks enabled is far
preferable to trying to deliver an #MC in purgatory.

That, or you're welcome to debug the next bug report I get where
(amazingly) an NMI managed to hit with a good stack in the new context,
but most of the old context (IDT, TSS and .text) still intact enough to
start emitting a very confused oops onto serial...

~Andrew

2023-02-22 10:26:43

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH 0/2] Kexec enabling in TDX guest

On Tue, 2023-02-14 at 02:48 +0300, Kirill A. Shutemov wrote:
> The patch brings basic enabling of kexec in TDX guests.
>
> By "basic enabling" I mean, kexec in the guests with a single CPU.
> TDX guests use ACPI MADT MPWK to bring up secondary CPUs. The mechanism
> doesn't allow to put a CPU back offline if it has woken up.
>
> We are looking into this, but it might take time.

Can't we park the secondary CPUs in a purgatory-like thing of their own
and wake them from there when we want them?

Patches for that were floating around once, although the primary reason
then was latency, and we decided to address that differently by doing
the bringup in parallel instead.


Attachments:
smime.p7s (5.83 kB)

2023-02-24 14:30:13

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH 0/2] Kexec enabling in TDX guest

On Wed, Feb 22, 2023 at 10:26:22AM +0000, David Woodhouse wrote:
> On Tue, 2023-02-14 at 02:48 +0300, Kirill A. Shutemov wrote:
> > The patch brings basic enabling of kexec in TDX guests.
> >
> > By "basic enabling" I mean, kexec in the guests with a single CPU.
> > TDX guests use ACPI MADT MPWK to bring up secondary CPUs. The mechanism
> > doesn't allow to put a CPU back offline if it has woken up.
> >
> > We are looking into this, but it might take time.
>
> Can't we park the secondary CPUs in a purgatory-like thing of their own
> and wake them from there when we want them?
>
> Patches for that were floating around once, although the primary reason
> then was latency, and we decided to address that differently by doing
> the bringup in parallel instead.

That's plan B. It is suboptimal. kexec() can happen into something that is
not Linux which will not be able to wake up CPUs.

Ideally, it has to be addressed on BIOS level: it has to provide a way to
offline CPUs, putting it back to pre-wakeup state.

--
Kiryl Shutsemau / Kirill A. Shutemov

2023-02-24 15:22:31

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 0/2] Kexec enabling in TDX guest

On 2/24/23 06:30, Kirill A. Shutemov wrote:
> Ideally, it has to be addressed on BIOS level: it has to provide a way to
> offline CPUs, putting it back to pre-wakeup state.

Is there anything stopping us from just parking the CPUs in a loop
looking at 'acpi_mp_wake_mailbox_paddr'? Basically park them in a way
which is indistinguishable from what the BIOS did.

2023-02-24 16:12:37

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH 0/2] Kexec enabling in TDX guest

On Fri, Feb 24, 2023 at 07:22:18AM -0800, Dave Hansen wrote:
> On 2/24/23 06:30, Kirill A. Shutemov wrote:
> > Ideally, it has to be addressed on BIOS level: it has to provide a way to
> > offline CPUs, putting it back to pre-wakeup state.
>
> Is there anything stopping us from just parking the CPUs in a loop
> looking at 'acpi_mp_wake_mailbox_paddr'? Basically park them in a way
> which is indistinguishable from what the BIOS did.

+Rafael.

- Forward compatibility can be an issue. Version 0 of mailbox supports
only single Wakeup command. Future specs may define a new command that
kernel implementation doesn't support.

- BIOS owns the mailbox page and can re-use for something else after the
last CPU has woken up. (I know it is very theoretical, but still.)

- We can patch ACPI table to point to mailbox page in kernel allocated
memory, but it brings other problem. If the first kernel didn't wake up
all CPUs for some reason (CONFIG_SMP=n or nr_cpus= or something) the
second kernel would not be able to wake up them too since they looping
around the old address.

But ultimately, I think it is clearly missing BIOS functionality and has
to be addressed there. Hacking around it in kernel will lead to more
problems down the road.

--
Kiryl Shutsemau / Kirill A. Shutemov