2024-05-28 09:57:13

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv11 11/19] x86/tdx: Convert shared memory back to private on kexec

TDX guests allocate shared buffers to perform I/O. It is done by
allocating pages normally from the buddy allocator and converting them
to shared with set_memory_decrypted().

The second, kexec-ed kernel has no idea what memory is converted this
way. It only sees E820_TYPE_RAM.

Accessing shared memory via private mapping is fatal. It leads to
unrecoverable TD exit.

On kexec walk direct mapping and convert all shared memory back to
private. It makes all RAM private again and second kernel may use it
normally.

The conversion occurs in two steps: stopping new conversions and
unsharing all memory. In the case of normal kexec, the stopping of
conversions takes place while scheduling is still functioning. This
allows for waiting until any ongoing conversions are finished. The
second step is carried out when all CPUs except one are inactive and
interrupts are disabled. This prevents any conflicts with code that may
access shared memory.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Rick Edgecombe <[email protected]>
Reviewed-by: Kai Huang <[email protected]>
Tested-by: Tao Liu <[email protected]>
---
arch/x86/coco/tdx/tdx.c | 69 +++++++++++++++++++++++++++++++
arch/x86/include/asm/pgtable.h | 5 +++
arch/x86/include/asm/set_memory.h | 3 ++
arch/x86/mm/pat/set_memory.c | 41 ++++++++++++++++--
4 files changed, 115 insertions(+), 3 deletions(-)

diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index 979891e97d83..c0a651fa8963 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -7,6 +7,7 @@
#include <linux/cpufeature.h>
#include <linux/export.h>
#include <linux/io.h>
+#include <linux/kexec.h>
#include <asm/coco.h>
#include <asm/tdx.h>
#include <asm/vmx.h>
@@ -14,6 +15,7 @@
#include <asm/insn.h>
#include <asm/insn-eval.h>
#include <asm/pgtable.h>
+#include <asm/set_memory.h>

/* MMIO direction */
#define EPT_READ 0
@@ -831,6 +833,70 @@ static int tdx_enc_status_change_finish(unsigned long vaddr, int numpages,
return 0;
}

+/* Stop new private<->shared conversions */
+static void tdx_kexec_begin(bool crash)
+{
+ /*
+ * Crash kernel reaches here with interrupts disabled: can't wait for
+ * conversions to finish.
+ *
+ * If race happened, just report and proceed.
+ */
+ if (!set_memory_enc_stop_conversion(!crash))
+ pr_warn("Failed to stop shared<->private conversions\n");
+}
+
+/* Walk direct mapping and convert all shared memory back to private */
+static void tdx_kexec_finish(void)
+{
+ unsigned long addr, end;
+ long found = 0, shared;
+
+ lockdep_assert_irqs_disabled();
+
+ addr = PAGE_OFFSET;
+ end = PAGE_OFFSET + get_max_mapped();
+
+ while (addr < end) {
+ unsigned long size;
+ unsigned int level;
+ pte_t *pte;
+
+ pte = lookup_address(addr, &level);
+ size = page_level_size(level);
+
+ if (pte && pte_decrypted(*pte)) {
+ int pages = size / PAGE_SIZE;
+
+ /*
+ * Touching memory with shared bit set triggers implicit
+ * conversion to shared.
+ *
+ * Make sure nobody touches the shared range from
+ * now on.
+ */
+ set_pte(pte, __pte(0));
+
+ if (!tdx_enc_status_changed(addr, pages, true)) {
+ pr_err("Failed to unshare range %#lx-%#lx\n",
+ addr, addr + size);
+ }
+
+ found += pages;
+ }
+
+ addr += size;
+ }
+
+ __flush_tlb_all();
+
+ shared = atomic_long_read(&nr_shared);
+ if (shared != found) {
+ pr_err("shared page accounting is off\n");
+ pr_err("nr_shared = %ld, nr_found = %ld\n", shared, found);
+ }
+}
+
void __init tdx_early_init(void)
{
struct tdx_module_args args = {
@@ -890,6 +956,9 @@ void __init tdx_early_init(void)
x86_platform.guest.enc_cache_flush_required = tdx_cache_flush_required;
x86_platform.guest.enc_tlb_flush_required = tdx_tlb_flush_required;

+ x86_platform.guest.enc_kexec_begin = tdx_kexec_begin;
+ x86_platform.guest.enc_kexec_finish = tdx_kexec_finish;
+
/*
* TDX intercepts the RDMSR to read the X2APIC ID in the parallel
* bringup low level code. That raises #VE which cannot be handled
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 65b8e5bb902c..e39311a89bf4 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -140,6 +140,11 @@ static inline int pte_young(pte_t pte)
return pte_flags(pte) & _PAGE_ACCESSED;
}

+static inline bool pte_decrypted(pte_t pte)
+{
+ return cc_mkdec(pte_val(pte)) == pte_val(pte);
+}
+
#define pmd_dirty pmd_dirty
static inline bool pmd_dirty(pmd_t pmd)
{
diff --git a/arch/x86/include/asm/set_memory.h b/arch/x86/include/asm/set_memory.h
index 9aee31862b4a..d490db38db9e 100644
--- a/arch/x86/include/asm/set_memory.h
+++ b/arch/x86/include/asm/set_memory.h
@@ -49,8 +49,11 @@ int set_memory_wb(unsigned long addr, int numpages);
int set_memory_np(unsigned long addr, int numpages);
int set_memory_p(unsigned long addr, int numpages);
int set_memory_4k(unsigned long addr, int numpages);
+
+bool set_memory_enc_stop_conversion(bool wait);
int set_memory_encrypted(unsigned long addr, int numpages);
int set_memory_decrypted(unsigned long addr, int numpages);
+
int set_memory_np_noalias(unsigned long addr, int numpages);
int set_memory_nonglobal(unsigned long addr, int numpages);
int set_memory_global(unsigned long addr, int numpages);
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index a7a7a6c6a3fb..2a548b65ef5f 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -2227,12 +2227,47 @@ static int __set_memory_enc_pgtable(unsigned long addr, int numpages, bool enc)
return ret;
}

+/*
+ * The lock serializes conversions between private and shared memory.
+ *
+ * It is taken for read on conversion. A write lock guarantees that no
+ * concurrent conversions are in progress.
+ */
+static DECLARE_RWSEM(mem_enc_lock);
+
+/*
+ * Stop new private<->shared conversions.
+ *
+ * Taking the exclusive mem_enc_lock waits for in-flight conversions to complete.
+ * The lock is not released to prevent new conversions from being started.
+ *
+ * If sleep is not allowed, as in a crash scenario, try to take the lock.
+ * Failure indicates that there is a race with the conversion.
+ */
+bool set_memory_enc_stop_conversion(bool wait)
+{
+ if (!wait)
+ return down_write_trylock(&mem_enc_lock);
+
+ down_write(&mem_enc_lock);
+
+ return true;
+}
+
static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
{
- if (cc_platform_has(CC_ATTR_MEM_ENCRYPT))
- return __set_memory_enc_pgtable(addr, numpages, enc);
+ int ret = 0;

- return 0;
+ if (cc_platform_has(CC_ATTR_MEM_ENCRYPT)) {
+ if (!down_read_trylock(&mem_enc_lock))
+ return -EBUSY;
+
+ ret = __set_memory_enc_pgtable(addr, numpages, enc);
+
+ up_read(&mem_enc_lock);
+ }
+
+ return ret;
}

int set_memory_encrypted(unsigned long addr, int numpages)
--
2.43.0



2024-05-31 15:16:28

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCHv11 11/19] x86/tdx: Convert shared memory back to private on kexec

On Tue, May 28, 2024 at 12:55:14PM +0300, Kirill A. Shutemov wrote:
> +static void tdx_kexec_finish(void)
> +{
> + unsigned long addr, end;
> + long found = 0, shared;
> +
> + lockdep_assert_irqs_disabled();
> +
> + addr = PAGE_OFFSET;
> + end = PAGE_OFFSET + get_max_mapped();
> +
> + while (addr < end) {
> + unsigned long size;
> + unsigned int level;
> + pte_t *pte;
> +
> + pte = lookup_address(addr, &level);
> + size = page_level_size(level);
> +
> + if (pte && pte_decrypted(*pte)) {
> + int pages = size / PAGE_SIZE;
> +
> + /*
> + * Touching memory with shared bit set triggers implicit
> + * conversion to shared.
> + *
> + * Make sure nobody touches the shared range from
> + * now on.
> + */
> + set_pte(pte, __pte(0));
> +

Format the below into a comment here:

/*

The only thing one can do at this point on failure is panic. It is
reasonable to proceed, especially for the crash case because the
kexec-ed kernel is using a different page table so there won't be
a mismatch between shared/private marking of the page so it doesn't
matter.

Also, even if the failure is real and the page cannot be touched as
private, the kdump kernel will boot fine as it uses pre-reserved memory.
What happens next depends on what the dumping process does and there's
a reasonable chance to produce useful dump on crash.

Regardless, the print leaves a trace in the log to give a clue for
debug.

One possible reason for the failure is if kdump raced with memory
conversion. In this case shared bit in page table got set (or not
cleared form shared->private conversion), but the page is actually
private. So this failure is not going to affect the kexec'ed kernel.

*/

<---

> + if (!tdx_enc_status_changed(addr, pages, true)) {
> + pr_err("Failed to unshare range %#lx-%#lx\n",
> + addr, addr + size);
> + }
> +
> + found += pages;
> + }
> +
> + addr += size;
> + }
> +
> + __flush_tlb_all();
> +
> + shared = atomic_long_read(&nr_shared);
> + if (shared != found) {
> + pr_err("shared page accounting is off\n");
> + pr_err("nr_shared = %ld, nr_found = %ld\n", shared, found);
> + }
> +}

..

> static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
> {
> - if (cc_platform_has(CC_ATTR_MEM_ENCRYPT))
> - return __set_memory_enc_pgtable(addr, numpages, enc);
> + int ret = 0;
>
> - return 0;
> + if (cc_platform_has(CC_ATTR_MEM_ENCRYPT)) {
> + if (!down_read_trylock(&mem_enc_lock))
> + return -EBUSY;
> +
> + ret = __set_memory_enc_pgtable(addr, numpages, enc);
> +
> + up_read(&mem_enc_lock);
> + }

So CC_ATTR_MEM_ENCRYPT is set for SEV* guests too. You need to change
that code here to take the lock only on TDX, where you want it, not on
the others.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-05-31 17:35:07

by Kalra, Ashish

[permalink] [raw]
Subject: Re: [PATCHv11 11/19] x86/tdx: Convert shared memory back to private on kexec

Hello Boris,

On 5/31/2024 10:14 AM, Borislav Petkov wrote:
>> static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
>> {
>> - if (cc_platform_has(CC_ATTR_MEM_ENCRYPT))
>> - return __set_memory_enc_pgtable(addr, numpages, enc);
>> + int ret = 0;
>>
>> - return 0;
>> + if (cc_platform_has(CC_ATTR_MEM_ENCRYPT)) {
>> + if (!down_read_trylock(&mem_enc_lock))
>> + return -EBUSY;
>> +
>> + ret = __set_memory_enc_pgtable(addr, numpages, enc);
>> +
>> + up_read(&mem_enc_lock);
>> + }
> So CC_ATTR_MEM_ENCRYPT is set for SEV* guests too. You need to change
> that code here to take the lock only on TDX, where you want it, not on
> the others.

SNP guest kexec patches are based on top of this patch-series and SNP
guests also need this exclusive mem_enc_lock protection, so
CC_ATTR_MEM_ENCRYPT makes sense to be used here.

Thanks, Ashish


2024-05-31 18:25:57

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCHv11 11/19] x86/tdx: Convert shared memory back to private on kexec

On Fri, May 31, 2024 at 12:34:49PM -0500, Kalra, Ashish wrote:
> SNP guest kexec patches are based on top of this patch-series and SNP guests
> also need this exclusive mem_enc_lock protection, so CC_ATTR_MEM_ENCRYPT
> makes sense to be used here.

Well, for the future, I'd encourage you to always send an Acked-by: you
or Reviewed-by: you as a reply to such patches so that it is clear that
such a change is desired.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-06-02 14:21:14

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv11 11/19] x86/tdx: Convert shared memory back to private on kexec

On Fri, May 31, 2024 at 05:14:42PM +0200, Borislav Petkov wrote:
> On Tue, May 28, 2024 at 12:55:14PM +0300, Kirill A. Shutemov wrote:
> > +static void tdx_kexec_finish(void)
> > +{
> > + unsigned long addr, end;
> > + long found = 0, shared;
> > +
> > + lockdep_assert_irqs_disabled();
> > +
> > + addr = PAGE_OFFSET;
> > + end = PAGE_OFFSET + get_max_mapped();
> > +
> > + while (addr < end) {
> > + unsigned long size;
> > + unsigned int level;
> > + pte_t *pte;
> > +
> > + pte = lookup_address(addr, &level);
> > + size = page_level_size(level);
> > +
> > + if (pte && pte_decrypted(*pte)) {
> > + int pages = size / PAGE_SIZE;
> > +
> > + /*
> > + * Touching memory with shared bit set triggers implicit
> > + * conversion to shared.
> > + *
> > + * Make sure nobody touches the shared range from
> > + * now on.
> > + */
> > + set_pte(pte, __pte(0));
> > +
>
> Format the below into a comment here:
>
> /*
>
> The only thing one can do at this point on failure is panic. It is
> reasonable to proceed, especially for the crash case because the
> kexec-ed kernel is using a different page table so there won't be
> a mismatch between shared/private marking of the page so it doesn't
> matter.

Page tables would not make a difference here. We will switch to identity
mappings soon. And kexec-ed kernel will build new page tables from
scratch.

I will drop the part after "It is reasonable to proceed".


--
Kiryl Shutsemau / Kirill A. Shutemov

2024-06-02 14:23:23

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv11.1 11/19] x86/tdx: Convert shared memory back to private on kexec

TDX guests allocate shared buffers to perform I/O. It is done by
allocating pages normally from the buddy allocator and converting them
to shared with set_memory_decrypted().

The second, kexec-ed kernel has no idea what memory is converted this
way. It only sees E820_TYPE_RAM.

Accessing shared memory via private mapping is fatal. It leads to
unrecoverable TD exit.

On kexec walk direct mapping and convert all shared memory back to
private. It makes all RAM private again and second kernel may use it
normally.

The conversion occurs in two steps: stopping new conversions and
unsharing all memory. In the case of normal kexec, the stopping of
conversions takes place while scheduling is still functioning. This
allows for waiting until any ongoing conversions are finished. The
second step is carried out when all CPUs except one are inactive and
interrupts are disabled. This prevents any conflicts with code that may
access shared memory.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Rick Edgecombe <[email protected]>
Reviewed-by: Kai Huang <[email protected]>
Tested-by: Tao Liu <[email protected]>
---
arch/x86/coco/tdx/tdx.c | 90 +++++++++++++++++++++++++++++++
arch/x86/include/asm/pgtable.h | 5 ++
arch/x86/include/asm/set_memory.h | 3 ++
arch/x86/mm/pat/set_memory.c | 41 ++++++++++++--
4 files changed, 136 insertions(+), 3 deletions(-)

diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index 979891e97d83..afd71bc6eb02 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -7,6 +7,7 @@
#include <linux/cpufeature.h>
#include <linux/export.h>
#include <linux/io.h>
+#include <linux/kexec.h>
#include <asm/coco.h>
#include <asm/tdx.h>
#include <asm/vmx.h>
@@ -14,6 +15,7 @@
#include <asm/insn.h>
#include <asm/insn-eval.h>
#include <asm/pgtable.h>
+#include <asm/set_memory.h>

/* MMIO direction */
#define EPT_READ 0
@@ -831,6 +833,91 @@ static int tdx_enc_status_change_finish(unsigned long vaddr, int numpages,
return 0;
}

+/* Stop new private<->shared conversions */
+static void tdx_kexec_begin(bool crash)
+{
+ /*
+ * Crash kernel reaches here with interrupts disabled: can't wait for
+ * conversions to finish.
+ *
+ * If race happened, just report and proceed.
+ */
+ if (!set_memory_enc_stop_conversion(!crash))
+ pr_warn("Failed to stop shared<->private conversions\n");
+}
+
+/* Walk direct mapping and convert all shared memory back to private */
+static void tdx_kexec_finish(void)
+{
+ unsigned long addr, end;
+ long found = 0, shared;
+
+ lockdep_assert_irqs_disabled();
+
+ addr = PAGE_OFFSET;
+ end = PAGE_OFFSET + get_max_mapped();
+
+ while (addr < end) {
+ unsigned long size;
+ unsigned int level;
+ pte_t *pte;
+
+ pte = lookup_address(addr, &level);
+ size = page_level_size(level);
+
+ if (pte && pte_decrypted(*pte)) {
+ int pages = size / PAGE_SIZE;
+
+ /*
+ * Touching memory with shared bit set triggers implicit
+ * conversion to shared.
+ *
+ * Make sure nobody touches the shared range from
+ * now on.
+ */
+ set_pte(pte, __pte(0));
+
+ /*
+ * The only thing one can do at this point on failure
+ * is panic. It is reasonable to proceed.
+ *
+ * Also, even if the failure is real and the page cannot
+ * be touched as private, the kdump kernel will boot
+ * fine as it uses pre-reserved memory. What happens
+ * next depends on what the dumping process does and
+ * there's a reasonable chance to produce useful dump
+ * on crash.
+ *
+ * Regardless, the print leaves a trace in the log to
+ * give a clue for debug.
+ *
+ * One possible reason for the failure is if kdump raced
+ * with memory conversion. In this case shared bit in
+ * page table got set (or not cleared) during
+ * shared<->private conversion, but the page is actually
+ * private. So this failure is not going to affect the
+ * kexec'ed kernel.
+ */
+ if (!tdx_enc_status_changed(addr, pages, true)) {
+ pr_err("Failed to unshare range %#lx-%#lx\n",
+ addr, addr + size);
+ }
+
+ found += pages;
+ }
+
+ addr += size;
+ }
+
+ __flush_tlb_all();
+
+ shared = atomic_long_read(&nr_shared);
+ if (shared != found) {
+ pr_err("shared page accounting is off\n");
+ pr_err("nr_shared = %ld, nr_found = %ld\n", shared, found);
+ }
+}
+
void __init tdx_early_init(void)
{
struct tdx_module_args args = {
@@ -890,6 +977,9 @@ void __init tdx_early_init(void)
x86_platform.guest.enc_cache_flush_required = tdx_cache_flush_required;
x86_platform.guest.enc_tlb_flush_required = tdx_tlb_flush_required;

+ x86_platform.guest.enc_kexec_begin = tdx_kexec_begin;
+ x86_platform.guest.enc_kexec_finish = tdx_kexec_finish;
+
/*
* TDX intercepts the RDMSR to read the X2APIC ID in the parallel
* bringup low level code. That raises #VE which cannot be handled
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 65b8e5bb902c..e39311a89bf4 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -140,6 +140,11 @@ static inline int pte_young(pte_t pte)
return pte_flags(pte) & _PAGE_ACCESSED;
}

+static inline bool pte_decrypted(pte_t pte)
+{
+ return cc_mkdec(pte_val(pte)) == pte_val(pte);
+}
+
#define pmd_dirty pmd_dirty
static inline bool pmd_dirty(pmd_t pmd)
{
diff --git a/arch/x86/include/asm/set_memory.h b/arch/x86/include/asm/set_memory.h
index 9aee31862b4a..d490db38db9e 100644
--- a/arch/x86/include/asm/set_memory.h
+++ b/arch/x86/include/asm/set_memory.h
@@ -49,8 +49,11 @@ int set_memory_wb(unsigned long addr, int numpages);
int set_memory_np(unsigned long addr, int numpages);
int set_memory_p(unsigned long addr, int numpages);
int set_memory_4k(unsigned long addr, int numpages);
+
+bool set_memory_enc_stop_conversion(bool wait);
int set_memory_encrypted(unsigned long addr, int numpages);
int set_memory_decrypted(unsigned long addr, int numpages);
+
int set_memory_np_noalias(unsigned long addr, int numpages);
int set_memory_nonglobal(unsigned long addr, int numpages);
int set_memory_global(unsigned long addr, int numpages);
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index a7a7a6c6a3fb..2a548b65ef5f 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -2227,12 +2227,47 @@ static int __set_memory_enc_pgtable(unsigned long addr, int numpages, bool enc)
return ret;
}

+/*
+ * The lock serializes conversions between private and shared memory.
+ *
+ * It is taken for read on conversion. A write lock guarantees that no
+ * concurrent conversions are in progress.
+ */
+static DECLARE_RWSEM(mem_enc_lock);
+
+/*
+ * Stop new private<->shared conversions.
+ *
+ * Taking the exclusive mem_enc_lock waits for in-flight conversions to complete.
+ * The lock is not released to prevent new conversions from being started.
+ *
+ * If sleep is not allowed, as in a crash scenario, try to take the lock.
+ * Failure indicates that there is a race with the conversion.
+ */
+bool set_memory_enc_stop_conversion(bool wait)
+{
+ if (!wait)
+ return down_write_trylock(&mem_enc_lock);
+
+ down_write(&mem_enc_lock);
+
+ return true;
+}
+
static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
{
- if (cc_platform_has(CC_ATTR_MEM_ENCRYPT))
- return __set_memory_enc_pgtable(addr, numpages, enc);
+ int ret = 0;

- return 0;
+ if (cc_platform_has(CC_ATTR_MEM_ENCRYPT)) {
+ if (!down_read_trylock(&mem_enc_lock))
+ return -EBUSY;
+
+ ret = __set_memory_enc_pgtable(addr, numpages, enc);
+
+ up_read(&mem_enc_lock);
+ }
+
+ return ret;
}

int set_memory_encrypted(unsigned long addr, int numpages)
--
2.43.0


2024-06-03 08:38:57

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCHv11.1 11/19] x86/tdx: Convert shared memory back to private on kexec

On Sun, Jun 02, 2024 at 05:23:03PM +0300, Kirill A. Shutemov wrote:
> + /*
> + * The only thing one can do at this point on failure
> + * is panic. It is reasonable to proceed.

It makes even less sense now: panic() means "all stops and we die" and
you say it is reasonable to proceed.

I'm confused.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-06-04 15:34:10

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv11.1 11/19] x86/tdx: Convert shared memory back to private on kexec

On Mon, Jun 03, 2024 at 10:37:54AM +0200, Borislav Petkov wrote:
> On Sun, Jun 02, 2024 at 05:23:03PM +0300, Kirill A. Shutemov wrote:
> > + /*
> > + * The only thing one can do at this point on failure
> > + * is panic. It is reasonable to proceed.
>
> It makes even less sense now: panic() means "all stops and we die" and
> you say it is reasonable to proceed.
>
> I'm confused.

Right.

What about the comment below?

/*
* One possible reason for the failure is if kexec raced
* with memory conversion. In this case shared bit in
* page table got set (or not cleared) during
* shared<->private conversion, but the page is actually
* private. So this failure is not going to affect the
* kexec'ed kernel.
*
* The only thing one can do at this point on failure
* at this point is panic. In absence of better options,
* it is reasonable to proceed, hoping the failure is a
* benign shared bit mismatch due to the race.
*
* Also, even if the failure is real and the page cannot
* be touched as private, the kdump kernel will boot
* fine as it uses pre-reserved memory. What happens
* next depends on what the dumping process does and
* there's a reasonable chance to produce useful dump
* on crash.
*
* Regardless, the print leaves a trace in the log to
* give a clue for debug.
*/

--
Kiryl Shutsemau / Kirill A. Shutemov

2024-06-04 15:50:37

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv11.1 11/19] x86/tdx: Convert shared memory back to private on kexec

On 6/4/24 08:32, Kirill A. Shutemov wrote:
> What about the comment below?
>
> /*
> * One possible reason for the failure is if kexec raced
> * with memory conversion. In this case shared bit in
> * page table got set (or not cleared) during
> * shared<->private conversion, but the page is actually
> * private. So this failure is not going to affect the
> * kexec'ed kernel.
> *
> * The only thing one can do at this point on failure
> * at this point is panic. In absence of better options,
> * it is reasonable to proceed, hoping the failure is a
> * benign shared bit mismatch due to the race.
> *
> * Also, even if the failure is real and the page cannot
> * be touched as private, the kdump kernel will boot
> * fine as it uses pre-reserved memory. What happens
> * next depends on what the dumping process does and
> * there's a reasonable chance to produce useful dump
> * on crash.
> *
> * Regardless, the print leaves a trace in the log to
> * give a clue for debug.
> */

It's rambling too much for my taste.

Let's boil this down to what matters:

1. Failures to change encryption status here can lead a future kernel
to touch shared memory with a private mapping
2. That causes an immediate unrecoverable guest shutdown (right?)
3. kdump kernels should not be affected since they have their own
memory ranges and its encryption status is not being tweawked here
4. The pr_err() may help make some sense out of #2 when it happens

I'm not sure the reason behind the failed conversion is important here.

I wouldn't mention panic().

We don't need to opine about what the next kernel might or might not do.

2024-06-04 16:15:50

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv11.1 11/19] x86/tdx: Convert shared memory back to private on kexec

On Tue, Jun 04, 2024 at 08:47:22AM -0700, Dave Hansen wrote:
> On 6/4/24 08:32, Kirill A. Shutemov wrote:
> > What about the comment below?
> >
> > /*
> > * One possible reason for the failure is if kexec raced
> > * with memory conversion. In this case shared bit in
> > * page table got set (or not cleared) during
> > * shared<->private conversion, but the page is actually
> > * private. So this failure is not going to affect the
> > * kexec'ed kernel.
> > *
> > * The only thing one can do at this point on failure
> > * at this point is panic. In absence of better options,
> > * it is reasonable to proceed, hoping the failure is a
> > * benign shared bit mismatch due to the race.
> > *
> > * Also, even if the failure is real and the page cannot
> > * be touched as private, the kdump kernel will boot
> > * fine as it uses pre-reserved memory. What happens
> > * next depends on what the dumping process does and
> > * there's a reasonable chance to produce useful dump
> > * on crash.
> > *
> > * Regardless, the print leaves a trace in the log to
> > * give a clue for debug.
> > */
>
> It's rambling too much for my taste.
>
> Let's boil this down to what matters:
>
> 1. Failures to change encryption status here can lead a future kernel
> to touch shared memory with a private mapping
> 2. That causes an immediate unrecoverable guest shutdown (right?)

Right.

> 3. kdump kernels should not be affected since they have their own
> memory ranges and its encryption status is not being tweawked here
> 4. The pr_err() may help make some sense out of #2 when it happens
>
> I'm not sure the reason behind the failed conversion is important here.

The important part is that failure can be benign. It explains "can" in #1.
But okay.

> I wouldn't mention panic().
>
> We don't need to opine about what the next kernel might or might not do.

Is this any better?

/*
* If tdx_enc_status_changed() fails, it leaves memory
* in an unknown state. If the memory remains shared,
* it can result in an unrecoverable guest shutdown on
* the first accessed through a private mapping.
*
* The kdump kernel boot is not impacted as it uses
* a pre-reserved memory range that is always private.
* However, gathering crash information could lead to
* a crash if it accesses unconverted memory through
* a private mapping.
*
* pr_err() may assist in understanding such crashes.
*/
--
Kiryl Shutsemau / Kirill A. Shutemov

2024-06-04 16:29:12

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv11 11/19] x86/tdx: Convert shared memory back to private on kexec

On 5/28/24 02:55, Kirill A. Shutemov wrote:
> +/* Stop new private<->shared conversions */
> +static void tdx_kexec_begin(bool crash)
> +{
> + /*
> + * Crash kernel reaches here with interrupts disabled: can't wait for
> + * conversions to finish.
> + *
> + * If race happened, just report and proceed.
> + */
> + if (!set_memory_enc_stop_conversion(!crash))
> + pr_warn("Failed to stop shared<->private conversions\n");
> +}

I don't like having to pass 'crash' in here.

If interrupts are the problem we have ways of testing for those directly.

If it's being in an oops that's a problem, we have 'oops_in_progress'
for that.

In other words, I'd much rather this function (or better yet
set_memory_enc_stop_conversion() itself) use some existing API to change
its behavior in a crash rather than have the context be passed down and
twiddled through several levels of function calls.

There are a ton of these in the console code:

if (oops_in_progress)
foo_trylock();
else
foo_lock();

To me, that's a billion times more clear than a 'wait' argument that
gets derives from who-knows-what that I have to trace through ten levels
of function calls.



2024-06-04 18:31:54

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCHv11.1 11/19] x86/tdx: Convert shared memory back to private on kexec

On Tue, Jun 04, 2024 at 07:14:00PM +0300, Kirill A. Shutemov wrote:
> /*
> * If tdx_enc_status_changed() fails, it leaves memory
> * in an unknown state. If the memory remains shared,
> * it can result in an unrecoverable guest shutdown on
> * the first accessed through a private mapping.

"access"

So this sentence above can go too, right?

Because that comment is in tdx_kexec_finish() and we're basically going
off to kexec. So can a guest even access it through a private mapping?
We're shutting down so nothing is running anymore...

> * The kdump kernel boot is not impacted as it uses
> * a pre-reserved memory range that is always private.
> * However, gathering crash information could lead to
> * a crash if it accesses unconverted memory through
> * a private mapping.

When does the kexec kernel even get such a private mapping? It is not
even up yet...

> * pr_err() may assist in understanding such crashes.

"Print error info in order to leave bread crumbs for debugging." is what
I'd say.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-06-05 16:05:24

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv11 11/19] x86/tdx: Convert shared memory back to private on kexec

On 6/5/24 05:43, Kirill A. Shutemov wrote:
> Okay fair enough. Check out the fixup below. Is it what you mean?

Yes. Much better.

> One other thing I realized is that these callback are dead code if kernel
> compiled without kexec support. Do we want them to be wrapped with
> #ifdef COFNIG_KEXEC_CORE everywhere? It is going to be ugly.
>
> Any better ideas?

The other callbacks don't have #ifdefs either and they're dependent on
memory encryption as far as I can tell.

I think a simple:

if (IS_ENABLED(COFNIG_KEXEC_CORE))
return;

in the top of the callbacks will result in a tiny little stub function
when kexec is disabled. So the bloat will be limited to kernels that
have TDX compiled in but kexec compiled out (probably never). The bloat
will be two callback pointer, one tiny stub function, and a quick
call/return in a slow path.

I think that probably ends up being a few dozen bytes of bloat in kernel
text for a "probably never" config.

2024-06-05 16:25:08

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCHv11.1 11/19] x86/tdx: Convert shared memory back to private on kexec

On Wed, Jun 05, 2024 at 03:21:42PM +0300, Kirill A. Shutemov wrote:
> If a page can be accessed via private mapping is determined by the
> presence in Secure EPT. This state persist across kexec.

I just love it how I tickle out details each time I touch this comment
because we three can't write a single concise and self-contained
explanation. :-(

Ok, next version:

"Private mappings persist across kexec. If tdx_enc_status_changed() fails
in the first kernel, it leaves memory in an unknown state.

If that memory remains shared, accessing it in the *next* kernel through
a private mapping will result in an unrecoverable guest shutdown.

The kdump kernel boot is not impacted as it uses a pre-reserved memory
range that is always private. However, gathering crash information
could lead to a crash if it accesses unconverted memory through
a private mapping which is possible when accessing that memory through
/proc/vmcore, for example.

In all cases, print error info in order to leave enough bread crumbs for
debugging."

I think this is getting in the right direction as it actually makes
sense now.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-06-05 16:53:13

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv11.1 11/19] x86/tdx: Convert shared memory back to private on kexec

On Tue, Jun 04, 2024 at 08:05:54PM +0200, Borislav Petkov wrote:
> On Tue, Jun 04, 2024 at 07:14:00PM +0300, Kirill A. Shutemov wrote:
> > /*
> > * If tdx_enc_status_changed() fails, it leaves memory
> > * in an unknown state. If the memory remains shared,
> > * it can result in an unrecoverable guest shutdown on
> > * the first accessed through a private mapping.
>
> "access"

Okay.

> So this sentence above can go too, right?

I don't think so.

> Because that comment is in tdx_kexec_finish() and we're basically going
> off to kexec. So can a guest even access it through a private mapping?
> We're shutting down so nothing is running anymore...

This kernel can't. But the next kernel can.

If a page can be accessed via private mapping is determined by the
presence in Secure EPT. This state persist across kexec.

> > * The kdump kernel boot is not impacted as it uses
> > * a pre-reserved memory range that is always private.
> > * However, gathering crash information could lead to
> > * a crash if it accesses unconverted memory through
> > * a private mapping.
>
> When does the kexec kernel even get such a private mapping? It is not
> even up yet...

Crash kernel provides access to this memory via /proc/vmcore. Crash kernel
will assume all memory there is private.

> > * pr_err() may assist in understanding such crashes.
>
> "Print error info in order to leave bread crumbs for debugging." is what
> I'd say.

Okay.

--
Kiryl Shutsemau / Kirill A. Shutemov

2024-06-05 17:03:16

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv11 11/19] x86/tdx: Convert shared memory back to private on kexec

On Tue, Jun 04, 2024 at 09:27:59AM -0700, Dave Hansen wrote:
> On 5/28/24 02:55, Kirill A. Shutemov wrote:
> > +/* Stop new private<->shared conversions */
> > +static void tdx_kexec_begin(bool crash)
> > +{
> > + /*
> > + * Crash kernel reaches here with interrupts disabled: can't wait for
> > + * conversions to finish.
> > + *
> > + * If race happened, just report and proceed.
> > + */
> > + if (!set_memory_enc_stop_conversion(!crash))
> > + pr_warn("Failed to stop shared<->private conversions\n");
> > +}
>
> I don't like having to pass 'crash' in here.
>
> If interrupts are the problem we have ways of testing for those directly.
>
> If it's being in an oops that's a problem, we have 'oops_in_progress'
> for that.
>
> In other words, I'd much rather this function (or better yet
> set_memory_enc_stop_conversion() itself) use some existing API to change
> its behavior in a crash rather than have the context be passed down and
> twiddled through several levels of function calls.
>
> There are a ton of these in the console code:
>
> if (oops_in_progress)
> foo_trylock();
> else
> foo_lock();
>
> To me, that's a billion times more clear than a 'wait' argument that
> gets derives from who-knows-what that I have to trace through ten levels
> of function calls.

Okay fair enough. Check out the fixup below. Is it what you mean?

One other thing I realized is that these callback are dead code if kernel
compiled without kexec support. Do we want them to be wrapped with
#ifdef COFNIG_KEXEC_CORE everywhere? It is going to be ugly.

Any better ideas?

diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index 3d23ea0f5d45..1c5aa036b76b 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -834,7 +834,7 @@ static int tdx_enc_status_change_finish(unsigned long vaddr, int numpages,
}

/* Stop new private<->shared conversions */
-static void tdx_kexec_begin(bool crash)
+static void tdx_kexec_begin(void)
{
/*
* Crash kernel reaches here with interrupts disabled: can't wait for
@@ -842,7 +842,7 @@ static void tdx_kexec_begin(bool crash)
*
* If race happened, just report and proceed.
*/
- if (!set_memory_enc_stop_conversion(!crash))
+ if (!set_memory_enc_stop_conversion())
pr_warn("Failed to stop shared<->private conversions\n");
}

diff --git a/arch/x86/include/asm/set_memory.h b/arch/x86/include/asm/set_memory.h
index d490db38db9e..4b2abce2e3e7 100644
--- a/arch/x86/include/asm/set_memory.h
+++ b/arch/x86/include/asm/set_memory.h
@@ -50,7 +50,7 @@ int set_memory_np(unsigned long addr, int numpages);
int set_memory_p(unsigned long addr, int numpages);
int set_memory_4k(unsigned long addr, int numpages);

-bool set_memory_enc_stop_conversion(bool wait);
+bool set_memory_enc_stop_conversion(void);
int set_memory_encrypted(unsigned long addr, int numpages);
int set_memory_decrypted(unsigned long addr, int numpages);

diff --git a/arch/x86/include/asm/x86_init.h b/arch/x86/include/asm/x86_init.h
index b0f313278967..213cf5379a5a 100644
--- a/arch/x86/include/asm/x86_init.h
+++ b/arch/x86/include/asm/x86_init.h
@@ -152,8 +152,6 @@ struct x86_init_acpi {
* @enc_kexec_begin Begin the two-step process of converting shared memory back
* to private. It stops the new conversions from being started
* and waits in-flight conversions to finish, if possible.
- * The @crash parameter denotes whether the function is being
- * called in the crash shutdown path.
* @enc_kexec_finish Finish the two-step process of converting shared memory to
* private. All memory is private after the call when
* the function returns.
@@ -165,7 +163,7 @@ struct x86_guest {
int (*enc_status_change_finish)(unsigned long vaddr, int npages, bool enc);
bool (*enc_tlb_flush_required)(bool enc);
bool (*enc_cache_flush_required)(void);
- void (*enc_kexec_begin)(bool crash);
+ void (*enc_kexec_begin)(void);
void (*enc_kexec_finish)(void);
};

diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index fc52ea80cdc8..340af8155658 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -137,7 +137,7 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
* down and interrupts have been disabled. This allows the callback to
* detect a race with the conversion and report it.
*/
- x86_platform.guest.enc_kexec_begin(true);
+ x86_platform.guest.enc_kexec_begin();
x86_platform.guest.enc_kexec_finish();

crash_save_cpu(regs, safe_smp_processor_id());
diff --git a/arch/x86/kernel/reboot.c b/arch/x86/kernel/reboot.c
index 513809b5b27c..0e0a4cf6b5eb 100644
--- a/arch/x86/kernel/reboot.c
+++ b/arch/x86/kernel/reboot.c
@@ -723,7 +723,7 @@ void native_machine_shutdown(void)
* conversions to finish cleanly.
*/
if (kexec_in_progress)
- x86_platform.guest.enc_kexec_begin(false);
+ x86_platform.guest.enc_kexec_begin();

/* Stop the cpus and apics */
#ifdef CONFIG_X86_IO_APIC
diff --git a/arch/x86/kernel/x86_init.c b/arch/x86/kernel/x86_init.c
index 8a79fb505303..82b128d3f309 100644
--- a/arch/x86/kernel/x86_init.c
+++ b/arch/x86/kernel/x86_init.c
@@ -138,7 +138,7 @@ static int enc_status_change_prepare_noop(unsigned long vaddr, int npages, bool
static int enc_status_change_finish_noop(unsigned long vaddr, int npages, bool enc) { return 0; }
static bool enc_tlb_flush_required_noop(bool enc) { return false; }
static bool enc_cache_flush_required_noop(void) { return false; }
-static void enc_kexec_begin_noop(bool crash) {}
+static void enc_kexec_begin_noop(void) {}
static void enc_kexec_finish_noop(void) {}
static bool is_private_mmio_noop(u64 addr) {return false; }

diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 2a548b65ef5f..443a97e515c0 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -2240,13 +2240,14 @@ static DECLARE_RWSEM(mem_enc_lock);
*
* Taking the exclusive mem_enc_lock waits for in-flight conversions to complete.
* The lock is not released to prevent new conversions from being started.
- *
- * If sleep is not allowed, as in a crash scenario, try to take the lock.
- * Failure indicates that there is a race with the conversion.
*/
-bool set_memory_enc_stop_conversion(bool wait)
+bool set_memory_enc_stop_conversion(void)
{
- if (!wait)
+ /*
+ * In a crash scenario, sleep is not allowed. Try to take the lock.
+ * Failure indicates that there is a race with the conversion.
+ */
+ if (oops_in_progress)
return down_write_trylock(&mem_enc_lock);

down_write(&mem_enc_lock);
--
Kiryl Shutsemau / Kirill A. Shutemov

2024-06-06 12:40:02

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv11.1 11/19] x86/tdx: Convert shared memory back to private on kexec

On Wed, Jun 05, 2024 at 06:24:19PM +0200, Borislav Petkov wrote:
> On Wed, Jun 05, 2024 at 03:21:42PM +0300, Kirill A. Shutemov wrote:
> > If a page can be accessed via private mapping is determined by the
> > presence in Secure EPT. This state persist across kexec.
>
> I just love it how I tickle out details each time I touch this comment
> because we three can't write a single concise and self-contained
> explanation. :-(
>
> Ok, next version:
>
> "Private mappings persist across kexec. If tdx_enc_status_changed() fails

s/Private mappings persist /Memory encryption state persists /

> in the first kernel, it leaves memory in an unknown state.
>
> If that memory remains shared, accessing it in the *next* kernel through
> a private mapping will result in an unrecoverable guest shutdown.
>
> The kdump kernel boot is not impacted as it uses a pre-reserved memory
> range that is always private. However, gathering crash information
> could lead to a crash if it accesses unconverted memory through
> a private mapping which is possible when accessing that memory through
> /proc/vmcore, for example.
>
> In all cases, print error info in order to leave enough bread crumbs for
> debugging."
>
> I think this is getting in the right direction as it actually makes
> sense now.

Otherwise looks good to me.

--
Kiryl Shutsemau / Kirill A. Shutemov