2023-06-04 14:57:13

by Kai Huang

[permalink] [raw]
Subject: [PATCH v11 18/20] x86: Handle TDX erratum to reset TDX private memory during kexec() and reboot

The first few generations of TDX hardware have an erratum. A partial
write to a TDX private memory cacheline will silently "poison" the
line. Subsequent reads will consume the poison and generate a machine
check. According to the TDX hardware spec, neither of these things
should have happened.

== Background ==

Virtually all kernel memory accesses operations happen in full
cachelines. In practice, writing a "byte" of memory usually reads a 64
byte cacheline of memory, modifies it, then writes the whole line back.
Those operations do not trigger this problem.

This problem is triggered by "partial" writes where a write transaction
of less than cacheline lands at the memory controller. The CPU does
these via non-temporal write instructions (like MOVNTI), or through
UC/WC memory mappings. The issue can also be triggered away from the
CPU by devices doing partial writes via DMA.

== Problem ==

A fast warm reset doesn't reset TDX private memory. Kexec() can also
boot into the new kernel directly. Thus if the old kernel has enabled
TDX on the platform with this erratum, the new kernel may get unexpected
machine check.

Note that w/o this erratum any kernel read/write on TDX private memory
should never cause machine check, thus it's OK for the old kernel to
leave TDX private pages as is.

== Solution ==

In short, with this erratum, the kernel needs to explicitly convert all
TDX private pages back to normal to give the new kernel a clean slate
after either a fast warm reset or kexec().

There's no existing infrastructure to track TDX private pages, which
could be PAMT pages, TDX guest private pages, or SEPT (secure EPT)
pages. The latter two are yet to be implemented thus it's not certain
how to track them for now.

It's not feasible to query the TDX module either because VMX has already
been stopped when KVM receives the reboot notifier.

Another option is to blindly convert all memory pages. But this may
bring non-trivial latency to machine reboot and kexec() on large memory
systems (especially when the number of TDX private pages is small). A
final solution should be tracking TDX private pages and only converting
them. Also, it's problematic to convert all memory pages because not
all pages are mapped as writable in the direct-mapping. Thus to do so
would require switching to a new page table which maps all pages as
writable. Such page table can either be a new page table, or the
identical mapping table built during kexec(). Using either seems too
dramatic, especially considering the kernel should eventually be able
to track all TDX private pages in which case the direct-mapping can be
directly used.

So for now just convert PAMT pages. Converting TDX guest private pages
and SEPT pages can be added when supporting TDX guests is added to the
kernel.

Introduce a new "x86_platform_ops::memory_shutdown()" callback as a
placeholder to convert all TDX private memory, and call it at the end of
machine_shutdown() after all remote cpus have been stopped (thus no more
TDX activities) and all dirty cachelines of TDX private memory have been
flushed (thus no more later cacheline writeback).

Implement the default callback as a noop function. Replace the callback
with TDX's own implementation when the platform has this erratum in TDX
early boot-time initialization. In this way only the platforms with
this erratum carry this additional memory conversion burden.

Signed-off-by: Kai Huang <[email protected]>
---

v10 -> v11:
- New patch

---
arch/x86/include/asm/x86_init.h | 1 +
arch/x86/kernel/reboot.c | 1 +
arch/x86/kernel/x86_init.c | 2 ++
arch/x86/virt/vmx/tdx/tdx.c | 57 +++++++++++++++++++++++++++++++++
4 files changed, 61 insertions(+)

diff --git a/arch/x86/include/asm/x86_init.h b/arch/x86/include/asm/x86_init.h
index 88085f369ff6..d2c6742b185a 100644
--- a/arch/x86/include/asm/x86_init.h
+++ b/arch/x86/include/asm/x86_init.h
@@ -299,6 +299,7 @@ struct x86_platform_ops {
void (*get_wallclock)(struct timespec64 *ts);
int (*set_wallclock)(const struct timespec64 *ts);
void (*iommu_shutdown)(void);
+ void (*memory_shutdown)(void);
bool (*is_untracked_pat_range)(u64 start, u64 end);
void (*nmi_init)(void);
unsigned char (*get_nmi_reason)(void);
diff --git a/arch/x86/kernel/reboot.c b/arch/x86/kernel/reboot.c
index b3d0e015dae2..6aadfec8df7a 100644
--- a/arch/x86/kernel/reboot.c
+++ b/arch/x86/kernel/reboot.c
@@ -720,6 +720,7 @@ void native_machine_shutdown(void)

#ifdef CONFIG_X86_64
x86_platform.iommu_shutdown();
+ x86_platform.memory_shutdown();
#endif
}

diff --git a/arch/x86/kernel/x86_init.c b/arch/x86/kernel/x86_init.c
index d82f4fa2f1bf..344250b35a5d 100644
--- a/arch/x86/kernel/x86_init.c
+++ b/arch/x86/kernel/x86_init.c
@@ -31,6 +31,7 @@ void x86_init_noop(void) { }
void __init x86_init_uint_noop(unsigned int unused) { }
static int __init iommu_init_noop(void) { return 0; }
static void iommu_shutdown_noop(void) { }
+static void memory_shutdown_noop(void) { }
bool __init bool_x86_init_noop(void) { return false; }
void x86_op_int_noop(int cpu) { }
int set_rtc_noop(const struct timespec64 *now) { return -EINVAL; }
@@ -142,6 +143,7 @@ struct x86_platform_ops x86_platform __ro_after_init = {
.get_wallclock = mach_get_cmos_time,
.set_wallclock = mach_set_cmos_time,
.iommu_shutdown = iommu_shutdown_noop,
+ .memory_shutdown = memory_shutdown_noop,
.is_untracked_pat_range = is_ISA_range,
.nmi_init = default_nmi_init,
.get_nmi_reason = default_get_nmi_reason,
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index 8ff07256a515..0aa413b712e8 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -587,6 +587,14 @@ static int tdmr_set_up_pamt(struct tdmr_info *tdmr,
tdmr_pamt_base += pamt_size[pgsz];
}

+ /*
+ * tdx_memory_shutdown() also reads TDMR's PAMT during
+ * kexec() or reboot, which could happen at anytime, even
+ * during this particular code. Make sure pamt_4k_base
+ * is firstly set otherwise tdx_memory_shutdown() may
+ * get an invalid PAMT base when it sees a valid number
+ * of PAMT pages.
+ */
tdmr->pamt_4k_base = pamt_base[TDX_PS_4K];
tdmr->pamt_4k_size = pamt_size[TDX_PS_4K];
tdmr->pamt_2m_base = pamt_base[TDX_PS_2M];
@@ -1318,6 +1326,46 @@ static struct notifier_block tdx_memory_nb = {
.notifier_call = tdx_memory_notifier,
};

+static void tdx_memory_shutdown(void)
+{
+ /*
+ * Convert all TDX private pages back to normal if the platform
+ * has "partial write machine check" erratum.
+ *
+ * For now there's no existing infrastructure to tell whether
+ * a page is TDX private memory. Using SEAMCALL to query TDX
+ * module isn't feasible either because: 1) VMX has been turned
+ * off by reaching here so SEAMCALL cannot be made; 2) Even
+ * SEAMCALL can be made the result from TDX module may not be
+ * accurate (e.g., remote CPU can be stopped while the kernel
+ * is in the middle of reclaiming one TDX private page and doing
+ * MOVDIR64B).
+ *
+ * One solution could be just converting all memory pages, but
+ * this may bring non-trivial latency on large memory systems
+ * (especially when the number of TDX private pages is small).
+ * Looks eventually the kernel should track TDX private pages and
+ * only convert these.
+ *
+ * Also, not all pages are mapped as writable in direct mapping,
+ * thus it's problematic to do so. It can be done by switching
+ * to the identical mapping page table built for kexec(), which
+ * maps all pages as writable, but the complexity looks overkill.
+ *
+ * Thus instead of doing something dramatic to convert all pages,
+ * only convert PAMTs for now as for now TDX private pages can
+ * only be PAMT. Converting TDX guest private pages and Secure
+ * EPT pages can be added later when the kernel has a proper way
+ * to track these pages.
+ *
+ * All other cpus are already dead, thus it's safe to read TDMRs
+ * to find PAMTs w/o holding any kind of locking here.
+ */
+ WARN_ON_ONCE(num_online_cpus() != 1);
+
+ tdmrs_reset_pamt_all(&tdx_tdmr_list);
+}
+
static int __init tdx_init(void)
{
u32 tdx_keyid_start, nr_tdx_keyids;
@@ -1356,6 +1404,15 @@ static int __init tdx_init(void)
tdx_guest_keyid_start = ++tdx_keyid_start;
tdx_nr_guest_keyids = --nr_tdx_keyids;

+ /*
+ * On the platform with erratum all TDX private pages need to
+ * be converted back to normal before rebooting (warm reset) or
+ * before kexec() booting to the new kernel, otherwise the (new)
+ * kernel may get unexpected SRAR machine check exception.
+ */
+ if (boot_cpu_has_bug(X86_BUG_TDX_PW_MCE))
+ x86_platform.memory_shutdown = tdx_memory_shutdown;
+
return 0;
no_tdx:
return -ENODEV;
--
2.40.1



2023-06-09 13:40:55

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v11 18/20] x86: Handle TDX erratum to reset TDX private memory during kexec() and reboot

On Mon, Jun 05, 2023 at 02:27:31AM +1200, Kai Huang wrote:
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index 8ff07256a515..0aa413b712e8 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -587,6 +587,14 @@ static int tdmr_set_up_pamt(struct tdmr_info *tdmr,
> tdmr_pamt_base += pamt_size[pgsz];
> }
>
> + /*
> + * tdx_memory_shutdown() also reads TDMR's PAMT during
> + * kexec() or reboot, which could happen at anytime, even
> + * during this particular code. Make sure pamt_4k_base
> + * is firstly set otherwise tdx_memory_shutdown() may
> + * get an invalid PAMT base when it sees a valid number
> + * of PAMT pages.
> + */

Hmm? What prevents compiler from messing this up. It can reorder as it
wishes, no?

Maybe add a proper locking? Anything that prevent preemption would do,
right?

> tdmr->pamt_4k_base = pamt_base[TDX_PS_4K];
> tdmr->pamt_4k_size = pamt_size[TDX_PS_4K];
> tdmr->pamt_2m_base = pamt_base[TDX_PS_2M];

--
Kiryl Shutsemau / Kirill A. Shutemov

2023-06-12 03:15:52

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v11 18/20] x86: Handle TDX erratum to reset TDX private memory during kexec() and reboot

On Fri, 2023-06-09 at 16:23 +0300, [email protected] wrote:
> On Mon, Jun 05, 2023 at 02:27:31AM +1200, Kai Huang wrote:
> > diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> > index 8ff07256a515..0aa413b712e8 100644
> > --- a/arch/x86/virt/vmx/tdx/tdx.c
> > +++ b/arch/x86/virt/vmx/tdx/tdx.c
> > @@ -587,6 +587,14 @@ static int tdmr_set_up_pamt(struct tdmr_info *tdmr,
> > tdmr_pamt_base += pamt_size[pgsz];
> > }
> >
> > + /*
> > + * tdx_memory_shutdown() also reads TDMR's PAMT during
> > + * kexec() or reboot, which could happen at anytime, even
> > + * during this particular code. Make sure pamt_4k_base
> > + * is firstly set otherwise tdx_memory_shutdown() may
> > + * get an invalid PAMT base when it sees a valid number
> > + * of PAMT pages.
> > + */
>
> Hmm? What prevents compiler from messing this up. It can reorder as it
> wishes, no?

Hmm.. Right. Sorry I missed.

>
> Maybe add a proper locking? Anything that prevent preemption would do,
> right?
>
> > tdmr->pamt_4k_base = pamt_base[TDX_PS_4K];
> > tdmr->pamt_4k_size = pamt_size[TDX_PS_4K];
> > tdmr->pamt_2m_base = pamt_base[TDX_PS_2M];
>

I think a simple memory barrier will do. How does below look?

--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -591,11 +591,12 @@ static int tdmr_set_up_pamt(struct tdmr_info *tdmr,
* tdx_memory_shutdown() also reads TDMR's PAMT during
* kexec() or reboot, which could happen at anytime, even
* during this particular code. Make sure pamt_4k_base
- * is firstly set otherwise tdx_memory_shutdown() may
- * get an invalid PAMT base when it sees a valid number
- * of PAMT pages.
+ * is firstly set and place a __mb() after it otherwise
+ * tdx_memory_shutdown() may get an invalid PAMT base
+ * when it sees a valid number of PAMT pages.
*/
tdmr->pamt_4k_base = pamt_base[TDX_PS_4K];
+ __mb();

2023-06-12 08:31:01

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v11 18/20] x86: Handle TDX erratum to reset TDX private memory during kexec() and reboot

On Mon, Jun 12, 2023 at 03:06:48AM +0000, Huang, Kai wrote:
> On Fri, 2023-06-09 at 16:23 +0300, [email protected] wrote:
> > On Mon, Jun 05, 2023 at 02:27:31AM +1200, Kai Huang wrote:
> > > diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> > > index 8ff07256a515..0aa413b712e8 100644
> > > --- a/arch/x86/virt/vmx/tdx/tdx.c
> > > +++ b/arch/x86/virt/vmx/tdx/tdx.c
> > > @@ -587,6 +587,14 @@ static int tdmr_set_up_pamt(struct tdmr_info *tdmr,
> > > tdmr_pamt_base += pamt_size[pgsz];
> > > }
> > >
> > > + /*
> > > + * tdx_memory_shutdown() also reads TDMR's PAMT during
> > > + * kexec() or reboot, which could happen at anytime, even
> > > + * during this particular code. Make sure pamt_4k_base
> > > + * is firstly set otherwise tdx_memory_shutdown() may
> > > + * get an invalid PAMT base when it sees a valid number
> > > + * of PAMT pages.
> > > + */
> >
> > Hmm? What prevents compiler from messing this up. It can reorder as it
> > wishes, no?
>
> Hmm.. Right. Sorry I missed.
>
> >
> > Maybe add a proper locking? Anything that prevent preemption would do,
> > right?
> >
> > > tdmr->pamt_4k_base = pamt_base[TDX_PS_4K];
> > > tdmr->pamt_4k_size = pamt_size[TDX_PS_4K];
> > > tdmr->pamt_2m_base = pamt_base[TDX_PS_2M];
> >
>
> I think a simple memory barrier will do. How does below look?
>
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -591,11 +591,12 @@ static int tdmr_set_up_pamt(struct tdmr_info *tdmr,
> * tdx_memory_shutdown() also reads TDMR's PAMT during
> * kexec() or reboot, which could happen at anytime, even
> * during this particular code. Make sure pamt_4k_base
> - * is firstly set otherwise tdx_memory_shutdown() may
> - * get an invalid PAMT base when it sees a valid number
> - * of PAMT pages.
> + * is firstly set and place a __mb() after it otherwise
> + * tdx_memory_shutdown() may get an invalid PAMT base
> + * when it sees a valid number of PAMT pages.
> */
> tdmr->pamt_4k_base = pamt_base[TDX_PS_4K];
> + __mb();

If you want to play with barriers, assign pamt_4k_base the last with
smp_store_release() and read it first in tdmr_get_pamt() with
smp_load_acquire(). If it is non-zero, all pamt_* fields are valid.

Or just drop this non-sense and use a spin lock for serialization.

--
Kiryl Shutsemau / Kirill A. Shutemov

2023-06-12 10:51:03

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v11 18/20] x86: Handle TDX erratum to reset TDX private memory during kexec() and reboot

On Mon, 2023-06-12 at 10:58 +0300, [email protected] wrote:
> On Mon, Jun 12, 2023 at 03:06:48AM +0000, Huang, Kai wrote:
> > On Fri, 2023-06-09 at 16:23 +0300, [email protected] wrote:
> > > On Mon, Jun 05, 2023 at 02:27:31AM +1200, Kai Huang wrote:
> > > > diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> > > > index 8ff07256a515..0aa413b712e8 100644
> > > > --- a/arch/x86/virt/vmx/tdx/tdx.c
> > > > +++ b/arch/x86/virt/vmx/tdx/tdx.c
> > > > @@ -587,6 +587,14 @@ static int tdmr_set_up_pamt(struct tdmr_info *tdmr,
> > > > tdmr_pamt_base += pamt_size[pgsz];
> > > > }
> > > >
> > > > + /*
> > > > + * tdx_memory_shutdown() also reads TDMR's PAMT during
> > > > + * kexec() or reboot, which could happen at anytime, even
> > > > + * during this particular code. Make sure pamt_4k_base
> > > > + * is firstly set otherwise tdx_memory_shutdown() may
> > > > + * get an invalid PAMT base when it sees a valid number
> > > > + * of PAMT pages.
> > > > + */
> > >
> > > Hmm? What prevents compiler from messing this up. It can reorder as it
> > > wishes, no?
> >
> > Hmm.. Right. Sorry I missed.
> >
> > >
> > > Maybe add a proper locking? Anything that prevent preemption would do,
> > > right?
> > >
> > > > tdmr->pamt_4k_base = pamt_base[TDX_PS_4K];
> > > > tdmr->pamt_4k_size = pamt_size[TDX_PS_4K];
> > > > tdmr->pamt_2m_base = pamt_base[TDX_PS_2M];
> > >
> >
> > I think a simple memory barrier will do. How does below look?
> >
> > --- a/arch/x86/virt/vmx/tdx/tdx.c
> > +++ b/arch/x86/virt/vmx/tdx/tdx.c
> > @@ -591,11 +591,12 @@ static int tdmr_set_up_pamt(struct tdmr_info *tdmr,
> > * tdx_memory_shutdown() also reads TDMR's PAMT during
> > * kexec() or reboot, which could happen at anytime, even
> > * during this particular code. Make sure pamt_4k_base
> > - * is firstly set otherwise tdx_memory_shutdown() may
> > - * get an invalid PAMT base when it sees a valid number
> > - * of PAMT pages.
> > + * is firstly set and place a __mb() after it otherwise
> > + * tdx_memory_shutdown() may get an invalid PAMT base
> > + * when it sees a valid number of PAMT pages.
> > */
> > tdmr->pamt_4k_base = pamt_base[TDX_PS_4K];
> > + __mb();
>
> If you want to play with barriers, assign pamt_4k_base the last with
> smp_store_release() and read it first in tdmr_get_pamt() with
> smp_load_acquire(). If it is non-zero, all pamt_* fields are valid.
>
> Or just drop this non-sense and use a spin lock for serialization.
>

We don't need to guarantee when pamt_4k_base is valid, all other pamt_* are
valid. Instead, we need to guarantee when (at least) _one_ of pamt_*_size is
valid, the pamt_4k_base is valid.

For example,

pamt_4k_base -> valid
pamt_4k_size -> invalid (0)
pamt_2m_size -> invalid
pamt_1g_size -> invalid

and
pamt_4k_base -> valid
pamt_4k_size -> valid
pamt_2m_size -> invalid
pamt_1g_size -> invalid

are both OK.

The reason is the PAMTs are only written by the TDX module in init_tdmrs(). So
if tdx_memory_shutdown() sees a part of PAMT (the second case above), those PAMT
pages are not yet TDX private pages, thus converting part of PAMT is fine.

The invalid case is when any pamt_*_size is valid, pamt_4k_base is invalid,
e.g.:

pamt_4k_base -> invalid
pamt_4k_size -> valid
pamt_2m_size -> invalid
pamt_1g_size -> invalid

as this case tdx_memory_shutdown() will convert a incorrect (not partial) PAMT
area.

So I think a __mb() after setting tdmr->pamt_4k_base should be good enough, as
it guarantees when setting to any pamt_*_size happens, the valid pamt_4k_base
will be seen by other cpus.

Does it make sense?

2023-06-12 12:27:16

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v11 18/20] x86: Handle TDX erratum to reset TDX private memory during kexec() and reboot

On Mon, Jun 12, 2023 at 10:27:44AM +0000, Huang, Kai wrote:
> Does it make sense?

I understand your logic. AFAICS, it is correct (smp_mb() instead of __mb()
would be better), but it is not justified from complexity PoV. This
lockless exercise gave me a pause to understand.

Lockless doesn't buy you anything here, only increases complexity.
Just take a lock.

Kernel is big. I'm sure you'll find a better opportunity to be clever
about serialization :P

--
Kiryl Shutsemau / Kirill A. Shutemov

2023-06-12 13:39:47

by David Laight

[permalink] [raw]
Subject: RE: [PATCH v11 18/20] x86: Handle TDX erratum to reset TDX private memory during kexec() and reboot

From: [email protected]
> Sent: 12 June 2023 12:49
>
> On Mon, Jun 12, 2023 at 10:27:44AM +0000, Huang, Kai wrote:
> > Does it make sense?
>
> I understand your logic. AFAICS, it is correct (smp_mb() instead of __mb()
> would be better), but it is not justified from complexity PoV.

Given that x86 performs writes pretty much in code order.
Do you need anything more than a compile barrier?

> This lockless exercise gave me a pause to understand.
>
> Lockless doesn't buy you anything here, only increases complexity.
> Just take a lock.

Indeed...

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


2023-06-12 14:26:46

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v11 18/20] x86: Handle TDX erratum to reset TDX private memory during kexec() and reboot

On 6/12/23 03:27, Huang, Kai wrote:
> So I think a __mb() after setting tdmr->pamt_4k_base should be good enough, as
> it guarantees when setting to any pamt_*_size happens, the valid pamt_4k_base
> will be seen by other cpus.
>
> Does it make sense?

Just use a normal old atomic_t or set_bit()/test_bit(). They have
built-in memory barriers are are less likely to get botched.

2023-06-13 01:14:05

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v11 18/20] x86: Handle TDX erratum to reset TDX private memory during kexec() and reboot

On Mon, 2023-06-12 at 06:47 -0700, Dave Hansen wrote:
> On 6/12/23 03:27, Huang, Kai wrote:
> > So I think a __mb() after setting tdmr->pamt_4k_base should be good enough, as
> > it guarantees when setting to any pamt_*_size happens, the valid pamt_4k_base
> > will be seen by other cpus.
> >
> > Does it make sense?
>
> Just use a normal old atomic_t or set_bit()/test_bit(). They have
> built-in memory barriers are are less likely to get botched.

Thanks for the suggestion.

Hi Dave, Kirill,

I'd like to check with you that whether we should introduce a mechanism to track
TDX private pages for both this patch and the next.

As you can see this patch only deals PAMT pages due to couple of reasons that
mnentioned in the changelog. The next MCE patch handles all TDX private pages,
but it uses SEAMCALL in the #MC handler. Using SEAMCALL has two cons: 1) it is
slow (probably doesn't matter, though); 2) it brings additional risk of
triggering further #MC inside TDX module, although such risk should be a
theoretical thing.

If we introduce a helper to mark a page as TDX private page, then both above
patches can utilize it. We don't need to consult TDMRs to get PAMT anymore in
this patch (we will need a way to loop all TDX-usable memory pages, but this
needs to be done anyway with TDX guests). I believe eventually we can end up
with less code.

In terms of how to do, for PAMT pages, we can set page->private to a TDX magic
number because they come out of page allocator directly. Secure-EPT pages are
like PAMT pages too. For TDX guest private pages, Sean is moving to implement
KVM's own pseudo filesystem so they will have a unique mapping to identify.

https://github.com/sean-jc/linux/commit/40d338c8629287dda60a9f7c800ede8549295a7c

And my thinking is in this TDX host series, we can just handle PAMT pages. Both
secure-EPT and TDX guest private pages can be handled later in KVM TDX series.
I think eventually we can have a function like below to tell whether a page is
TDX private page:

bool page_is_tdx_private(struct page *page)
{
if (page->private == TDX_PRIVATE_MAGIC)
return true;

if (!page_mapping(page))
return false;

return page_mapping(page)->a_ops == &kvm_gmem_ops;
}

How does this sound? Or any other comments? Thanks!

2023-06-13 11:12:36

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v11 18/20] x86: Handle TDX erratum to reset TDX private memory during kexec() and reboot

On Tue, Jun 13, 2023 at 12:51:23AM +0000, Huang, Kai wrote:
> On Mon, 2023-06-12 at 06:47 -0700, Dave Hansen wrote:
> > On 6/12/23 03:27, Huang, Kai wrote:
> > > So I think a __mb() after setting tdmr->pamt_4k_base should be good enough, as
> > > it guarantees when setting to any pamt_*_size happens, the valid pamt_4k_base
> > > will be seen by other cpus.
> > >
> > > Does it make sense?
> >
> > Just use a normal old atomic_t or set_bit()/test_bit(). They have
> > built-in memory barriers are are less likely to get botched.
>
> Thanks for the suggestion.
>
> Hi Dave, Kirill,
>
> I'd like to check with you that whether we should introduce a mechanism to track
> TDX private pages for both this patch and the next.
>
> As you can see this patch only deals PAMT pages due to couple of reasons that
> mnentioned in the changelog. The next MCE patch handles all TDX private pages,
> but it uses SEAMCALL in the #MC handler. Using SEAMCALL has two cons: 1) it is
> slow (probably doesn't matter, though); 2) it brings additional risk of
> triggering further #MC inside TDX module, although such risk should be a
> theoretical thing.
>
> If we introduce a helper to mark a page as TDX private page, then both above
> patches can utilize it. We don't need to consult TDMRs to get PAMT anymore in
> this patch (we will need a way to loop all TDX-usable memory pages, but this
> needs to be done anyway with TDX guests). I believe eventually we can end up
> with less code.
>
> In terms of how to do, for PAMT pages, we can set page->private to a TDX magic
> number because they come out of page allocator directly. Secure-EPT pages are
> like PAMT pages too. For TDX guest private pages, Sean is moving to implement
> KVM's own pseudo filesystem so they will have a unique mapping to identify.
>
> https://github.com/sean-jc/linux/commit/40d338c8629287dda60a9f7c800ede8549295a7c
>
> And my thinking is in this TDX host series, we can just handle PAMT pages. Both
> secure-EPT and TDX guest private pages can be handled later in KVM TDX series.
> I think eventually we can have a function like below to tell whether a page is
> TDX private page:
>
> bool page_is_tdx_private(struct page *page)
> {
> if (page->private == TDX_PRIVATE_MAGIC)
> return true;
>
> if (!page_mapping(page))
> return false;
>
> return page_mapping(page)->a_ops == &kvm_gmem_ops;
> }
>
> How does this sound? Or any other comments? Thanks!

If you going to take this path it has to be supported natively by
kvm_gmem_: it has to provide API for that. You should not assume that
page->private is free to use. It is owned by kvm_gmmem.

--
Kiryl Shutsemau / Kirill A. Shutemov

2023-06-13 14:56:06

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v11 18/20] x86: Handle TDX erratum to reset TDX private memory during kexec() and reboot

On 6/12/23 17:51, Huang, Kai wrote:
> If we introduce a helper to mark a page as TDX private page,

Let me get this right: you have working, functional code for a
highly-unlikely scenario (kernel bugs or even more rare hardware
errors). But, you want to optimize this super-rare case? It's not fast
enough?

Is there any other motivation here that I'm missing?


2023-06-13 23:40:41

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v11 18/20] x86: Handle TDX erratum to reset TDX private memory during kexec() and reboot

On Tue, 2023-06-13 at 07:25 -0700, Hansen, Dave wrote:
> On 6/12/23 17:51, Huang, Kai wrote:
> > If we introduce a helper to mark a page as TDX private page,
>
> Let me get this right: you have working, functional code for a
> highly-unlikely scenario (kernel bugs or even more rare hardware
> errors). But, you want to optimize this super-rare case? It's not fast
> enough?
>
> Is there any other motivation here that I'm missing?
>

No it's not about speed. The motivation is to have a common code to yield less
line of code, though I don't have clear number of how many LoC can be reduced.

2023-06-14 00:50:50

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v11 18/20] x86: Handle TDX erratum to reset TDX private memory during kexec() and reboot

On Tue, 2023-06-13 at 17:24 -0700, Dave Hansen wrote:
> On 6/13/23 16:18, Huang, Kai wrote:
> > On Tue, 2023-06-13 at 07:25 -0700, Hansen, Dave wrote:
> > > On 6/12/23 17:51, Huang, Kai wrote:
> > > > If we introduce a helper to mark a page as TDX private page,
> > > Let me get this right: you have working, functional code for a
> > > highly-unlikely scenario (kernel bugs or even more rare hardware
> > > errors). But, you want to optimize this super-rare case? It's not fast
> > > enough?
> > >
> > > Is there any other motivation here that I'm missing?
> > >
> > No it's not about speed. The motivation is to have a common code to yield less
> > line of code, though I don't have clear number of how many LoC can be reduced.
>
> OK, so ... ballpark. How many lines of code are we going to _save_ for
> this super-rare case? 10? 100? 1000?

~50 LoC I guess, certainly < 100.

>
> The upside is saving X lines of code ... somewhere. The downside is
> adding Y lines of code ... somewhere else and maybe breaking things in
> the process.
>
> You've evidently done _some_ kind of calculus in your head to make this
> tradeoff worthwhile. I'd love to hear what your calculus is, even if
> it's just a gut feel.
>
> Could you share your logic here, please?

The logic is the whole tdx_is_private_mem() function in the next patch (#MC
handling one) can be significantly reduced from 100 -> ~10, and we roughly needs
some more code (<50 LoC) to mark PAMT as private.

2023-06-14 00:55:38

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v11 18/20] x86: Handle TDX erratum to reset TDX private memory during kexec() and reboot

On Wed, 2023-06-14 at 00:38 +0000, Huang, Kai wrote:
> On Tue, 2023-06-13 at 17:24 -0700, Dave Hansen wrote:
> > On 6/13/23 16:18, Huang, Kai wrote:
> > > On Tue, 2023-06-13 at 07:25 -0700, Hansen, Dave wrote:
> > > > On 6/12/23 17:51, Huang, Kai wrote:
> > > > > If we introduce a helper to mark a page as TDX private page,
> > > > Let me get this right: you have working, functional code for a
> > > > highly-unlikely scenario (kernel bugs or even more rare hardware
> > > > errors). But, you want to optimize this super-rare case? It's not fast
> > > > enough?
> > > >
> > > > Is there any other motivation here that I'm missing?
> > > >
> > > No it's not about speed. The motivation is to have a common code to yield less
> > > line of code, though I don't have clear number of how many LoC can be reduced.
> >
> > OK, so ... ballpark. How many lines of code are we going to _save_ for
> > this super-rare case? 10? 100? 1000?
>
> ~50 LoC I guess, certainly < 100.
>
> >
> > The upside is saving X lines of code ... somewhere. The downside is
> > adding Y lines of code ... somewhere else and maybe breaking things in
> > the process.
> >
> > You've evidently done _some_ kind of calculus in your head to make this
> > tradeoff worthwhile. I'd love to hear what your calculus is, even if
> > it's just a gut feel.
> >
> > Could you share your logic here, please?
>
> The logic is the whole tdx_is_private_mem() function in the next patch (#MC
> handling one) can be significantly reduced from 100 -> ~10, and we roughly needs
> some more code (<50 LoC) to mark PAMT as private.
>

Apologize, should be "we roughly need some more code (<50 LoC) to mark PAMT and
Secure-EPT and TDX guest private pages as TDX private pages". But now we only
have PAMT.

2023-06-14 01:29:56

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v11 18/20] x86: Handle TDX erratum to reset TDX private memory during kexec() and reboot

On 6/13/23 16:18, Huang, Kai wrote:
> On Tue, 2023-06-13 at 07:25 -0700, Hansen, Dave wrote:
>> On 6/12/23 17:51, Huang, Kai wrote:
>>> If we introduce a helper to mark a page as TDX private page,
>> Let me get this right: you have working, functional code for a
>> highly-unlikely scenario (kernel bugs or even more rare hardware
>> errors). But, you want to optimize this super-rare case? It's not fast
>> enough?
>>
>> Is there any other motivation here that I'm missing?
>>
> No it's not about speed. The motivation is to have a common code to yield less
> line of code, though I don't have clear number of how many LoC can be reduced.

OK, so ... ballpark. How many lines of code are we going to _save_ for
this super-rare case? 10? 100? 1000?

The upside is saving X lines of code ... somewhere. The downside is
adding Y lines of code ... somewhere else and maybe breaking things in
the process.

You've evidently done _some_ kind of calculus in your head to make this
tradeoff worthwhile. I'd love to hear what your calculus is, even if
it's just a gut feel.

Could you share your logic here, please?

2023-06-14 01:30:55

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v11 18/20] x86: Handle TDX erratum to reset TDX private memory during kexec() and reboot

On Tue, 2023-06-13 at 14:05 +0300, [email protected] wrote:
> On Tue, Jun 13, 2023 at 12:51:23AM +0000, Huang, Kai wrote:
> > On Mon, 2023-06-12 at 06:47 -0700, Dave Hansen wrote:
> > > On 6/12/23 03:27, Huang, Kai wrote:
> > > > So I think a __mb() after setting tdmr->pamt_4k_base should be good enough, as
> > > > it guarantees when setting to any pamt_*_size happens, the valid pamt_4k_base
> > > > will be seen by other cpus.
> > > >
> > > > Does it make sense?
> > >
> > > Just use a normal old atomic_t or set_bit()/test_bit(). They have
> > > built-in memory barriers are are less likely to get botched.
> >
> > Thanks for the suggestion.
> >
> > Hi Dave, Kirill,
> >
> > I'd like to check with you that whether we should introduce a mechanism to track
> > TDX private pages for both this patch and the next.
> >
> > As you can see this patch only deals PAMT pages due to couple of reasons that
> > mnentioned in the changelog. The next MCE patch handles all TDX private pages,
> > but it uses SEAMCALL in the #MC handler. Using SEAMCALL has two cons: 1) it is
> > slow (probably doesn't matter, though); 2) it brings additional risk of
> > triggering further #MC inside TDX module, although such risk should be a
> > theoretical thing.
> >
> > If we introduce a helper to mark a page as TDX private page, then both above
> > patches can utilize it. We don't need to consult TDMRs to get PAMT anymore in
> > this patch (we will need a way to loop all TDX-usable memory pages, but this
> > needs to be done anyway with TDX guests). I believe eventually we can end up
> > with less code.
> >
> > In terms of how to do, for PAMT pages, we can set page->private to a TDX magic
> > number because they come out of page allocator directly. Secure-EPT pages are
> > like PAMT pages too. For TDX guest private pages, Sean is moving to implement
> > KVM's own pseudo filesystem so they will have a unique mapping to identify.
> >
> > https://github.com/sean-jc/linux/commit/40d338c8629287dda60a9f7c800ede8549295a7c
> >
> > And my thinking is in this TDX host series, we can just handle PAMT pages. Both
> > secure-EPT and TDX guest private pages can be handled later in KVM TDX series.
> > I think eventually we can have a function like below to tell whether a page is
> > TDX private page:
> >
> > bool page_is_tdx_private(struct page *page)
> > {
> > if (page->private == TDX_PRIVATE_MAGIC)
> > return true;
> >
> > if (!page_mapping(page))
> > return false;
> >
> > return page_mapping(page)->a_ops == &kvm_gmem_ops;
> > }
> >
> > How does this sound? Or any other comments? Thanks!
>
> If you going to take this path it has to be supported natively by
> kvm_gmem_: it has to provide API for that. 
>

Yes.

> You should not assume that
> page->private is free to use. It is owned by kvm_gmmem.
>

page->private is only for PAMT and SEPT pages. kmem_gmem has it's own mapping
which can be used to identify the pages owned by it.

Hmm.. I think we should just leave them out for now, as they theoretically are
owned by KVM thus can be handled by KVM, e.g., in it's reboot or shutdown or
module unload code path.

If we are fine to use SEAMCALL in the #MC handler code path, I think perhaps we
can just keep using TDMRs to locate PAMTs.

2023-06-14 09:48:44

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v11 18/20] x86: Handle TDX erratum to reset TDX private memory during kexec() and reboot

On Mon, 2023-06-05 at 02:27 +1200, Kai Huang wrote:
> --- a/arch/x86/kernel/reboot.c
> +++ b/arch/x86/kernel/reboot.c
> @@ -720,6 +720,7 @@ void native_machine_shutdown(void)
>  
>  #ifdef CONFIG_X86_64
>   x86_platform.iommu_shutdown();
> + x86_platform.memory_shutdown();
>  #endif
>  }

Hi Kirill/Dave,

I missed that this solution doesn't reset TDX private for emergency restart or
when reboot_force is set, because machine_shutdown() isn't called for them.

Is it acceptable? Or should we handle them too?

2023-06-14 10:33:13

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v11 18/20] x86: Handle TDX erratum to reset TDX private memory during kexec() and reboot

On Wed, Jun 14, 2023 at 09:33:45AM +0000, Huang, Kai wrote:
> On Mon, 2023-06-05 at 02:27 +1200, Kai Huang wrote:
> > --- a/arch/x86/kernel/reboot.c
> > +++ b/arch/x86/kernel/reboot.c
> > @@ -720,6 +720,7 @@ void native_machine_shutdown(void)
> > ?
> > ?#ifdef CONFIG_X86_64
> > ? x86_platform.iommu_shutdown();
> > + x86_platform.memory_shutdown();
> > ?#endif
> > ?}
>
> Hi Kirill/Dave,
>
> I missed that this solution doesn't reset TDX private for emergency restart or
> when reboot_force is set, because machine_shutdown() isn't called for them.
>
> Is it acceptable? Or should we handle them too?

Force reboot is not used in kexec path, right? And the platform has to
handle erratum in BIOS to reset memory status on reboot anyway.

I think we should be fine. But it worth mentioning it in the commit
message.

--
Kiryl Shutsemau / Kirill A. Shutemov

2023-06-14 11:11:27

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v11 18/20] x86: Handle TDX erratum to reset TDX private memory during kexec() and reboot

On Wed, 2023-06-14 at 13:02 +0300, [email protected] wrote:
> On Wed, Jun 14, 2023 at 09:33:45AM +0000, Huang, Kai wrote:
> > On Mon, 2023-06-05 at 02:27 +1200, Kai Huang wrote:
> > > --- a/arch/x86/kernel/reboot.c
> > > +++ b/arch/x86/kernel/reboot.c
> > > @@ -720,6 +720,7 @@ void native_machine_shutdown(void)
> > >  
> > >  #ifdef CONFIG_X86_64
> > >   x86_platform.iommu_shutdown();
> > > + x86_platform.memory_shutdown();
> > >  #endif
> > >  }
> >
> > Hi Kirill/Dave,
> >
> > I missed that this solution doesn't reset TDX private for emergency restart or
> > when reboot_force is set, because machine_shutdown() isn't called for them.
> >
> > Is it acceptable? Or should we handle them too?
>
> Force reboot is not used in kexec path, right? 
>

Correct.

> And the platform has to
> handle erratum in BIOS to reset memory status on reboot anyway.

So "handle erratum in BIOS" I think you mean "warm reset" doesn't reset TDX
private pages, and the BIOS needs to disable "warm reset".

IIUC this means the kernel needs to depend on specific BIOS setting to work
normally, and IIUC the kernel even cannot be aware of this setting?

Should the kernel just reset all TDX private pages when erratum is present
during reboot so the kernel doesn't depend on BIOS?

>
> I think we should be fine. But it worth mentioning it in the commit
> message.
>

Agreed.

2023-06-14 11:17:15

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v11 18/20] x86: Handle TDX erratum to reset TDX private memory during kexec() and reboot

On Wed, Jun 14, 2023 at 10:58:13AM +0000, Huang, Kai wrote:
> > And the platform has to
> > handle erratum in BIOS to reset memory status on reboot anyway.
>
> So "handle erratum in BIOS" I think you mean "warm reset" doesn't reset TDX
> private pages, and the BIOS needs to disable "warm reset".
>
> IIUC this means the kernel needs to depend on specific BIOS setting to work
> normally, and IIUC the kernel even cannot be aware of this setting?
>
> Should the kernel just reset all TDX private pages when erratum is present
> during reboot so the kernel doesn't depend on BIOS?

Kernel cannot really function if we don't trust BIOS to do its job. Kernel
depends on BIOS services anyway. We cannot try to handle everything in
kernel just in case BIOS drops the ball.

--
Kiryl Shutsemau / Kirill A. Shutemov

2023-06-14 11:30:57

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v11 18/20] x86: Handle TDX erratum to reset TDX private memory during kexec() and reboot

On Wed, 2023-06-14 at 14:08 +0300, [email protected] wrote:
> On Wed, Jun 14, 2023 at 10:58:13AM +0000, Huang, Kai wrote:
> > > And the platform has to
> > > handle erratum in BIOS to reset memory status on reboot anyway.
> >
> > So "handle erratum in BIOS" I think you mean "warm reset" doesn't reset TDX
> > private pages, and the BIOS needs to disable "warm reset".
> >
> > IIUC this means the kernel needs to depend on specific BIOS setting to work
> > normally, and IIUC the kernel even cannot be aware of this setting?
> >
> > Should the kernel just reset all TDX private pages when erratum is present
> > during reboot so the kernel doesn't depend on BIOS?
>
> Kernel cannot really function if we don't trust BIOS to do its job. Kernel
> depends on BIOS services anyway. We cannot try to handle everything in
> kernel just in case BIOS drops the ball.
>

In other words, I assume we just need to take care of kexec().

The current patch tries to handle reboot too, so I'll change to only cover
kexec(), assuming the BIOS will always disable warm reset reboot for platforms
with this erratum.

Thanks.

2023-06-19 12:03:03

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v11 18/20] x86: Handle TDX erratum to reset TDX private memory during kexec() and reboot

On Mon, 2023-06-12 at 06:47 -0700, Dave Hansen wrote:
> On 6/12/23 03:27, Huang, Kai wrote:
> > So I think a __mb() after setting tdmr->pamt_4k_base should be good enough, as
> > it guarantees when setting to any pamt_*_size happens, the valid pamt_4k_base
> > will be seen by other cpus.
> >
> > Does it make sense?
>
> Just use a normal old atomic_t or set_bit()/test_bit(). They have
> built-in memory barriers are are less likely to get botched.

Hi Dave,

Using atomic_set() requires changing tdmr->pamt_4k_base to atomic_t, which is a
little bit silly or overkill IMHO. Looking at the code, it seems
arch_atomic_set() simply uses __WRITE_ONCE():

static __always_inline void arch_atomic_set(atomic_t *v, int i)
{
__WRITE_ONCE(v->counter, i);
}

Is it better to just use __WRITE_ONCE() or WRITE_ONCE() here?

- tdmr->pamt_4k_base = pamt_base[TDX_PS_4K];
+ WRITE_ONCE(tdmr->pamt_4k_base, pamt_base[TDX_PS_4K]);



2023-06-19 14:59:45

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v11 18/20] x86: Handle TDX erratum to reset TDX private memory during kexec() and reboot

On 6/19/23 04:43, Huang, Kai wrote:
> On Mon, 2023-06-12 at 06:47 -0700, Dave Hansen wrote:
>> On 6/12/23 03:27, Huang, Kai wrote:
>>> So I think a __mb() after setting tdmr->pamt_4k_base should be good enough, as
>>> it guarantees when setting to any pamt_*_size happens, the valid pamt_4k_base
>>> will be seen by other cpus.
>>>
>>> Does it make sense?
>> Just use a normal old atomic_t or set_bit()/test_bit(). They have
>> built-in memory barriers are are less likely to get botched.
> Hi Dave,
>
> Using atomic_set() requires changing tdmr->pamt_4k_base to atomic_t, which is a
> little bit silly or overkill IMHO. Looking at the code, it seems
> arch_atomic_set() simply uses __WRITE_ONCE():

How about _adding_ a variable that protects tdmr->pamt_4k_base?
Wouldn't that be more straightforward than mucking around with existing
types?

2023-06-19 15:01:25

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v11 18/20] x86: Handle TDX erratum to reset TDX private memory during kexec() and reboot

On Mon, Jun 19, 2023 at 07:31:21AM -0700, Dave Hansen wrote:
> On 6/19/23 04:43, Huang, Kai wrote:
> > On Mon, 2023-06-12 at 06:47 -0700, Dave Hansen wrote:
> >> On 6/12/23 03:27, Huang, Kai wrote:
> >>> So I think a __mb() after setting tdmr->pamt_4k_base should be good enough, as
> >>> it guarantees when setting to any pamt_*_size happens, the valid pamt_4k_base
> >>> will be seen by other cpus.
> >>>
> >>> Does it make sense?
> >> Just use a normal old atomic_t or set_bit()/test_bit(). They have
> >> built-in memory barriers are are less likely to get botched.
> > Hi Dave,
> >
> > Using atomic_set() requires changing tdmr->pamt_4k_base to atomic_t, which is a
> > little bit silly or overkill IMHO. Looking at the code, it seems
> > arch_atomic_set() simply uses __WRITE_ONCE():
>
> How about _adding_ a variable that protects tdmr->pamt_4k_base?
> Wouldn't that be more straightforward than mucking around with existing
> types?

What's wrong with simple global spinlock that protects all tdmr->pamt_*?
It is much easier to follow than a custom serialization scheme.

--
Kiryl Shutsemau / Kirill A. Shutemov

2023-06-19 23:43:51

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v11 18/20] x86: Handle TDX erratum to reset TDX private memory during kexec() and reboot

On Mon, 2023-06-19 at 17:46 +0300, [email protected] wrote:
> On Mon, Jun 19, 2023 at 07:31:21AM -0700, Dave Hansen wrote:
> > On 6/19/23 04:43, Huang, Kai wrote:
> > > On Mon, 2023-06-12 at 06:47 -0700, Dave Hansen wrote:
> > > > On 6/12/23 03:27, Huang, Kai wrote:
> > > > > So I think a __mb() after setting tdmr->pamt_4k_base should be good enough, as
> > > > > it guarantees when setting to any pamt_*_size happens, the valid pamt_4k_base
> > > > > will be seen by other cpus.
> > > > >
> > > > > Does it make sense?
> > > > Just use a normal old atomic_t or set_bit()/test_bit(). They have
> > > > built-in memory barriers are are less likely to get botched.
> > > Hi Dave,
> > >
> > > Using atomic_set() requires changing tdmr->pamt_4k_base to atomic_t, which is a
> > > little bit silly or overkill IMHO. Looking at the code, it seems
> > > arch_atomic_set() simply uses __WRITE_ONCE():
> >
> > How about _adding_ a variable that protects tdmr->pamt_4k_base?
> > Wouldn't that be more straightforward than mucking around with existing
> > types?
>
> What's wrong with simple global spinlock that protects all tdmr->pamt_*?
> It is much easier to follow than a custom serialization scheme.
>

For this patch I think it's overkill to use spinlock because when the rebooting
cpu is reading this all other cpus have been stopped already, so there's no
concurrent thing here.

However I just recall that the next #MC handler patch can also take advantage of
this too because #MC handler can truly run concurrently with module
initialization. Currently that one reads tdx_module_status first but again we
may have the same memory order issue. So having a spinlock makes sense from #MC
handler patch's point of view.

I'll change to use spinlock if Dave is fine?

Thanks for feedback!

2023-06-19 23:50:39

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v11 18/20] x86: Handle TDX erratum to reset TDX private memory during kexec() and reboot

On 6/19/23 07:46, [email protected] wrote:
>>>
>>> Using atomic_set() requires changing tdmr->pamt_4k_base to atomic_t, which is a
>>> little bit silly or overkill IMHO. Looking at the code, it seems
>>> arch_atomic_set() simply uses __WRITE_ONCE():
>> How about _adding_ a variable that protects tdmr->pamt_4k_base?
>> Wouldn't that be more straightforward than mucking around with existing
>> types?
> What's wrong with simple global spinlock that protects all tdmr->pamt_*?
> It is much easier to follow than a custom serialization scheme.

Quick, what prevents a:

spin_lock() => #MC => spin_lock()

deadlock?

Plain old test/sets don't deadlock ever.

2023-06-20 01:18:02

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v11 18/20] x86: Handle TDX erratum to reset TDX private memory during kexec() and reboot

On 6/19/23 17:56, Huang, Kai wrote:
> Any comments to below?

Nothing that I haven't already said in this thread:

> Just use a normal old atomic_t or set_bit()/test_bit(). They have
> built-in memory barriers are are less likely to get botched.

I kinda made a point of literally suggesting "atomic_t or
set_bit()/test_bit()". I even told you why: "built-in memory barriers".

Guess what READ/WRITE_ONCE() *don't* have. Memory barriers.




2023-06-20 01:18:43

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v11 18/20] x86: Handle TDX erratum to reset TDX private memory during kexec() and reboot

On Mon, 2023-06-19 at 16:41 -0700, Dave Hansen wrote:
> On 6/19/23 07:46, [email protected] wrote:
> > > >
> > > > Using atomic_set() requires changing tdmr->pamt_4k_base to atomic_t, which is a
> > > > little bit silly or overkill IMHO. Looking at the code, it seems
> > > > arch_atomic_set() simply uses __WRITE_ONCE():
> > > How about _adding_ a variable that protects tdmr->pamt_4k_base?
> > > Wouldn't that be more straightforward than mucking around with existing
> > > types?
> > What's wrong with simple global spinlock that protects all tdmr->pamt_*?
> > It is much easier to follow than a custom serialization scheme.
>
> Quick, what prevents a:
>
> spin_lock() => #MC => spin_lock()
>
> deadlock?
>
> Plain old test/sets don't deadlock ever.

Agreed. So I think having any locking in #MC handle is kinda dangerous.

Adding "a" variable has another advantage: We can have a more precise result of
whether we need to reset PAMT pages, even those PAMTs are already allocated and
set to the TDMRs, because the TDX module only starts to write PAMTs using global
KeyID until some SEAMCALL.

Any comments to below?

+static bool tdx_private_mem_begin;
+
/*
* Wrapper of __seamcall() to convert SEAMCALL leaf function error code
* to kernel error code. @seamcall_ret and @out contain the SEAMCALL
@@ -1141,6 +1143,8 @@ static int init_tdx_module(void)
*/
wbinvd_on_all_cpus();

+ WRITE_ONCE(tdx_private_mem_begin, true);
+
/* Config the key of global KeyID on all packages */
ret = config_global_keyid();
if (ret)
@@ -1463,6 +1467,14 @@ static void tdx_memory_shutdown(void)
*/
WARN_ON_ONCE(num_online_cpus() != 1);

+ /*
+ * It's not possible to have any TDX private pages if the TDX
+ * module hasn't started to write any memory using the global
+ * KeyID.
+ */
+ if (!READ_ONCE(tdx_private_mem_begin))
+ return;
+
tdmrs_reset_pamt_all(&tdx_tdmr_list);



2023-06-20 08:09:48

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v11 18/20] x86: Handle TDX erratum to reset TDX private memory during kexec() and reboot

On Mon, Jun 19, 2023 at 04:41:13PM -0700, Dave Hansen wrote:
> On 6/19/23 07:46, [email protected] wrote:
> >>>
> >>> Using atomic_set() requires changing tdmr->pamt_4k_base to atomic_t, which is a
> >>> little bit silly or overkill IMHO. Looking at the code, it seems
> >>> arch_atomic_set() simply uses __WRITE_ONCE():
> >> How about _adding_ a variable that protects tdmr->pamt_4k_base?
> >> Wouldn't that be more straightforward than mucking around with existing
> >> types?
> > What's wrong with simple global spinlock that protects all tdmr->pamt_*?
> > It is much easier to follow than a custom serialization scheme.
>
> Quick, what prevents a:
>
> spin_lock() => #MC => spin_lock()
>
> deadlock?
>
> Plain old test/sets don't deadlock ever.

Depends on what you mean; anything that spin-waits will deadlock,
doesn't matter if its a test-and-set or not.

The thing with these non-maskable exceptions/interrupts is that they
must be wait-free. If serialization is required it needs to be try based
and accept failure without waiting.

2023-06-20 08:34:11

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v11 18/20] x86: Handle TDX erratum to reset TDX private memory during kexec() and reboot

On Mon, Jun 12, 2023 at 03:06:48AM +0000, Huang, Kai wrote:

> + __mb();

__mb() is not a valid interface to use.

2023-06-20 08:52:29

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v11 18/20] x86: Handle TDX erratum to reset TDX private memory during kexec() and reboot

On Mon, Jun 19, 2023 at 06:06:30PM -0700, Dave Hansen wrote:
> On 6/19/23 17:56, Huang, Kai wrote:
> > Any comments to below?
>
> Nothing that I haven't already said in this thread:
>
> > Just use a normal old atomic_t or set_bit()/test_bit(). They have
> > built-in memory barriers are are less likely to get botched.
>
> I kinda made a point of literally suggesting "atomic_t or
> set_bit()/test_bit()". I even told you why: "built-in memory barriers".
>
> Guess what READ/WRITE_ONCE() *don't* have. Memory barriers.

x86 has built-in memory barriers for being TSO :-) Specifically all
barriers provided by spinlock (acquire/release) are no-ops on x86.

(strictly speaking locks imply stronger order than they have to because
TSO atomic ops imply stronger ordering than required)

There is one (and only the one) re-ordering possible on TSO and that is
the store-buffer, later loads can fail to observe prior stores.

If that is a concern, you need explicit barriers.

This is #MC, much care and explicit open-coded crap is expected. Also,
this is #MC, much broken is also expected :-( As in, the current #MC
handler is a know pile of shit.

Basically the whole of #MC should be noinstr -- it isn't and that's a
significant problem.

Also we still very much suffer the NMI <- #MC problem and the #MC latch
is known broken garbage.

Whatever you do, do it very carefully, double check and be more careful.

2023-06-20 11:07:24

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v11 18/20] x86: Handle TDX erratum to reset TDX private memory during kexec() and reboot

On Tue, 2023-06-20 at 10:11 +0200, Peter Zijlstra wrote:
> On Mon, Jun 12, 2023 at 03:06:48AM +0000, Huang, Kai wrote:
>
> > + __mb();
>
> __mb() is not a valid interface to use.

Thanks for feedback!

May I ask why, for education purpose? :)

2023-06-20 11:31:27

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v11 18/20] x86: Handle TDX erratum to reset TDX private memory during kexec() and reboot

On Tue, Jun 20, 2023 at 10:42:32AM +0000, Huang, Kai wrote:
> On Tue, 2023-06-20 at 10:11 +0200, Peter Zijlstra wrote:
> > On Mon, Jun 12, 2023 at 03:06:48AM +0000, Huang, Kai wrote:
> >
> > > + __mb();
> >
> > __mb() is not a valid interface to use.
>
> Thanks for feedback!
>
> May I ask why, for education purpose? :)

it's the raw MFENCE wrapper, not one of the *many* documented barriers.

Also, typicaly you do *not* want MFENCE, MFENCE bad.

2023-06-25 16:02:50

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v11 18/20] x86: Handle TDX erratum to reset TDX private memory during kexec() and reboot

On Mon, 2023-06-19 at 18:06 -0700, Dave Hansen wrote:
> On 6/19/23 17:56, Huang, Kai wrote:
> > Any comments to below?
>
> Nothing that I haven't already said in this thread:
>
> > Just use a normal old atomic_t or set_bit()/test_bit(). They have
> > built-in memory barriers are are less likely to get botched.
>
> I kinda made a point of literally suggesting "atomic_t or
> set_bit()/test_bit()". I even told you why: "built-in memory barriers".
>
> Guess what READ/WRITE_ONCE() *don't* have. Memory barriers.
>

Hi Dave,

Sorry to bring this up again. I thought more on this topic, and I think using
atotmic_t is only necessary if we add it right after setting up tdmr->pamt_* in
tdmr_set_up_pamt(), because there we need both compiler barrier and CPU memory
barrier to make sure memory order (as Kirill commented in the first reply).

However, if we add a new variable like below ...

+static bool tdx_private_mem_begin;
+
/*
* Wrapper of __seamcall() to convert SEAMCALL leaf function error code
* to kernel error code. @seamcall_ret and @out contain the SEAMCALL
@@ -1123,6 +1125,8 @@ static int init_tdx_module(void)
*/
wbinvd_on_all_cpus();

+ tdx_private_mem_begin = true;


... then we don't need any more explicit barrier, because: 1) it's not possible
for compiler to optimize the order between setting tdmr->pamt_* and
tdx_private_mem_begin; 2) no CPU memory barrier is needed as WBINVD is a
serializing instruction so the wbinvd_on_all_cpus() above has already implied
memory barrier.

Does this make sense?

2023-06-25 23:53:28

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v11 18/20] x86: Handle TDX erratum to reset TDX private memory during kexec() and reboot

On Sun, 2023-06-25 at 15:30 +0000, Huang, Kai wrote:
> On Mon, 2023-06-19 at 18:06 -0700, Dave Hansen wrote:
> > On 6/19/23 17:56, Huang, Kai wrote:
> > > Any comments to below?
> >
> > Nothing that I haven't already said in this thread:
> >
> > > Just use a normal old atomic_t or set_bit()/test_bit(). They have
> > > built-in memory barriers are are less likely to get botched.
> >
> > I kinda made a point of literally suggesting "atomic_t or
> > set_bit()/test_bit()". I even told you why: "built-in memory barriers".
> >
> > Guess what READ/WRITE_ONCE() *don't* have. Memory barriers.
> >
>
> Hi Dave,
>
> Sorry to bring this up again. I thought more on this topic, and I think using
> atotmic_t is only necessary if we add it right after setting up tdmr->pamt_* in
> tdmr_set_up_pamt(), because there we need both compiler barrier and CPU memory
> barrier to make sure memory order (as Kirill commented in the first reply).
>
> However, if we add a new variable like below ...
>
> +static bool tdx_private_mem_begin;
> +
> /*
> * Wrapper of __seamcall() to convert SEAMCALL leaf function error code
> * to kernel error code. @seamcall_ret and @out contain the SEAMCALL
> @@ -1123,6 +1125,8 @@ static int init_tdx_module(void)
> */
> wbinvd_on_all_cpus();
>
> + tdx_private_mem_begin = true;
>
>
> ... then we don't need any more explicit barrier, because: 1) it's not possible
> for compiler to optimize the order between setting tdmr->pamt_* and
> tdx_private_mem_begin; 2) no CPU memory barrier is needed as WBINVD is a
> serializing instruction so the wbinvd_on_all_cpus() above has already implied
> memory barrier.
>
> Does this make sense?

Sorry please ignore this. I missed a corner case that the kexec() can happen
when something goes wrong during module initialization and when PAMTs/TDMRs are
being freed. We still need explicit memory barrier for this case. I will use
atomic_t as suggested. Thanks!