2023-09-29 21:41:39

by Michael Kelley (LINUX)

[permalink] [raw]
Subject: [PATCH 0/5] x86/coco: Mark CoCo VM pages not present when changing encrypted state

In a CoCo VM when a page transitions from encrypted to decrypted, or vice
versa, attributes in the PTE must be updated *and* the hypervisor must
be notified of the change. Because there are two separate steps, there's
a window where the settings are inconsistent. Normally the code that
initiates the transition (via set_memory_decrypted() or
set_memory_encrypted()) ensures that the memory is not being accessed
during a transition, so the window of inconsistency is not a problem.
However, load_unaligned_zeropad() can read arbitrary memory pages at
arbitrary times, which could read a transitioning page during the
window. In such a case, CoCo VM specific exceptions are taken
(depending on the CoCo architecture in use). Current code in those
exception handlers recovers and does "fixup" on the result returned by
load_unaligned_zeropad(). Unfortunately, this exception handling can't
work in paravisor scenarios (TDX Paritioning and SEV-SNP in vTOM mode).
The exceptions would need to be forwarded from the paravisor to the
Linux guest, but there's no architectural spec for how to do that.

Fortunately, there's a simpler way to solve the problem by changing
the core transition code in __set_memory_enc_pgtable() to do the
following:

1. Remove aliasing mappings
2. Flush the data cache if needed
3. Remove the PRESENT bit from the PTEs of all transitioning pages
4. Set/clear the encryption attribute as appropriate
5. Flush the TLB so the changed encryption attribute isn't visible
6. Notify the hypervisor of the encryption status change
7. Add back the PRESENT bit, making the changed attribute visible

With this approach, load_unaligned_zeropad() just takes its normal
page-fault-based fixup path if it touches a page that is transitioning.
As a result, load_unaligned_zeropad() and CoCo VM page transitioning
are completely decoupled. CoCo VM page transitions can proceed
without needing to handle architecture-specific exceptions and fix
things up. This decoupling reduces the complexity due to separate
TDX and SEV-SNP fixup paths, and gives more freedom to revise and
introduce new capabilities in future versions of the TDX and SEV-SNP
architectures. Paravisor scenarios work properly without needing
to forward exceptions.

This patch set is follow-up to an RFC patch and discussion.[1]
Compared with the RFC patch, the steps listed above are optimized for
better performance and particularly for fewer TLB flushes.

Patch 1 handles implications of the hypervisor callbacks in Step 6
needing to do virt-to-phys translations on pages that are temporarily
marked not present.

Patch 2 is a performance optimization so that Step 7 doesn't generate
a TLB flush.

Patch 3 is the core change that implements Steps 1 thru 7. It also
simplifies the associated TDX, SEV-SNP, and Hyper-V vTOM callbacks.

Patch 4 is a somewhat tangential cleanup that removes an unnecessary
wrapper function in the path for doing a transition.

Patch 5 adds comments describing the implications of errors when
doing a transition. These implications are discussed in the email
thread for the RFC patch.

With this change, the #VE and #VC exception handlers should no longer
be triggered for load_unaligned_zeropad() accesses, and the existing
code in those handlers to do the "fixup" shouldn't be needed. But I
have not removed that code in this patch set. Kirill Shutemov wants
to keep the code for TDX #VE, so the code for #VC on the the SEV-SNP
side has also been kept.

This patch set is based on the linux-next20230921 code tree.

[1] https://lore.kernel.org/lkml/[email protected]/

Michael Kelley (5):
x86/coco: Use slow_virt_to_phys() in page transition hypervisor
callbacks
x86/mm: Don't do a TLB flush if changing a PTE that isn't marked
present
x86/mm: Mark CoCo VM pages not present while changing encrypted state
x86/mm: Remove unnecessary call layer for __set_memory_enc_pgtable()
x86/mm: Add comments about errors in
set_memory_decrypted()/encrypted()

arch/x86/coco/tdx/tdx.c | 66 +-----------------------
arch/x86/hyperv/ivm.c | 15 +++---
arch/x86/kernel/sev.c | 8 ++-
arch/x86/kernel/x86_init.c | 4 --
arch/x86/mm/mem_encrypt_amd.c | 27 +++-------
arch/x86/mm/pat/set_memory.c | 114 +++++++++++++++++++++++++++++-------------
6 files changed, 102 insertions(+), 132 deletions(-)

--
1.8.3.1


2023-09-29 23:41:25

by Michael Kelley (LINUX)

[permalink] [raw]
Subject: [PATCH 4/5] x86/mm: Remove unnecessary call layer for __set_memory_enc_pgtable()

__set_memory_enc_pgtable() is only called from __set_memory_enc_dec()
after doing a simple validation check. Prior to commit 812b0597fb40,
__set_memory_enc_dec() did more complex checking, but now the code
can be simplified by collapsing the two functions.

No functional change.

Signed-off-by: Michael Kelley <[email protected]>
---
arch/x86/mm/pat/set_memory.c | 15 +++++----------
1 file changed, 5 insertions(+), 10 deletions(-)

diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index d062e01..1578077 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -2131,15 +2131,18 @@ int set_memory_global(unsigned long addr, int numpages)
}

/*
- * __set_memory_enc_pgtable() is used for the hypervisors that get
+ * __set_memory_enc_dec() is used for the hypervisors that get
* informed about "encryption" status via page tables.
*/
-static int __set_memory_enc_pgtable(unsigned long addr, int numpages, bool enc)
+static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
{
pgprot_t empty = __pgprot(0);
struct cpa_data cpa;
int ret;

+ if (!cc_platform_has(CC_ATTR_MEM_ENCRYPT))
+ return 0;
+
/* Should not be working on unaligned addresses */
if (WARN_ONCE(addr & ~PAGE_MASK, "misaligned address: %#lx\n", addr))
addr &= PAGE_MASK;
@@ -2200,14 +2203,6 @@ static int __set_memory_enc_pgtable(unsigned long addr, int numpages, bool enc)
return set_memory_p(&addr, numpages);
}

-static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
-{
- if (cc_platform_has(CC_ATTR_MEM_ENCRYPT))
- return __set_memory_enc_pgtable(addr, numpages, enc);
-
- return 0;
-}
-
int set_memory_encrypted(unsigned long addr, int numpages)
{
return __set_memory_enc_dec(addr, numpages, true);
--
1.8.3.1

2023-09-29 23:50:09

by Michael Kelley (LINUX)

[permalink] [raw]
Subject: [PATCH 2/5] x86/mm: Don't do a TLB flush if changing a PTE that isn't marked present

The core function __change_page_attr() currently sets up a TLB flush if
a PTE is changed. But if the old value of the PTE doesn't include the
PRESENT flag, the PTE won't be in the TLB, so a flush isn't needed.

Avoid an unnecessary TLB flush by conditioning the flush on the old
PTE value including PRESENT. This change improves the performance of
functions like set_memory_p() by avoiding the flush if the memory range
was previously all not present.

Signed-off-by: Michael Kelley <[email protected]>
---
arch/x86/mm/pat/set_memory.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 8e19796..d7ef8d3 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -1636,7 +1636,10 @@ static int __change_page_attr(struct cpa_data *cpa, int primary)
*/
if (pte_val(old_pte) != pte_val(new_pte)) {
set_pte_atomic(kpte, new_pte);
- cpa->flags |= CPA_FLUSHTLB;
+
+ /* If old_pte isn't present, it's not in the TLB */
+ if (pte_present(old_pte))
+ cpa->flags |= CPA_FLUSHTLB;
}
cpa->numpages = 1;
return 0;
--
1.8.3.1

2023-09-30 01:27:09

by Michael Kelley (LINUX)

[permalink] [raw]
Subject: [PATCH 3/5] x86/mm: Mark CoCo VM pages not present while changing encrypted state

In a CoCo VM when a page transitions from encrypted to decrypted, or vice
versa, attributes in the PTE must be updated *and* the hypervisor must
be notified of the change. Because there are two separate steps, there's
a window where the settings are inconsistent. Normally the code that
initiates the transition (via set_memory_decrypted() or
set_memory_encrypted()) ensures that the memory is not being accessed
during a transition, so the window of inconsistency is not a problem.
However, the load_unaligned_zeropad() function can read arbitrary memory
pages at arbitrary times, which could access a transitioning page during
the window. In such a case, CoCo VM specific exceptions are taken
(depending on the CoCo architecture in use). Current code in those
exception handlers recovers and does "fixup" on the result returned by
load_unaligned_zeropad(). Unfortunately, this exception handling can't
work in paravisor scenarios (TDX Paritioning and SEV-SNP in vTOM mode).
The exceptions need to be forwarded from the paravisor to the Linux
guest, but there are no architectural specs for how to do that.

Fortunately, there's a simpler way to solve the problem by changing
the core transition code in __set_memory_enc_pgtable() to do the
following:

1. Remove aliasing mappings
2. Flush the data cache if needed
3. Remove the PRESENT bit from the PTEs of all transitioning pages
4. Set/clear the encryption attribute as appropriate
5. Flush the TLB so the changed encryption attribute isn't visible
6. Notify the hypervisor of the encryption status change
7. Add back the PRESENT bit, making the changed attribute visible

With this approach, load_unaligned_zeropad() just takes its normal
page-fault-based fixup path if it touches a page that is transitioning.
As a result, load_unaligned_zeropad() and CoCo VM page transitioning
are completely decoupled. CoCo VM page transitions can proceed
without needing to handle architecture-specific exceptions and fix
things up. This decoupling reduces the complexity due to separate
TDX and SEV-SNP fixup paths, and gives more freedom to revise and
introduce new capabilities in future versions of the TDX and SEV-SNP
architectures. Paravisor scenarios work properly without needing
to forward exceptions.

With this approach, the order of updating the guest PTEs and
notifying the hypervisor doesn't matter. As such, only a single
hypervisor callback is needed, rather one before and one after
the PTE update. Simplify the code by eliminating the extra
hypervisor callback along with the TDX and SEV-SNP code that
handles the before and after cases. The TLB flush callback is
also no longer required and is removed.

Signed-off-by: Michael Kelley <[email protected]>
---
arch/x86/coco/tdx/tdx.c | 66 +------------------------------------------
arch/x86/hyperv/ivm.c | 6 ----
arch/x86/kernel/x86_init.c | 4 ---
arch/x86/mm/mem_encrypt_amd.c | 27 ++++--------------
arch/x86/mm/pat/set_memory.c | 55 +++++++++++++++++++++++-------------
5 files changed, 43 insertions(+), 115 deletions(-)

diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index 3e6dbd2..1bb2fff 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -676,24 +676,6 @@ bool tdx_handle_virt_exception(struct pt_regs *regs, struct ve_info *ve)
return true;
}

-static bool tdx_tlb_flush_required(bool private)
-{
- /*
- * TDX guest is responsible for flushing TLB on private->shared
- * transition. VMM is responsible for flushing on shared->private.
- *
- * The VMM _can't_ flush private addresses as it can't generate PAs
- * with the guest's HKID. Shared memory isn't subject to integrity
- * checking, i.e. the VMM doesn't need to flush for its own protection.
- *
- * There's no need to flush when converting from shared to private,
- * as flushing is the VMM's responsibility in this case, e.g. it must
- * flush to avoid integrity failures in the face of a buggy or
- * malicious guest.
- */
- return !private;
-}
-
static bool tdx_cache_flush_required(void)
{
/*
@@ -776,30 +758,6 @@ static bool tdx_enc_status_changed(unsigned long vaddr, int numpages, bool enc)
return true;
}

-static bool tdx_enc_status_change_prepare(unsigned long vaddr, int numpages,
- bool enc)
-{
- /*
- * Only handle shared->private conversion here.
- * See the comment in tdx_early_init().
- */
- if (enc)
- return tdx_enc_status_changed(vaddr, numpages, enc);
- return true;
-}
-
-static bool tdx_enc_status_change_finish(unsigned long vaddr, int numpages,
- bool enc)
-{
- /*
- * Only handle private->shared conversion here.
- * See the comment in tdx_early_init().
- */
- if (!enc)
- return tdx_enc_status_changed(vaddr, numpages, enc);
- return true;
-}
-
void __init tdx_early_init(void)
{
struct tdx_module_args args = {
@@ -831,30 +789,8 @@ void __init tdx_early_init(void)
*/
physical_mask &= cc_mask - 1;

- /*
- * The kernel mapping should match the TDX metadata for the page.
- * load_unaligned_zeropad() can touch memory *adjacent* to that which is
- * owned by the caller and can catch even _momentary_ mismatches. Bad
- * things happen on mismatch:
- *
- * - Private mapping => Shared Page == Guest shutdown
- * - Shared mapping => Private Page == Recoverable #VE
- *
- * guest.enc_status_change_prepare() converts the page from
- * shared=>private before the mapping becomes private.
- *
- * guest.enc_status_change_finish() converts the page from
- * private=>shared after the mapping becomes private.
- *
- * In both cases there is a temporary shared mapping to a private page,
- * which can result in a #VE. But, there is never a private mapping to
- * a shared page.
- */
- x86_platform.guest.enc_status_change_prepare = tdx_enc_status_change_prepare;
- x86_platform.guest.enc_status_change_finish = tdx_enc_status_change_finish;
-
+ x86_platform.guest.enc_status_change_finish = tdx_enc_status_changed;
x86_platform.guest.enc_cache_flush_required = tdx_cache_flush_required;
- x86_platform.guest.enc_tlb_flush_required = tdx_tlb_flush_required;

/*
* TDX intercepts the RDMSR to read the X2APIC ID in the parallel
diff --git a/arch/x86/hyperv/ivm.c b/arch/x86/hyperv/ivm.c
index 084fab6..fbe2585 100644
--- a/arch/x86/hyperv/ivm.c
+++ b/arch/x86/hyperv/ivm.c
@@ -550,11 +550,6 @@ static bool hv_vtom_set_host_visibility(unsigned long kbuffer, int pagecount, bo
return result;
}

-static bool hv_vtom_tlb_flush_required(bool private)
-{
- return true;
-}
-
static bool hv_vtom_cache_flush_required(void)
{
return false;
@@ -614,7 +609,6 @@ void __init hv_vtom_init(void)

x86_platform.hyper.is_private_mmio = hv_is_private_mmio;
x86_platform.guest.enc_cache_flush_required = hv_vtom_cache_flush_required;
- x86_platform.guest.enc_tlb_flush_required = hv_vtom_tlb_flush_required;
x86_platform.guest.enc_status_change_finish = hv_vtom_set_host_visibility;

/* Set WB as the default cache mode. */
diff --git a/arch/x86/kernel/x86_init.c b/arch/x86/kernel/x86_init.c
index a37ebd3..cf5179b 100644
--- a/arch/x86/kernel/x86_init.c
+++ b/arch/x86/kernel/x86_init.c
@@ -131,9 +131,7 @@ struct x86_cpuinit_ops x86_cpuinit = {

static void default_nmi_init(void) { };

-static bool enc_status_change_prepare_noop(unsigned long vaddr, int npages, bool enc) { return true; }
static bool enc_status_change_finish_noop(unsigned long vaddr, int npages, bool enc) { return true; }
-static bool enc_tlb_flush_required_noop(bool enc) { return false; }
static bool enc_cache_flush_required_noop(void) { return false; }
static bool is_private_mmio_noop(u64 addr) {return false; }

@@ -154,9 +152,7 @@ struct x86_platform_ops x86_platform __ro_after_init = {
.hyper.is_private_mmio = is_private_mmio_noop,

.guest = {
- .enc_status_change_prepare = enc_status_change_prepare_noop,
.enc_status_change_finish = enc_status_change_finish_noop,
- .enc_tlb_flush_required = enc_tlb_flush_required_noop,
.enc_cache_flush_required = enc_cache_flush_required_noop,
},
};
diff --git a/arch/x86/mm/mem_encrypt_amd.c b/arch/x86/mm/mem_encrypt_amd.c
index 6faea41..06960ba 100644
--- a/arch/x86/mm/mem_encrypt_amd.c
+++ b/arch/x86/mm/mem_encrypt_amd.c
@@ -278,11 +278,6 @@ static unsigned long pg_level_to_pfn(int level, pte_t *kpte, pgprot_t *ret_prot)
return pfn;
}

-static bool amd_enc_tlb_flush_required(bool enc)
-{
- return true;
-}
-
static bool amd_enc_cache_flush_required(void)
{
return !cpu_feature_enabled(X86_FEATURE_SME_COHERENT);
@@ -318,18 +313,6 @@ static void enc_dec_hypercall(unsigned long vaddr, unsigned long size, bool enc)
#endif
}

-static bool amd_enc_status_change_prepare(unsigned long vaddr, int npages, bool enc)
-{
- /*
- * To maintain the security guarantees of SEV-SNP guests, make sure
- * to invalidate the memory before encryption attribute is cleared.
- */
- if (cc_platform_has(CC_ATTR_GUEST_SEV_SNP) && !enc)
- snp_set_memory_shared(vaddr, npages);
-
- return true;
-}
-
/* Return true unconditionally: return value doesn't matter for the SEV side */
static bool amd_enc_status_change_finish(unsigned long vaddr, int npages, bool enc)
{
@@ -337,8 +320,12 @@ static bool amd_enc_status_change_finish(unsigned long vaddr, int npages, bool e
* After memory is mapped encrypted in the page table, validate it
* so that it is consistent with the page table updates.
*/
- if (cc_platform_has(CC_ATTR_GUEST_SEV_SNP) && enc)
- snp_set_memory_private(vaddr, npages);
+ if (cc_platform_has(CC_ATTR_GUEST_SEV_SNP)) {
+ if (enc)
+ snp_set_memory_private(vaddr, npages);
+ else
+ snp_set_memory_shared(vaddr, npages);
+ }

if (!cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT))
enc_dec_hypercall(vaddr, npages << PAGE_SHIFT, enc);
@@ -498,9 +485,7 @@ void __init sme_early_init(void)
/* Update the protection map with memory encryption mask */
add_encrypt_protection_map();

- x86_platform.guest.enc_status_change_prepare = amd_enc_status_change_prepare;
x86_platform.guest.enc_status_change_finish = amd_enc_status_change_finish;
- x86_platform.guest.enc_tlb_flush_required = amd_enc_tlb_flush_required;
x86_platform.guest.enc_cache_flush_required = amd_enc_cache_flush_required;

/*
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index d7ef8d3..d062e01 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -2147,40 +2147,57 @@ static int __set_memory_enc_pgtable(unsigned long addr, int numpages, bool enc)
memset(&cpa, 0, sizeof(cpa));
cpa.vaddr = &addr;
cpa.numpages = numpages;
+
+ /*
+ * The caller must ensure that the memory being transitioned between
+ * encrypted and decrypted is not being accessed. But if
+ * load_unaligned_zeropad() touches the "next" page, it may generate a
+ * read access the caller has no control over. To ensure such accesses
+ * cause a normal page fault for the load_unaligned_zeropad() handler,
+ * mark the pages not present until the transition is complete. We
+ * don't want a #VE or #VC fault due to a mismatch in the memory
+ * encryption status, since paravisor configurations can't cleanly do
+ * the load_unaligned_zeropad() handling in the paravisor.
+ *
+ * There's no requirement to do so, but for efficiency we can clear
+ * _PAGE_PRESENT and set/clr encryption attr as a single operation.
+ */
cpa.mask_set = enc ? pgprot_encrypted(empty) : pgprot_decrypted(empty);
- cpa.mask_clr = enc ? pgprot_decrypted(empty) : pgprot_encrypted(empty);
+ cpa.mask_clr = enc ? pgprot_decrypted(__pgprot(_PAGE_PRESENT)) :
+ pgprot_encrypted(__pgprot(_PAGE_PRESENT));
cpa.pgd = init_mm.pgd;

/* Must avoid aliasing mappings in the highmem code */
kmap_flush_unused();
vm_unmap_aliases();

- /* Flush the caches as needed before changing the encryption attribute. */
- if (x86_platform.guest.enc_tlb_flush_required(enc))
- cpa_flush(&cpa, x86_platform.guest.enc_cache_flush_required());
-
- /* Notify hypervisor that we are about to set/clr encryption attribute. */
- if (!x86_platform.guest.enc_status_change_prepare(addr, numpages, enc))
- return -EIO;
+ /* Flush the caches as needed before changing the encryption attr. */
+ if (x86_platform.guest.enc_cache_flush_required())
+ cpa_flush(&cpa, 1);

ret = __change_page_attr_set_clr(&cpa, 1);
+ if (ret)
+ return ret;

/*
- * After changing the encryption attribute, we need to flush TLBs again
- * in case any speculative TLB caching occurred (but no need to flush
- * caches again). We could just use cpa_flush_all(), but in case TLB
- * flushing gets optimized in the cpa_flush() path use the same logic
- * as above.
+ * After clearing _PAGE_PRESENT and changing the encryption attribute,
+ * we need to flush TLBs to ensure no further accesses to the memory can
+ * be made with the old encryption attribute (but no need to flush caches
+ * again). We could just use cpa_flush_all(), but in case TLB flushing
+ * gets optimized in the cpa_flush() path use the same logic as above.
*/
cpa_flush(&cpa, 0);

- /* Notify hypervisor that we have successfully set/clr encryption attribute. */
- if (!ret) {
- if (!x86_platform.guest.enc_status_change_finish(addr, numpages, enc))
- ret = -EIO;
- }
+ /* Notify hypervisor that we have successfully set/clr encryption attr. */
+ if (!x86_platform.guest.enc_status_change_finish(addr, numpages, enc))
+ return -EIO;

- return ret;
+ /*
+ * Now that the hypervisor is sync'ed with the page table changes
+ * made here, add back _PAGE_PRESENT. set_memory_p() does not flush
+ * the TLB.
+ */
+ return set_memory_p(&addr, numpages);
}

static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
--
1.8.3.1

2023-10-02 18:02:26

by Tom Lendacky

[permalink] [raw]
Subject: Re: [PATCH 0/5] x86/coco: Mark CoCo VM pages not present when changing encrypted state

On 9/29/23 13:19, Michael Kelley wrote:
> In a CoCo VM when a page transitions from encrypted to decrypted, or vice
> versa, attributes in the PTE must be updated *and* the hypervisor must
> be notified of the change. Because there are two separate steps, there's
> a window where the settings are inconsistent. Normally the code that
> initiates the transition (via set_memory_decrypted() or
> set_memory_encrypted()) ensures that the memory is not being accessed
> during a transition, so the window of inconsistency is not a problem.
> However, load_unaligned_zeropad() can read arbitrary memory pages at
> arbitrary times, which could read a transitioning page during the
> window. In such a case, CoCo VM specific exceptions are taken
> (depending on the CoCo architecture in use). Current code in those
> exception handlers recovers and does "fixup" on the result returned by
> load_unaligned_zeropad(). Unfortunately, this exception handling can't
> work in paravisor scenarios (TDX Paritioning and SEV-SNP in vTOM mode).
> The exceptions would need to be forwarded from the paravisor to the
> Linux guest, but there's no architectural spec for how to do that.
>
> Fortunately, there's a simpler way to solve the problem by changing
> the core transition code in __set_memory_enc_pgtable() to do the
> following:
>
> 1. Remove aliasing mappings
> 2. Flush the data cache if needed
> 3. Remove the PRESENT bit from the PTEs of all transitioning pages
> 4. Set/clear the encryption attribute as appropriate
> 5. Flush the TLB so the changed encryption attribute isn't visible
> 6. Notify the hypervisor of the encryption status change
> 7. Add back the PRESENT bit, making the changed attribute visible
>
> With this approach, load_unaligned_zeropad() just takes its normal
> page-fault-based fixup path if it touches a page that is transitioning.
> As a result, load_unaligned_zeropad() and CoCo VM page transitioning
> are completely decoupled. CoCo VM page transitions can proceed
> without needing to handle architecture-specific exceptions and fix
> things up. This decoupling reduces the complexity due to separate
> TDX and SEV-SNP fixup paths, and gives more freedom to revise and
> introduce new capabilities in future versions of the TDX and SEV-SNP
> architectures. Paravisor scenarios work properly without needing
> to forward exceptions.
>
> This patch set is follow-up to an RFC patch and discussion.[1]
> Compared with the RFC patch, the steps listed above are optimized for
> better performance and particularly for fewer TLB flushes.
>
> Patch 1 handles implications of the hypervisor callbacks in Step 6
> needing to do virt-to-phys translations on pages that are temporarily
> marked not present.
>
> Patch 2 is a performance optimization so that Step 7 doesn't generate
> a TLB flush.
>
> Patch 3 is the core change that implements Steps 1 thru 7. It also
> simplifies the associated TDX, SEV-SNP, and Hyper-V vTOM callbacks.
>
> Patch 4 is a somewhat tangential cleanup that removes an unnecessary
> wrapper function in the path for doing a transition.
>
> Patch 5 adds comments describing the implications of errors when
> doing a transition. These implications are discussed in the email
> thread for the RFC patch.
>
> With this change, the #VE and #VC exception handlers should no longer
> be triggered for load_unaligned_zeropad() accesses, and the existing
> code in those handlers to do the "fixup" shouldn't be needed. But I
> have not removed that code in this patch set. Kirill Shutemov wants
> to keep the code for TDX #VE, so the code for #VC on the the SEV-SNP
> side has also been kept.

There isn't any code for the SEV-SNP side at the moment.

Thanks,
Tom

>
> This patch set is based on the linux-next20230921 code tree.
>
> [1] https://lore.kernel.org/lkml/[email protected]/
>
> Michael Kelley (5):
> x86/coco: Use slow_virt_to_phys() in page transition hypervisor
> callbacks
> x86/mm: Don't do a TLB flush if changing a PTE that isn't marked
> present
> x86/mm: Mark CoCo VM pages not present while changing encrypted state
> x86/mm: Remove unnecessary call layer for __set_memory_enc_pgtable()
> x86/mm: Add comments about errors in
> set_memory_decrypted()/encrypted()
>
> arch/x86/coco/tdx/tdx.c | 66 +-----------------------
> arch/x86/hyperv/ivm.c | 15 +++---
> arch/x86/kernel/sev.c | 8 ++-
> arch/x86/kernel/x86_init.c | 4 --
> arch/x86/mm/mem_encrypt_amd.c | 27 +++-------
> arch/x86/mm/pat/set_memory.c | 114 +++++++++++++++++++++++++++++-------------
> 6 files changed, 102 insertions(+), 132 deletions(-)
>

2023-10-02 20:23:44

by Tom Lendacky

[permalink] [raw]
Subject: Re: [PATCH 3/5] x86/mm: Mark CoCo VM pages not present while changing encrypted state

On 10/2/23 11:35, Tom Lendacky wrote:
> On 9/29/23 13:19, Michael Kelley wrote:
>> In a CoCo VM when a page transitions from encrypted to decrypted, or vice
>> versa, attributes in the PTE must be updated *and* the hypervisor must
>> be notified of the change. Because there are two separate steps, there's
>> a window where the settings are inconsistent.  Normally the code that
>> initiates the transition (via set_memory_decrypted() or
>> set_memory_encrypted()) ensures that the memory is not being accessed
>> during a transition, so the window of inconsistency is not a problem.
>> However, the load_unaligned_zeropad() function can read arbitrary memory
>> pages at arbitrary times, which could access a transitioning page during
>> the window.  In such a case, CoCo VM specific exceptions are taken
>> (depending on the CoCo architecture in use).  Current code in those
>> exception handlers recovers and does "fixup" on the result returned by
>> load_unaligned_zeropad().  Unfortunately, this exception handling can't
>> work in paravisor scenarios (TDX Paritioning and SEV-SNP in vTOM mode).
>> The exceptions need to be forwarded from the paravisor to the Linux
>> guest, but there are no architectural specs for how to do that.
>>
>> Fortunately, there's a simpler way to solve the problem by changing
>> the core transition code in __set_memory_enc_pgtable() to do the
>> following:
>>
>> 1.  Remove aliasing mappings
>> 2.  Flush the data cache if needed
>> 3.  Remove the PRESENT bit from the PTEs of all transitioning pages
>> 4.  Set/clear the encryption attribute as appropriate
>> 5.  Flush the TLB so the changed encryption attribute isn't visible
>> 6.  Notify the hypervisor of the encryption status change
>
> Not sure why I didn't notice this before, but I will need to test this to
> be certain. As part of this notification, the SNP support will issue a
> PVALIDATE instruction (to either validate or rescind validation to the
> page). PVALIDATE takes a virtual address. If the PRESENT bit has been
> removed, the PVALIDATE instruction will take a #PF (see comments below).

Yes, this series results in a #PF booting an SNP guest:

[ 0.807735] BUG: unable to handle page fault for address: ffff88803bef7000
[ 0.807829] #PF: supervisor read access in kernel mode
[ 0.807829] #PF: error_code(0x0000) - not-present page
[ 0.807829] PGD 8000004c01067 P4D 8000004c01067 PUD 8000004c02067 PMD 80001001f8063 PTE 8007ffffc4108062
[ 0.807829] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 0.807829] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.6.0-rc3-sos-testing #1
[ 0.807829] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 2/2/2022
[ 0.807829] RIP: 0010:pvalidate_pages+0x99/0x140
[ 0.807829] Code: 48 09 ca 0f b6 4c c7 0f 48 09 d6 0f b6 54 c7 0e 48 c1 e6 0c 4c 21 de 83 e2 f0 80 fa 10 4a 8d 2c 16 0f 94 c2 48 89 e8 83 e1 01 <f2> 0f 01 ff 41 89 c4 72 6
1 83 f8 06 75 45 84 c9 74 41 48 01 de 48
[ 0.807829] RSP: 0000:ffffffff82803bd0 EFLAGS: 00010246
[ 0.807829] RAX: ffff88803bef7000 RBX: ffff888000200000 RCX: 0000000000000000
[ 0.807829] RDX: 0000000000000000 RSI: 000000003bef7000 RDI: ffffffff82803c40
[ 0.807829] RBP: ffff88803bef7000 R08: 0000000000000000 R09: 0000000000000000
[ 0.807829] R10: ffff888000000000 R11: 000000ffffffffff R12: 0000000000000040
[ 0.807829] R13: 0000000000000020 R14: ffffffff82803e48 R15: ffff88803bf37000
[ 0.807829] FS: 0000000000000000(0000) GS:ffff88846fc00000(0000) knlGS:0000000000000000
[ 0.807829] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 0.807829] CR2: ffff88803bef7000 CR3: 000800000382e000 CR4: 00000000003506f0
[ 0.807829] Call Trace:
[ 0.807829] <TASK>
[ 0.807829] ? __die+0x1f/0x70
[ 0.807829] ? page_fault_oops+0x81/0x150
[ 0.807829] ? srso_return_thunk+0x5/0x5f
[ 0.807829] ? kernelmode_fixup_or_oops+0x84/0x110
[ 0.807829] ? exc_page_fault+0xa8/0x150
[ 0.807829] ? asm_exc_page_fault+0x22/0x30
[ 0.807829] ? pvalidate_pages+0x99/0x140
[ 0.807829] __set_pages_state+0x280/0x2b0
[ 0.807829] set_pages_state+0x4e/0xa0
[ 0.807829] amd_enc_status_change_finish+0x4a/0x80
[ 0.807829] __set_memory_enc_dec+0xe1/0x190
[ 0.807829] mem_encrypt_init+0x15/0xc0
[ 0.807829] start_kernel+0x31b/0x5e0
[ 0.807829] x86_64_start_reservations+0x14/0x30
[ 0.807829] x86_64_start_kernel+0x79/0x80
[ 0.807829] secondary_startup_64_no_verify+0x16b/0x16b
[ 0.807829] </TASK>
[ 0.807829] Modules linked in:
[ 0.807829] CR2: ffff88803bef7000
[ 0.807829] ---[ end trace 0000000000000000 ]---
[ 0.807829] RIP: 0010:pvalidate_pages+0x99/0x140
[ 0.807829] Code: 48 09 ca 0f b6 4c c7 0f 48 09 d6 0f b6 54 c7 0e 48 c1 e6 0c 4c 21 de 83 e2 f0 80 fa 10 4a 8d 2c 16 0f 94 c2 48 89 e8 83 e1 01 <f2> 0f 01 ff 41 89 c4 72 6
1 83 f8 06 75 45 84 c9 74 41 48 01 de 48
[ 0.807829] RSP: 0000:ffffffff82803bd0 EFLAGS: 00010246
[ 0.807829] RAX: ffff88803bef7000 RBX: ffff888000200000 RCX: 0000000000000000
[ 0.807829] RDX: 0000000000000000 RSI: 000000003bef7000 RDI: ffffffff82803c40
[ 0.807829] RBP: ffff88803bef7000 R08: 0000000000000000 R09: 0000000000000000
[ 0.807829] R10: ffff888000000000 R11: 000000ffffffffff R12: 0000000000000040
[ 0.807829] R13: 0000000000000020 R14: ffffffff82803e48 R15: ffff88803bf37000
[ 0.807829] FS: 0000000000000000(0000) GS:ffff88846fc00000(0000) knlGS:0000000000000000
[ 0.807829] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 0.807829] CR2: ffff88803bef7000 CR3: 000800000382e000 CR4: 00000000003506f0
[ 0.807829] Kernel panic - not syncing: Fatal exception
[ 0.807829] ---[ end Kernel panic - not syncing: Fatal exception ]---


Thanks,
Tom

>
>> 7.  Add back the PRESENT bit, making the changed attribute visible
>>
>> With this approach, load_unaligned_zeropad() just takes its normal
>> page-fault-based fixup path if it touches a page that is transitioning.
>> As a result, load_unaligned_zeropad() and CoCo VM page transitioning
>> are completely decoupled.  CoCo VM page transitions can proceed
>> without needing to handle architecture-specific exceptions and fix
>> things up. This decoupling reduces the complexity due to separate
>> TDX and SEV-SNP fixup paths, and gives more freedom to revise and
>> introduce new capabilities in future versions of the TDX and SEV-SNP
>> architectures. Paravisor scenarios work properly without needing
>> to forward exceptions.
>>
>> With this approach, the order of updating the guest PTEs and
>> notifying the hypervisor doesn't matter. As such, only a single
>> hypervisor callback is needed, rather one before and one after
>> the PTE update. Simplify the code by eliminating the extra
>> hypervisor callback along with the TDX and SEV-SNP code that
>> handles the before and after cases. The TLB flush callback is
>> also no longer required and is removed.
>>
>> Signed-off-by: Michael Kelley <[email protected]>
>> ---
>>   arch/x86/coco/tdx/tdx.c       | 66
>> +------------------------------------------
>>   arch/x86/hyperv/ivm.c         |  6 ----
>>   arch/x86/kernel/x86_init.c    |  4 ---
>>   arch/x86/mm/mem_encrypt_amd.c | 27 ++++--------------
>>   arch/x86/mm/pat/set_memory.c  | 55 +++++++++++++++++++++++-------------
>>   5 files changed, 43 insertions(+), 115 deletions(-)
>>
>> diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
>> index 3e6dbd2..1bb2fff 100644
>> --- a/arch/x86/coco/tdx/tdx.c
>> +++ b/arch/x86/coco/tdx/tdx.c
>> @@ -676,24 +676,6 @@ bool tdx_handle_virt_exception(struct pt_regs
>> *regs, struct ve_info *ve)
>>       return true;
>>   }
>> -static bool tdx_tlb_flush_required(bool private)
>> -{
>> -    /*
>> -     * TDX guest is responsible for flushing TLB on private->shared
>> -     * transition. VMM is responsible for flushing on shared->private.
>> -     *
>> -     * The VMM _can't_ flush private addresses as it can't generate PAs
>> -     * with the guest's HKID.  Shared memory isn't subject to integrity
>> -     * checking, i.e. the VMM doesn't need to flush for its own
>> protection.
>> -     *
>> -     * There's no need to flush when converting from shared to private,
>> -     * as flushing is the VMM's responsibility in this case, e.g. it must
>> -     * flush to avoid integrity failures in the face of a buggy or
>> -     * malicious guest.
>> -     */
>> -    return !private;
>> -}
>> -
>>   static bool tdx_cache_flush_required(void)
>>   {
>>       /*
>> @@ -776,30 +758,6 @@ static bool tdx_enc_status_changed(unsigned long
>> vaddr, int numpages, bool enc)
>>       return true;
>>   }
>> -static bool tdx_enc_status_change_prepare(unsigned long vaddr, int
>> numpages,
>> -                      bool enc)
>> -{
>> -    /*
>> -     * Only handle shared->private conversion here.
>> -     * See the comment in tdx_early_init().
>> -     */
>> -    if (enc)
>> -        return tdx_enc_status_changed(vaddr, numpages, enc);
>> -    return true;
>> -}
>> -
>> -static bool tdx_enc_status_change_finish(unsigned long vaddr, int
>> numpages,
>> -                     bool enc)
>> -{
>> -    /*
>> -     * Only handle private->shared conversion here.
>> -     * See the comment in tdx_early_init().
>> -     */
>> -    if (!enc)
>> -        return tdx_enc_status_changed(vaddr, numpages, enc);
>> -    return true;
>> -}
>> -
>>   void __init tdx_early_init(void)
>>   {
>>       struct tdx_module_args args = {
>> @@ -831,30 +789,8 @@ void __init tdx_early_init(void)
>>        */
>>       physical_mask &= cc_mask - 1;
>> -    /*
>> -     * The kernel mapping should match the TDX metadata for the page.
>> -     * load_unaligned_zeropad() can touch memory *adjacent* to that
>> which is
>> -     * owned by the caller and can catch even _momentary_ mismatches.  Bad
>> -     * things happen on mismatch:
>> -     *
>> -     *   - Private mapping => Shared Page  == Guest shutdown
>> -         *   - Shared mapping  => Private Page == Recoverable #VE
>> -     *
>> -     * guest.enc_status_change_prepare() converts the page from
>> -     * shared=>private before the mapping becomes private.
>> -     *
>> -     * guest.enc_status_change_finish() converts the page from
>> -     * private=>shared after the mapping becomes private.
>> -     *
>> -     * In both cases there is a temporary shared mapping to a private
>> page,
>> -     * which can result in a #VE.  But, there is never a private
>> mapping to
>> -     * a shared page.
>> -     */
>> -    x86_platform.guest.enc_status_change_prepare =
>> tdx_enc_status_change_prepare;
>> -    x86_platform.guest.enc_status_change_finish  =
>> tdx_enc_status_change_finish;
>> -
>> +    x86_platform.guest.enc_status_change_finish  = tdx_enc_status_changed;
>>       x86_platform.guest.enc_cache_flush_required  =
>> tdx_cache_flush_required;
>> -    x86_platform.guest.enc_tlb_flush_required    = tdx_tlb_flush_required;
>>       /*
>>        * TDX intercepts the RDMSR to read the X2APIC ID in the parallel
>> diff --git a/arch/x86/hyperv/ivm.c b/arch/x86/hyperv/ivm.c
>> index 084fab6..fbe2585 100644
>> --- a/arch/x86/hyperv/ivm.c
>> +++ b/arch/x86/hyperv/ivm.c
>> @@ -550,11 +550,6 @@ static bool hv_vtom_set_host_visibility(unsigned
>> long kbuffer, int pagecount, bo
>>       return result;
>>   }
>> -static bool hv_vtom_tlb_flush_required(bool private)
>> -{
>> -    return true;
>> -}
>> -
>>   static bool hv_vtom_cache_flush_required(void)
>>   {
>>       return false;
>> @@ -614,7 +609,6 @@ void __init hv_vtom_init(void)
>>       x86_platform.hyper.is_private_mmio = hv_is_private_mmio;
>>       x86_platform.guest.enc_cache_flush_required =
>> hv_vtom_cache_flush_required;
>> -    x86_platform.guest.enc_tlb_flush_required =
>> hv_vtom_tlb_flush_required;
>>       x86_platform.guest.enc_status_change_finish =
>> hv_vtom_set_host_visibility;
>>       /* Set WB as the default cache mode. */
>> diff --git a/arch/x86/kernel/x86_init.c b/arch/x86/kernel/x86_init.c
>> index a37ebd3..cf5179b 100644
>> --- a/arch/x86/kernel/x86_init.c
>> +++ b/arch/x86/kernel/x86_init.c
>> @@ -131,9 +131,7 @@ struct x86_cpuinit_ops x86_cpuinit = {
>>   static void default_nmi_init(void) { };
>> -static bool enc_status_change_prepare_noop(unsigned long vaddr, int
>> npages, bool enc) { return true; }
>>   static bool enc_status_change_finish_noop(unsigned long vaddr, int
>> npages, bool enc) { return true; }
>> -static bool enc_tlb_flush_required_noop(bool enc) { return false; }
>>   static bool enc_cache_flush_required_noop(void) { return false; }
>>   static bool is_private_mmio_noop(u64 addr) {return false; }
>> @@ -154,9 +152,7 @@ struct x86_platform_ops x86_platform __ro_after_init
>> = {
>>       .hyper.is_private_mmio        = is_private_mmio_noop,
>>       .guest = {
>> -        .enc_status_change_prepare = enc_status_change_prepare_noop,
>>           .enc_status_change_finish  = enc_status_change_finish_noop,
>> -        .enc_tlb_flush_required       = enc_tlb_flush_required_noop,
>>           .enc_cache_flush_required  = enc_cache_flush_required_noop,
>>       },
>>   };
>> diff --git a/arch/x86/mm/mem_encrypt_amd.c b/arch/x86/mm/mem_encrypt_amd.c
>> index 6faea41..06960ba 100644
>> --- a/arch/x86/mm/mem_encrypt_amd.c
>> +++ b/arch/x86/mm/mem_encrypt_amd.c
>> @@ -278,11 +278,6 @@ static unsigned long pg_level_to_pfn(int level,
>> pte_t *kpte, pgprot_t *ret_prot)
>>       return pfn;
>>   }
>> -static bool amd_enc_tlb_flush_required(bool enc)
>> -{
>> -    return true;
>> -}
>> -
>>   static bool amd_enc_cache_flush_required(void)
>>   {
>>       return !cpu_feature_enabled(X86_FEATURE_SME_COHERENT);
>> @@ -318,18 +313,6 @@ static void enc_dec_hypercall(unsigned long vaddr,
>> unsigned long size, bool enc)
>>   #endif
>>   }
>> -static bool amd_enc_status_change_prepare(unsigned long vaddr, int
>> npages, bool enc)
>> -{
>> -    /*
>> -     * To maintain the security guarantees of SEV-SNP guests, make sure
>> -     * to invalidate the memory before encryption attribute is cleared.
>> -     */
>> -    if (cc_platform_has(CC_ATTR_GUEST_SEV_SNP) && !enc)
>> -        snp_set_memory_shared(vaddr, npages);
>> -
>> -    return true;
>> -}
>> -
>>   /* Return true unconditionally: return value doesn't matter for the
>> SEV side */
>>   static bool amd_enc_status_change_finish(unsigned long vaddr, int
>> npages, bool enc)
>>   {
>> @@ -337,8 +320,12 @@ static bool amd_enc_status_change_finish(unsigned
>> long vaddr, int npages, bool e
>>        * After memory is mapped encrypted in the page table, validate it
>>        * so that it is consistent with the page table updates.
>>        */
>> -    if (cc_platform_has(CC_ATTR_GUEST_SEV_SNP) && enc)
>> -        snp_set_memory_private(vaddr, npages);
>> +    if (cc_platform_has(CC_ATTR_GUEST_SEV_SNP)) {
>> +        if (enc)
>> +            snp_set_memory_private(vaddr, npages);
>> +        else
>> +            snp_set_memory_shared(vaddr, npages);
>> +    }
>
> These calls will both result in a PVALIDATE being issued (either before or
> after the page state change to the hypervisor) using the virtual address,
> which will trigger a #PF is the present bit isn't set.
>
>>       if (!cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT))
>>           enc_dec_hypercall(vaddr, npages << PAGE_SHIFT, enc);
>> @@ -498,9 +485,7 @@ void __init sme_early_init(void)
>>       /* Update the protection map with memory encryption mask */
>>       add_encrypt_protection_map();
>> -    x86_platform.guest.enc_status_change_prepare =
>> amd_enc_status_change_prepare;
>>       x86_platform.guest.enc_status_change_finish  =
>> amd_enc_status_change_finish;
>> -    x86_platform.guest.enc_tlb_flush_required    =
>> amd_enc_tlb_flush_required;
>>       x86_platform.guest.enc_cache_flush_required  =
>> amd_enc_cache_flush_required;
>>       /*
>> diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
>> index d7ef8d3..d062e01 100644
>> --- a/arch/x86/mm/pat/set_memory.c
>> +++ b/arch/x86/mm/pat/set_memory.c
>> @@ -2147,40 +2147,57 @@ static int __set_memory_enc_pgtable(unsigned
>> long addr, int numpages, bool enc)
>>       memset(&cpa, 0, sizeof(cpa));
>>       cpa.vaddr = &addr;
>>       cpa.numpages = numpages;
>> +
>> +    /*
>> +     * The caller must ensure that the memory being transitioned between
>> +     * encrypted and decrypted is not being accessed.  But if
>> +     * load_unaligned_zeropad() touches the "next" page, it may generate a
>> +     * read access the caller has no control over. To ensure such accesses
>> +     * cause a normal page fault for the load_unaligned_zeropad() handler,
>> +     * mark the pages not present until the transition is complete.  We
>> +     * don't want a #VE or #VC fault due to a mismatch in the memory
>> +     * encryption status, since paravisor configurations can't cleanly do
>> +     * the load_unaligned_zeropad() handling in the paravisor.
>> +     *
>> +     * There's no requirement to do so, but for efficiency we can clear
>> +     * _PAGE_PRESENT and set/clr encryption attr as a single operation.
>> +     */
>>       cpa.mask_set = enc ? pgprot_encrypted(empty) :
>> pgprot_decrypted(empty);
>> -    cpa.mask_clr = enc ? pgprot_decrypted(empty) :
>> pgprot_encrypted(empty);
>> +    cpa.mask_clr = enc ? pgprot_decrypted(__pgprot(_PAGE_PRESENT)) :
>> +                pgprot_encrypted(__pgprot(_PAGE_PRESENT));
>
> This should be lined up with the pgprot_decrypted above, e.g.:
>
> cpa.mask_clr = enc ? pgprot_decrypted(__pgprot(_PAGE_PRESENT)) :
>              pgprot_encrypted(__pgprot(_PAGE_PRESENT));
>
> or
>
> cpa.mask_clr = enc ? pgprot_decrypted(__pgprot(_PAGE_PRESENT))
>            : pgprot_encrypted(__pgprot(_PAGE_PRESENT));
>
>>       cpa.pgd = init_mm.pgd;
>>       /* Must avoid aliasing mappings in the highmem code */
>>       kmap_flush_unused();
>>       vm_unmap_aliases();
>> -    /* Flush the caches as needed before changing the encryption
>> attribute. */
>> -    if (x86_platform.guest.enc_tlb_flush_required(enc))
>> -        cpa_flush(&cpa, x86_platform.guest.enc_cache_flush_required());
>> -
>> -    /* Notify hypervisor that we are about to set/clr encryption
>> attribute. */
>> -    if (!x86_platform.guest.enc_status_change_prepare(addr, numpages,
>> enc))
>> -        return -EIO;
>> +    /* Flush the caches as needed before changing the encryption attr. */
>> +    if (x86_platform.guest.enc_cache_flush_required())
>> +        cpa_flush(&cpa, 1);
>>       ret = __change_page_attr_set_clr(&cpa, 1);
>> +    if (ret)
>> +        return ret;
>>       /*
>> -     * After changing the encryption attribute, we need to flush TLBs
>> again
>> -     * in case any speculative TLB caching occurred (but no need to flush
>> -     * caches again).  We could just use cpa_flush_all(), but in case TLB
>> -     * flushing gets optimized in the cpa_flush() path use the same logic
>> -     * as above.
>> +     * After clearing _PAGE_PRESENT and changing the encryption attribute,
>> +     * we need to flush TLBs to ensure no further accesses to the
>> memory can
>> +     * be made with the old encryption attribute (but no need to flush
>> caches
>> +     * again).  We could just use cpa_flush_all(), but in case TLB
>> flushing
>> +     * gets optimized in the cpa_flush() path use the same logic as above.
>>        */
>>       cpa_flush(&cpa, 0);
>> -    /* Notify hypervisor that we have successfully set/clr encryption
>> attribute. */
>> -    if (!ret) {
>> -        if (!x86_platform.guest.enc_status_change_finish(addr,
>> numpages, enc))
>> -            ret = -EIO;
>> -    }
>> +    /* Notify hypervisor that we have successfully set/clr encryption
>> attr. */
>> +    if (!x86_platform.guest.enc_status_change_finish(addr, numpages, enc))
>> +        return -EIO;
>
> Here's where the #PF is likely to be triggered.
>
> Thanks,
> Tom
>
>> -    return ret;
>> +    /*
>> +     * Now that the hypervisor is sync'ed with the page table changes
>> +     * made here, add back _PAGE_PRESENT. set_memory_p() does not flush
>> +     * the TLB.
>> +     */
>> +    return set_memory_p(&addr, numpages);
>>   }
>>   static int __set_memory_enc_dec(unsigned long addr, int numpages, bool
>> enc)

2023-10-02 21:58:40

by Michael Kelley (LINUX)

[permalink] [raw]
Subject: RE: [PATCH 3/5] x86/mm: Mark CoCo VM pages not present while changing encrypted state

From: Tom Lendacky <[email protected]> Sent: Monday, October 2, 2023 11:59 AM
>
> On 10/2/23 11:35, Tom Lendacky wrote:
> > On 9/29/23 13:19, Michael Kelley wrote:
> >> In a CoCo VM when a page transitions from encrypted to decrypted, or vice
> >> versa, attributes in the PTE must be updated *and* the hypervisor must
> >> be notified of the change. Because there are two separate steps, there's
> >> a window where the settings are inconsistent.  Normally the code that
> >> initiates the transition (via set_memory_decrypted() or
> >> set_memory_encrypted()) ensures that the memory is not being accessed
> >> during a transition, so the window of inconsistency is not a problem.
> >> However, the load_unaligned_zeropad() function can read arbitrary memory
> >> pages at arbitrary times, which could access a transitioning page during
> >> the window.  In such a case, CoCo VM specific exceptions are taken
> >> (depending on the CoCo architecture in use).  Current code in those
> >> exception handlers recovers and does "fixup" on the result returned by
> >> load_unaligned_zeropad().  Unfortunately, this exception handling can't
> >> work in paravisor scenarios (TDX Paritioning and SEV-SNP in vTOM mode).
> >> The exceptions need to be forwarded from the paravisor to the Linux
> >> guest, but there are no architectural specs for how to do that.
> >>
> >> Fortunately, there's a simpler way to solve the problem by changing
> >> the core transition code in __set_memory_enc_pgtable() to do the
> >> following:
> >>
> >> 1.  Remove aliasing mappings
> >> 2.  Flush the data cache if needed
> >> 3.  Remove the PRESENT bit from the PTEs of all transitioning pages
> >> 4.  Set/clear the encryption attribute as appropriate
> >> 5.  Flush the TLB so the changed encryption attribute isn't visible
> >> 6.  Notify the hypervisor of the encryption status change
> >
> > Not sure why I didn't notice this before, but I will need to test this to
> > be certain. As part of this notification, the SNP support will issue a
> > PVALIDATE instruction (to either validate or rescind validation to the
> > page). PVALIDATE takes a virtual address. If the PRESENT bit has been
> > removed, the PVALIDATE instruction will take a #PF (see comments below).
>
> Yes, this series results in a #PF booting an SNP guest:

Bummer :-( Interestingly, an SNP guest on Hyper-V with a paravisor
works, presumably because the paravisor is doing the PVALIDATE with
a different guest virtual address for the physical page. TDX operates
on physical addresses, and I've tested that it works.

The spec for PVALIDATE says it performs the same segmentation
and paging checks as a 1-byte read, so indeed, the #PF is expected.
But in the spec, the pseudo-code for PVALIDATE uses the GUEST_VA
only to derive the GUEST_PA and the SYSTEM_PA. The GUEST_VA isn't
otherwise relevant, so any GUEST_VA that validly maps to the GUEST_PA
would work, as long as we can be assured that load_unaligned_zeropad()
won't touch that GUEST_VA.

Let me think about if there's a not-too-hacky way to make this work
with some temporary GVA.

Michael

>
> [ 0.807735] BUG: unable to handle page fault for address: ffff88803bef7000
> [ 0.807829] #PF: supervisor read access in kernel mode
> [ 0.807829] #PF: error_code(0x0000) - not-present page
> [ 0.807829] PGD 8000004c01067 P4D 8000004c01067 PUD 8000004c02067 PMD
> 80001001f8063 PTE 8007ffffc4108062
> [ 0.807829] Oops: 0000 [#1] PREEMPT SMP NOPTI
> [ 0.807829] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.6.0-rc3-sos-testing #1
> [ 0.807829] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown
> 2/2/2022
> [ 0.807829] RIP: 0010:pvalidate_pages+0x99/0x140
> [ 0.807829] Code: 48 09 ca 0f b6 4c c7 0f 48 09 d6 0f b6 54 c7 0e 48 c1 e6 0c 4c 21
> de 83 e2 f0 80 fa 10 4a 8d 2c 16 0f 94 c2 48 89 e8 83 e1 01 <f2> 0f 01 ff 41 89 c4 72 6
> 1 83 f8 06 75 45 84 c9 74 41 48 01 de 48
> [ 0.807829] RSP: 0000:ffffffff82803bd0 EFLAGS: 00010246
> [ 0.807829] RAX: ffff88803bef7000 RBX: ffff888000200000 RCX:
> 0000000000000000
> [ 0.807829] RDX: 0000000000000000 RSI: 000000003bef7000 RDI: ffffffff82803c40
> [ 0.807829] RBP: ffff88803bef7000 R08: 0000000000000000 R09:
> 0000000000000000
> [ 0.807829] R10: ffff888000000000 R11: 000000ffffffffff R12: 0000000000000040
> [ 0.807829] R13: 0000000000000020 R14: ffffffff82803e48 R15: ffff88803bf37000
> [ 0.807829] FS: 0000000000000000(0000) GS:ffff88846fc00000(0000)
> knlGS:0000000000000000
> [ 0.807829] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 0.807829] CR2: ffff88803bef7000 CR3: 000800000382e000 CR4:
> 00000000003506f0
> [ 0.807829] Call Trace:
> [ 0.807829] <TASK>
> [ 0.807829] ? __die+0x1f/0x70
> [ 0.807829] ? page_fault_oops+0x81/0x150
> [ 0.807829] ? srso_return_thunk+0x5/0x5f
> [ 0.807829] ? kernelmode_fixup_or_oops+0x84/0x110
> [ 0.807829] ? exc_page_fault+0xa8/0x150
> [ 0.807829] ? asm_exc_page_fault+0x22/0x30
> [ 0.807829] ? pvalidate_pages+0x99/0x140
> [ 0.807829] __set_pages_state+0x280/0x2b0
> [ 0.807829] set_pages_state+0x4e/0xa0
> [ 0.807829] amd_enc_status_change_finish+0x4a/0x80
> [ 0.807829] __set_memory_enc_dec+0xe1/0x190
> [ 0.807829] mem_encrypt_init+0x15/0xc0
> [ 0.807829] start_kernel+0x31b/0x5e0
> [ 0.807829] x86_64_start_reservations+0x14/0x30
> [ 0.807829] x86_64_start_kernel+0x79/0x80
> [ 0.807829] secondary_startup_64_no_verify+0x16b/0x16b
> [ 0.807829] </TASK>
> [ 0.807829] Modules linked in:
> [ 0.807829] CR2: ffff88803bef7000
> [ 0.807829] ---[ end trace 0000000000000000 ]---
> [ 0.807829] RIP: 0010:pvalidate_pages+0x99/0x140
> [ 0.807829] Code: 48 09 ca 0f b6 4c c7 0f 48 09 d6 0f b6 54 c7 0e 48 c1 e6 0c 4c 21
> de 83 e2 f0 80 fa 10 4a 8d 2c 16 0f 94 c2 48 89 e8 83 e1 01 <f2> 0f 01 ff 41 89 c4 72 6
> 1 83 f8 06 75 45 84 c9 74 41 48 01 de 48
> [ 0.807829] RSP: 0000:ffffffff82803bd0 EFLAGS: 00010246
> [ 0.807829] RAX: ffff88803bef7000 RBX: ffff888000200000 RCX:
> 0000000000000000
> [ 0.807829] RDX: 0000000000000000 RSI: 000000003bef7000 RDI: ffffffff82803c40
> [ 0.807829] RBP: ffff88803bef7000 R08: 0000000000000000 R09:
> 0000000000000000
> [ 0.807829] R10: ffff888000000000 R11: 000000ffffffffff R12: 0000000000000040
> [ 0.807829] R13: 0000000000000020 R14: ffffffff82803e48 R15: ffff88803bf37000
> [ 0.807829] FS: 0000000000000000(0000) GS:ffff88846fc00000(0000)
> knlGS:0000000000000000
> [ 0.807829] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 0.807829] CR2: ffff88803bef7000 CR3: 000800000382e000 CR4:
> 00000000003506f0
> [ 0.807829] Kernel panic - not syncing: Fatal exception
> [ 0.807829] ---[ end Kernel panic - not syncing: Fatal exception ]---
>
>
> Thanks,
> Tom
>
> >
> >> 7.  Add back the PRESENT bit, making the changed attribute visible
> >>
> >> With this approach, load_unaligned_zeropad() just takes its normal
> >> page-fault-based fixup path if it touches a page that is transitioning.
> >> As a result, load_unaligned_zeropad() and CoCo VM page transitioning
> >> are completely decoupled.  CoCo VM page transitions can proceed
> >> without needing to handle architecture-specific exceptions and fix
> >> things up. This decoupling reduces the complexity due to separate
> >> TDX and SEV-SNP fixup paths, and gives more freedom to revise and
> >> introduce new capabilities in future versions of the TDX and SEV-SNP
> >> architectures. Paravisor scenarios work properly without needing
> >> to forward exceptions.
> >>
> >> With this approach, the order of updating the guest PTEs and
> >> notifying the hypervisor doesn't matter. As such, only a single
> >> hypervisor callback is needed, rather one before and one after
> >> the PTE update. Simplify the code by eliminating the extra
> >> hypervisor callback along with the TDX and SEV-SNP code that
> >> handles the before and after cases. The TLB flush callback is
> >> also no longer required and is removed.
> >>
> >> Signed-off-by: Michael Kelley <[email protected]>
> >> ---
> >>   arch/x86/coco/tdx/tdx.c       | 66
> >> +------------------------------------------
> >>   arch/x86/hyperv/ivm.c         |  6 ----
> >>   arch/x86/kernel/x86_init.c    |  4 ---
> >>   arch/x86/mm/mem_encrypt_amd.c | 27 ++++--------------
> >>   arch/x86/mm/pat/set_memory.c  | 55 +++++++++++++++++++++++-------------
> >>   5 files changed, 43 insertions(+), 115 deletions(-)
> >>
> >> diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
> >> index 3e6dbd2..1bb2fff 100644
> >> --- a/arch/x86/coco/tdx/tdx.c
> >> +++ b/arch/x86/coco/tdx/tdx.c
> >> @@ -676,24 +676,6 @@ bool tdx_handle_virt_exception(struct pt_regs
> >> *regs, struct ve_info *ve)
> >>       return true;
> >>   }
> >> -static bool tdx_tlb_flush_required(bool private)
> >> -{
> >> -    /*
> >> -     * TDX guest is responsible for flushing TLB on private->shared
> >> -     * transition. VMM is responsible for flushing on shared->private.
> >> -     *
> >> -     * The VMM _can't_ flush private addresses as it can't generate PAs
> >> -     * with the guest's HKID.  Shared memory isn't subject to integrity
> >> -     * checking, i.e. the VMM doesn't need to flush for its own
> >> protection.
> >> -     *
> >> -     * There's no need to flush when converting from shared to private,
> >> -     * as flushing is the VMM's responsibility in this case, e.g. it must
> >> -     * flush to avoid integrity failures in the face of a buggy or
> >> -     * malicious guest.
> >> -     */
> >> -    return !private;
> >> -}
> >> -
> >>   static bool tdx_cache_flush_required(void)
> >>   {
> >>       /*
> >> @@ -776,30 +758,6 @@ static bool tdx_enc_status_changed(unsigned long
> >> vaddr, int numpages, bool enc)
> >>       return true;
> >>   }
> >> -static bool tdx_enc_status_change_prepare(unsigned long vaddr, int
> >> numpages,
> >> -                      bool enc)
> >> -{
> >> -    /*
> >> -     * Only handle shared->private conversion here.
> >> -     * See the comment in tdx_early_init().
> >> -     */
> >> -    if (enc)
> >> -        return tdx_enc_status_changed(vaddr, numpages, enc);
> >> -    return true;
> >> -}
> >> -
> >> -static bool tdx_enc_status_change_finish(unsigned long vaddr, int
> >> numpages,
> >> -                     bool enc)
> >> -{
> >> -    /*
> >> -     * Only handle private->shared conversion here.
> >> -     * See the comment in tdx_early_init().
> >> -     */
> >> -    if (!enc)
> >> -        return tdx_enc_status_changed(vaddr, numpages, enc);
> >> -    return true;
> >> -}
> >> -
> >>   void __init tdx_early_init(void)
> >>   {
> >>       struct tdx_module_args args = {
> >> @@ -831,30 +789,8 @@ void __init tdx_early_init(void)
> >>        */
> >>       physical_mask &= cc_mask - 1;
> >> -    /*
> >> -     * The kernel mapping should match the TDX metadata for the page.
> >> -     * load_unaligned_zeropad() can touch memory *adjacent* to that
> >> which is
> >> -     * owned by the caller and can catch even _momentary_ mismatches.  Bad
> >> -     * things happen on mismatch:
> >> -     *
> >> -     *   - Private mapping => Shared Page  == Guest shutdown
> >> -         *   - Shared mapping  => Private Page == Recoverable #VE
> >> -     *
> >> -     * guest.enc_status_change_prepare() converts the page from
> >> -     * shared=>private before the mapping becomes private.
> >> -     *
> >> -     * guest.enc_status_change_finish() converts the page from
> >> -     * private=>shared after the mapping becomes private.
> >> -     *
> >> -     * In both cases there is a temporary shared mapping to a private
> >> page,
> >> -     * which can result in a #VE.  But, there is never a private
> >> mapping to
> >> -     * a shared page.
> >> -     */
> >> -    x86_platform.guest.enc_status_change_prepare =
> >> tdx_enc_status_change_prepare;
> >> -    x86_platform.guest.enc_status_change_finish  =
> >> tdx_enc_status_change_finish;
> >> -
> >> +    x86_platform.guest.enc_status_change_finish  = tdx_enc_status_changed;
> >>       x86_platform.guest.enc_cache_flush_required  =
> >> tdx_cache_flush_required;
> >> -    x86_platform.guest.enc_tlb_flush_required    = tdx_tlb_flush_required;
> >>       /*
> >>        * TDX intercepts the RDMSR to read the X2APIC ID in the parallel
> >> diff --git a/arch/x86/hyperv/ivm.c b/arch/x86/hyperv/ivm.c
> >> index 084fab6..fbe2585 100644
> >> --- a/arch/x86/hyperv/ivm.c
> >> +++ b/arch/x86/hyperv/ivm.c
> >> @@ -550,11 +550,6 @@ static bool hv_vtom_set_host_visibility(unsigned
> >> long kbuffer, int pagecount, bo
> >>       return result;
> >>   }
> >> -static bool hv_vtom_tlb_flush_required(bool private)
> >> -{
> >> -    return true;
> >> -}
> >> -
> >>   static bool hv_vtom_cache_flush_required(void)
> >>   {
> >>       return false;
> >> @@ -614,7 +609,6 @@ void __init hv_vtom_init(void)
> >>       x86_platform.hyper.is_private_mmio = hv_is_private_mmio;
> >>       x86_platform.guest.enc_cache_flush_required =
> >> hv_vtom_cache_flush_required;
> >> -    x86_platform.guest.enc_tlb_flush_required =
> >> hv_vtom_tlb_flush_required;
> >>       x86_platform.guest.enc_status_change_finish =
> >> hv_vtom_set_host_visibility;
> >>       /* Set WB as the default cache mode. */
> >> diff --git a/arch/x86/kernel/x86_init.c b/arch/x86/kernel/x86_init.c
> >> index a37ebd3..cf5179b 100644
> >> --- a/arch/x86/kernel/x86_init.c
> >> +++ b/arch/x86/kernel/x86_init.c
> >> @@ -131,9 +131,7 @@ struct x86_cpuinit_ops x86_cpuinit = {
> >>   static void default_nmi_init(void) { };
> >> -static bool enc_status_change_prepare_noop(unsigned long vaddr, int
> >> npages, bool enc) { return true; }
> >>   static bool enc_status_change_finish_noop(unsigned long vaddr, int
> >> npages, bool enc) { return true; }
> >> -static bool enc_tlb_flush_required_noop(bool enc) { return false; }
> >>   static bool enc_cache_flush_required_noop(void) { return false; }
> >>   static bool is_private_mmio_noop(u64 addr) {return false; }
> >> @@ -154,9 +152,7 @@ struct x86_platform_ops x86_platform __ro_after_init
> >> = {
> >>       .hyper.is_private_mmio        = is_private_mmio_noop,
> >>       .guest = {
> >> -        .enc_status_change_prepare = enc_status_change_prepare_noop,
> >>           .enc_status_change_finish  = enc_status_change_finish_noop,
> >> -        .enc_tlb_flush_required       = enc_tlb_flush_required_noop,
> >>           .enc_cache_flush_required  = enc_cache_flush_required_noop,
> >>       },
> >>   };
> >> diff --git a/arch/x86/mm/mem_encrypt_amd.c
> b/arch/x86/mm/mem_encrypt_amd.c
> >> index 6faea41..06960ba 100644
> >> --- a/arch/x86/mm/mem_encrypt_amd.c
> >> +++ b/arch/x86/mm/mem_encrypt_amd.c
> >> @@ -278,11 +278,6 @@ static unsigned long pg_level_to_pfn(int level,
> >> pte_t *kpte, pgprot_t *ret_prot)
> >>       return pfn;
> >>   }
> >> -static bool amd_enc_tlb_flush_required(bool enc)
> >> -{
> >> -    return true;
> >> -}
> >> -
> >>   static bool amd_enc_cache_flush_required(void)
> >>   {
> >>       return !cpu_feature_enabled(X86_FEATURE_SME_COHERENT);
> >> @@ -318,18 +313,6 @@ static void enc_dec_hypercall(unsigned long vaddr,
> >> unsigned long size, bool enc)
> >>   #endif
> >>   }
> >> -static bool amd_enc_status_change_prepare(unsigned long vaddr, int
> >> npages, bool enc)
> >> -{
> >> -    /*
> >> -     * To maintain the security guarantees of SEV-SNP guests, make sure
> >> -     * to invalidate the memory before encryption attribute is cleared.
> >> -     */
> >> -    if (cc_platform_has(CC_ATTR_GUEST_SEV_SNP) && !enc)
> >> -        snp_set_memory_shared(vaddr, npages);
> >> -
> >> -    return true;
> >> -}
> >> -
> >>   /* Return true unconditionally: return value doesn't matter for the
> >> SEV side */
> >>   static bool amd_enc_status_change_finish(unsigned long vaddr, int
> >> npages, bool enc)
> >>   {
> >> @@ -337,8 +320,12 @@ static bool amd_enc_status_change_finish(unsigned
> >> long vaddr, int npages, bool e
> >>        * After memory is mapped encrypted in the page table, validate it
> >>        * so that it is consistent with the page table updates.
> >>        */
> >> -    if (cc_platform_has(CC_ATTR_GUEST_SEV_SNP) && enc)
> >> -        snp_set_memory_private(vaddr, npages);
> >> +    if (cc_platform_has(CC_ATTR_GUEST_SEV_SNP)) {
> >> +        if (enc)
> >> +            snp_set_memory_private(vaddr, npages);
> >> +        else
> >> +            snp_set_memory_shared(vaddr, npages);
> >> +    }
> >
> > These calls will both result in a PVALIDATE being issued (either before or
> > after the page state change to the hypervisor) using the virtual address,
> > which will trigger a #PF is the present bit isn't set.
> >
> >>       if (!cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT))
> >>           enc_dec_hypercall(vaddr, npages << PAGE_SHIFT, enc);
> >> @@ -498,9 +485,7 @@ void __init sme_early_init(void)
> >>       /* Update the protection map with memory encryption mask */
> >>       add_encrypt_protection_map();
> >> -    x86_platform.guest.enc_status_change_prepare =
> >> amd_enc_status_change_prepare;
> >>       x86_platform.guest.enc_status_change_finish  =
> >> amd_enc_status_change_finish;
> >> -    x86_platform.guest.enc_tlb_flush_required    =
> >> amd_enc_tlb_flush_required;
> >>       x86_platform.guest.enc_cache_flush_required  =
> >> amd_enc_cache_flush_required;
> >>       /*
> >> diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
> >> index d7ef8d3..d062e01 100644
> >> --- a/arch/x86/mm/pat/set_memory.c
> >> +++ b/arch/x86/mm/pat/set_memory.c
> >> @@ -2147,40 +2147,57 @@ static int __set_memory_enc_pgtable(unsigned
> >> long addr, int numpages, bool enc)
> >>       memset(&cpa, 0, sizeof(cpa));
> >>       cpa.vaddr = &addr;
> >>       cpa.numpages = numpages;
> >> +
> >> +    /*
> >> +     * The caller must ensure that the memory being transitioned between
> >> +     * encrypted and decrypted is not being accessed.  But if
> >> +     * load_unaligned_zeropad() touches the "next" page, it may generate a
> >> +     * read access the caller has no control over. To ensure such accesses
> >> +     * cause a normal page fault for the load_unaligned_zeropad() handler,
> >> +     * mark the pages not present until the transition is complete.  We
> >> +     * don't want a #VE or #VC fault due to a mismatch in the memory
> >> +     * encryption status, since paravisor configurations can't cleanly do
> >> +     * the load_unaligned_zeropad() handling in the paravisor.
> >> +     *
> >> +     * There's no requirement to do so, but for efficiency we can clear
> >> +     * _PAGE_PRESENT and set/clr encryption attr as a single operation.
> >> +     */
> >>       cpa.mask_set = enc ? pgprot_encrypted(empty) :
> >> pgprot_decrypted(empty);
> >> -    cpa.mask_clr = enc ? pgprot_decrypted(empty) :
> >> pgprot_encrypted(empty);
> >> +    cpa.mask_clr = enc ? pgprot_decrypted(__pgprot(_PAGE_PRESENT)) :
> >> +                pgprot_encrypted(__pgprot(_PAGE_PRESENT));
> >
> > This should be lined up with the pgprot_decrypted above, e.g.:
> >
> > cpa.mask_clr = enc ? pgprot_decrypted(__pgprot(_PAGE_PRESENT)) :
> >              pgprot_encrypted(__pgprot(_PAGE_PRESENT));
> >
> > or
> >
> > cpa.mask_clr = enc ? pgprot_decrypted(__pgprot(_PAGE_PRESENT))
> >            : pgprot_encrypted(__pgprot(_PAGE_PRESENT));
> >
> >>       cpa.pgd = init_mm.pgd;
> >>       /* Must avoid aliasing mappings in the highmem code */
> >>       kmap_flush_unused();
> >>       vm_unmap_aliases();
> >> -    /* Flush the caches as needed before changing the encryption
> >> attribute. */
> >> -    if (x86_platform.guest.enc_tlb_flush_required(enc))
> >> -        cpa_flush(&cpa, x86_platform.guest.enc_cache_flush_required());
> >> -
> >> -    /* Notify hypervisor that we are about to set/clr encryption
> >> attribute. */
> >> -    if (!x86_platform.guest.enc_status_change_prepare(addr, numpages,
> >> enc))
> >> -        return -EIO;
> >> +    /* Flush the caches as needed before changing the encryption attr. */
> >> +    if (x86_platform.guest.enc_cache_flush_required())
> >> +        cpa_flush(&cpa, 1);
> >>       ret = __change_page_attr_set_clr(&cpa, 1);
> >> +    if (ret)
> >> +        return ret;
> >>       /*
> >> -     * After changing the encryption attribute, we need to flush TLBs
> >> again
> >> -     * in case any speculative TLB caching occurred (but no need to flush
> >> -     * caches again).  We could just use cpa_flush_all(), but in case TLB
> >> -     * flushing gets optimized in the cpa_flush() path use the same logic
> >> -     * as above.
> >> +     * After clearing _PAGE_PRESENT and changing the encryption attribute,
> >> +     * we need to flush TLBs to ensure no further accesses to the
> >> memory can
> >> +     * be made with the old encryption attribute (but no need to flush
> >> caches
> >> +     * again).  We could just use cpa_flush_all(), but in case TLB
> >> flushing
> >> +     * gets optimized in the cpa_flush() path use the same logic as above.
> >>        */
> >>       cpa_flush(&cpa, 0);
> >> -    /* Notify hypervisor that we have successfully set/clr encryption
> >> attribute. */
> >> -    if (!ret) {
> >> -        if (!x86_platform.guest.enc_status_change_finish(addr,
> >> numpages, enc))
> >> -            ret = -EIO;
> >> -    }
> >> +    /* Notify hypervisor that we have successfully set/clr encryption
> >> attr. */
> >> +    if (!x86_platform.guest.enc_status_change_finish(addr, numpages, enc))
> >> +        return -EIO;
> >
> > Here's where the #PF is likely to be triggered.
> >
> > Thanks,
> > Tom
> >
> >> -    return ret;
> >> +    /*
> >> +     * Now that the hypervisor is sync'ed with the page table changes
> >> +     * made here, add back _PAGE_PRESENT. set_memory_p() does not flush
> >> +     * the TLB.
> >> +     */
> >> +    return set_memory_p(&addr, numpages);
> >>   }
> >>   static int __set_memory_enc_dec(unsigned long addr, int numpages, bool
> >> enc)

2023-10-02 22:25:59

by Tom Lendacky

[permalink] [raw]
Subject: Re: [PATCH 3/5] x86/mm: Mark CoCo VM pages not present while changing encrypted state

On 9/29/23 13:19, Michael Kelley wrote:
> In a CoCo VM when a page transitions from encrypted to decrypted, or vice
> versa, attributes in the PTE must be updated *and* the hypervisor must
> be notified of the change. Because there are two separate steps, there's
> a window where the settings are inconsistent. Normally the code that
> initiates the transition (via set_memory_decrypted() or
> set_memory_encrypted()) ensures that the memory is not being accessed
> during a transition, so the window of inconsistency is not a problem.
> However, the load_unaligned_zeropad() function can read arbitrary memory
> pages at arbitrary times, which could access a transitioning page during
> the window. In such a case, CoCo VM specific exceptions are taken
> (depending on the CoCo architecture in use). Current code in those
> exception handlers recovers and does "fixup" on the result returned by
> load_unaligned_zeropad(). Unfortunately, this exception handling can't
> work in paravisor scenarios (TDX Paritioning and SEV-SNP in vTOM mode).
> The exceptions need to be forwarded from the paravisor to the Linux
> guest, but there are no architectural specs for how to do that.
>
> Fortunately, there's a simpler way to solve the problem by changing
> the core transition code in __set_memory_enc_pgtable() to do the
> following:
>
> 1. Remove aliasing mappings
> 2. Flush the data cache if needed
> 3. Remove the PRESENT bit from the PTEs of all transitioning pages
> 4. Set/clear the encryption attribute as appropriate
> 5. Flush the TLB so the changed encryption attribute isn't visible
> 6. Notify the hypervisor of the encryption status change

Not sure why I didn't notice this before, but I will need to test this to
be certain. As part of this notification, the SNP support will issue a
PVALIDATE instruction (to either validate or rescind validation to the
page). PVALIDATE takes a virtual address. If the PRESENT bit has been
removed, the PVALIDATE instruction will take a #PF (see comments below).

> 7. Add back the PRESENT bit, making the changed attribute visible
>
> With this approach, load_unaligned_zeropad() just takes its normal
> page-fault-based fixup path if it touches a page that is transitioning.
> As a result, load_unaligned_zeropad() and CoCo VM page transitioning
> are completely decoupled. CoCo VM page transitions can proceed
> without needing to handle architecture-specific exceptions and fix
> things up. This decoupling reduces the complexity due to separate
> TDX and SEV-SNP fixup paths, and gives more freedom to revise and
> introduce new capabilities in future versions of the TDX and SEV-SNP
> architectures. Paravisor scenarios work properly without needing
> to forward exceptions.
>
> With this approach, the order of updating the guest PTEs and
> notifying the hypervisor doesn't matter. As such, only a single
> hypervisor callback is needed, rather one before and one after
> the PTE update. Simplify the code by eliminating the extra
> hypervisor callback along with the TDX and SEV-SNP code that
> handles the before and after cases. The TLB flush callback is
> also no longer required and is removed.
>
> Signed-off-by: Michael Kelley <[email protected]>
> ---
> arch/x86/coco/tdx/tdx.c | 66 +------------------------------------------
> arch/x86/hyperv/ivm.c | 6 ----
> arch/x86/kernel/x86_init.c | 4 ---
> arch/x86/mm/mem_encrypt_amd.c | 27 ++++--------------
> arch/x86/mm/pat/set_memory.c | 55 +++++++++++++++++++++++-------------
> 5 files changed, 43 insertions(+), 115 deletions(-)
>
> diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
> index 3e6dbd2..1bb2fff 100644
> --- a/arch/x86/coco/tdx/tdx.c
> +++ b/arch/x86/coco/tdx/tdx.c
> @@ -676,24 +676,6 @@ bool tdx_handle_virt_exception(struct pt_regs *regs, struct ve_info *ve)
> return true;
> }
>
> -static bool tdx_tlb_flush_required(bool private)
> -{
> - /*
> - * TDX guest is responsible for flushing TLB on private->shared
> - * transition. VMM is responsible for flushing on shared->private.
> - *
> - * The VMM _can't_ flush private addresses as it can't generate PAs
> - * with the guest's HKID. Shared memory isn't subject to integrity
> - * checking, i.e. the VMM doesn't need to flush for its own protection.
> - *
> - * There's no need to flush when converting from shared to private,
> - * as flushing is the VMM's responsibility in this case, e.g. it must
> - * flush to avoid integrity failures in the face of a buggy or
> - * malicious guest.
> - */
> - return !private;
> -}
> -
> static bool tdx_cache_flush_required(void)
> {
> /*
> @@ -776,30 +758,6 @@ static bool tdx_enc_status_changed(unsigned long vaddr, int numpages, bool enc)
> return true;
> }
>
> -static bool tdx_enc_status_change_prepare(unsigned long vaddr, int numpages,
> - bool enc)
> -{
> - /*
> - * Only handle shared->private conversion here.
> - * See the comment in tdx_early_init().
> - */
> - if (enc)
> - return tdx_enc_status_changed(vaddr, numpages, enc);
> - return true;
> -}
> -
> -static bool tdx_enc_status_change_finish(unsigned long vaddr, int numpages,
> - bool enc)
> -{
> - /*
> - * Only handle private->shared conversion here.
> - * See the comment in tdx_early_init().
> - */
> - if (!enc)
> - return tdx_enc_status_changed(vaddr, numpages, enc);
> - return true;
> -}
> -
> void __init tdx_early_init(void)
> {
> struct tdx_module_args args = {
> @@ -831,30 +789,8 @@ void __init tdx_early_init(void)
> */
> physical_mask &= cc_mask - 1;
>
> - /*
> - * The kernel mapping should match the TDX metadata for the page.
> - * load_unaligned_zeropad() can touch memory *adjacent* to that which is
> - * owned by the caller and can catch even _momentary_ mismatches. Bad
> - * things happen on mismatch:
> - *
> - * - Private mapping => Shared Page == Guest shutdown
> - * - Shared mapping => Private Page == Recoverable #VE
> - *
> - * guest.enc_status_change_prepare() converts the page from
> - * shared=>private before the mapping becomes private.
> - *
> - * guest.enc_status_change_finish() converts the page from
> - * private=>shared after the mapping becomes private.
> - *
> - * In both cases there is a temporary shared mapping to a private page,
> - * which can result in a #VE. But, there is never a private mapping to
> - * a shared page.
> - */
> - x86_platform.guest.enc_status_change_prepare = tdx_enc_status_change_prepare;
> - x86_platform.guest.enc_status_change_finish = tdx_enc_status_change_finish;
> -
> + x86_platform.guest.enc_status_change_finish = tdx_enc_status_changed;
> x86_platform.guest.enc_cache_flush_required = tdx_cache_flush_required;
> - x86_platform.guest.enc_tlb_flush_required = tdx_tlb_flush_required;
>
> /*
> * TDX intercepts the RDMSR to read the X2APIC ID in the parallel
> diff --git a/arch/x86/hyperv/ivm.c b/arch/x86/hyperv/ivm.c
> index 084fab6..fbe2585 100644
> --- a/arch/x86/hyperv/ivm.c
> +++ b/arch/x86/hyperv/ivm.c
> @@ -550,11 +550,6 @@ static bool hv_vtom_set_host_visibility(unsigned long kbuffer, int pagecount, bo
> return result;
> }
>
> -static bool hv_vtom_tlb_flush_required(bool private)
> -{
> - return true;
> -}
> -
> static bool hv_vtom_cache_flush_required(void)
> {
> return false;
> @@ -614,7 +609,6 @@ void __init hv_vtom_init(void)
>
> x86_platform.hyper.is_private_mmio = hv_is_private_mmio;
> x86_platform.guest.enc_cache_flush_required = hv_vtom_cache_flush_required;
> - x86_platform.guest.enc_tlb_flush_required = hv_vtom_tlb_flush_required;
> x86_platform.guest.enc_status_change_finish = hv_vtom_set_host_visibility;
>
> /* Set WB as the default cache mode. */
> diff --git a/arch/x86/kernel/x86_init.c b/arch/x86/kernel/x86_init.c
> index a37ebd3..cf5179b 100644
> --- a/arch/x86/kernel/x86_init.c
> +++ b/arch/x86/kernel/x86_init.c
> @@ -131,9 +131,7 @@ struct x86_cpuinit_ops x86_cpuinit = {
>
> static void default_nmi_init(void) { };
>
> -static bool enc_status_change_prepare_noop(unsigned long vaddr, int npages, bool enc) { return true; }
> static bool enc_status_change_finish_noop(unsigned long vaddr, int npages, bool enc) { return true; }
> -static bool enc_tlb_flush_required_noop(bool enc) { return false; }
> static bool enc_cache_flush_required_noop(void) { return false; }
> static bool is_private_mmio_noop(u64 addr) {return false; }
>
> @@ -154,9 +152,7 @@ struct x86_platform_ops x86_platform __ro_after_init = {
> .hyper.is_private_mmio = is_private_mmio_noop,
>
> .guest = {
> - .enc_status_change_prepare = enc_status_change_prepare_noop,
> .enc_status_change_finish = enc_status_change_finish_noop,
> - .enc_tlb_flush_required = enc_tlb_flush_required_noop,
> .enc_cache_flush_required = enc_cache_flush_required_noop,
> },
> };
> diff --git a/arch/x86/mm/mem_encrypt_amd.c b/arch/x86/mm/mem_encrypt_amd.c
> index 6faea41..06960ba 100644
> --- a/arch/x86/mm/mem_encrypt_amd.c
> +++ b/arch/x86/mm/mem_encrypt_amd.c
> @@ -278,11 +278,6 @@ static unsigned long pg_level_to_pfn(int level, pte_t *kpte, pgprot_t *ret_prot)
> return pfn;
> }
>
> -static bool amd_enc_tlb_flush_required(bool enc)
> -{
> - return true;
> -}
> -
> static bool amd_enc_cache_flush_required(void)
> {
> return !cpu_feature_enabled(X86_FEATURE_SME_COHERENT);
> @@ -318,18 +313,6 @@ static void enc_dec_hypercall(unsigned long vaddr, unsigned long size, bool enc)
> #endif
> }
>
> -static bool amd_enc_status_change_prepare(unsigned long vaddr, int npages, bool enc)
> -{
> - /*
> - * To maintain the security guarantees of SEV-SNP guests, make sure
> - * to invalidate the memory before encryption attribute is cleared.
> - */
> - if (cc_platform_has(CC_ATTR_GUEST_SEV_SNP) && !enc)
> - snp_set_memory_shared(vaddr, npages);
> -
> - return true;
> -}
> -
> /* Return true unconditionally: return value doesn't matter for the SEV side */
> static bool amd_enc_status_change_finish(unsigned long vaddr, int npages, bool enc)
> {
> @@ -337,8 +320,12 @@ static bool amd_enc_status_change_finish(unsigned long vaddr, int npages, bool e
> * After memory is mapped encrypted in the page table, validate it
> * so that it is consistent with the page table updates.
> */
> - if (cc_platform_has(CC_ATTR_GUEST_SEV_SNP) && enc)
> - snp_set_memory_private(vaddr, npages);
> + if (cc_platform_has(CC_ATTR_GUEST_SEV_SNP)) {
> + if (enc)
> + snp_set_memory_private(vaddr, npages);
> + else
> + snp_set_memory_shared(vaddr, npages);
> + }

These calls will both result in a PVALIDATE being issued (either before or
after the page state change to the hypervisor) using the virtual address,
which will trigger a #PF is the present bit isn't set.

>
> if (!cc_platform_has(CC_ATTR_HOST_MEM_ENCRYPT))
> enc_dec_hypercall(vaddr, npages << PAGE_SHIFT, enc);
> @@ -498,9 +485,7 @@ void __init sme_early_init(void)
> /* Update the protection map with memory encryption mask */
> add_encrypt_protection_map();
>
> - x86_platform.guest.enc_status_change_prepare = amd_enc_status_change_prepare;
> x86_platform.guest.enc_status_change_finish = amd_enc_status_change_finish;
> - x86_platform.guest.enc_tlb_flush_required = amd_enc_tlb_flush_required;
> x86_platform.guest.enc_cache_flush_required = amd_enc_cache_flush_required;
>
> /*
> diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
> index d7ef8d3..d062e01 100644
> --- a/arch/x86/mm/pat/set_memory.c
> +++ b/arch/x86/mm/pat/set_memory.c
> @@ -2147,40 +2147,57 @@ static int __set_memory_enc_pgtable(unsigned long addr, int numpages, bool enc)
> memset(&cpa, 0, sizeof(cpa));
> cpa.vaddr = &addr;
> cpa.numpages = numpages;
> +
> + /*
> + * The caller must ensure that the memory being transitioned between
> + * encrypted and decrypted is not being accessed. But if
> + * load_unaligned_zeropad() touches the "next" page, it may generate a
> + * read access the caller has no control over. To ensure such accesses
> + * cause a normal page fault for the load_unaligned_zeropad() handler,
> + * mark the pages not present until the transition is complete. We
> + * don't want a #VE or #VC fault due to a mismatch in the memory
> + * encryption status, since paravisor configurations can't cleanly do
> + * the load_unaligned_zeropad() handling in the paravisor.
> + *
> + * There's no requirement to do so, but for efficiency we can clear
> + * _PAGE_PRESENT and set/clr encryption attr as a single operation.
> + */
> cpa.mask_set = enc ? pgprot_encrypted(empty) : pgprot_decrypted(empty);
> - cpa.mask_clr = enc ? pgprot_decrypted(empty) : pgprot_encrypted(empty);
> + cpa.mask_clr = enc ? pgprot_decrypted(__pgprot(_PAGE_PRESENT)) :
> + pgprot_encrypted(__pgprot(_PAGE_PRESENT));

This should be lined up with the pgprot_decrypted above, e.g.:

cpa.mask_clr = enc ? pgprot_decrypted(__pgprot(_PAGE_PRESENT)) :
pgprot_encrypted(__pgprot(_PAGE_PRESENT));

or

cpa.mask_clr = enc ? pgprot_decrypted(__pgprot(_PAGE_PRESENT))
: pgprot_encrypted(__pgprot(_PAGE_PRESENT));

> cpa.pgd = init_mm.pgd;
>
> /* Must avoid aliasing mappings in the highmem code */
> kmap_flush_unused();
> vm_unmap_aliases();
>
> - /* Flush the caches as needed before changing the encryption attribute. */
> - if (x86_platform.guest.enc_tlb_flush_required(enc))
> - cpa_flush(&cpa, x86_platform.guest.enc_cache_flush_required());
> -
> - /* Notify hypervisor that we are about to set/clr encryption attribute. */
> - if (!x86_platform.guest.enc_status_change_prepare(addr, numpages, enc))
> - return -EIO;
> + /* Flush the caches as needed before changing the encryption attr. */
> + if (x86_platform.guest.enc_cache_flush_required())
> + cpa_flush(&cpa, 1);
>
> ret = __change_page_attr_set_clr(&cpa, 1);
> + if (ret)
> + return ret;
>
> /*
> - * After changing the encryption attribute, we need to flush TLBs again
> - * in case any speculative TLB caching occurred (but no need to flush
> - * caches again). We could just use cpa_flush_all(), but in case TLB
> - * flushing gets optimized in the cpa_flush() path use the same logic
> - * as above.
> + * After clearing _PAGE_PRESENT and changing the encryption attribute,
> + * we need to flush TLBs to ensure no further accesses to the memory can
> + * be made with the old encryption attribute (but no need to flush caches
> + * again). We could just use cpa_flush_all(), but in case TLB flushing
> + * gets optimized in the cpa_flush() path use the same logic as above.
> */
> cpa_flush(&cpa, 0);
>
> - /* Notify hypervisor that we have successfully set/clr encryption attribute. */
> - if (!ret) {
> - if (!x86_platform.guest.enc_status_change_finish(addr, numpages, enc))
> - ret = -EIO;
> - }
> + /* Notify hypervisor that we have successfully set/clr encryption attr. */
> + if (!x86_platform.guest.enc_status_change_finish(addr, numpages, enc))
> + return -EIO;

Here's where the #PF is likely to be triggered.

Thanks,
Tom

>
> - return ret;
> + /*
> + * Now that the hypervisor is sync'ed with the page table changes
> + * made here, add back _PAGE_PRESENT. set_memory_p() does not flush
> + * the TLB.
> + */
> + return set_memory_p(&addr, numpages);
> }
>
> static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)

2023-10-05 17:10:51

by Michael Kelley (LINUX)

[permalink] [raw]
Subject: [PATCH 1/5] x86/coco: Use slow_virt_to_phys() in page transition hypervisor callbacks

In preparation for temporarily marking pages not present during a
transition between encrypted and decrypted, use slow_virt_to_phys()
in the hypervisor callbacks. As long as the PFN is correct,
slow_virt_to_phys() works even if the leaf PTE is not present.
The existing functions that depends on vmalloc_to_page() all
require that the leaf PTE be marked present, so they don't work.

Update the comments for slow_virt_to_phys() to note this broader usage
and the requirement to work even if the PTE is not marked present.

Signed-off-by: Michael Kelley <[email protected]>
---
arch/x86/hyperv/ivm.c | 9 ++++++++-
arch/x86/kernel/sev.c | 8 +++++++-
arch/x86/mm/pat/set_memory.c | 13 +++++++++----
3 files changed, 24 insertions(+), 6 deletions(-)

diff --git a/arch/x86/hyperv/ivm.c b/arch/x86/hyperv/ivm.c
index c1088d3..084fab6 100644
--- a/arch/x86/hyperv/ivm.c
+++ b/arch/x86/hyperv/ivm.c
@@ -524,7 +524,14 @@ static bool hv_vtom_set_host_visibility(unsigned long kbuffer, int pagecount, bo
return false;

for (i = 0, pfn = 0; i < pagecount; i++) {
- pfn_array[pfn] = virt_to_hvpfn((void *)kbuffer + i * HV_HYP_PAGE_SIZE);
+ /*
+ * Use slow_virt_to_phys() because the PRESENT bit has been
+ * temporarily cleared in the PTEs. slow_virt_to_phys() works
+ * without the PRESENT bit while virt_to_hvpfn() or similar
+ * does not.
+ */
+ pfn_array[pfn] = slow_virt_to_phys((void *)kbuffer +
+ i * HV_HYP_PAGE_SIZE) >> HV_HYP_PAGE_SHIFT;
pfn++;

if (pfn == HV_MAX_MODIFY_GPA_REP_COUNT || i == pagecount - 1) {
diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index 2787826..f5d6cec 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -784,7 +784,13 @@ static unsigned long __set_pages_state(struct snp_psc_desc *data, unsigned long
hdr->end_entry = i;

if (is_vmalloc_addr((void *)vaddr)) {
- pfn = vmalloc_to_pfn((void *)vaddr);
+ /*
+ * Use slow_virt_to_phys() because the PRESENT bit has been
+ * temporarily cleared in the PTEs. slow_virt_to_phys() works
+ * without the PRESENT bit while vmalloc_to_pfn() or similar
+ * does not.
+ */
+ pfn = slow_virt_to_phys((void *)vaddr) >> PAGE_SHIFT;
use_large_entry = false;
} else {
pfn = __pa(vaddr) >> PAGE_SHIFT;
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index bda9f12..8e19796 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -755,10 +755,15 @@ pmd_t *lookup_pmd_address(unsigned long address)
* areas on 32-bit NUMA systems. The percpu areas can
* end up in this kind of memory, for instance.
*
- * This could be optimized, but it is only intended to be
- * used at initialization time, and keeping it
- * unoptimized should increase the testing coverage for
- * the more obscure platforms.
+ * It is also used in callbacks for CoCo VM page transitions between private
+ * and shared because it works when the PRESENT bit is not set in the leaf
+ * PTE. In such cases, the state of the PTEs, including the PFN, is otherwise
+ * known to be valid, so the returned physical address is correct. The similar
+ * function vmalloc_to_pfn() can't be used because it requires the PRESENT bit.
+ *
+ * This could be optimized, but it is only used in paths that are not perf
+ * sensitive, and keeping it unoptimized should increase the testing coverage
+ * for the more obscure platforms.
*/
phys_addr_t slow_virt_to_phys(void *__virt_addr)
{
--
1.8.3.1

2023-10-05 17:21:31

by Michael Kelley (LINUX)

[permalink] [raw]
Subject: [PATCH 5/5] x86/mm: Add comments about errors in set_memory_decrypted()/encrypted()

The functions set_memory_decrypted()/encrypted() may leave the input
memory range in an inconsistent state if an error occurs. Add comments
describing the situation and what callers must be aware of. Also add
comments in __set_memory_enc_dec() with more details on the issues and
why further investment in error handling is not likely to be useful.

No functional change.

Suggested-by: Rick Edgecombe <[email protected]>
Signed-off-by: Michael Kelley <[email protected]>
---
arch/x86/mm/pat/set_memory.c | 26 ++++++++++++++++++++++++++
1 file changed, 26 insertions(+)

diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 1578077..9d07d4e 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -2133,6 +2133,24 @@ int set_memory_global(unsigned long addr, int numpages)
/*
* __set_memory_enc_dec() is used for the hypervisors that get
* informed about "encryption" status via page tables.
+ *
+ * If an error occurs in making the transition between encrypted and
+ * decrypted, the transitioned memory is left in an indeterminate state.
+ * The encryption status in the guest page tables may not match the
+ * hypervisor's view of the encryption status, making the memory unusable.
+ * If the memory consists of multiple pages, different pages may be in
+ * different indeterminate states.
+ *
+ * It is difficult to recover from errors such that we can ensure
+ * consistency between the page tables and hypervisor view of the encryption
+ * state. It may not be possible to back out of changes, particularly if the
+ * failure occurs in communicating with the hypervisor. Given this limitation,
+ * further work on the error handling is not likely to meaningfully improve
+ * the reliablity or usability of the system.
+ *
+ * Any errors are likely to soon render the VM inoperable, but we return
+ * an error rather than panic'ing so that the caller can decide how best
+ * to shutdown cleanly.
*/
static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
{
@@ -2203,6 +2221,14 @@ static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
return set_memory_p(&addr, numpages);
}

+/*
+ * If set_memory_encrypted()/decrypted() returns an error, the input memory
+ * range is left in an indeterminate state. The encryption status of pages
+ * may be inconsistent, so the memory is unusable. The caller should not try
+ * to do further operations on the memory, or return it to the free list.
+ * The memory must be leaked, and the caller should take steps to shutdown
+ * the system as cleanly as possible as something is seriously wrong.
+ */
int set_memory_encrypted(unsigned long addr, int numpages)
{
return __set_memory_enc_dec(addr, numpages, true);
--
1.8.3.1

2023-10-17 00:35:49

by Michael Kelley (LINUX)

[permalink] [raw]
Subject: RE: [PATCH 3/5] x86/mm: Mark CoCo VM pages not present while changing encrypted state

From: Michael Kelley (LINUX) <[email protected]> Sent: Monday, October 2, 2023 1:43 PM
>
> From: Tom Lendacky <[email protected]> Sent: Monday, October 2, 2023
> 11:59 AM
> >
> > On 10/2/23 11:35, Tom Lendacky wrote:
> > > On 9/29/23 13:19, Michael Kelley wrote:
> > >> In a CoCo VM when a page transitions from encrypted to decrypted, or vice
> > >> versa, attributes in the PTE must be updated *and* the hypervisor must
> > >> be notified of the change. Because there are two separate steps, there's
> > >> a window where the settings are inconsistent.  Normally the code that
> > >> initiates the transition (via set_memory_decrypted() or
> > >> set_memory_encrypted()) ensures that the memory is not being accessed
> > >> during a transition, so the window of inconsistency is not a problem.
> > >> However, the load_unaligned_zeropad() function can read arbitrary memory
> > >> pages at arbitrary times, which could access a transitioning page during
> > >> the window.  In such a case, CoCo VM specific exceptions are taken
> > >> (depending on the CoCo architecture in use).  Current code in those
> > >> exception handlers recovers and does "fixup" on the result returned by
> > >> load_unaligned_zeropad().  Unfortunately, this exception handling can't
> > >> work in paravisor scenarios (TDX Paritioning and SEV-SNP in vTOM mode).
> > >> The exceptions need to be forwarded from the paravisor to the Linux
> > >> guest, but there are no architectural specs for how to do that.
> > >>
> > >> Fortunately, there's a simpler way to solve the problem by changing
> > >> the core transition code in __set_memory_enc_pgtable() to do the
> > >> following:
> > >>
> > >> 1.  Remove aliasing mappings
> > >> 2.  Flush the data cache if needed
> > >> 3.  Remove the PRESENT bit from the PTEs of all transitioning pages
> > >> 4.  Set/clear the encryption attribute as appropriate
> > >> 5.  Flush the TLB so the changed encryption attribute isn't visible
> > >> 6.  Notify the hypervisor of the encryption status change
> > >
> > > Not sure why I didn't notice this before, but I will need to test this to
> > > be certain. As part of this notification, the SNP support will issue a
> > > PVALIDATE instruction (to either validate or rescind validation to the
> > > page). PVALIDATE takes a virtual address. If the PRESENT bit has been
> > > removed, the PVALIDATE instruction will take a #PF (see comments below).
> >
> > Yes, this series results in a #PF booting an SNP guest:
>
> Bummer :-( Interestingly, an SNP guest on Hyper-V with a paravisor
> works, presumably because the paravisor is doing the PVALIDATE with
> a different guest virtual address for the physical page. TDX operates
> on physical addresses, and I've tested that it works.
>
> The spec for PVALIDATE says it performs the same segmentation
> and paging checks as a 1-byte read, so indeed, the #PF is expected.
> But in the spec, the pseudo-code for PVALIDATE uses the GUEST_VA
> only to derive the GUEST_PA and the SYSTEM_PA. The GUEST_VA isn't
> otherwise relevant, so any GUEST_VA that validly maps to the GUEST_PA
> would work, as long as we can be assured that load_unaligned_zeropad()
> won't touch that GUEST_VA.
>
> Let me think about if there's a not-too-hacky way to make this work
> with some temporary GVA.
>

It seems like there are two possible approaches to make this work:

1. Create a temporary virtual mapping in vmalloc space and pass
that virtual address to PVALIDATE. (But only do this when PVALIDATE
is being used for private <-> shared transitions, and not for memory
acceptance.) The temporary mapping is updated with each invocation
of PVALIDATE. To make this work, the temp virtual addr must be aligned
on a 2 Meg boundary and must have a guard page preceding it so that
load_unaligned_zeropad() can't stumble into the temporary mapping.
I've wrestled with a few approaches to coding this over the past two
weeks, and have something that's not too bad. This approach
certainly takes some additional CPU cycles. I've tested doing the
temp mapping in the context of a Hyper-V vTOM VM, but don't see
any measurable impact on boot time, even when converting a 1 Gbyte
swiotlb space from private to shared. I'm setting up now to test
in a regular SNP VM where snp_enc_status_change_finish() is used,
to have end-to-end confirmation that it really does work.

2. A completely different approach is for __set_memory_enc_pgtable()
to clear and restore the PRESENT bit only when REFLECT_VC is set in
the MSR_AMD64_SEV, and the equivalent on TDX. This is the case that's
problematic for doing the load_unaligned_zeropad() fix up in the SNP
#VC or TDX #VE exception handler. I think you said some additional work
was needed for the fixup to be done properly in the SNP #VC case, so that
would have to be done. The cleared PRESENT bit is handled by the
paravisor because the paravisor already has a virtual mapping to pass
to PVALIDATE.

Any thoughts? I'll try to get the code for #1 posted in the next few
days, so you can judge the level of additional complexity to manage
the temp virtual mapping.

Michael