From: Michael Kelley <[email protected]>
In a CoCo VM when a page transitions from encrypted to decrypted, or vice
versa, attributes in the PTE must be updated *and* the hypervisor must
be notified of the change. Because there are two separate steps, there's
a window where the settings are inconsistent. Normally the code that
initiates the transition (via set_memory_decrypted() or
set_memory_encrypted()) ensures that the memory is not being accessed
during a transition, so the window of inconsistency is not a problem.
However, the load_unaligned_zeropad() function can read arbitrary memory
pages at arbitrary times, which could read a transitioning page during
the window. In such a case, CoCo VM specific exceptions are taken
(depending on the CoCo architecture in use). Current code in those
exception handlers recovers and does "fixup" on the result returned by
load_unaligned_zeropad(). Unfortunately, this exception handling can't
work in paravisor scenarios (TDX Paritioning and SEV-SNP in vTOM mode)
if the exceptions are routed to the paravisor. The paravisor can't
do load_unaligned_zeropad() fixup, so the exceptions would need to
be forwarded from the paravisor to the Linux guest, but there are
no architectural specs for how to do that.
Fortunately, there's a simpler way to solve the problem by changing
the core transition code in __set_memory_enc_pgtable() to do the
following:
1. Remove aliasing mappings
2. Flush the data cache if needed
3. Remove the PRESENT bit from the PTEs of all transitioning pages
4. Set/clear the encryption attribute as appropriate
5. Flush the TLB so the changed encryption attribute isn't visible
6. Notify the hypervisor of the encryption status change
7. Add back the PRESENT bit, making the changed attribute visible
With this approach, load_unaligned_zeropad() just takes its normal
page-fault-based fixup path if it touches a page that is transitioning.
As a result, load_unaligned_zeropad() and CoCo VM page transitioning
are completely decoupled. CoCo VM page transitions can proceed
without needing to handle architecture-specific exceptions and fix
things up. This decoupling reduces the complexity due to separate
TDX and SEV-SNP fixup paths, and gives more freedom to revise and
introduce new capabilities in future versions of the TDX and SEV-SNP
architectures. Paravisor scenarios work properly without needing
to forward exceptions.
Patch 1 handles implications of the hypervisor callbacks in Step 6
needing to do virt-to-phys translations on pages that are temporarily
marked not present.
Patch 2 ensures that Step 7 doesn't generate a TLB flush. It is a
performance optimization only and is not necessary for correctness.
Patches 3 and 4 handle the case where SEV-SNP does PVALIDATE in
Step 6, which requires a valid virtual address. But since the
PRESENT bit has been removed from the direct map virtual address,
the PVALIDATE fails. These patches construct a temporary virtual
address to be used by PVALIDATE. This code is SEV-SNP only because
the TDX and Hyper-V paravisor flavors operate on physical addresses.
Patches 5 and 6 are the core change that marks the transitioning
pages as not present. Patch 6 is optional since retaining both
the "prepare" and "finish" callbacks doesn't hurt anything and
there might be an argument for retaining both for future
flexibility. However, Patch 6 *does* eliminate about 75 lines of
code and comments.
Patch 7 is a somewhat tangential cleanup that removes an unnecessary
wrapper function in the path for doing a transition.
Patch 8 adds comments describing the implications of errors when
doing a transition. These implications are discussed in the email
thread for the RFC patch[1] and a patch proposed by Rick Edgecombe.
[2][3]
With this change, the #VE and #VC exception handlers should no longer
be triggered for load_unaligned_zeropad() accesses, and the existing
code in those handlers to do the "fixup" shouldn't be needed. But I
have not removed that code in this patch set. Kirill Shutemov wants
to keep the code for TDX #VE, so the code for #VC on the SEV-SNP
side has also been kept.
This patch set is based on the linux-next20231117 code tree.
Changes in v2:
* Added Patches 3 and 4 to deal with the failure on SEV-SNP
[Tom Lendacky]
* Split the main change into two separate patches (Patch 5 and
Patch 6) to improve reviewability and to offer the option of
retaining both hypervisor callbacks.
* Patch 5 moves set_memory_p() out of an #ifdef CONFIG_X86_64
so that the code builds correctly for 32-bit, even though it
is never executed for 32-bit [reported by kernel test robot]
[1] https://lore.kernel.org/lkml/[email protected]/
[2] https://lore.kernel.org/lkml/[email protected]/
[3] https://lore.kernel.org/lkml/[email protected]/
Michael Kelley (8):
x86/coco: Use slow_virt_to_phys() in page transition hypervisor
callbacks
x86/mm: Don't do a TLB flush if changing a PTE that isn't marked
present
x86/mm: Remove "static" from vmap_pages_range()
x86/sev: Enable PVALIDATE for PFNs without a valid virtual address
x86/mm: Mark CoCo VM pages not present while changing encrypted state
x86/mm: Merge CoCo prepare and finish hypervisor callbacks
x86/mm: Remove unnecessary call layer for __set_memory_enc_pgtable()
x86/mm: Add comments about errors in
set_memory_decrypted()/encrypted()
arch/x86/boot/compressed/sev.c | 2 +-
arch/x86/coco/tdx/tdx.c | 66 +----------------
arch/x86/hyperv/ivm.c | 15 ++--
arch/x86/include/asm/sev.h | 6 +-
arch/x86/include/asm/x86_init.h | 4 --
arch/x86/kernel/sev-shared.c | 57 ++++++++++++---
arch/x86/kernel/sev.c | 43 ++++++-----
arch/x86/kernel/x86_init.c | 4 --
arch/x86/mm/mem_encrypt_amd.c | 23 +-----
arch/x86/mm/pat/set_memory.c | 122 ++++++++++++++++++++++----------
include/linux/vmalloc.h | 2 +
mm/vmalloc.c | 2 +-
12 files changed, 171 insertions(+), 175 deletions(-)
--
2.25.1
From: Michael Kelley <[email protected]>
The core function __change_page_attr() currently sets up a TLB flush if
a PTE is changed. But if the old value of the PTE doesn't include the
PRESENT flag, the PTE won't be in the TLB, so a flush isn't needed.
Avoid an unnecessary TLB flush by conditioning the flush on the old
PTE value including PRESENT. This change improves the performance of
functions like set_memory_p() by avoiding the flush if the memory range
was previously all not present.
Signed-off-by: Michael Kelley <[email protected]>
---
arch/x86/mm/pat/set_memory.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 8e19796e7ce5..d7ef8d312a47 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -1636,7 +1636,10 @@ static int __change_page_attr(struct cpa_data *cpa, int primary)
*/
if (pte_val(old_pte) != pte_val(new_pte)) {
set_pte_atomic(kpte, new_pte);
- cpa->flags |= CPA_FLUSHTLB;
+
+ /* If old_pte isn't present, it's not in the TLB */
+ if (pte_present(old_pte))
+ cpa->flags |= CPA_FLUSHTLB;
}
cpa->numpages = 1;
return 0;
--
2.25.1
From: Michael Kelley <[email protected]>
__set_memory_enc_pgtable() is only called from __set_memory_enc_dec()
after doing a simple validation check. Prior to commit 812b0597fb40,
__set_memory_enc_dec() did more complex checking, but now the code
can be simplified by collapsing the two functions.
No functional change.
Signed-off-by: Michael Kelley <[email protected]>
---
arch/x86/mm/pat/set_memory.c | 15 +++++----------
1 file changed, 5 insertions(+), 10 deletions(-)
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index c13178f37b13..7365c86a7ff0 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -2131,15 +2131,18 @@ int set_memory_global(unsigned long addr, int numpages)
}
/*
- * __set_memory_enc_pgtable() is used for the hypervisors that get
+ * __set_memory_enc_dec() is used for the hypervisors that get
* informed about "encryption" status via page tables.
*/
-static int __set_memory_enc_pgtable(unsigned long addr, int numpages, bool enc)
+static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
{
pgprot_t empty = __pgprot(0);
struct cpa_data cpa;
int ret;
+ if (!cc_platform_has(CC_ATTR_MEM_ENCRYPT))
+ return 0;
+
/* Should not be working on unaligned addresses */
if (WARN_ONCE(addr & ~PAGE_MASK, "misaligned address: %#lx\n", addr))
addr &= PAGE_MASK;
@@ -2200,14 +2203,6 @@ static int __set_memory_enc_pgtable(unsigned long addr, int numpages, bool enc)
return set_memory_p(&addr, numpages);
}
-static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
-{
- if (cc_platform_has(CC_ATTR_MEM_ENCRYPT))
- return __set_memory_enc_pgtable(addr, numpages, enc);
-
- return 0;
-}
-
int set_memory_encrypted(unsigned long addr, int numpages)
{
return __set_memory_enc_dec(addr, numpages, true);
--
2.25.1
From: Michael Kelley <[email protected]>
The mm subsystem currently provides no mechanism to map memory pages
to a specified virtual address range. A virtual address range can be
allocated using get_vm_area(), but the only function available for
mapping memory pages to a caller-specified address in that range is
ioremap_page_range(), which is inappropriate for system memory.
Fix this by allowing vmap_pages_range() to be used by callers outside
of vmalloc.c.
Signed-off-by: Michael Kelley <[email protected]>
---
include/linux/vmalloc.h | 2 ++
mm/vmalloc.c | 2 +-
2 files changed, 3 insertions(+), 1 deletion(-)
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index c720be70c8dd..ee12f5226a45 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -233,6 +233,8 @@ static inline bool is_vm_area_hugepages(const void *addr)
#ifdef CONFIG_MMU
void vunmap_range(unsigned long addr, unsigned long end);
+int vmap_pages_range(unsigned long addr, unsigned long end, pgprot_t prot,
+ struct page **pages, unsigned int page_shift);
static inline void set_vm_flush_reset_perms(void *addr)
{
struct vm_struct *vm = find_vm_area(addr);
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index d12a17fc0c17..b2a72bd317c6 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -625,7 +625,7 @@ int vmap_pages_range_noflush(unsigned long addr, unsigned long end,
* RETURNS:
* 0 on success, -errno on failure.
*/
-static int vmap_pages_range(unsigned long addr, unsigned long end,
+int vmap_pages_range(unsigned long addr, unsigned long end,
pgprot_t prot, struct page **pages, unsigned int page_shift)
{
int err;
--
2.25.1
From: Michael Kelley <[email protected]>
For SEV-SNP, the PVALIDATE instruction requires a valid virtual
address that it translates to the PFN that it operates on. Per
the spec, it translates the virtual address as if it were doing a
single byte read.
In transitioning a page between encrypted and decrypted, the direct
map virtual address of the page may be temporarily marked invalid
(i.e., PRESENT is cleared in the PTE) to prevent interference from
load_unaligned_zeropad(). In such a case, the PVALIDATE that is
required for the encrypted<->decrypted transition fails due to
an invalid virtual address.
Fix this by providing a temporary virtual address that is mapped to the
target PFN just before executing PVALIDATE. Have PVALIDATE use this
temp virtual address instead of the direct map virtual address. Unmap
the temp virtual address after PVALIDATE completes.
The temp virtual address must be aligned on a 2 Mbyte boundary
to meet PVALIDATE requirements for operating on 2 Meg large pages,
though the temp mapping need only be a 4K mapping. Also, the temp
virtual address must be preceded by a 4K invalid page so it can't
be accessed by load_unaligned_zeropad().
This mechanism is used only for pages transitioning between encrypted
and decrypted. When PVALIDATE is done for initial page acceptance,
a temp virtual address is not provided, and PVALIDATE uses the
direct map virtual address.
Signed-off-by: Michael Kelley <[email protected]>
---
arch/x86/boot/compressed/sev.c | 2 +-
arch/x86/kernel/sev-shared.c | 57 +++++++++++++++++++++++++++-------
arch/x86/kernel/sev.c | 32 ++++++++++++-------
3 files changed, 67 insertions(+), 24 deletions(-)
diff --git a/arch/x86/boot/compressed/sev.c b/arch/x86/boot/compressed/sev.c
index 454acd7a2daf..4d4a3fc0b725 100644
--- a/arch/x86/boot/compressed/sev.c
+++ b/arch/x86/boot/compressed/sev.c
@@ -224,7 +224,7 @@ static phys_addr_t __snp_accept_memory(struct snp_psc_desc *desc,
if (vmgexit_psc(boot_ghcb, desc))
sev_es_terminate(SEV_TERM_SET_LINUX, GHCB_TERM_PSC);
- pvalidate_pages(desc);
+ pvalidate_pages(desc, 0);
return pa;
}
diff --git a/arch/x86/kernel/sev-shared.c b/arch/x86/kernel/sev-shared.c
index ccb0915e84e1..fc45fdcf3892 100644
--- a/arch/x86/kernel/sev-shared.c
+++ b/arch/x86/kernel/sev-shared.c
@@ -1071,35 +1071,70 @@ static void __init setup_cpuid_table(const struct cc_blob_sev_info *cc_info)
}
}
-static void pvalidate_pages(struct snp_psc_desc *desc)
+#ifdef __BOOT_COMPRESSED
+static int pvalidate_pfn(unsigned long vaddr, unsigned int size,
+ unsigned long pfn, bool validate, int *rc2)
+{
+ return 0;
+}
+#else
+static int pvalidate_pfn(unsigned long vaddr, unsigned int size,
+ unsigned long pfn, bool validate, int *rc2)
+{
+ int rc;
+ struct page *page = pfn_to_page(pfn);
+
+ *rc2 = vmap_pages_range(vaddr, vaddr + PAGE_SIZE,
+ PAGE_KERNEL, &page, PAGE_SHIFT);
+ rc = pvalidate(vaddr, size, validate);
+ vunmap_range(vaddr, vaddr + PAGE_SIZE);
+
+ return rc;
+}
+#endif
+
+static void pvalidate_pages(struct snp_psc_desc *desc, unsigned long vaddr)
{
struct psc_entry *e;
- unsigned long vaddr;
+ unsigned long pfn;
unsigned int size;
unsigned int i;
bool validate;
- int rc;
+ int rc, rc2 = 0;
for (i = 0; i <= desc->hdr.end_entry; i++) {
e = &desc->entries[i];
- vaddr = (unsigned long)pfn_to_kaddr(e->gfn);
- size = e->pagesize ? RMP_PG_SIZE_2M : RMP_PG_SIZE_4K;
+ size = e->pagesize;
validate = e->operation == SNP_PAGE_STATE_PRIVATE;
+ pfn = e->gfn;
- rc = pvalidate(vaddr, size, validate);
- if (rc == PVALIDATE_FAIL_SIZEMISMATCH && size == RMP_PG_SIZE_2M) {
- unsigned long vaddr_end = vaddr + PMD_SIZE;
+ if (vaddr) {
+ rc = pvalidate_pfn(vaddr, size, pfn, validate, &rc2);
+ } else {
+ vaddr = (unsigned long)pfn_to_kaddr(pfn);
+ rc = pvalidate(vaddr, size, validate);
+ }
- for (; vaddr < vaddr_end; vaddr += PAGE_SIZE) {
- rc = pvalidate(vaddr, RMP_PG_SIZE_4K, validate);
+ if (rc == PVALIDATE_FAIL_SIZEMISMATCH && size == RMP_PG_SIZE_2M) {
+ unsigned long last_pfn = pfn + PTRS_PER_PMD - 1;
+
+ for (; pfn <= last_pfn; pfn++) {
+ if (vaddr) {
+ rc = pvalidate_pfn(vaddr, RMP_PG_SIZE_4K,
+ pfn, validate, &rc2);
+ } else {
+ vaddr = (unsigned long)pfn_to_kaddr(pfn);
+ rc = pvalidate(vaddr, RMP_PG_SIZE_4K, validate);
+ }
if (rc)
break;
}
}
if (rc) {
- WARN(1, "Failed to validate address 0x%lx ret %d", vaddr, rc);
+ WARN(1, "Failed to validate address 0x%lx ret %d ret2 %d",
+ vaddr, rc, rc2);
sev_es_terminate(SEV_TERM_SET_LINUX, GHCB_TERM_PVALIDATE);
}
}
diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index 7eac92c07a58..08b2e2a0d67d 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -790,7 +790,7 @@ void __init snp_prep_memory(unsigned long paddr, unsigned int sz, enum psc_op op
}
static unsigned long __set_pages_state(struct snp_psc_desc *data, unsigned long vaddr,
- unsigned long vaddr_end, int op)
+ unsigned long vaddr_end, int op, unsigned long temp_vaddr)
{
struct ghcb_state state;
bool use_large_entry;
@@ -842,7 +842,7 @@ static unsigned long __set_pages_state(struct snp_psc_desc *data, unsigned long
/* Page validation must be rescinded before changing to shared */
if (op == SNP_PAGE_STATE_SHARED)
- pvalidate_pages(data);
+ pvalidate_pages(data, temp_vaddr);
local_irq_save(flags);
@@ -862,12 +862,13 @@ static unsigned long __set_pages_state(struct snp_psc_desc *data, unsigned long
/* Page validation must be performed after changing to private */
if (op == SNP_PAGE_STATE_PRIVATE)
- pvalidate_pages(data);
+ pvalidate_pages(data, temp_vaddr);
return vaddr;
}
-static void set_pages_state(unsigned long vaddr, unsigned long npages, int op)
+static void set_pages_state(unsigned long vaddr, unsigned long npages,
+ int op, unsigned long temp_vaddr)
{
struct snp_psc_desc desc;
unsigned long vaddr_end;
@@ -880,23 +881,30 @@ static void set_pages_state(unsigned long vaddr, unsigned long npages, int op)
vaddr_end = vaddr + (npages << PAGE_SHIFT);
while (vaddr < vaddr_end)
- vaddr = __set_pages_state(&desc, vaddr, vaddr_end, op);
+ vaddr = __set_pages_state(&desc, vaddr, vaddr_end,
+ op, temp_vaddr);
}
void snp_set_memory_shared(unsigned long vaddr, unsigned long npages)
{
- if (!cc_platform_has(CC_ATTR_GUEST_SEV_SNP))
- return;
+ struct vm_struct *area;
+ unsigned long temp_vaddr;
- set_pages_state(vaddr, npages, SNP_PAGE_STATE_SHARED);
+ area = get_vm_area(PAGE_SIZE * (PTRS_PER_PMD + 1), 0);
+ temp_vaddr = ALIGN((unsigned long)(area->addr + PAGE_SIZE), PMD_SIZE);
+ set_pages_state(vaddr, npages, SNP_PAGE_STATE_SHARED, temp_vaddr);
+ free_vm_area(area);
}
void snp_set_memory_private(unsigned long vaddr, unsigned long npages)
{
- if (!cc_platform_has(CC_ATTR_GUEST_SEV_SNP))
- return;
+ struct vm_struct *area;
+ unsigned long temp_vaddr;
- set_pages_state(vaddr, npages, SNP_PAGE_STATE_PRIVATE);
+ area = get_vm_area(PAGE_SIZE * (PTRS_PER_PMD + 1), 0);
+ temp_vaddr = ALIGN((unsigned long)(area->addr + PAGE_SIZE), PMD_SIZE);
+ set_pages_state(vaddr, npages, SNP_PAGE_STATE_PRIVATE, temp_vaddr);
+ free_vm_area(area);
}
void snp_accept_memory(phys_addr_t start, phys_addr_t end)
@@ -909,7 +917,7 @@ void snp_accept_memory(phys_addr_t start, phys_addr_t end)
vaddr = (unsigned long)__va(start);
npages = (end - start) >> PAGE_SHIFT;
- set_pages_state(vaddr, npages, SNP_PAGE_STATE_PRIVATE);
+ set_pages_state(vaddr, npages, SNP_PAGE_STATE_PRIVATE, 0);
}
static int snp_set_vmsa(void *va, bool vmsa)
--
2.25.1
From: Michael Kelley <[email protected]>
In a CoCo VM when a page transitions from encrypted to decrypted, or vice
versa, attributes in the PTE must be updated *and* the hypervisor must
be notified of the change. Because there are two separate steps, there's
a window where the settings are inconsistent. Normally the code that
initiates the transition (via set_memory_decrypted() or
set_memory_encrypted()) ensures that the memory is not being accessed
during a transition, so the window of inconsistency is not a problem.
However, the load_unaligned_zeropad() function can read arbitrary memory
pages at arbitrary times, which could access a transitioning page during
the window. In such a case, CoCo VM specific exceptions are taken
(depending on the CoCo architecture in use). Current code in those
exception handlers recovers and does "fixup" on the result returned by
load_unaligned_zeropad(). Unfortunately, this exception handling can't
work in paravisor scenarios (TDX Paritioning and SEV-SNP in vTOM mode)
if the exceptions are routed to the paravisor. The paravisor
can't do the load_unaligned_zeropad() fixup, so the exceptions would
need to be forwarded from the paravisor to the Linux guest, but there
are no architectural specs for how to do that.
Fortunately, there's a simpler way to solve the problem by changing
the core transition code in __set_memory_enc_pgtable() to do the
following:
1. Remove aliasing mappings
2. Flush the data cache if needed
3. Remove the PRESENT bit from the PTEs of all transitioning pages
4. Notify the hypervisor of the impending encryption status change
5. Set/clear the encryption attribute as appropriate
6. Flush the TLB so the changed encryption attribute isn't visible
7. Notify the hypervisor after the encryption status change
8. Add back the PRESENT bit, making the changed attribute visible
With this approach, load_unaligned_zeropad() just takes its normal
page-fault-based fixup path if it touches a page that is transitioning.
As a result, load_unaligned_zeropad() and CoCo VM page transitioning
are completely decoupled. CoCo VM page transitions can proceed
without needing to handle architecture-specific exceptions and fix
things up. This decoupling reduces the complexity due to separate
TDX and SEV-SNP fixup paths, and gives more freedom to revise and
introduce new capabilities in future versions of the TDX and SEV-SNP
architectures. Paravisor scenarios work properly without needing
to forward exceptions.
Because Step 3 always does a TLB flush, the separate TLB flush
callback is no longer required and is removed.
Signed-off-by: Michael Kelley <[email protected]>
---
arch/x86/coco/tdx/tdx.c | 20 --------------
arch/x86/hyperv/ivm.c | 6 -----
arch/x86/include/asm/x86_init.h | 2 --
arch/x86/kernel/x86_init.c | 2 --
arch/x86/mm/mem_encrypt_amd.c | 6 -----
arch/x86/mm/pat/set_memory.c | 48 ++++++++++++++++++++++++---------
6 files changed, 35 insertions(+), 49 deletions(-)
diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index 1b5d17a9f70d..39ead21bcba6 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -697,24 +697,6 @@ bool tdx_handle_virt_exception(struct pt_regs *regs, struct ve_info *ve)
return true;
}
-static bool tdx_tlb_flush_required(bool private)
-{
- /*
- * TDX guest is responsible for flushing TLB on private->shared
- * transition. VMM is responsible for flushing on shared->private.
- *
- * The VMM _can't_ flush private addresses as it can't generate PAs
- * with the guest's HKID. Shared memory isn't subject to integrity
- * checking, i.e. the VMM doesn't need to flush for its own protection.
- *
- * There's no need to flush when converting from shared to private,
- * as flushing is the VMM's responsibility in this case, e.g. it must
- * flush to avoid integrity failures in the face of a buggy or
- * malicious guest.
- */
- return !private;
-}
-
static bool tdx_cache_flush_required(void)
{
/*
@@ -876,9 +858,7 @@ void __init tdx_early_init(void)
*/
x86_platform.guest.enc_status_change_prepare = tdx_enc_status_change_prepare;
x86_platform.guest.enc_status_change_finish = tdx_enc_status_change_finish;
-
x86_platform.guest.enc_cache_flush_required = tdx_cache_flush_required;
- x86_platform.guest.enc_tlb_flush_required = tdx_tlb_flush_required;
/*
* TDX intercepts the RDMSR to read the X2APIC ID in the parallel
diff --git a/arch/x86/hyperv/ivm.c b/arch/x86/hyperv/ivm.c
index 8ba18635e338..4005c573e00c 100644
--- a/arch/x86/hyperv/ivm.c
+++ b/arch/x86/hyperv/ivm.c
@@ -550,11 +550,6 @@ static bool hv_vtom_set_host_visibility(unsigned long kbuffer, int pagecount, bo
return result;
}
-static bool hv_vtom_tlb_flush_required(bool private)
-{
- return true;
-}
-
static bool hv_vtom_cache_flush_required(void)
{
return false;
@@ -614,7 +609,6 @@ void __init hv_vtom_init(void)
x86_platform.hyper.is_private_mmio = hv_is_private_mmio;
x86_platform.guest.enc_cache_flush_required = hv_vtom_cache_flush_required;
- x86_platform.guest.enc_tlb_flush_required = hv_vtom_tlb_flush_required;
x86_platform.guest.enc_status_change_finish = hv_vtom_set_host_visibility;
/* Set WB as the default cache mode. */
diff --git a/arch/x86/include/asm/x86_init.h b/arch/x86/include/asm/x86_init.h
index c878616a18b8..5b3a9a214815 100644
--- a/arch/x86/include/asm/x86_init.h
+++ b/arch/x86/include/asm/x86_init.h
@@ -146,13 +146,11 @@ struct x86_init_acpi {
*
* @enc_status_change_prepare Notify HV before the encryption status of a range is changed
* @enc_status_change_finish Notify HV after the encryption status of a range is changed
- * @enc_tlb_flush_required Returns true if a TLB flush is needed before changing page encryption status
* @enc_cache_flush_required Returns true if a cache flush is needed before changing page encryption status
*/
struct x86_guest {
bool (*enc_status_change_prepare)(unsigned long vaddr, int npages, bool enc);
bool (*enc_status_change_finish)(unsigned long vaddr, int npages, bool enc);
- bool (*enc_tlb_flush_required)(bool enc);
bool (*enc_cache_flush_required)(void);
};
diff --git a/arch/x86/kernel/x86_init.c b/arch/x86/kernel/x86_init.c
index a37ebd3b4773..1c0d23a2b6cf 100644
--- a/arch/x86/kernel/x86_init.c
+++ b/arch/x86/kernel/x86_init.c
@@ -133,7 +133,6 @@ static void default_nmi_init(void) { };
static bool enc_status_change_prepare_noop(unsigned long vaddr, int npages, bool enc) { return true; }
static bool enc_status_change_finish_noop(unsigned long vaddr, int npages, bool enc) { return true; }
-static bool enc_tlb_flush_required_noop(bool enc) { return false; }
static bool enc_cache_flush_required_noop(void) { return false; }
static bool is_private_mmio_noop(u64 addr) {return false; }
@@ -156,7 +155,6 @@ struct x86_platform_ops x86_platform __ro_after_init = {
.guest = {
.enc_status_change_prepare = enc_status_change_prepare_noop,
.enc_status_change_finish = enc_status_change_finish_noop,
- .enc_tlb_flush_required = enc_tlb_flush_required_noop,
.enc_cache_flush_required = enc_cache_flush_required_noop,
},
};
diff --git a/arch/x86/mm/mem_encrypt_amd.c b/arch/x86/mm/mem_encrypt_amd.c
index a68f2dda0948..652cc61b89b6 100644
--- a/arch/x86/mm/mem_encrypt_amd.c
+++ b/arch/x86/mm/mem_encrypt_amd.c
@@ -242,11 +242,6 @@ static unsigned long pg_level_to_pfn(int level, pte_t *kpte, pgprot_t *ret_prot)
return pfn;
}
-static bool amd_enc_tlb_flush_required(bool enc)
-{
- return true;
-}
-
static bool amd_enc_cache_flush_required(void)
{
return !cpu_feature_enabled(X86_FEATURE_SME_COHERENT);
@@ -464,7 +459,6 @@ void __init sme_early_init(void)
x86_platform.guest.enc_status_change_prepare = amd_enc_status_change_prepare;
x86_platform.guest.enc_status_change_finish = amd_enc_status_change_finish;
- x86_platform.guest.enc_tlb_flush_required = amd_enc_tlb_flush_required;
x86_platform.guest.enc_cache_flush_required = amd_enc_cache_flush_required;
/*
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index d7ef8d312a47..b125035608d5 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -2019,6 +2019,11 @@ int set_memory_wb(unsigned long addr, int numpages)
}
EXPORT_SYMBOL(set_memory_wb);
+static int set_memory_p(unsigned long *addr, int numpages)
+{
+ return change_page_attr_set(addr, numpages, __pgprot(_PAGE_PRESENT), 0);
+}
+
/* Prevent speculative access to a page by marking it not-present */
#ifdef CONFIG_X86_64
int set_mce_nospec(unsigned long pfn)
@@ -2049,11 +2054,6 @@ int set_mce_nospec(unsigned long pfn)
return rc;
}
-static int set_memory_p(unsigned long *addr, int numpages)
-{
- return change_page_attr_set(addr, numpages, __pgprot(_PAGE_PRESENT), 0);
-}
-
/* Restore full speculative operation to the pfn. */
int clear_mce_nospec(unsigned long pfn)
{
@@ -2144,6 +2144,23 @@ static int __set_memory_enc_pgtable(unsigned long addr, int numpages, bool enc)
if (WARN_ONCE(addr & ~PAGE_MASK, "misaligned address: %#lx\n", addr))
addr &= PAGE_MASK;
+ /*
+ * The caller must ensure that the memory being transitioned between
+ * encrypted and decrypted is not being accessed. But if
+ * load_unaligned_zeropad() touches the "next" page, it may generate a
+ * read access the caller has no control over. To ensure such accesses
+ * cause a normal page fault for the load_unaligned_zeropad() handler,
+ * mark the pages not present until the transition is complete. We
+ * don't want a #VE or #VC fault due to a mismatch in the memory
+ * encryption status, since paravisor configurations can't cleanly do
+ * the load_unaligned_zeropad() handling in the paravisor.
+ *
+ * set_memory_np() flushes the TLB.
+ */
+ ret = set_memory_np(addr, numpages);
+ if (ret)
+ return ret;
+
memset(&cpa, 0, sizeof(cpa));
cpa.vaddr = &addr;
cpa.numpages = numpages;
@@ -2156,14 +2173,16 @@ static int __set_memory_enc_pgtable(unsigned long addr, int numpages, bool enc)
vm_unmap_aliases();
/* Flush the caches as needed before changing the encryption attribute. */
- if (x86_platform.guest.enc_tlb_flush_required(enc))
- cpa_flush(&cpa, x86_platform.guest.enc_cache_flush_required());
+ if (x86_platform.guest.enc_cache_flush_required())
+ cpa_flush(&cpa, 1);
/* Notify hypervisor that we are about to set/clr encryption attribute. */
if (!x86_platform.guest.enc_status_change_prepare(addr, numpages, enc))
return -EIO;
ret = __change_page_attr_set_clr(&cpa, 1);
+ if (ret)
+ return ret;
/*
* After changing the encryption attribute, we need to flush TLBs again
@@ -2174,13 +2193,16 @@ static int __set_memory_enc_pgtable(unsigned long addr, int numpages, bool enc)
*/
cpa_flush(&cpa, 0);
- /* Notify hypervisor that we have successfully set/clr encryption attribute. */
- if (!ret) {
- if (!x86_platform.guest.enc_status_change_finish(addr, numpages, enc))
- ret = -EIO;
- }
+ /* Notify hypervisor that we have successfully set/clr encryption attr. */
+ if (!x86_platform.guest.enc_status_change_finish(addr, numpages, enc))
+ return -EIO;
- return ret;
+ /*
+ * Now that the hypervisor is sync'ed with the page table changes
+ * made here, add back _PAGE_PRESENT. set_memory_p() does not flush
+ * the TLB.
+ */
+ return set_memory_p(&addr, numpages);
}
static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
--
2.25.1
From: Michael Kelley <[email protected]>
The functions set_memory_decrypted()/encrypted() may leave the input
memory range in an inconsistent state if an error occurs. Add comments
describing the situation and what callers must be aware of. Also add
comments in __set_memory_enc_dec() with more details on the issues and
why further investment in error handling is not likely to be useful.
No functional change.
Suggested-by: Rick Edgecombe <[email protected]>
Signed-off-by: Michael Kelley <[email protected]>
---
arch/x86/mm/pat/set_memory.c | 26 ++++++++++++++++++++++++++
1 file changed, 26 insertions(+)
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 7365c86a7ff0..f519e5ca543b 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -2133,6 +2133,24 @@ int set_memory_global(unsigned long addr, int numpages)
/*
* __set_memory_enc_dec() is used for the hypervisors that get
* informed about "encryption" status via page tables.
+ *
+ * If an error occurs in making the transition between encrypted and
+ * decrypted, the transitioned memory is left in an indeterminate state.
+ * The encryption status in the guest page tables may not match the
+ * hypervisor's view of the encryption status, making the memory unusable.
+ * If the memory consists of multiple pages, different pages may be in
+ * different indeterminate states.
+ *
+ * It is difficult to recover from errors such that we can ensure
+ * consistency between the page tables and hypervisor view of the encryption
+ * state. It may not be possible to back out of changes, particularly if the
+ * failure occurs in communicating with the hypervisor. Given this limitation,
+ * further work on the error handling is not likely to meaningfully improve
+ * the reliablity or usability of the system.
+ *
+ * Any errors are likely to soon render the VM inoperable, but we return
+ * an error rather than panic'ing so that the caller can decide how best
+ * to shutdown cleanly.
*/
static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
{
@@ -2203,6 +2221,14 @@ static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
return set_memory_p(&addr, numpages);
}
+/*
+ * If set_memory_encrypted()/decrypted() returns an error, the input memory
+ * range is left in an indeterminate state. The encryption status of pages
+ * may be inconsistent, so the memory is unusable. The caller should not try
+ * to do further operations on the memory, or return it to the free list.
+ * The memory must be leaked, and the caller should take steps to shutdown
+ * the system as cleanly as possible as something is seriously wrong.
+ */
int set_memory_encrypted(unsigned long addr, int numpages)
{
return __set_memory_enc_dec(addr, numpages, true);
--
2.25.1
On Tue, Nov 21, 2023 at 01:20:11PM -0800, [email protected] wrote:
> From: Michael Kelley <[email protected]>
>
> The mm subsystem currently provides no mechanism to map memory pages
> to a specified virtual address range. A virtual address range can be
> allocated using get_vm_area(), but the only function available for
> mapping memory pages to a caller-specified address in that range is
> ioremap_page_range(), which is inappropriate for system memory.
>
> Fix this by allowing vmap_pages_range() to be used by callers outside
> of vmalloc.c.
I really do not want to expose vmap_pages_range. Please try to come up
with a good way to encapsulate your map at a certain address primitive
and implement it in vmalloc.c.
From: Christoph Hellwig <[email protected]> Sent: Tuesday, November 21, 2023 10:26 PM
>
> On Tue, Nov 21, 2023 at 01:20:11PM -0800, [email protected] wrote:
> > From: Michael Kelley <[email protected]>
> >
> > The mm subsystem currently provides no mechanism to map memory pages
> > to a specified virtual address range. A virtual address range can be
> > allocated using get_vm_area(), but the only function available for
> > mapping memory pages to a caller-specified address in that range is
> > ioremap_page_range(), which is inappropriate for system memory.
> >
> > Fix this by allowing vmap_pages_range() to be used by callers outside
> > of vmalloc.c.
>
> I really do not want to expose vmap_pages_range. Please try to come up
> with a good way to encapsulate your map at a certain address primitive
> and implement it in vmalloc.c.
To clarify, is your concern narrowly about vmap_pages_range()
specifically? Or is your concern more generally about having two separate
steps as applied to system memory: 1) allocate the virtual address
space and 2) do the mapping? The two separate steps are already
available for MMIO space. Doing the equivalent for system memory
should be straightforward.
Conversely, combining the two steps into a single new vmap() variant
would be a little messy, but can probably be made to work. This
combined approach will be less efficient since my use case does a single
allocation of virtual address space and repeatedly maps/unmaps the
same page in that space. I would need to take some measurements
to see if the inefficiency actually matters.
Michael
On Thu, Nov 23, 2023 at 12:24:49AM +0000, Michael Kelley wrote:
> > I really do not want to expose vmap_pages_range. Please try to come up
> > with a good way to encapsulate your map at a certain address primitive
> > and implement it in vmalloc.c.
>
> To clarify, is your concern narrowly about vmap_pages_range()
> specifically?
The prime concern is that it took a lot of effort to make
vmap_pages_range static and remove all the abuses. I absolutely
object to undoing that.
On Tue, Nov 21, 2023 at 01:20:08PM -0800, [email protected] wrote:
> From: Michael Kelley <[email protected]>
>
> In a CoCo VM when a page transitions from encrypted to decrypted, or vice
> versa, attributes in the PTE must be updated *and* the hypervisor must
> be notified of the change.
Strictly speaking it is not true for TDX. Conversion to shared can be
implicit: set shared bit and touch the page will do the conversion. MapGPA
is optional.
> Because there are two separate steps, there's
> a window where the settings are inconsistent. Normally the code that
> initiates the transition (via set_memory_decrypted() or
> set_memory_encrypted()) ensures that the memory is not being accessed
> during a transition, so the window of inconsistency is not a problem.
> However, the load_unaligned_zeropad() function can read arbitrary memory
> pages at arbitrary times, which could read a transitioning page during
> the window. In such a case, CoCo VM specific exceptions are taken
> (depending on the CoCo architecture in use). Current code in those
> exception handlers recovers and does "fixup" on the result returned by
> load_unaligned_zeropad(). Unfortunately, this exception handling can't
> work in paravisor scenarios (TDX Paritioning and SEV-SNP in vTOM mode)
> if the exceptions are routed to the paravisor. The paravisor can't
> do load_unaligned_zeropad() fixup, so the exceptions would need to
> be forwarded from the paravisor to the Linux guest, but there are
> no architectural specs for how to do that.
Hm. Can't we inject #PF (or #GP) into L2 if #VE/#VC handler in L1 sees
cross-page access to shared memory while no fixup entry for the page in
L1. It would give L2 chance to handle the situation in a transparent way.
Maybe I miss something, I donno.
--
Kiryl Shutsemau / Kirill A. Shutemov
From: Christoph Hellwig <[email protected]> Sent: Wednesday, November 22, 2023 11:32 PM
>
> On Thu, Nov 23, 2023 at 12:24:49AM +0000, Michael Kelley wrote:
> > > I really do not want to expose vmap_pages_range. Please try to come up
> > > with a good way to encapsulate your map at a certain address primitive
> > > and implement it in vmalloc.c.
> >
> > To clarify, is your concern narrowly about vmap_pages_range()
> > specifically?
>
> The prime concern is that it took a lot of effort to make
> vmap_pages_range static and remove all the abuses. I absolutely
> object to undoing that.
OK, so I assume that means a new variant of vmap_pages_range(),
such as one that always sets the page_shift parameter to PAGE_SIZE,
is also disallowed because of the same potential for abuse.
So the only way to map a system memory page to a vmalloc
vaddr is via vmap() or some vmap() variant, which always
creates a new vmalloc area via get_vm_area(). I've done the
perf measurements, and that approach won't work for this use
case. Independent of the alignment requirements, the churn in
creating and removing a lot of vmalloc areas has too much perf
impact. The use case needs to create a single vmalloc area, and
then repeatedly map/unmap a page in that existing area.
I'll have to handle the top-level problem in this patch set in
a completely different way.
Michael
On Tue, 2023-11-21 at 13:20 -0800, [email protected] wrote:
> +static int pvalidate_pfn(unsigned long vaddr, unsigned int size,
> + unsigned long pfn, bool validate, int *rc2)
> +{
> + int rc;
> + struct page *page = pfn_to_page(pfn);
> +
> + *rc2 = vmap_pages_range(vaddr, vaddr + PAGE_SIZE,
> + PAGE_KERNEL, &page, PAGE_SHIFT);
Can't this fail and then the pvalidate below would encounter trouble?
Sort of separately, if those vmalloc objections can't be worked
through, did you consider doing something like text_poke() does (create
the temporary mapping in a temporary MM) for pvalidate purposes? I
don't know enough about what kind of special exceptions might popup
during that operation though, might be playing with fire...
> + rc = pvalidate(vaddr, size, validate);
> + vunmap_range(vaddr, vaddr + PAGE_SIZE);
> +
> + return rc;
> +}
On Tue, 2023-11-21 at 13:20 -0800, [email protected] wrote:
> --- a/arch/x86/mm/pat/set_memory.c
> +++ b/arch/x86/mm/pat/set_memory.c
> @@ -1636,7 +1636,10 @@ static int __change_page_attr(struct cpa_data
> *cpa, int primary)
> */
> if (pte_val(old_pte) != pte_val(new_pte)) {
> set_pte_atomic(kpte, new_pte);
> - cpa->flags |= CPA_FLUSHTLB;
> +
> + /* If old_pte isn't present, it's not in the
> TLB */
> + if (pte_present(old_pte))
> + cpa->flags |= CPA_FLUSHTLB;
> }
> cpa->numpages = 1;
> return 0;
>
Makes sense to me. The PMD case can be handled similarly in
__should_split_large_page().
I also think it should be more robust in regards to the cache flushing
changes.
If callers did:
set_memory_np()
set_memory_uc()
set_memory_p()
Then the cache flush would be missed. I don't think anyone is, but we
shouldn't introduce hidden things like that. Maybe fix it like this:
diff --git a/arch/x86/mm/pat/set_memory.c
b/arch/x86/mm/pat/set_memory.c
index f519e5ca543b..28ff53a4447a 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -1856,11 +1856,6 @@ static int change_page_attr_set_clr(unsigned
long *addr, int numpages,
ret = __change_page_attr_set_clr(&cpa, 1);
- /*
- * Check whether we really changed something:
- */
- if (!(cpa.flags & CPA_FLUSHTLB))
- goto out;
/*
* No need to flush, when we did not set any of the caching
@@ -1868,6 +1863,12 @@ static int change_page_attr_set_clr(unsigned
long *addr, int numpages,
*/
cache = !!pgprot2cachemode(mask_set);
+ /*
+ * Check whether we really changed something:
+ */
+ if (!(cpa.flags & CPA_FLUSHTLB) && !cache)
+ goto out;
+
/*
* On error; flush everything to be sure.
*/
Hmm, might want to maintain the "On error; flush everything to be sure"
logic in the NP->P case as well.
From: Edgecombe, Rick P <[email protected]> Sent: Monday, November 27, 2023 2:21 PM
>
> On Tue, 2023-11-21 at 13:20 -0800, [email protected] wrote:
> > --- a/arch/x86/mm/pat/set_memory.c
> > +++ b/arch/x86/mm/pat/set_memory.c
> > @@ -1636,7 +1636,10 @@ static int __change_page_attr(struct cpa_data
> > *cpa, int primary)
> > */
> > if (pte_val(old_pte) != pte_val(new_pte)) {
> > set_pte_atomic(kpte, new_pte);
> > - cpa->flags |= CPA_FLUSHTLB;
> > +
> > + /* If old_pte isn't present, it's not in the TLB */
> > + if (pte_present(old_pte))
> > + cpa->flags |= CPA_FLUSHTLB;
> > }
> > cpa->numpages = 1;
> > return 0;
> >
>
> Makes sense to me. The PMD case can be handled similarly in
> __should_split_large_page().
OK, I'll look at that case.
>
> I also think it should be more robust in regards to the cache flushing
> changes.
>
> If callers did:
> set_memory_np()
> set_memory_uc()
> set_memory_p()
>
> Then the cache flush would be missed. I don't think anyone is, but we
> shouldn't introduce hidden things like that. Maybe fix it like this:
>
> diff --git a/arch/x86/mm/pat/set_memory.c
> b/arch/x86/mm/pat/set_memory.c
> index f519e5ca543b..28ff53a4447a 100644
> --- a/arch/x86/mm/pat/set_memory.c
> +++ b/arch/x86/mm/pat/set_memory.c
> @@ -1856,11 +1856,6 @@ static int change_page_attr_set_clr(unsigned
> long *addr, int numpages,
>
> ret = __change_page_attr_set_clr(&cpa, 1);
>
> - /*
> - * Check whether we really changed something:
> - */
> - if (!(cpa.flags & CPA_FLUSHTLB))
> - goto out;
>
> /*
> * No need to flush, when we did not set any of the caching
> @@ -1868,6 +1863,12 @@ static int change_page_attr_set_clr(unsigned
> long *addr, int numpages,
> */
> cache = !!pgprot2cachemode(mask_set);
>
> + /*
> + * Check whether we really changed something:
> + */
> + if (!(cpa.flags & CPA_FLUSHTLB) && !cache)
> + goto out;
> +
> /*
> * On error; flush everything to be sure.
> */
>
> Hmm, might want to maintain the "On error; flush everything to be sure"
> logic in the NP->P case as well.
OK, I see your point. I had not realized that CPA_FLUSHTLB really
has a meaning beyond just indicating that the TLB needs to be
flushed. It really means "something has changed" in a PTE. I'll
incorporate your suggestion.
Michael
From: Edgecombe, Rick P <[email protected]> Sent: Monday, November 27, 2023 1:39 PM
>
> On Tue, 2023-11-21 at 13:20 -0800, [email protected] wrote:
> > +static int pvalidate_pfn(unsigned long vaddr, unsigned int size,
> > + unsigned long pfn, bool validate, int *rc2)
> > +{
> > + int rc;
> > + struct page *page = pfn_to_page(pfn);
> > +
> > + *rc2 = vmap_pages_range(vaddr, vaddr + PAGE_SIZE,
> > + PAGE_KERNEL, &page, PAGE_SHIFT);
>
> Can't this fail and then the pvalidate below would encounter trouble?
Yes. My intent was to just let the PVALIDATE fail because of operating
on a vaddr that's invalid. But that would be worth a comment.
>
> Sort of separately, if those vmalloc objections can't be worked
> through, did you consider doing something like text_poke() does (create
> the temporary mapping in a temporary MM) for pvalidate purposes? I
> don't know enough about what kind of special exceptions might popup
> during that operation though, might be playing with fire...
Interesting idea. But from a quick glance at the text_poke() code,
such an approach seems somewhat complex, and I suspect it will have
the same perf issues (or worse) as creating a new vmalloc area for
each PVALIDATE invocation.
At this point, the complexity of creating the temp mapping for
PVALIDATE is seeming excessive. On balance it seems simpler to
revert to an approach where the use of set_memory_np() and
set_memory_p() is conditional. It would be necessary when #VC
and #VE exceptions are directed to a paravisor. (This assumes the
paravisor interface in the hypervisor callbacks does the natural thing
of working with physical addresses, so there's no need for a temp
mapping.)
Optionally, the set_memory_np()/set_memory_p() approach could
be used in other cases where the hypervisor callbacks work with
physical addresses. But it can't be used with cases where the hypervisor
callbacks need valid virtual addresses.
So on net, set_memory_np()/set_memory_p() would be used in
the Hyper-V cases of TDX and SEV-SNP with a paravisor. It could
optionally be used with TDX with no paravisor, but my sense is
that Kirill wants to keep TDX "as is" and let the exception handlers
do the load_unaligned_zeropad() fixup.
It could not be used with SEV-SNP with no paravisor. Additional fixes
may be needed on the SEV-SNP side to properly fixup
load_unaligned_zeropad() accesses to a page that's in transition
between encrypted and decrypted.
I'll work on a patch series that takes the conditional approach.
Michael
>
> > + rc = pvalidate(vaddr, size, validate);
> > + vunmap_range(vaddr, vaddr + PAGE_SIZE);
> > +
> > + return rc;
> > +}
On Tue, 2023-11-28 at 18:08 +0000, Michael Kelley wrote:
> >
> > Sort of separately, if those vmalloc objections can't be worked
> > through, did you consider doing something like text_poke() does
> > (create
> > the temporary mapping in a temporary MM) for pvalidate purposes? I
> > don't know enough about what kind of special exceptions might popup
> > during that operation though, might be playing with fire...
>
> Interesting idea. But from a quick glance at the text_poke() code,
> such an approach seems somewhat complex, and I suspect it will have
> the same perf issues (or worse) as creating a new vmalloc area for
> each PVALIDATE invocation.
Using new vmalloc area's will eventually result in a kernel shootdown,
but usually have no flushes. text_poke will always result in a local-
only flush. So at least whatever slowdown there is would only affect
the calling thread.
As for complexity, I think it might be simple to implement actually.
What kind of special exceptions could come out of pvalidate, I'm not so
sure. But the kernel terminates the VM on failure anyway, so maybe it's
not an issue?
diff --git a/arch/x86/kernel/alternative.c
b/arch/x86/kernel/alternative.c
index 73be3931e4f0..a13293564eeb 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -1905,6 +1905,16 @@ void *text_poke(void *addr, const void *opcode,
size_t len)
return __text_poke(text_poke_memcpy, addr, opcode, len);
}
+static void text_poke_pvalidate(void *dst, const void *src, size_t
len)
+{
+ pvalidate(dst, len, true); // if fail, terminate
+}
+
+void *pvalidated_poke(void *addr)
+{
+ return __text_poke(text_poke_pvalidate, addr, NULL, PAGE_SIZE);
+}
+
/**
* text_poke_kgdb - Update instructions on a live kernel by kgdb
* @addr: address to modify
>
> At this point, the complexity of creating the temp mapping for
> PVALIDATE is seeming excessive. On balance it seems simpler to
> revert to an approach where the use of set_memory_np() and
> set_memory_p() is conditional. It would be necessary when #VC
> and #VE exceptions are directed to a paravisor. (This assumes the
> paravisor interface in the hypervisor callbacks does the natural
> thing
> of working with physical addresses, so there's no need for a temp
> mapping.)
>
> Optionally, the set_memory_np()/set_memory_p() approach could
> be used in other cases where the hypervisor callbacks work with
> physical addresses. But it can't be used with cases where the
> hypervisor
> callbacks need valid virtual addresses.
>
> So on net, set_memory_np()/set_memory_p() would be used in
> the Hyper-V cases of TDX and SEV-SNP with a paravisor. It could
> optionally be used with TDX with no paravisor, but my sense is
> that Kirill wants to keep TDX "as is" and let the exception handlers
> do the load_unaligned_zeropad() fixup.
>
> It could not be used with SEV-SNP with no paravisor. Additional
> fixes
> may be needed on the SEV-SNP side to properly fixup
> load_unaligned_zeropad() accesses to a page that's in transition
> between encrypted and decrypted.
>
Yea, I don't know about this paravisor/exception stuff.
From: [email protected] <[email protected]> Sent: Friday, November 24, 2023 2:06 AM
>
> On Tue, Nov 21, 2023 at 01:20:08PM -0800, [email protected] wrote:
> > From: Michael Kelley <[email protected]>
> >
> > In a CoCo VM when a page transitions from encrypted to decrypted, or vice
> > versa, attributes in the PTE must be updated *and* the hypervisor must
> > be notified of the change.
>
> Strictly speaking it is not true for TDX. Conversion to shared can be
> implicit: set shared bit and touch the page will do the conversion. MapGPA
> is optional.
Interesting. Given that, is there a reason to use the explicit
hypervisor callbacks in for private->shared transitions in
__set_mem_enc_pgtable()? It probably doesn't have direct relevance
to this patch series, but I'm just trying to understand the tradeoffs of
the implicit vs. explicit approach. And am I correct that
shared->private transitions must use the explicit approach?
>
> > Because there are two separate steps, there's
> > a window where the settings are inconsistent. Normally the code that
> > initiates the transition (via set_memory_decrypted() or
> > set_memory_encrypted()) ensures that the memory is not being accessed
> > during a transition, so the window of inconsistency is not a problem.
> > However, the load_unaligned_zeropad() function can read arbitrary memory
> > pages at arbitrary times, which could read a transitioning page during
> > the window. In such a case, CoCo VM specific exceptions are taken
> > (depending on the CoCo architecture in use). Current code in those
> > exception handlers recovers and does "fixup" on the result returned by
> > load_unaligned_zeropad(). Unfortunately, this exception handling can't
> > work in paravisor scenarios (TDX Paritioning and SEV-SNP in vTOM mode)
> > if the exceptions are routed to the paravisor. The paravisor can't
> > do load_unaligned_zeropad() fixup, so the exceptions would need to
> > be forwarded from the paravisor to the Linux guest, but there are
> > no architectural specs for how to do that.
>
> Hm. Can't we inject #PF (or #GP) into L2 if #VE/#VC handler in L1 sees
> cross-page access to shared memory while no fixup entry for the page in
> L1. It would give L2 chance to handle the situation in a transparent way.
>
> Maybe I miss something, I donno.
I'm recounting what the Hyper-V paravisor folks say without knowing all the
details. :-( But it seems like any kind of forwarding scheme needs to be a
well-defined contract that would work for both TDX and SEV-SNP. The
paravisor in L1 might or might not be Linux-based, so the contract must be OS
independent. And the L2 guest might or might not be Linux, so there's
potential for some other kind of error to be confused with a Linux
load_unaligned_zeropad() reference.
Maybe all that could be sorted out, but I come back to believing that it's
better now and in the long run to just avoid all this complexity by decoupling
private-shared page transitions and Linux load_unaligned_zeropad().
Unfortunately that decoupling hasn't been as simple as I first envisioned
because of SEV-SNP PVALIDATE needing a virtual address. But doing the
decoupling only in the paravisor case still seems like the simpler approach.
Michael
>
> --
> Kiryl Shutsemau / Kirill A. Shutemov
On Tue, Nov 28, 2023 at 07:12:33PM +0000, Michael Kelley wrote:
> From: [email protected] <[email protected]> Sent: Friday, November 24, 2023 2:06 AM
> >
> > On Tue, Nov 21, 2023 at 01:20:08PM -0800, [email protected] wrote:
> > > From: Michael Kelley <[email protected]>
> > >
> > > In a CoCo VM when a page transitions from encrypted to decrypted, or vice
> > > versa, attributes in the PTE must be updated *and* the hypervisor must
> > > be notified of the change.
> >
> > Strictly speaking it is not true for TDX. Conversion to shared can be
> > implicit: set shared bit and touch the page will do the conversion. MapGPA
> > is optional.
>
> Interesting. Given that, is there a reason to use the explicit
> hypervisor callbacks in for private->shared transitions in
> __set_mem_enc_pgtable()? It probably doesn't have direct relevance
> to this patch series, but I'm just trying to understand the tradeoffs of
> the implicit vs. explicit approach. And am I correct that
> shared->private transitions must use the explicit approach?
It must be explicit in sense, that the memory has to be accepted before
use. MapGPA() is still optional.
I don't like this implicit tricks. I spent a lot of time debugging an
issue that was obscured by this semantics.
But I think it is going to say :/
> > > Because there are two separate steps, there's
> > > a window where the settings are inconsistent. Normally the code that
> > > initiates the transition (via set_memory_decrypted() or
> > > set_memory_encrypted()) ensures that the memory is not being accessed
> > > during a transition, so the window of inconsistency is not a problem.
> > > However, the load_unaligned_zeropad() function can read arbitrary memory
> > > pages at arbitrary times, which could read a transitioning page during
> > > the window. In such a case, CoCo VM specific exceptions are taken
> > > (depending on the CoCo architecture in use). Current code in those
> > > exception handlers recovers and does "fixup" on the result returned by
> > > load_unaligned_zeropad(). Unfortunately, this exception handling can't
> > > work in paravisor scenarios (TDX Paritioning and SEV-SNP in vTOM mode)
> > > if the exceptions are routed to the paravisor. The paravisor can't
> > > do load_unaligned_zeropad() fixup, so the exceptions would need to
> > > be forwarded from the paravisor to the Linux guest, but there are
> > > no architectural specs for how to do that.
> >
> > Hm. Can't we inject #PF (or #GP) into L2 if #VE/#VC handler in L1 sees
> > cross-page access to shared memory while no fixup entry for the page in
> > L1. It would give L2 chance to handle the situation in a transparent way.
> >
> > Maybe I miss something, I donno.
>
> I'm recounting what the Hyper-V paravisor folks say without knowing all the
> details. :-( But it seems like any kind of forwarding scheme needs to be a
> well-defined contract that would work for both TDX and SEV-SNP. The
> paravisor in L1 might or might not be Linux-based, so the contract must be OS
> independent. And the L2 guest might or might not be Linux, so there's
> potential for some other kind of error to be confused with a Linux
> load_unaligned_zeropad() reference.
Okay, fair enough. I have hard time reasoning if it is okay for L2 which
is not Linux.
--
Kiryl Shutsemau / Kirill A. Shutemov
From: Edgecombe, Rick P <[email protected]> Sent: Tuesday, November 28, 2023 10:59 AM
>
> On Tue, 2023-11-28 at 18:08 +0000, Michael Kelley wrote:
> > >
> > > Sort of separately, if those vmalloc objections can't be worked
> > > through, did you consider doing something like text_poke() does
> > > (create
> > > the temporary mapping in a temporary MM) for pvalidate purposes? I
> > > don't know enough about what kind of special exceptions might popup
> > > during that operation though, might be playing with fire...
> >
> > Interesting idea. But from a quick glance at the text_poke() code,
> > such an approach seems somewhat complex, and I suspect it will have
> > the same perf issues (or worse) as creating a new vmalloc area for
> > each PVALIDATE invocation.
>
> Using new vmalloc area's will eventually result in a kernel shootdown,
> but usually have no flushes. text_poke will always result in a local-
> only flush. So at least whatever slowdown there is would only affect
> the calling thread.
>
> As for complexity, I think it might be simple to implement actually.
> What kind of special exceptions could come out of pvalidate, I'm not so
> sure. But the kernel terminates the VM on failure anyway, so maybe it's
> not an issue?
Sorry for the delay in getting back to this topic.
OK, I see now what you are suggesting. For each page that needs to
be PVALIDATE'd, use __text_poke() to create the temp mapping and
run PVALIDATE. However, there are some problems. __text_poke()
runs vmalloc_to_page() for addresses that aren't core kernel text,
and vmalloc_to_page() will fail if the PTE "present" bit has been
cleared. That could be easily addressed by changing it to use
slow_virt_to_phys(). But PVALIDATE also needs to be able to return
the PVALIDATE_FAIL_SIZEMISMATCH error code, which is tested for
in pvalidate_pages() and does not terminate the VM. __text_poke()
doesn’t have the machinery to return such an error code, and
that's harder to fix.
There's also the conceptual issue. The PVALIDATE use case isn't
working with a text area, so that case would be abusing "text_poke"
a bit. I could imagine __text_poke() having code to verify that it's
working on a text area, even if that code isn't there now.
To get a sense of performance, I hacked the equivalent of text_poke()
to work with PVALIDATE. The biggest case of transitioning pages from
encrypted to decrypted is the swiotlb area, which is 1 Gbyte for a
VM with 16 Gbytes or more of memory. In a Hyper-V CoCo VM, current
code takes about 270 milliseconds to transition that 1 Gbyte swiotlb area.
With my initial approach using vmap_pages_range(), that 270 ms went to
319 ms, which is fairly negligible in the overall VM boot time. Using the
text_poke() approach increased the time to 368 ms, which is bigger but
still probably not a show-stopper. It's definitely faster than creating a
new vmalloc area for each page that needs to be PVALIDATE'd, which
adds about 6 seconds to the boot time.
All-in-all, I'm back to my Plan B, which is to mark the pages "not
present" only in configurations where the hypervisor callbacks operate
on physical addresses instead of virtual addresses. Since SEV-SNP
needs virtual addresses, it will need to handle exceptions generated
by load_unaligned_zeropad() and do the appropriate fixup.
I'll try to get my "Plan B" patch set posted in a new few days.
Michael
>
> diff --git a/arch/x86/kernel/alternative.c
> b/arch/x86/kernel/alternative.c
> index 73be3931e4f0..a13293564eeb 100644
> --- a/arch/x86/kernel/alternative.c
> +++ b/arch/x86/kernel/alternative.c
> @@ -1905,6 +1905,16 @@ void *text_poke(void *addr, const void *opcode,
> size_t len)
> return __text_poke(text_poke_memcpy, addr, opcode, len);
> }
>
> +static void text_poke_pvalidate(void *dst, const void *src, size_t
> len)
> +{
> + pvalidate(dst, len, true); // if fail, terminate
> +}
> +
> +void *pvalidated_poke(void *addr)
> +{
> + return __text_poke(text_poke_pvalidate, addr, NULL, PAGE_SIZE);
> +}
> +
> /**
> * text_poke_kgdb - Update instructions on a live kernel by kgdb
> * @addr: address to modify
>
>
>
> >
> > At this point, the complexity of creating the temp mapping for
> > PVALIDATE is seeming excessive. On balance it seems simpler to
> > revert to an approach where the use of set_memory_np() and
> > set_memory_p() is conditional. It would be necessary when #VC
> > and #VE exceptions are directed to a paravisor. (This assumes the
> > paravisor interface in the hypervisor callbacks does the natural thing
> > of working with physical addresses, so there's no need for a temp
> > mapping.)
> >
> > Optionally, the set_memory_np()/set_memory_p() approach could
> > be used in other cases where the hypervisor callbacks work with
> > physical addresses. But it can't be used with cases where the
> > hypervisor callbacks need valid virtual addresses.
> >
> > So on net, set_memory_np()/set_memory_p() would be used in
> > the Hyper-V cases of TDX and SEV-SNP with a paravisor. It could
> > optionally be used with TDX with no paravisor, but my sense is
> > that Kirill wants to keep TDX "as is" and let the exception handlers
> > do the load_unaligned_zeropad() fixup.
> >
> > It could not be used with SEV-SNP with no paravisor. Additional fixes
> > may be needed on the SEV-SNP side to properly fixup
> > load_unaligned_zeropad() accesses to a page that's in transition
> > between encrypted and decrypted.
> >
>
> Yea, I don't know about this paravisor/exception stuff.