2023-05-04 22:57:20

by Dexuan Cui

[permalink] [raw]
Subject: [PATCH v6 0/6] Support TDX guests on Hyper-V

The patchset adds the Hyper-V specific code so that a TDX guest can run
on Hyper-V. Please review.

The v6 patchset is based on today's mainline (a1fd058b07d5).

The v6 patchset addressed Michael's comments on patch 5:
Removed 2 unnecessary lines of messages from the commit log.
Fixed the error handling path for hv_synic_alloc()/free().
Printed the 'ret' in hv_synic_alloc()/free().

@Michael Kelley: Can you please review patch 5?

@x86 maintainers:
If the patches look good to you, can you please take patch 1 and 2
into the tip tree?

@Wei Liu: I think patch 3, 4, 5, 6 should go through the Hyper-V tree
since they change the Hyper-V code.

If you want to view the patches on github, it is here:
https://github.com/dcui/tdx/commits/decui/mainline/v6

FYI, v1-v5 are here:
https://lwn.net/ml/linux-kernel/[email protected]/
https://lwn.net/ml/linux-kernel/[email protected]/
https://lwn.net/ml/linux-kernel/[email protected]/
https://lwn.net/ml/linux-kernel/[email protected]/
https://lwn.net/ml/linux-kernel/[email protected]/

Thanks,
Dexuan

Dexuan Cui (6):
x86/tdx: Retry TDVMCALL_MAP_GPA() when needed
x86/tdx: Support vmalloc() for tdx_enc_status_changed()
x86/hyperv: Add hv_isolation_type_tdx() to detect TDX guests
x86/hyperv: Support hypercalls for TDX guests
Drivers: hv: vmbus: Support TDX guests
x86/hyperv: Fix serial console interrupts for TDX guests

arch/x86/coco/tdx/tdx.c | 122 ++++++++++++++++++++++-------
arch/x86/hyperv/hv_apic.c | 6 +-
arch/x86/hyperv/hv_init.c | 27 ++++++-
arch/x86/hyperv/ivm.c | 20 +++++
arch/x86/include/asm/hyperv-tlfs.h | 3 +-
arch/x86/include/asm/mshyperv.h | 20 +++++
arch/x86/kernel/cpu/mshyperv.c | 43 ++++++++++
drivers/hv/hv.c | 65 ++++++++++++++-
drivers/hv/hv_common.c | 30 +++++++
include/asm-generic/mshyperv.h | 1 +
10 files changed, 300 insertions(+), 37 deletions(-)

--
2.25.1


2023-05-04 22:59:52

by Dexuan Cui

[permalink] [raw]
Subject: [PATCH v6 6/6] x86/hyperv: Fix serial console interrupts for TDX guests

When a TDX guest runs on Hyper-V, the UEFI firmware sets the HW_REDUCED
flag, and consequently ttyS0 interrupts can't work. Fix the issue by
overriding x86_init.acpi.reduced_hw_early_init().

Reviewed-by: Michael Kelley <[email protected]>
Signed-off-by: Dexuan Cui <[email protected]>
---

Changes since v1:
None.

Changes in v5:
Improved the comment [Michael Kelley]
Added Michael's Reviewed-by.

Changes in v6: None.

arch/x86/kernel/cpu/mshyperv.c | 22 ++++++++++++++++++++++
1 file changed, 22 insertions(+)

diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index b95b689efa07..d642e3624286 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -313,6 +313,26 @@ static void __init hv_smp_prepare_cpus(unsigned int max_cpus)
}
#endif

+/*
+ * When a TDX guest runs on Hyper-V, the firmware sets the HW_REDUCED flag: see
+ * acpi_tb_create_local_fadt(). Consequently ttyS0 interrupts can't work because
+ * request_irq() -> ... -> irq_to_desc() returns NULL for ttyS0. This happens
+ * because mp_config_acpi_legacy_irqs() sees a nr_legacy_irqs() of 0, so it
+ * doesn't initialize the array 'mp_irqs[]', and later setup_IO_APIC_irqs() ->
+ * find_irq_entry() fails to find the legacy irqs from the array, and hence
+ * doesn't create the necessary irq description info.
+ *
+ * Clone arch/x86/kernel/acpi/boot.c: acpi_generic_reduced_hw_init() here,
+ * except don't change 'legacy_pic'. It keeps its default value
+ * 'default_legacy_pic'. mp_config_acpi_legacy_irqs() sees a non-zero
+ * nr_legacy_irqs(), and eventually serial console interrupts works properly.
+ */
+static void __init reduced_hw_init(void)
+{
+ x86_init.timers.timer_init = x86_init_noop;
+ x86_init.irqs.pre_vector_init = x86_init_noop;
+}
+
static void __init ms_hyperv_init_platform(void)
{
int hv_max_functions_eax;
@@ -424,6 +444,8 @@ static void __init ms_hyperv_init_platform(void)

/* A TDX VM must use x2APIC and doesn't use lazy EOI */
ms_hyperv.hints &= ~HV_X64_APIC_ACCESS_RECOMMENDED;
+
+ x86_init.acpi.reduced_hw_early_init = reduced_hw_init;
}
}

--
2.25.1

2023-05-04 23:02:26

by Dexuan Cui

[permalink] [raw]
Subject: [PATCH v6 3/6] x86/hyperv: Add hv_isolation_type_tdx() to detect TDX guests

No logic change to SNP/VBS guests.

hv_isolation_type_tdx() wil be used to instruct a TDX guest on Hyper-V to
do some TDX-specific operations, e.g. hv_do_hypercall() should use
__tdx_hypercall(), and a TDX guest on Hyper-V should handle the Hyper-V
Event/Message/Monitor pages specially.

Reviewed-by: Kuppuswamy Sathyanarayanan <[email protected]>
Reviewed-by: Michael Kelley <[email protected]>
Signed-off-by: Dexuan Cui <[email protected]>
---

Changes in v2:
Added "#ifdef CONFIG_INTEL_TDX_GUEST and #endif" for
hv_isolation_type_tdx() in arch/x86/hyperv/ivm.c.

Simplified the changes in ms_hyperv_init_platform().

Changes in v3:
Added Kuppuswamy's Reviewed-by.

Changes in v4:
A minor rebase to Michael's v7 DDA patchset.

Changes in v5:
Added Michael's Reviewed-by.

Changes in v6: None.

arch/x86/hyperv/ivm.c | 6 ++++++
arch/x86/include/asm/hyperv-tlfs.h | 3 ++-
arch/x86/include/asm/mshyperv.h | 3 +++
arch/x86/kernel/cpu/mshyperv.c | 2 ++
drivers/hv/hv_common.c | 6 ++++++
include/asm-generic/mshyperv.h | 1 +
6 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/arch/x86/hyperv/ivm.c b/arch/x86/hyperv/ivm.c
index cc92388b7a99..952117ce2d80 100644
--- a/arch/x86/hyperv/ivm.c
+++ b/arch/x86/hyperv/ivm.c
@@ -400,6 +400,7 @@ bool hv_is_isolation_supported(void)
}

DEFINE_STATIC_KEY_FALSE(isolation_type_snp);
+DEFINE_STATIC_KEY_FALSE(isolation_type_tdx);

/*
* hv_isolation_type_snp - Check system runs in the AMD SEV-SNP based
@@ -409,3 +410,8 @@ bool hv_isolation_type_snp(void)
{
return static_branch_unlikely(&isolation_type_snp);
}
+
+bool hv_isolation_type_tdx(void)
+{
+ return static_branch_unlikely(&isolation_type_tdx);
+}
diff --git a/arch/x86/include/asm/hyperv-tlfs.h b/arch/x86/include/asm/hyperv-tlfs.h
index cea95dcd27c2..ebb6d24fe352 100644
--- a/arch/x86/include/asm/hyperv-tlfs.h
+++ b/arch/x86/include/asm/hyperv-tlfs.h
@@ -169,7 +169,8 @@
enum hv_isolation_type {
HV_ISOLATION_TYPE_NONE = 0,
HV_ISOLATION_TYPE_VBS = 1,
- HV_ISOLATION_TYPE_SNP = 2
+ HV_ISOLATION_TYPE_SNP = 2,
+ HV_ISOLATION_TYPE_TDX = 3
};

/* Hyper-V specific model specific registers (MSRs) */
diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
index 49bb4f2bd300..231e56631295 100644
--- a/arch/x86/include/asm/mshyperv.h
+++ b/arch/x86/include/asm/mshyperv.h
@@ -26,6 +26,7 @@
union hv_ghcb;

DECLARE_STATIC_KEY_FALSE(isolation_type_snp);
+DECLARE_STATIC_KEY_FALSE(isolation_type_tdx);

typedef int (*hyperv_fill_flush_list_func)(
struct hv_guest_mapping_flush_list *flush,
@@ -47,6 +48,8 @@ extern u64 hv_current_partition_id;

extern union hv_ghcb * __percpu *hv_ghcb_pg;

+extern bool hv_isolation_type_tdx(void);
+
int hv_call_deposit_pages(int node, u64 partition_id, u32 num_pages);
int hv_call_add_logical_proc(int node, u32 lp_index, u32 acpi_id);
int hv_call_create_vp(int node, u64 partition_id, u32 vp_index, u32 flags);
diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index c7969e806c64..2fd687a80033 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -404,6 +404,8 @@ static void __init ms_hyperv_init_platform(void)

if (hv_get_isolation_type() == HV_ISOLATION_TYPE_SNP)
static_branch_enable(&isolation_type_snp);
+ else if (hv_get_isolation_type() == HV_ISOLATION_TYPE_TDX)
+ static_branch_enable(&isolation_type_tdx);
}

if (hv_max_functions_eax >= HYPERV_CPUID_NESTED_FEATURES) {
diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c
index 64f9ceca887b..6156114cd9c5 100644
--- a/drivers/hv/hv_common.c
+++ b/drivers/hv/hv_common.c
@@ -502,6 +502,12 @@ bool __weak hv_isolation_type_snp(void)
}
EXPORT_SYMBOL_GPL(hv_isolation_type_snp);

+bool __weak hv_isolation_type_tdx(void)
+{
+ return false;
+}
+EXPORT_SYMBOL_GPL(hv_isolation_type_tdx);
+
void __weak hv_setup_vmbus_handler(void (*handler)(void))
{
}
diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
index 402a8c1c202d..3449a8a27c03 100644
--- a/include/asm-generic/mshyperv.h
+++ b/include/asm-generic/mshyperv.h
@@ -58,6 +58,7 @@ extern void * __percpu *hyperv_pcpu_output_arg;
extern u64 hv_do_hypercall(u64 control, void *inputaddr, void *outputaddr);
extern u64 hv_do_fast_hypercall8(u16 control, u64 input8);
extern bool hv_isolation_type_snp(void);
+extern bool hv_isolation_type_tdx(void);

/* Helper functions that provide a consistent pattern for checking Hyper-V hypercall status. */
static inline int hv_result(u64 status)
--
2.25.1

2023-05-04 23:02:34

by Dexuan Cui

[permalink] [raw]
Subject: [PATCH v6 2/6] x86/tdx: Support vmalloc() for tdx_enc_status_changed()

When a TDX guest runs on Hyper-V, the hv_netvsc driver's netvsc_init_buf()
allocates buffers using vzalloc(), and needs to share the buffers with the
host OS by calling set_memory_decrypted(), which is not working for
vmalloc() yet. Add the support by handling the pages one by one.

Co-developed-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Michael Kelley <[email protected]>
Signed-off-by: Dexuan Cui <[email protected]>
---

Changes in v2:
Changed tdx_enc_status_changed() in place.

Hi, Dave, I checked the huge vmalloc mapping code, but still don't know
how to get the underlying huge page info (if huge page is in use) and
try to use PG_LEVEL_2M/1G in try_accept_page() for vmalloc: I checked
is_vm_area_hugepages() and __vfree() -> __vunmap(), and I think the
underlying page allocation info is internal to the mm code, and there
is no mm API to for me get the info in tdx_enc_status_changed().

Changes in v3:
No change since v2.

Changes in v4:
Added Kirill's Co-developed-by since Kirill helped to improve the
code by adding tdx_enc_status_changed_phys().

Thanks Kirill for the clarification on load_unaligned_zeropad()!

The vzalloc() usage in drivers/net/hyperv/netvsc.c: netvsc_init_buf()
remains the same. It may not worth it to "allocate a vmalloc region,
allocate pages manually", because we have to consider the worst case
where the system is sufferiing from severe memory fragmentation and
we can only allocate multiple single pages. We may not want to
complicate the code in netvsc_init_buf(). We'll support NIC SR-IOV
for TDX VMs on Hyper-V, so the netvsc send/recv buffers won't be
used when the VF NIC is up.

Changes in v5:
Added Kirill's Signed-off-by.
Added Michael's Reviewed-by.

Changes in v6: None.

arch/x86/coco/tdx/tdx.c | 76 ++++++++++++++++++++++++++++-------------
1 file changed, 52 insertions(+), 24 deletions(-)

diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index 5574c91541a2..731be50b3d09 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -7,6 +7,7 @@
#include <linux/cpufeature.h>
#include <linux/export.h>
#include <linux/io.h>
+#include <linux/mm.h>
#include <asm/coco.h>
#include <asm/tdx.h>
#include <asm/vmx.h>
@@ -789,6 +790,34 @@ static bool try_accept_one(phys_addr_t *start, unsigned long len,
return true;
}

+static bool try_accept_page(phys_addr_t start, phys_addr_t end)
+{
+ /*
+ * For shared->private conversion, accept the page using
+ * TDX_ACCEPT_PAGE TDX module call.
+ */
+ while (start < end) {
+ unsigned long len = end - start;
+
+ /*
+ * Try larger accepts first. It gives chance to VMM to keep
+ * 1G/2M SEPT entries where possible and speeds up process by
+ * cutting number of hypercalls (if successful).
+ */
+
+ if (try_accept_one(&start, len, PG_LEVEL_1G))
+ continue;
+
+ if (try_accept_one(&start, len, PG_LEVEL_2M))
+ continue;
+
+ if (!try_accept_one(&start, len, PG_LEVEL_4K))
+ return false;
+ }
+
+ return true;
+}
+
/*
* Notify the VMM about page mapping conversion. More info about ABI
* can be found in TDX Guest-Host-Communication Interface (GHCI),
@@ -838,6 +867,19 @@ static bool tdx_map_gpa(phys_addr_t start, phys_addr_t end, bool enc)
return !ret;
}

+static bool tdx_enc_status_changed_phys(phys_addr_t start, phys_addr_t end,
+ bool enc)
+{
+ if (!tdx_map_gpa(start, end, enc))
+ return false;
+
+ /* private->shared conversion requires only MapGPA call */
+ if (!enc)
+ return true;
+
+ return try_accept_page(start, end);
+}
+
/*
* Inform the VMM of the guest's intent for this physical page: shared with
* the VMM or private to the guest. The VMM is expected to change its mapping
@@ -845,37 +887,23 @@ static bool tdx_map_gpa(phys_addr_t start, phys_addr_t end, bool enc)
*/
static bool tdx_enc_status_changed(unsigned long vaddr, int numpages, bool enc)
{
- phys_addr_t start = __pa(vaddr);
- phys_addr_t end = __pa(vaddr + numpages * PAGE_SIZE);
+ unsigned long start = vaddr;
+ unsigned long end = start + numpages * PAGE_SIZE;

- if (!tdx_map_gpa(start, end, enc))
+ if (offset_in_page(start) != 0)
return false;

- /* private->shared conversion requires only MapGPA call */
- if (!enc)
- return true;
+ if (!is_vmalloc_addr((void *)start))
+ return tdx_enc_status_changed_phys(__pa(start), __pa(end), enc);

- /*
- * For shared->private conversion, accept the page using
- * TDX_ACCEPT_PAGE TDX module call.
- */
while (start < end) {
- unsigned long len = end - start;
+ phys_addr_t start_pa = slow_virt_to_phys((void *)start);
+ phys_addr_t end_pa = start_pa + PAGE_SIZE;

- /*
- * Try larger accepts first. It gives chance to VMM to keep
- * 1G/2M SEPT entries where possible and speeds up process by
- * cutting number of hypercalls (if successful).
- */
-
- if (try_accept_one(&start, len, PG_LEVEL_1G))
- continue;
-
- if (try_accept_one(&start, len, PG_LEVEL_2M))
- continue;
-
- if (!try_accept_one(&start, len, PG_LEVEL_4K))
+ if (!tdx_enc_status_changed_phys(start_pa, end_pa, enc))
return false;
+
+ start += PAGE_SIZE;
}

return true;
--
2.25.1

2023-05-04 23:03:24

by Dexuan Cui

[permalink] [raw]
Subject: [PATCH v6 4/6] x86/hyperv: Support hypercalls for TDX guests

A TDX guest uses the GHCI call rather than hv_hypercall_pg.

In hv_do_hypercall(), Hyper-V requires that the input/output addresses
must have the cc_mask.

Reviewed-by: Kuppuswamy Sathyanarayanan <[email protected]>
Reviewed-by: Michael Kelley <[email protected]>
Signed-off-by: Dexuan Cui <[email protected]>
---

Changes in v2:
Implemented hv_tdx_hypercall() in C rather than in assembly code.
Renamed the parameter names of hv_tdx_hypercall().
Used cc_mkdec() directly in hv_do_hypercall().

Changes in v3:
Decrypted/encrypted hyperv_pcpu_input_arg in
hv_common_cpu_init() and hv_common_cpu_die().

Changes in v4:
__tdx_hypercall(&args, TDX_HCALL_HAS_OUTPUT) -> __tdx_hypercall_ret()
hv_common_cpu_die(): explicitly ignore the error set_memory_encrypted() [Michael Kelley]
Added Sathyanarayanan's Reviewed-by.

Changes in v5:
Added Michael's Reviewed-by.

Changes in v6: None.

arch/x86/hyperv/hv_init.c | 8 ++++++++
arch/x86/hyperv/ivm.c | 14 ++++++++++++++
arch/x86/include/asm/mshyperv.h | 17 +++++++++++++++++
drivers/hv/hv_common.c | 24 ++++++++++++++++++++++++
4 files changed, 63 insertions(+)

diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index a5f9474f08e1..f175e0de821c 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -432,6 +432,10 @@ void __init hyperv_init(void)
/* Hyper-V requires to write guest os id via ghcb in SNP IVM. */
hv_ghcb_msr_write(HV_X64_MSR_GUEST_OS_ID, guest_id);

+ /* A TDX guest uses the GHCI call rather than hv_hypercall_pg. */
+ if (hv_isolation_type_tdx())
+ goto skip_hypercall_pg_init;
+
hv_hypercall_pg = __vmalloc_node_range(PAGE_SIZE, 1, VMALLOC_START,
VMALLOC_END, GFP_KERNEL, PAGE_KERNEL_ROX,
VM_FLUSH_RESET_PERMS, NUMA_NO_NODE,
@@ -471,6 +475,7 @@ void __init hyperv_init(void)
wrmsrl(HV_X64_MSR_HYPERCALL, hypercall_msr.as_uint64);
}

+skip_hypercall_pg_init:
/*
* hyperv_init() is called before LAPIC is initialized: see
* apic_intr_mode_init() -> x86_platform.apic_post_init() and
@@ -594,6 +599,9 @@ bool hv_is_hyperv_initialized(void)
if (x86_hyper_type != X86_HYPER_MS_HYPERV)
return false;

+ /* A TDX guest uses the GHCI call rather than hv_hypercall_pg. */
+ if (hv_isolation_type_tdx())
+ return true;
/*
* Verify that earlier initialization succeeded by checking
* that the hypercall page is setup
diff --git a/arch/x86/hyperv/ivm.c b/arch/x86/hyperv/ivm.c
index 952117ce2d80..61ff7060b39b 100644
--- a/arch/x86/hyperv/ivm.c
+++ b/arch/x86/hyperv/ivm.c
@@ -415,3 +415,17 @@ bool hv_isolation_type_tdx(void)
{
return static_branch_unlikely(&isolation_type_tdx);
}
+
+u64 hv_tdx_hypercall(u64 control, u64 param1, u64 param2)
+{
+ struct tdx_hypercall_args args = { };
+
+ args.r10 = control;
+ args.rdx = param1;
+ args.r8 = param2;
+
+ (void)__tdx_hypercall_ret(&args);
+
+ return args.r11;
+}
+EXPORT_SYMBOL_GPL(hv_tdx_hypercall);
diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
index 231e56631295..945e5afaba69 100644
--- a/arch/x86/include/asm/mshyperv.h
+++ b/arch/x86/include/asm/mshyperv.h
@@ -10,6 +10,7 @@
#include <asm/nospec-branch.h>
#include <asm/paravirt.h>
#include <asm/mshyperv.h>
+#include <asm/coco.h>

/*
* Hyper-V always provides a single IO-APIC at this MMIO address.
@@ -54,6 +55,12 @@ int hv_call_deposit_pages(int node, u64 partition_id, u32 num_pages);
int hv_call_add_logical_proc(int node, u32 lp_index, u32 acpi_id);
int hv_call_create_vp(int node, u64 partition_id, u32 vp_index, u32 flags);

+u64 hv_tdx_hypercall(u64 control, u64 param1, u64 param2);
+
+/*
+ * If the hypercall involves no input or output parameters, the hypervisor
+ * ignores the corresponding GPA pointer.
+ */
static inline u64 hv_do_hypercall(u64 control, void *input, void *output)
{
u64 input_address = input ? virt_to_phys(input) : 0;
@@ -61,6 +68,10 @@ static inline u64 hv_do_hypercall(u64 control, void *input, void *output)
u64 hv_status;

#ifdef CONFIG_X86_64
+ if (hv_isolation_type_tdx())
+ return hv_tdx_hypercall(control,
+ cc_mkdec(input_address),
+ cc_mkdec(output_address));
if (!hv_hypercall_pg)
return U64_MAX;

@@ -104,6 +115,9 @@ static inline u64 _hv_do_fast_hypercall8(u64 control, u64 input1)
u64 hv_status;

#ifdef CONFIG_X86_64
+ if (hv_isolation_type_tdx())
+ return hv_tdx_hypercall(control, input1, 0);
+
{
__asm__ __volatile__(CALL_NOSPEC
: "=a" (hv_status), ASM_CALL_CONSTRAINT,
@@ -149,6 +163,9 @@ static inline u64 _hv_do_fast_hypercall16(u64 control, u64 input1, u64 input2)
u64 hv_status;

#ifdef CONFIG_X86_64
+ if (hv_isolation_type_tdx())
+ return hv_tdx_hypercall(control, input1, input2);
+
{
__asm__ __volatile__("mov %4, %%r8\n"
CALL_NOSPEC
diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c
index 6156114cd9c5..5b32adff0e61 100644
--- a/drivers/hv/hv_common.c
+++ b/drivers/hv/hv_common.c
@@ -24,6 +24,7 @@
#include <linux/kmsg_dump.h>
#include <linux/slab.h>
#include <linux/dma-map-ops.h>
+#include <linux/set_memory.h>
#include <asm/hyperv-tlfs.h>
#include <asm/mshyperv.h>

@@ -359,6 +360,7 @@ int hv_common_cpu_init(unsigned int cpu)
u64 msr_vp_index;
gfp_t flags;
int pgcount = hv_root_partition ? 2 : 1;
+ int ret;

/* hv_cpu_init() can be called with IRQs disabled from hv_resume() */
flags = irqs_disabled() ? GFP_ATOMIC : GFP_KERNEL;
@@ -368,6 +370,17 @@ int hv_common_cpu_init(unsigned int cpu)
if (!(*inputarg))
return -ENOMEM;

+ if (hv_isolation_type_tdx()) {
+ ret = set_memory_decrypted((unsigned long)*inputarg, pgcount);
+ if (ret) {
+ /* It may be unsafe to free *inputarg */
+ *inputarg = NULL;
+ return ret;
+ }
+
+ memset(*inputarg, 0x00, pgcount * HV_HYP_PAGE_SIZE);
+ }
+
if (hv_root_partition) {
outputarg = (void **)this_cpu_ptr(hyperv_pcpu_output_arg);
*outputarg = (char *)(*inputarg) + HV_HYP_PAGE_SIZE;
@@ -388,6 +401,8 @@ int hv_common_cpu_die(unsigned int cpu)
unsigned long flags;
void **inputarg, **outputarg;
void *mem;
+ int pgcount = hv_root_partition ? 2 : 1;
+ int ret;

local_irq_save(flags);

@@ -402,6 +417,15 @@ int hv_common_cpu_die(unsigned int cpu)

local_irq_restore(flags);

+ if (hv_isolation_type_tdx()) {
+ ret = set_memory_encrypted((unsigned long)mem, pgcount);
+ if (ret)
+ pr_warn("Hyper-V: Failed to encrypt input arg on cpu%d: %d\n",
+ cpu, ret);
+ /* It's unsafe to free 'mem'. */
+ return 0;
+ }
+
kfree(mem);

return 0;
--
2.25.1

2023-05-04 23:05:11

by Dexuan Cui

[permalink] [raw]
Subject: [PATCH v6 5/6] Drivers: hv: vmbus: Support TDX guests

Add Hyper-V specific code so that a TDX guest can run on Hyper-V:
No need to use hv_vp_assist_page.
Don't use the unsafe Hyper-V TSC page.
Don't try to use HV_REGISTER_CRASH_CTL.
Don't trust Hyper-V's TLB-flushing hypercalls.
Don't use lazy EOI.

Signed-off-by: Dexuan Cui <[email protected]>
---

Changes in v2:
Used a new function hv_set_memory_enc_dec_needed() in
__set_memory_enc_pgtable().
Added the missing set_memory_encrypted() in hv_synic_free().

Changes in v3:
Use pgprot_decrypted(PAGE_KERNEL)in hv_ringbuffer_init().
(Do not use PAGE_KERNEL_NOENC, which doesn't exist for ARM64).

Used cc_mkdec() in hv_synic_enable_regs().

ms_hyperv_init_platform():
Explicitly do not use HV_X64_REMOTE_TLB_FLUSH_RECOMMENDED.
Explicitly do not use HV_X64_APIC_ACCESS_RECOMMENDED.

Enabled __send_ipi_mask() and __send_ipi_one() for TDX guests.

Changes in v4:
A minor rebase to Michael's v7 DDA patchset. I'm very happy that
I can drop my v3 change to arch/x86/mm/pat/set_memory.c due to
Michael's work.

Changes in v5:
Added memset() to clear synic_message_page and synic_event_page()
after set_memory_decrypted().
Rebased the patch since "post_msg_page" has been removed in
hyperv-next.
Improved the error handling in hv_synic_alloc()/free() [Michael
Kelley]

Changes in v6:
Adressed Michael Kelley's comments on patch 5:
Removed 2 unnecessary lines of messages from the commit log.
Fixed the error handling path for hv_synic_alloc()/free().
Printed the 'ret' in hv_synic_alloc()/free().

arch/x86/hyperv/hv_apic.c | 6 ++--
arch/x86/hyperv/hv_init.c | 19 +++++++---
arch/x86/kernel/cpu/mshyperv.c | 21 ++++++++++-
drivers/hv/hv.c | 65 ++++++++++++++++++++++++++++++++--
4 files changed, 101 insertions(+), 10 deletions(-)

diff --git a/arch/x86/hyperv/hv_apic.c b/arch/x86/hyperv/hv_apic.c
index 1fbda2f94184..b28da8b41b45 100644
--- a/arch/x86/hyperv/hv_apic.c
+++ b/arch/x86/hyperv/hv_apic.c
@@ -177,7 +177,8 @@ static bool __send_ipi_mask(const struct cpumask *mask, int vector,
(exclude_self && weight == 1 && cpumask_test_cpu(this_cpu, mask)))
return true;

- if (!hv_hypercall_pg)
+ /* A TDX guest doesn't use hv_hypercall_pg. */
+ if (!hv_isolation_type_tdx() && !hv_hypercall_pg)
return false;

if ((vector < HV_IPI_LOW_VECTOR) || (vector > HV_IPI_HIGH_VECTOR))
@@ -231,7 +232,8 @@ static bool __send_ipi_one(int cpu, int vector)

trace_hyperv_send_ipi_one(cpu, vector);

- if (!hv_hypercall_pg || (vp == VP_INVAL))
+ /* A TDX guest doesn't use hv_hypercall_pg. */
+ if ((!hv_isolation_type_tdx() && !hv_hypercall_pg) || (vp == VP_INVAL))
return false;

if ((vector < HV_IPI_LOW_VECTOR) || (vector > HV_IPI_HIGH_VECTOR))
diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index f175e0de821c..f28357ecad7d 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -79,7 +79,7 @@ static int hyperv_init_ghcb(void)
static int hv_cpu_init(unsigned int cpu)
{
union hv_vp_assist_msr_contents msr = { 0 };
- struct hv_vp_assist_page **hvp = &hv_vp_assist_page[cpu];
+ struct hv_vp_assist_page **hvp;
int ret;

ret = hv_common_cpu_init(cpu);
@@ -89,6 +89,7 @@ static int hv_cpu_init(unsigned int cpu)
if (!hv_vp_assist_page)
return 0;

+ hvp = &hv_vp_assist_page[cpu];
if (hv_root_partition) {
/*
* For root partition we get the hypervisor provided VP assist
@@ -398,11 +399,21 @@ void __init hyperv_init(void)
if (hv_common_init())
return;

- hv_vp_assist_page = kcalloc(num_possible_cpus(),
- sizeof(*hv_vp_assist_page), GFP_KERNEL);
+ /*
+ * The VP assist page is useless to a TDX guest: the only use we
+ * would have for it is lazy EOI, which can not be used with TDX.
+ */
+ if (hv_isolation_type_tdx())
+ hv_vp_assist_page = NULL;
+ else
+ hv_vp_assist_page = kcalloc(num_possible_cpus(),
+ sizeof(*hv_vp_assist_page),
+ GFP_KERNEL);
if (!hv_vp_assist_page) {
ms_hyperv.hints &= ~HV_X64_ENLIGHTENED_VMCS_RECOMMENDED;
- goto common_free;
+
+ if (!hv_isolation_type_tdx())
+ goto common_free;
}

if (hv_isolation_type_snp()) {
diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index 2fd687a80033..b95b689efa07 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -404,8 +404,27 @@ static void __init ms_hyperv_init_platform(void)

if (hv_get_isolation_type() == HV_ISOLATION_TYPE_SNP)
static_branch_enable(&isolation_type_snp);
- else if (hv_get_isolation_type() == HV_ISOLATION_TYPE_TDX)
+ else if (hv_get_isolation_type() == HV_ISOLATION_TYPE_TDX) {
static_branch_enable(&isolation_type_tdx);
+
+ /*
+ * The GPAs of SynIC Event/Message pages and VMBus
+ * Moniter pages need to be added by this offset.
+ */
+ ms_hyperv.shared_gpa_boundary = cc_mkdec(0);
+
+ /* Don't use the unsafe Hyper-V TSC page */
+ ms_hyperv.features &= ~HV_MSR_REFERENCE_TSC_AVAILABLE;
+
+ /* HV_REGISTER_CRASH_CTL is unsupported */
+ ms_hyperv.misc_features &= ~HV_FEATURE_GUEST_CRASH_MSR_AVAILABLE;
+
+ /* Don't trust Hyper-V's TLB-flushing hypercalls */
+ ms_hyperv.hints &= ~HV_X64_REMOTE_TLB_FLUSH_RECOMMENDED;
+
+ /* A TDX VM must use x2APIC and doesn't use lazy EOI */
+ ms_hyperv.hints &= ~HV_X64_APIC_ACCESS_RECOMMENDED;
+ }
}

if (hv_max_functions_eax >= HYPERV_CPUID_NESTED_FEATURES) {
diff --git a/drivers/hv/hv.c b/drivers/hv/hv.c
index de6708dbe0df..af959e87b6e7 100644
--- a/drivers/hv/hv.c
+++ b/drivers/hv/hv.c
@@ -18,6 +18,7 @@
#include <linux/clockchips.h>
#include <linux/delay.h>
#include <linux/interrupt.h>
+#include <linux/set_memory.h>
#include <clocksource/hyperv_timer.h>
#include <asm/mshyperv.h>
#include "hyperv_vmbus.h"
@@ -80,6 +81,7 @@ int hv_synic_alloc(void)
{
int cpu;
struct hv_per_cpu_context *hv_cpu;
+ int ret = -ENOMEM;

/*
* First, zero all per-cpu memory areas so hv_synic_free() can
@@ -120,9 +122,42 @@ int hv_synic_alloc(void)
(void *)get_zeroed_page(GFP_ATOMIC);
if (hv_cpu->synic_event_page == NULL) {
pr_err("Unable to allocate SYNIC event page\n");
+
+ free_page((unsigned long)hv_cpu->synic_message_page);
+ hv_cpu->synic_message_page = NULL;
+
goto err;
}
}
+
+ /* It's better to leak the page if the decryption fails. */
+ if (hv_isolation_type_tdx()) {
+ ret = set_memory_decrypted(
+ (unsigned long)hv_cpu->synic_message_page, 1);
+ if (ret) {
+ pr_err("Failed to decrypt SYNIC msg page: %d\n", ret);
+ hv_cpu->synic_message_page = NULL;
+
+ /*
+ * Free the event page so that a TDX VM won't
+ * try to encrypt the page in hv_synic_free().
+ */
+ free_page((unsigned long)hv_cpu->synic_event_page);
+ hv_cpu->synic_event_page = NULL;
+ goto err;
+ }
+
+ ret = set_memory_decrypted(
+ (unsigned long)hv_cpu->synic_event_page, 1);
+ if (ret) {
+ pr_err("Failed to decrypt SYNIC event page: %d\n", ret);
+ hv_cpu->synic_event_page = NULL;
+ goto err;
+ }
+
+ memset(hv_cpu->synic_message_page, 0, PAGE_SIZE);
+ memset(hv_cpu->synic_event_page, 0, PAGE_SIZE);
+ }
}

return 0;
@@ -131,18 +166,40 @@ int hv_synic_alloc(void)
* Any memory allocations that succeeded will be freed when
* the caller cleans up by calling hv_synic_free()
*/
- return -ENOMEM;
+ return ret;
}


void hv_synic_free(void)
{
int cpu;
+ int ret;

for_each_present_cpu(cpu) {
struct hv_per_cpu_context *hv_cpu
= per_cpu_ptr(hv_context.cpu_context, cpu);

+ /* It's better to leak the page if the encryption fails. */
+ if (hv_isolation_type_tdx()) {
+ if (hv_cpu->synic_message_page) {
+ ret = set_memory_encrypted((unsigned long)
+ hv_cpu->synic_message_page, 1);
+ if (ret) {
+ pr_err("Failed to encrypt SYNIC msg page: %d\n", ret);
+ hv_cpu->synic_message_page = NULL;
+ }
+ }
+
+ if (hv_cpu->synic_event_page) {
+ ret = set_memory_encrypted((unsigned long)
+ hv_cpu->synic_event_page, 1);
+ if (ret) {
+ pr_err("Failed to encrypt SYNIC event page: %d\n", ret);
+ hv_cpu->synic_event_page = NULL;
+ }
+ }
+ }
+
free_page((unsigned long)hv_cpu->synic_event_page);
free_page((unsigned long)hv_cpu->synic_message_page);
}
@@ -179,7 +236,8 @@ void hv_synic_enable_regs(unsigned int cpu)
if (!hv_cpu->synic_message_page)
pr_err("Fail to map synic message page.\n");
} else {
- simp.base_simp_gpa = virt_to_phys(hv_cpu->synic_message_page)
+ simp.base_simp_gpa =
+ cc_mkdec(virt_to_phys(hv_cpu->synic_message_page))
>> HV_HYP_PAGE_SHIFT;
}

@@ -198,7 +256,8 @@ void hv_synic_enable_regs(unsigned int cpu)
if (!hv_cpu->synic_event_page)
pr_err("Fail to map synic event page.\n");
} else {
- siefp.base_siefp_gpa = virt_to_phys(hv_cpu->synic_event_page)
+ siefp.base_siefp_gpa =
+ cc_mkdec(virt_to_phys(hv_cpu->synic_event_page))
>> HV_HYP_PAGE_SHIFT;
}

--
2.25.1

2023-05-04 23:05:23

by Dexuan Cui

[permalink] [raw]
Subject: [PATCH v6 1/6] x86/tdx: Retry TDVMCALL_MAP_GPA() when needed

GHCI spec for TDX 1.0 says that the MapGPA call may fail with the R10
error code = TDG.VP.VMCALL_RETRY (1), and the guest must retry this
operation for the pages in the region starting at the GPA specified
in R11.

When a TDX guest runs on Hyper-V, Hyper-V returns the retry error
when hyperv_init() -> swiotlb_update_mem_attributes() ->
set_memory_decrypted() decrypts up to 1GB of swiotlb bounce buffers.

Acked-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Michael Kelley <[email protected]>
Signed-off-by: Dexuan Cui <[email protected]>
---

Changes in v2:
Used __tdx_hypercall() directly in tdx_map_gpa().
Added a max_retry_cnt of 1000.
Renamed a few variables, e.g., r11 -> map_fail_paddr.

Changes in v3:
Changed max_retry_cnt from 1000 to 3.

Changes in v4:
__tdx_hypercall(&args, TDX_HCALL_HAS_OUTPUT) -> __tdx_hypercall_ret()
Added Kirill's Acked-by.

Changes in v5:
Added Michael's Reviewed-by.

Changes in v6: None.

arch/x86/coco/tdx/tdx.c | 64 +++++++++++++++++++++++++++++++++--------
1 file changed, 52 insertions(+), 12 deletions(-)

diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index 4c4c6db39eca..5574c91541a2 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -28,6 +28,8 @@
#define TDVMCALL_MAP_GPA 0x10001
#define TDVMCALL_REPORT_FATAL_ERROR 0x10003

+#define TDVMCALL_STATUS_RETRY 1
+
/* MMIO direction */
#define EPT_READ 0
#define EPT_WRITE 1
@@ -788,14 +790,15 @@ static bool try_accept_one(phys_addr_t *start, unsigned long len,
}

/*
- * Inform the VMM of the guest's intent for this physical page: shared with
- * the VMM or private to the guest. The VMM is expected to change its mapping
- * of the page in response.
+ * Notify the VMM about page mapping conversion. More info about ABI
+ * can be found in TDX Guest-Host-Communication Interface (GHCI),
+ * section "TDG.VP.VMCALL<MapGPA>".
*/
-static bool tdx_enc_status_changed(unsigned long vaddr, int numpages, bool enc)
+static bool tdx_map_gpa(phys_addr_t start, phys_addr_t end, bool enc)
{
- phys_addr_t start = __pa(vaddr);
- phys_addr_t end = __pa(vaddr + numpages * PAGE_SIZE);
+ int max_retry_cnt = 3, retry_cnt = 0;
+ struct tdx_hypercall_args args;
+ u64 map_fail_paddr, ret;

if (!enc) {
/* Set the shared (decrypted) bits: */
@@ -803,12 +806,49 @@ static bool tdx_enc_status_changed(unsigned long vaddr, int numpages, bool enc)
end |= cc_mkdec(0);
}

- /*
- * Notify the VMM about page mapping conversion. More info about ABI
- * can be found in TDX Guest-Host-Communication Interface (GHCI),
- * section "TDG.VP.VMCALL<MapGPA>"
- */
- if (_tdx_hypercall(TDVMCALL_MAP_GPA, start, end - start, 0, 0))
+ while (1) {
+ memset(&args, 0, sizeof(args));
+ args.r10 = TDX_HYPERCALL_STANDARD;
+ args.r11 = TDVMCALL_MAP_GPA;
+ args.r12 = start;
+ args.r13 = end - start;
+
+ ret = __tdx_hypercall_ret(&args);
+ if (ret != TDVMCALL_STATUS_RETRY)
+ break;
+ /*
+ * The guest must retry the operation for the pages in the
+ * region starting at the GPA specified in R11. Make sure R11
+ * contains a sane value.
+ */
+ map_fail_paddr = args.r11;
+ if (map_fail_paddr < start || map_fail_paddr >= end)
+ return false;
+
+ if (map_fail_paddr == start) {
+ retry_cnt++;
+ if (retry_cnt > max_retry_cnt)
+ return false;
+ } else {
+ retry_cnt = 0;
+ start = map_fail_paddr;
+ }
+ }
+
+ return !ret;
+}
+
+/*
+ * Inform the VMM of the guest's intent for this physical page: shared with
+ * the VMM or private to the guest. The VMM is expected to change its mapping
+ * of the page in response.
+ */
+static bool tdx_enc_status_changed(unsigned long vaddr, int numpages, bool enc)
+{
+ phys_addr_t start = __pa(vaddr);
+ phys_addr_t end = __pa(vaddr + numpages * PAGE_SIZE);
+
+ if (!tdx_map_gpa(start, end, enc))
return false;

/* private->shared conversion requires only MapGPA call */
--
2.25.1

2023-05-05 16:46:36

by Michael Kelley (LINUX)

[permalink] [raw]
Subject: RE: [PATCH v6 5/6] Drivers: hv: vmbus: Support TDX guests

From: Dexuan Cui <[email protected]>
>
> Add Hyper-V specific code so that a TDX guest can run on Hyper-V:
> No need to use hv_vp_assist_page.
> Don't use the unsafe Hyper-V TSC page.
> Don't try to use HV_REGISTER_CRASH_CTL.
> Don't trust Hyper-V's TLB-flushing hypercalls.
> Don't use lazy EOI.

Nit: Actually, you overdid the cleanup. :-( The line in v5 about
"Share SynIC Event/Message pages" was correct. It was only the
part about VMBus Monitor pages that no longer applied.

>
> Signed-off-by: Dexuan Cui <[email protected]>
> ---
>
> Changes in v2:
> Used a new function hv_set_memory_enc_dec_needed() in
> __set_memory_enc_pgtable().
> Added the missing set_memory_encrypted() in hv_synic_free().
>
> Changes in v3:
> Use pgprot_decrypted(PAGE_KERNEL)in hv_ringbuffer_init().
> (Do not use PAGE_KERNEL_NOENC, which doesn't exist for ARM64).
>
> Used cc_mkdec() in hv_synic_enable_regs().
>
> ms_hyperv_init_platform():
> Explicitly do not use HV_X64_REMOTE_TLB_FLUSH_RECOMMENDED.
> Explicitly do not use HV_X64_APIC_ACCESS_RECOMMENDED.
>
> Enabled __send_ipi_mask() and __send_ipi_one() for TDX guests.
>
> Changes in v4:
> A minor rebase to Michael's v7 DDA patchset. I'm very happy that
> I can drop my v3 change to arch/x86/mm/pat/set_memory.c due to
> Michael's work.
>
> Changes in v5:
> Added memset() to clear synic_message_page and synic_event_page()
> after set_memory_decrypted().
> Rebased the patch since "post_msg_page" has been removed in
> hyperv-next.
> Improved the error handling in hv_synic_alloc()/free() [Michael
> Kelley]
>
> Changes in v6:
> Adressed Michael Kelley's comments on patch 5:
> Removed 2 unnecessary lines of messages from the commit log.
> Fixed the error handling path for hv_synic_alloc()/free().
> Printed the 'ret' in hv_synic_alloc()/free().
>
> arch/x86/hyperv/hv_apic.c | 6 ++--
> arch/x86/hyperv/hv_init.c | 19 +++++++---
> arch/x86/kernel/cpu/mshyperv.c | 21 ++++++++++-
> drivers/hv/hv.c | 65 ++++++++++++++++++++++++++++++++--
> 4 files changed, 101 insertions(+), 10 deletions(-)
>
> diff --git a/arch/x86/hyperv/hv_apic.c b/arch/x86/hyperv/hv_apic.c
> index 1fbda2f94184..b28da8b41b45 100644
> --- a/arch/x86/hyperv/hv_apic.c
> +++ b/arch/x86/hyperv/hv_apic.c
> @@ -177,7 +177,8 @@ static bool __send_ipi_mask(const struct cpumask *mask, int
> vector,
> (exclude_self && weight == 1 && cpumask_test_cpu(this_cpu, mask)))
> return true;
>
> - if (!hv_hypercall_pg)
> + /* A TDX guest doesn't use hv_hypercall_pg. */
> + if (!hv_isolation_type_tdx() && !hv_hypercall_pg)
> return false;
>
> if ((vector < HV_IPI_LOW_VECTOR) || (vector > HV_IPI_HIGH_VECTOR))
> @@ -231,7 +232,8 @@ static bool __send_ipi_one(int cpu, int vector)
>
> trace_hyperv_send_ipi_one(cpu, vector);
>
> - if (!hv_hypercall_pg || (vp == VP_INVAL))
> + /* A TDX guest doesn't use hv_hypercall_pg. */
> + if ((!hv_isolation_type_tdx() && !hv_hypercall_pg) || (vp == VP_INVAL))
> return false;
>
> if ((vector < HV_IPI_LOW_VECTOR) || (vector > HV_IPI_HIGH_VECTOR))
> diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
> index f175e0de821c..f28357ecad7d 100644
> --- a/arch/x86/hyperv/hv_init.c
> +++ b/arch/x86/hyperv/hv_init.c
> @@ -79,7 +79,7 @@ static int hyperv_init_ghcb(void)
> static int hv_cpu_init(unsigned int cpu)
> {
> union hv_vp_assist_msr_contents msr = { 0 };
> - struct hv_vp_assist_page **hvp = &hv_vp_assist_page[cpu];
> + struct hv_vp_assist_page **hvp;
> int ret;
>
> ret = hv_common_cpu_init(cpu);
> @@ -89,6 +89,7 @@ static int hv_cpu_init(unsigned int cpu)
> if (!hv_vp_assist_page)
> return 0;
>
> + hvp = &hv_vp_assist_page[cpu];
> if (hv_root_partition) {
> /*
> * For root partition we get the hypervisor provided VP assist
> @@ -398,11 +399,21 @@ void __init hyperv_init(void)
> if (hv_common_init())
> return;
>
> - hv_vp_assist_page = kcalloc(num_possible_cpus(),
> - sizeof(*hv_vp_assist_page), GFP_KERNEL);
> + /*
> + * The VP assist page is useless to a TDX guest: the only use we
> + * would have for it is lazy EOI, which can not be used with TDX.
> + */
> + if (hv_isolation_type_tdx())
> + hv_vp_assist_page = NULL;
> + else
> + hv_vp_assist_page = kcalloc(num_possible_cpus(),
> + sizeof(*hv_vp_assist_page),
> + GFP_KERNEL);
> if (!hv_vp_assist_page) {
> ms_hyperv.hints &= ~HV_X64_ENLIGHTENED_VMCS_RECOMMENDED;
> - goto common_free;
> +
> + if (!hv_isolation_type_tdx())
> + goto common_free;
> }
>
> if (hv_isolation_type_snp()) {
> diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
> index 2fd687a80033..b95b689efa07 100644
> --- a/arch/x86/kernel/cpu/mshyperv.c
> +++ b/arch/x86/kernel/cpu/mshyperv.c
> @@ -404,8 +404,27 @@ static void __init ms_hyperv_init_platform(void)
>
> if (hv_get_isolation_type() == HV_ISOLATION_TYPE_SNP)
> static_branch_enable(&isolation_type_snp);
> - else if (hv_get_isolation_type() == HV_ISOLATION_TYPE_TDX)
> + else if (hv_get_isolation_type() == HV_ISOLATION_TYPE_TDX) {
> static_branch_enable(&isolation_type_tdx);
> +
> + /*
> + * The GPAs of SynIC Event/Message pages and VMBus
> + * Moniter pages need to be added by this offset.
> + */
> + ms_hyperv.shared_gpa_boundary = cc_mkdec(0);
> +
> + /* Don't use the unsafe Hyper-V TSC page */
> + ms_hyperv.features &= ~HV_MSR_REFERENCE_TSC_AVAILABLE;
> +
> + /* HV_REGISTER_CRASH_CTL is unsupported */
> + ms_hyperv.misc_features &= ~HV_FEATURE_GUEST_CRASH_MSR_AVAILABLE;
> +
> + /* Don't trust Hyper-V's TLB-flushing hypercalls */
> + ms_hyperv.hints &= ~HV_X64_REMOTE_TLB_FLUSH_RECOMMENDED;
> +
> + /* A TDX VM must use x2APIC and doesn't use lazy EOI */
> + ms_hyperv.hints &= ~HV_X64_APIC_ACCESS_RECOMMENDED;
> + }
> }
>
> if (hv_max_functions_eax >= HYPERV_CPUID_NESTED_FEATURES) {
> diff --git a/drivers/hv/hv.c b/drivers/hv/hv.c
> index de6708dbe0df..af959e87b6e7 100644
> --- a/drivers/hv/hv.c
> +++ b/drivers/hv/hv.c
> @@ -18,6 +18,7 @@
> #include <linux/clockchips.h>
> #include <linux/delay.h>
> #include <linux/interrupt.h>
> +#include <linux/set_memory.h>
> #include <clocksource/hyperv_timer.h>
> #include <asm/mshyperv.h>
> #include "hyperv_vmbus.h"
> @@ -80,6 +81,7 @@ int hv_synic_alloc(void)
> {
> int cpu;
> struct hv_per_cpu_context *hv_cpu;
> + int ret = -ENOMEM;
>
> /*
> * First, zero all per-cpu memory areas so hv_synic_free() can
> @@ -120,9 +122,42 @@ int hv_synic_alloc(void)
> (void *)get_zeroed_page(GFP_ATOMIC);
> if (hv_cpu->synic_event_page == NULL) {
> pr_err("Unable to allocate SYNIC event page\n");
> +
> + free_page((unsigned long)hv_cpu->synic_message_page);
> + hv_cpu->synic_message_page = NULL;
> +
> goto err;
> }
> }
> +
> + /* It's better to leak the page if the decryption fails. */
> + if (hv_isolation_type_tdx()) {
> + ret = set_memory_decrypted(
> + (unsigned long)hv_cpu->synic_message_page, 1);
> + if (ret) {
> + pr_err("Failed to decrypt SYNIC msg page: %d\n", ret);
> + hv_cpu->synic_message_page = NULL;
> +
> + /*
> + * Free the event page so that a TDX VM won't
> + * try to encrypt the page in hv_synic_free().
> + */
> + free_page((unsigned long)hv_cpu->synic_event_page);
> + hv_cpu->synic_event_page = NULL;
> + goto err;
> + }
> +
> + ret = set_memory_decrypted(
> + (unsigned long)hv_cpu->synic_event_page, 1);
> + if (ret) {
> + pr_err("Failed to decrypt SYNIC event page: %d\n", ret);
> + hv_cpu->synic_event_page = NULL;
> + goto err;
> + }
> +
> + memset(hv_cpu->synic_message_page, 0, PAGE_SIZE);
> + memset(hv_cpu->synic_event_page, 0, PAGE_SIZE);
> + }
> }
>
> return 0;
> @@ -131,18 +166,40 @@ int hv_synic_alloc(void)
> * Any memory allocations that succeeded will be freed when
> * the caller cleans up by calling hv_synic_free()
> */
> - return -ENOMEM;
> + return ret;
> }
>
>
> void hv_synic_free(void)
> {
> int cpu;
> + int ret;
>
> for_each_present_cpu(cpu) {
> struct hv_per_cpu_context *hv_cpu
> = per_cpu_ptr(hv_context.cpu_context, cpu);
>
> + /* It's better to leak the page if the encryption fails. */
> + if (hv_isolation_type_tdx()) {
> + if (hv_cpu->synic_message_page) {
> + ret = set_memory_encrypted((unsigned long)
> + hv_cpu->synic_message_page, 1);
> + if (ret) {
> + pr_err("Failed to encrypt SYNIC msg page: %d\n", ret);
> + hv_cpu->synic_message_page = NULL;
> + }
> + }
> +
> + if (hv_cpu->synic_event_page) {
> + ret = set_memory_encrypted((unsigned long)
> + hv_cpu->synic_event_page, 1);
> + if (ret) {
> + pr_err("Failed to encrypt SYNIC event page: %d\n", ret);
> + hv_cpu->synic_event_page = NULL;
> + }
> + }
> + }
> +
> free_page((unsigned long)hv_cpu->synic_event_page);
> free_page((unsigned long)hv_cpu->synic_message_page);
> }
> @@ -179,7 +236,8 @@ void hv_synic_enable_regs(unsigned int cpu)
> if (!hv_cpu->synic_message_page)
> pr_err("Fail to map synic message page.\n");
> } else {
> - simp.base_simp_gpa = virt_to_phys(hv_cpu->synic_message_page)
> + simp.base_simp_gpa =
> + cc_mkdec(virt_to_phys(hv_cpu->synic_message_page))
> >> HV_HYP_PAGE_SHIFT;
> }
>
> @@ -198,7 +256,8 @@ void hv_synic_enable_regs(unsigned int cpu)
> if (!hv_cpu->synic_event_page)
> pr_err("Fail to map synic event page.\n");
> } else {
> - siefp.base_siefp_gpa = virt_to_phys(hv_cpu->synic_event_page)
> + siefp.base_siefp_gpa =
> + cc_mkdec(virt_to_phys(hv_cpu->synic_event_page))
> >> HV_HYP_PAGE_SHIFT;
> }
>
> --
> 2.25.1

Commit message nit notwithstanding --

Reviewed-by: Michael Kelley <[email protected]>

2023-05-05 16:53:32

by Dexuan Cui

[permalink] [raw]
Subject: RE: [PATCH v6 5/6] Drivers: hv: vmbus: Support TDX guests

> From: Michael Kelley (LINUX) <[email protected]>
> Sent: Friday, May 5, 2023 9:23 AM
> ...
> From: Dexuan Cui <[email protected]>
> >
> > Add Hyper-V specific code so that a TDX guest can run on Hyper-V:
> > No need to use hv_vp_assist_page.
> > Don't use the unsafe Hyper-V TSC page.
> > Don't try to use HV_REGISTER_CRASH_CTL.
> > Don't trust Hyper-V's TLB-flushing hypercalls.
> > Don't use lazy EOI.
>
> Nit: Actually, you overdid the cleanup. :-( The line in v5 about
> "Share SynIC Event/Message pages" was correct. It was only the
> part about VMBus Monitor pages that no longer applied.

Sorry, so here we need to add a line after the line "Don't use lazy EOI":
Share SynIC Event/Message pages with the host.

I suppose the maintainer(s) can help add this line, if possible.

> >
> > Signed-off-by: Dexuan Cui <[email protected]>
> > ---
> > ...
> ...
> Commit message nit notwithstanding --
>
> Reviewed-by: Michael Kelley <[email protected]>

Thanks!

2023-05-23 19:36:44

by Dexuan Cui

[permalink] [raw]
Subject: RE: [PATCH v6 0/6] Support TDX guests on Hyper-V

> From: Dexuan Cui <[email protected]>
> Sent: Thursday, May 4, 2023 3:54 PM
> To: [email protected]; [email protected]; [email protected];
> [email protected]; [email protected];
> [email protected]; Haiyang Zhang <[email protected]>;
> [email protected]; [email protected]; [email protected]; KY
> Srinivasan <[email protected]>; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected];
> [email protected]; [email protected];
> [email protected]; [email protected]; [email protected]; [email protected];
> Michael Kelley (LINUX) <[email protected]>
> Cc: [email protected]; Tianyu Lan <[email protected]>;
> Dexuan Cui <[email protected]>
> Subject: [PATCH v6 0/6] Support TDX guests on Hyper-V
>
> The patchset adds the Hyper-V specific code so that a TDX guest can run
> on Hyper-V. Please review.
>
> The v6 patchset is based on today's mainline (a1fd058b07d5).
>
> The v6 patchset addressed Michael's comments on patch 5:
> Removed 2 unnecessary lines of messages from the commit log.
> Fixed the error handling path for hv_synic_alloc()/free().
> Printed the 'ret' in hv_synic_alloc()/free().
>
> @Michael Kelley: Can you please review patch 5?

Thanks Michael for your Reviewed-by on patch 5.

> @x86 maintainers:
> If the patches look good to you, can you please take patch 1 and 2
> into the tip tree?

Hi Dave and all, could you please take a look at the patchset?

The patchset can also be viewed here:
https://lwn.net/ml/linux-kernel/20230504225351.10765-1-decui%40microsoft.com/

> @Wei Liu: I think patch 3, 4, 5, 6 should go through the Hyper-V tree
> since they change the Hyper-V code.
>
> If you want to view the patches on github, it is here:
> https://github.com/dcui/tdx/commits/decui/mainline/v6
>
> FYI, v1-v5 are here:
> <snipped> on 5/23/2023
>
> Thanks,
> Dexuan
>
> Dexuan Cui (6):
> x86/tdx: Retry TDVMCALL_MAP_GPA() when needed
> x86/tdx: Support vmalloc() for tdx_enc_status_changed()
> x86/hyperv: Add hv_isolation_type_tdx() to detect TDX guests
> x86/hyperv: Support hypercalls for TDX guests
> Drivers: hv: vmbus: Support TDX guests
> x86/hyperv: Fix serial console interrupts for TDX guests
>
> arch/x86/coco/tdx/tdx.c | 122
> ++++++++++++++++++++++-------
> arch/x86/hyperv/hv_apic.c | 6 +-
> arch/x86/hyperv/hv_init.c | 27 ++++++-
> arch/x86/hyperv/ivm.c | 20 +++++
> arch/x86/include/asm/hyperv-tlfs.h | 3 +-
> arch/x86/include/asm/mshyperv.h | 20 +++++
> arch/x86/kernel/cpu/mshyperv.c | 43 ++++++++++
> drivers/hv/hv.c | 65 ++++++++++++++-
> drivers/hv/hv_common.c | 30 +++++++
> include/asm-generic/mshyperv.h | 1 +
> 10 files changed, 300 insertions(+), 37 deletions(-)

Thanks,
-- Dexuan

2023-05-23 20:55:50

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v6 2/6] x86/tdx: Support vmalloc() for tdx_enc_status_changed()

On 5/4/23 15:53, Dexuan Cui wrote:
> When a TDX guest runs on Hyper-V, the hv_netvsc driver's netvsc_init_buf()
> allocates buffers using vzalloc(), and needs to share the buffers with the
> host OS by calling set_memory_decrypted(), which is not working for
> vmalloc() yet. Add the support by handling the pages one by one.

I think this sets a bad precedent.

There are consequences for converting pages between shared and private.
Doing it on a vmalloc() mapping is guaranteed to fracture the underlying
EPT/SEPT mappings.

How does this work with load_unaligned_zeropad()? Couldn't it be
running around poking at one of these vmalloc()'d pages via the direct
map during a shared->private conversion before the page has been accepted?

2023-05-23 21:39:48

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v6 1/6] x86/tdx: Retry TDVMCALL_MAP_GPA() when needed

On 5/4/23 15:53, Dexuan Cui wrote:
> - if (_tdx_hypercall(TDVMCALL_MAP_GPA, start, end - start, 0, 0))
> + while (1) {
> + memset(&args, 0, sizeof(args));
> + args.r10 = TDX_HYPERCALL_STANDARD;
> + args.r11 = TDVMCALL_MAP_GPA;
> + args.r12 = start;
> + args.r13 = end - start;
> +
> + ret = __tdx_hypercall_ret(&args);
> + if (ret != TDVMCALL_STATUS_RETRY)
> + break;
> + /*
> + * The guest must retry the operation for the pages in the
> + * region starting at the GPA specified in R11. Make sure R11
> + * contains a sane value.
> + */
> + map_fail_paddr = args.r11;
> + if (map_fail_paddr < start || map_fail_paddr >= end)
> + return false;

This should probably also say: "r11" comes from the untrusted VMM.
Sanity check it.

Should this *really* be "map_fail_paddr >= end"? Or is "map_fail_paddr
> end" sufficient. In other words, is it really worth failing this if a
VMM said to retry a 0-byte region at the end?

> + if (map_fail_paddr == start) {
> + retry_cnt++;
> + if (retry_cnt > max_retry_cnt)

I think we can spare two bytes in a few spots to make these 'count'
instead of 'cnt'.

> + return false;
> + } else {
> + retry_cnt = 0;
> + start = map_fail_paddr;
> + }
> + }

this fails the "normal operation should be at the lowest indentation"
rule. How about this:

while (retry_count < max_retries) {
...

/* "Consume" a retry without forward progress: */
if (map_fail_paddr == start) {
retry_count++;
continue;
}

start = map_fail_paddr;
retry_count = 0;
}

// plus maybe a wee bit different 'ret' processing


'max_retries' also ends up being a misnomer. You can have as many
retries as there are pages plus 'max_retries'. It's really "maximum
allowed consecutive failures". Maybe it should be "max_retries_per_page".

2023-05-23 21:41:25

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v6 2/6] x86/tdx: Support vmalloc() for tdx_enc_status_changed()

On 5/23/23 14:25, Sean Christopherson wrote:
>> There are consequences for converting pages between shared and private.
>> Doing it on a vmalloc() mapping is guaranteed to fracture the underlying
>> EPT/SEPT mappings.
>>
>> How does this work with load_unaligned_zeropad()? Couldn't it be
>> running around poking at one of these vmalloc()'d pages via the direct
>> map during a shared->private conversion before the page has been accepted?
> Would it be feasible and sensible to add a GFP_SHARED or whatever, to communicate
> to the core allocators that the page is destined to be converted to a shared page?
> I assume that would provide a common place (or two) for initiating conversions,
> and would hopefully allow for future optimizations, e.g. to keep shared allocation
> in the same pool or whatever. Sharing memory without any intelligence as to what
> memory is converted is going to make both the guest and host sad.

I don't think we want a GFP flag. This is still way too specialized to
warrant one of those.

It sounds like a similar problem to what folks want for modules or BPF.
There are a bunch of allocations that are related and can have some of
their setup/teardown costs amortized if they can be clumped together.

For BPF, the costs are from doing RW=>RO in the kernel direct map, and
fracturing it in the process.

Here, the costs are from the private->shared conversions and fracturing
both the direct map and the EPT/SEPT.

I just don't know if there's anything that we can reuse from the BPF effort.

2023-05-23 21:56:17

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v6 2/6] x86/tdx: Support vmalloc() for tdx_enc_status_changed()

On Tue, May 23, 2023, Dave Hansen wrote:
> On 5/4/23 15:53, Dexuan Cui wrote:
> > When a TDX guest runs on Hyper-V, the hv_netvsc driver's netvsc_init_buf()
> > allocates buffers using vzalloc(), and needs to share the buffers with the
> > host OS by calling set_memory_decrypted(), which is not working for
> > vmalloc() yet. Add the support by handling the pages one by one.
>
> I think this sets a bad precedent.

+1

> There are consequences for converting pages between shared and private.
> Doing it on a vmalloc() mapping is guaranteed to fracture the underlying
> EPT/SEPT mappings.
>
> How does this work with load_unaligned_zeropad()? Couldn't it be
> running around poking at one of these vmalloc()'d pages via the direct
> map during a shared->private conversion before the page has been accepted?

Would it be feasible and sensible to add a GFP_SHARED or whatever, to communicate
to the core allocators that the page is destined to be converted to a shared page?
I assume that would provide a common place (or two) for initiating conversions,
and would hopefully allow for future optimizations, e.g. to keep shared allocation
in the same pool or whatever. Sharing memory without any intelligence as to what
memory is converted is going to make both the guest and host sad.

2023-05-23 22:46:14

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v6 2/6] x86/tdx: Support vmalloc() for tdx_enc_status_changed()

On Tue, May 23, 2023 at 01:39:11PM -0700, Dave Hansen wrote:
> On 5/4/23 15:53, Dexuan Cui wrote:
> > When a TDX guest runs on Hyper-V, the hv_netvsc driver's netvsc_init_buf()
> > allocates buffers using vzalloc(), and needs to share the buffers with the
> > host OS by calling set_memory_decrypted(), which is not working for
> > vmalloc() yet. Add the support by handling the pages one by one.
>
> I think this sets a bad precedent.
>
> There are consequences for converting pages between shared and private.
> Doing it on a vmalloc() mapping is guaranteed to fracture the underlying
> EPT/SEPT mappings.
>
> How does this work with load_unaligned_zeropad()? Couldn't it be
> running around poking at one of these vmalloc()'d pages via the direct
> map during a shared->private conversion before the page has been accepted?

Alias processing in __change_page_attr_set_clr() will change direct
mapping if you call it on vmalloc()ed memory. I think we are safe wrt
load_unaligned_zeropad() here.
--
Kiryl Shutsemau / Kirill A. Shutemov

2023-05-23 22:54:42

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v6 2/6] x86/tdx: Support vmalloc() for tdx_enc_status_changed()

On 5/23/23 15:37, [email protected] wrote:
>> How does this work with load_unaligned_zeropad()? Couldn't it be
>> running around poking at one of these vmalloc()'d pages via the direct
>> map during a shared->private conversion before the page has been accepted?
> Alias processing in __change_page_attr_set_clr() will change direct
> mapping if you call it on vmalloc()ed memory. I think we are safe wrt
> load_unaligned_zeropad() here.

We're *eventually* OK:

> /* Notify hypervisor that we are about to set/clr encryption attribute. */
> x86_platform.guest.enc_status_change_prepare(addr, numpages, enc);
>
> ret = __change_page_attr_set_clr(&cpa, 1);

But what about in the middle between enc_status_change_prepare() and
__change_page_attr_set_clr()? Don't the direct map and the
shared/private status of the page diverge in there?


2023-05-23 23:16:16

by Edgecombe, Rick P

[permalink] [raw]
Subject: Re: [PATCH v6 2/6] x86/tdx: Support vmalloc() for tdx_enc_status_changed()

On Tue, 2023-05-23 at 14:33 -0700, Dave Hansen wrote:
> On 5/23/23 14:25, Sean Christopherson wrote:
> > > There are consequences for converting pages between shared and
> > > private.
> > > Doing it on a vmalloc() mapping is guaranteed to fracture the
> > > underlying
> > > EPT/SEPT mappings.
> > >
> > > How does this work with load_unaligned_zeropad()?  Couldn't it be
> > > running around poking at one of these vmalloc()'d pages via the
> > > direct
> > > map during a shared->private conversion before the page has been
> > > accepted?
> > Would it be feasible and sensible to add a GFP_SHARED or whatever,
> > to communicate
> > to the core allocators that the page is destined to be converted to
> > a shared page?
> > I assume that would provide a common place (or two) for initiating
> > conversions,
> > and would hopefully allow for future optimizations, e.g. to keep
> > shared allocation
> > in the same pool or whatever.  Sharing memory without any
> > intelligence as to what
> > memory is converted is going to make both the guest and host sad.
>
> I don't think we want a GFP flag.  This is still way too specialized
> to
> warrant one of those.
>
> It sounds like a similar problem to what folks want for modules or
> BPF.
> There are a bunch of allocations that are related and can have some
> of
> their setup/teardown costs amortized if they can be clumped together.
>
> For BPF, the costs are from doing RW=>RO in the kernel direct map,
> and
> fracturing it in the process.
>
> Here, the costs are from the private->shared conversions and
> fracturing
> both the direct map and the EPT/SEPT.
>
> I just don't know if there's anything that we can reuse from the BPF
> effort.

Last I saw the BPF allocator was focused on module space memory and
packing multiple allocations into pages. So that would probably have to
be generalized to support this.

But also, a lot of the efforts around improving direct map modification
efficiency have sort of assumed all of the types of special permissions
could be grouped together on the direct map and just mapped elsewhere
in whatever permission they needed. This does seem like a special case
where it really needs to be grouped together only with similar
permissions.

If a GFP flag is too precious, what about something like PKS tables[0]
tried. It kind of had a similar problem with respect to preferring
physically contiguous special permissioned memory on the direct map. It
did this with a cache of converted pages outside the page allocator and
a separate alloc() function to get at it. This one could be
vmalloc_shared() or something. Don't know, just tossing it out there.

[0]
https://lore.kernel.org/lkml/[email protected]/

2023-05-23 23:53:59

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v6 2/6] x86/tdx: Support vmalloc() for tdx_enc_status_changed()

On Tue, May 23, 2023 at 03:43:15PM -0700, Dave Hansen wrote:
> On 5/23/23 15:37, [email protected] wrote:
> >> How does this work with load_unaligned_zeropad()? Couldn't it be
> >> running around poking at one of these vmalloc()'d pages via the direct
> >> map during a shared->private conversion before the page has been accepted?
> > Alias processing in __change_page_attr_set_clr() will change direct
> > mapping if you call it on vmalloc()ed memory. I think we are safe wrt
> > load_unaligned_zeropad() here.
>
> We're *eventually* OK:
>
> > /* Notify hypervisor that we are about to set/clr encryption attribute. */
> > x86_platform.guest.enc_status_change_prepare(addr, numpages, enc);
> >
> > ret = __change_page_attr_set_clr(&cpa, 1);
>
> But what about in the middle between enc_status_change_prepare() and
> __change_page_attr_set_clr()? Don't the direct map and the
> shared/private status of the page diverge in there?

Hmm. Maybe we would need to go through making the range in direct mapping
non-present before notifying VMM about the change.

I need to look at this again in the morning.

--
Kiryl Shutsemau / Kirill A. Shutemov

2023-05-25 02:28:16

by Dexuan Cui

[permalink] [raw]
Subject: RE: [PATCH v6 1/6] x86/tdx: Retry TDVMCALL_MAP_GPA() when needed

> From: Dave Hansen <[email protected]>
> Sent: Tuesday, May 23, 2023 2:13 PM
> ...
> On 5/4/23 15:53, Dexuan Cui wrote:
> > - if (_tdx_hypercall(TDVMCALL_MAP_GPA, start, end - start, 0, 0))
> > + while (1) {
> > + memset(&args, 0, sizeof(args));
> > + args.r10 = TDX_HYPERCALL_STANDARD;
> > + args.r11 = TDVMCALL_MAP_GPA;
> > + args.r12 = start;
> > + args.r13 = end - start;
> > +
> > + ret = __tdx_hypercall_ret(&args);
> > + if (ret != TDVMCALL_STATUS_RETRY)
> > + break;
> > + /*
> > + * The guest must retry the operation for the pages in the
> > + * region starting at the GPA specified in R11. Make sure R11
> > + * contains a sane value.
> > + */
> > + map_fail_paddr = args.r11;
> > + if (map_fail_paddr < start || map_fail_paddr >= end)
> > + return false;
>
> This should probably also say: "r11" comes from the untrusted VMM.
> Sanity check it.

Thanks! I'll use the wording you recommended in the next version.

> Should this *really* be "map_fail_paddr >= end"? Or is
> "map_fail_paddr > end" sufficient. In other words, is it really
> worth failing this if a VMM said to retry a 0-byte region at the end?

According to the GHCI spec, R13 "must be a multiple of 4KB". My
understanding is that R13 should not be 0, and a hypervisor is not
supposed to tell the guest to retry a 0-byte region at the end.

IMHO it should be a hypervisor bug if a hypervisor returns
TDVMCALL_STATUS_RETRY and the returned 'map_fail_paddr' equals
'end' (Note: the valid page range is [start, end - 1]).

Hyper-V returns "invalid parameter" if the length (i.e. args.r13) is 0,
so "retry a 0-byte region at the end" would fail anyway. I guess
other hypervisors may also return an error if the length is 0.

So I'd like to keep the comparison as-is.

> > + if (map_fail_paddr == start) {
> > + retry_cnt++;
> > + if (retry_cnt > max_retry_cnt)
>
> I think we can spare two bytes in a few spots to make these 'count'
> instead of 'cnt'.

Ok, I'll rename the variable 'max_retry_cnt' to 'max_retries_per_page',
and 'retry_cnt' to 'retry_count'.

> > + return false;
> > + } else {
> > + retry_cnt = 0;
> > + start = map_fail_paddr;
> > + }
> > + }
>
> this fails the "normal operation should be at the lowest indentation"
> rule. How about this:
>
> while (retry_count < max_retries) {
> ...
>
> /* "Consume" a retry without forward progress: */
> if (map_fail_paddr == start) {
> retry_count++;
> continue;
> }
>
> start = map_fail_paddr;
> retry_count = 0;
> }
>
> // plus maybe a wee bit different 'ret' processing
>
>
> 'max_retries' also ends up being a misnomer. You can have as many
> retries as there are pages plus 'max_retries'. It's really "maximum
> allowed consecutive failures". Maybe it should be "max_retries_per_page".

Thanks, I'll raname 'max_retries" to 'max_retries_per_page'.

I'll use the beow in the next version.
I added "const" to "int max_retries_per_page".
Please let me know if I missed anything.

+static bool tdx_map_gpa(phys_addr_t start, phys_addr_t end, bool enc)
+{
+ const int max_retries_per_page = 3;
+ struct tdx_hypercall_args args;
+ u64 map_fail_paddr, ret;
+ int retry_count = 0;
+
+ if (!enc) {
+ /* Set the shared (decrypted) bits: */
+ start |= cc_mkdec(0);
+ end |= cc_mkdec(0);
+ }
+
+ while (retry_count < max_retries_per_page) {
+ memset(&args, 0, sizeof(args));
+ args.r10 = TDX_HYPERCALL_STANDARD;
+ args.r11 = TDVMCALL_MAP_GPA;
+ args.r12 = start;
+ args.r13 = end - start;
+
+ ret = __tdx_hypercall_ret(&args);
+ if (ret != TDVMCALL_STATUS_RETRY)
+ return !ret;
+ /*
+ * The guest must retry the operation for the pages in the
+ * region starting at the GPA specified in R11. R11 comes
+ * from the untrusted VMM. Sanity check it.
+ */
+ map_fail_paddr = args.r11;
+ if (map_fail_paddr < start || map_fail_paddr >= end)
+ return false;
+
+ /* "Consume" a retry without forward progress */
+ if (map_fail_paddr == start) {
+ retry_count++;
+ continue;
+ }
+
+ start = map_fail_paddr;
+ retry_count = 0;
+ }
+
+ return false;
+}

2023-05-25 20:10:14

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v6 2/6] x86/tdx: Support vmalloc() for tdx_enc_status_changed()

On Wed, May 24, 2023 at 02:28:51AM +0300, [email protected] wrote:
> On Tue, May 23, 2023 at 03:43:15PM -0700, Dave Hansen wrote:
> > On 5/23/23 15:37, [email protected] wrote:
> > >> How does this work with load_unaligned_zeropad()? Couldn't it be
> > >> running around poking at one of these vmalloc()'d pages via the direct
> > >> map during a shared->private conversion before the page has been accepted?
> > > Alias processing in __change_page_attr_set_clr() will change direct
> > > mapping if you call it on vmalloc()ed memory. I think we are safe wrt
> > > load_unaligned_zeropad() here.
> >
> > We're *eventually* OK:
> >
> > > /* Notify hypervisor that we are about to set/clr encryption attribute. */
> > > x86_platform.guest.enc_status_change_prepare(addr, numpages, enc);
> > >
> > > ret = __change_page_attr_set_clr(&cpa, 1);
> >
> > But what about in the middle between enc_status_change_prepare() and
> > __change_page_attr_set_clr()? Don't the direct map and the
> > shared/private status of the page diverge in there?
>
> Hmm. Maybe we would need to go through making the range in direct mapping
> non-present before notifying VMM about the change.
>
> I need to look at this again in the morning.

Okay, I've got around to it finally.

Private->Shared conversion is safe: we first set shared bit in the direct
mapping and all aliases and then call MapGPA enc_status_change_finish().
So we don't have privately mapped pages that we converted to shared with
MapGPA.

Shared->Private is not safe. As with Private->Shared, we adjust direct
mapping before notifying VMM and accepting the memory, so there's short
window when privately mapped memory that is neither mapped into SEPT nor
accepted. It is a problem as it can race with load_unaligned_zeropad().

Shared->Private conversion is rare. I only see one call total during the
boot in my setup. Worth fixing anyway.


The patch below fixes the issue by hooking up enc_status_change_prepare()
and doing conversion from Shared to Private there.

enc_status_change_finish() only covers Private to Shared conversion.

The patch is on top of unaccepted memory patchset. It is more convenient
base. I will rebase to Linus' tree if the approach looks sane to you.

Any comments?

diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index 32501277ef84..b73ec2449c64 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -713,16 +713,32 @@ static bool tdx_cache_flush_required(void)
return true;
}

+static bool tdx_enc_status_change_prepare(unsigned long vaddr, int numpages,
+ bool enc)
+{
+ phys_addr_t start = __pa(vaddr);
+ phys_addr_t end = __pa(vaddr + numpages * PAGE_SIZE);
+
+ if (!enc)
+ return true;
+
+ return tdx_enc_status_changed_phys(start, end, enc);
+}
+
/*
* Inform the VMM of the guest's intent for this physical page: shared with
* the VMM or private to the guest. The VMM is expected to change its mapping
* of the page in response.
*/
-static bool tdx_enc_status_changed(unsigned long vaddr, int numpages, bool enc)
+static bool tdx_enc_status_change_finish(unsigned long vaddr, int numpages,
+ bool enc)
{
phys_addr_t start = __pa(vaddr);
phys_addr_t end = __pa(vaddr + numpages * PAGE_SIZE);

+ if (enc)
+ return true;
+
return tdx_enc_status_changed_phys(start, end, enc);
}

@@ -753,9 +769,10 @@ void __init tdx_early_init(void)
*/
physical_mask &= cc_mask - 1;

- x86_platform.guest.enc_cache_flush_required = tdx_cache_flush_required;
- x86_platform.guest.enc_tlb_flush_required = tdx_tlb_flush_required;
- x86_platform.guest.enc_status_change_finish = tdx_enc_status_changed;
+ x86_platform.guest.enc_cache_flush_required = tdx_cache_flush_required;
+ x86_platform.guest.enc_tlb_flush_required = tdx_tlb_flush_required;
+ x86_platform.guest.enc_status_change_prepare = tdx_enc_status_change_prepare;
+ x86_platform.guest.enc_status_change_finish = tdx_enc_status_change_finish;

pr_info("Guest detected\n");
}
diff --git a/arch/x86/include/asm/x86_init.h b/arch/x86/include/asm/x86_init.h
index 88085f369ff6..1ca9701917c5 100644
--- a/arch/x86/include/asm/x86_init.h
+++ b/arch/x86/include/asm/x86_init.h
@@ -150,7 +150,7 @@ struct x86_init_acpi {
* @enc_cache_flush_required Returns true if a cache flush is needed before changing page encryption status
*/
struct x86_guest {
- void (*enc_status_change_prepare)(unsigned long vaddr, int npages, bool enc);
+ bool (*enc_status_change_prepare)(unsigned long vaddr, int npages, bool enc);
bool (*enc_status_change_finish)(unsigned long vaddr, int npages, bool enc);
bool (*enc_tlb_flush_required)(bool enc);
bool (*enc_cache_flush_required)(void);
diff --git a/arch/x86/kernel/x86_init.c b/arch/x86/kernel/x86_init.c
index d82f4fa2f1bf..64664311ac2b 100644
--- a/arch/x86/kernel/x86_init.c
+++ b/arch/x86/kernel/x86_init.c
@@ -130,8 +130,8 @@ struct x86_cpuinit_ops x86_cpuinit = {

static void default_nmi_init(void) { };

-static void enc_status_change_prepare_noop(unsigned long vaddr, int npages, bool enc) { }
-static bool enc_status_change_finish_noop(unsigned long vaddr, int npages, bool enc) { return false; }
+static bool enc_status_change_prepare_noop(unsigned long vaddr, int npages, bool enc) { return true; }
+static bool enc_status_change_finish_noop(unsigned long vaddr, int npages, bool enc) { return true; }
static bool enc_tlb_flush_required_noop(bool enc) { return false; }
static bool enc_cache_flush_required_noop(void) { return false; }
static bool is_private_mmio_noop(u64 addr) {return false; }
diff --git a/arch/x86/mm/mem_encrypt_amd.c b/arch/x86/mm/mem_encrypt_amd.c
index e0b51c09109f..4f95c449a406 100644
--- a/arch/x86/mm/mem_encrypt_amd.c
+++ b/arch/x86/mm/mem_encrypt_amd.c
@@ -319,7 +319,7 @@ static void enc_dec_hypercall(unsigned long vaddr, int npages, bool enc)
#endif
}

-static void amd_enc_status_change_prepare(unsigned long vaddr, int npages, bool enc)
+static bool amd_enc_status_change_prepare(unsigned long vaddr, int npages, bool enc)
{
/*
* To maintain the security guarantees of SEV-SNP guests, make sure
@@ -327,6 +327,8 @@ static void amd_enc_status_change_prepare(unsigned long vaddr, int npages, bool
*/
if (cc_platform_has(CC_ATTR_GUEST_SEV_SNP) && !enc)
snp_set_memory_shared(vaddr, npages);
+
+ return true;
}

/* Return true unconditionally: return value doesn't matter for the SEV side */
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 7159cf787613..b8f48ebe753c 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -2151,7 +2151,8 @@ static int __set_memory_enc_pgtable(unsigned long addr, int numpages, bool enc)
cpa_flush(&cpa, x86_platform.guest.enc_cache_flush_required());

/* Notify hypervisor that we are about to set/clr encryption attribute. */
- x86_platform.guest.enc_status_change_prepare(addr, numpages, enc);
+ if (!x86_platform.guest.enc_status_change_prepare(addr, numpages, enc))
+ return -EIO;

ret = __change_page_attr_set_clr(&cpa, 1);

--
Kiryl Shutsemau / Kirill A. Shutemov

2023-05-25 20:12:51

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v6 2/6] x86/tdx: Support vmalloc() for tdx_enc_status_changed()

On 5/25/23 12:08, Kirill A. Shutemov wrote:
> Shared->Private conversion is rare. I only see one call total during the
> boot in my setup. Worth fixing anyway.
...
> Any comments?

So the rules are:

* Shared mapping of a private page: BAD
* Private mapping of a shared page: OK

?

The patch seems OK, other than having zero comments in it.