2023-08-11 03:04:02

by Dexuan Cui

[permalink] [raw]
Subject: [PATCH RESEND v9 0/2] Support TDX guests on Hyper-V (the x86/tdx part)

(Resending the pachset as the v9 sent on 6/23/2023 didn't draw much attention)

The two patches are based on today's tip.git's master branch.

Note: the two patches don't apply to the current x86/tdx branch, which
doesn't have commit 75d090fd167a ("x86/tdx: Add unaccepted memory support").

As Dave suggested, I moved some local variables of tdx_map_gpa() to
inside the loop. I added Sathyanarayanan's Reviewed-by.

Please review.

FWIW, the old versons are here:
v9: https://lwn.net/ml/linux-kernel/[email protected]/
v8: https://lwn.net/ml/linux-kernel/[email protected]/
v7: https://lwn.net/ml/linux-kernel/20230616044701.15888-1-decui%40microsoft.com/
v6: https://lwn.net/ml/linux-kernel/[email protected]/

Dexuan Cui (2):
x86/tdx: Retry TDVMCALL_MAP_GPA() when needed
x86/tdx: Support vmalloc() for tdx_enc_status_changed()

arch/x86/coco/tdx/tdx.c | 87 ++++++++++++++++++++++++++-----
arch/x86/include/asm/shared/tdx.h | 2 +
2 files changed, 77 insertions(+), 12 deletions(-)

--
2.25.1



2023-08-11 03:43:28

by Dexuan Cui

[permalink] [raw]
Subject: [PATCH RESEND v9 1/2] x86/tdx: Retry TDVMCALL_MAP_GPA() when needed

GHCI spec for TDX 1.0 says that the MapGPA call may fail with the R10
error code = TDG.VP.VMCALL_RETRY (1), and the guest must retry this
operation for the pages in the region starting at the GPA specified
in R11.

When a fully enlightened TDX guest runs on Hyper-V, Hyper-V can return
the retry error when set_memory_decrypted() is called to decrypt up to
1GB of swiotlb bounce buffers.

Acked-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Michael Kelley <[email protected]>
Reviewed-by: Kuppuswamy Sathyanarayanan <[email protected]>
Signed-off-by: Dexuan Cui <[email protected]>
---
arch/x86/coco/tdx/tdx.c | 64 +++++++++++++++++++++++++------
arch/x86/include/asm/shared/tdx.h | 2 +
2 files changed, 54 insertions(+), 12 deletions(-)

Changes in v2:
Used __tdx_hypercall() directly in tdx_map_gpa().
Added a max_retry_cnt of 1000.
Renamed a few variables, e.g., r11 -> map_fail_paddr.

Changes in v3:
Changed max_retry_cnt from 1000 to 3.

Changes in v4:
__tdx_hypercall(&args, TDX_HCALL_HAS_OUTPUT) -> __tdx_hypercall_ret()
Added Kirill's Acked-by.

Changes in v5:
Added Michael's Reviewed-by.

Changes in v6: None.

Changes in v7:
Addressed Dave's comments:
see https://lwn.net/ml/linux-kernel/SA1PR21MB1335736123C2BCBBFD7460C3BF46A@SA1PR21MB1335.namprd21.prod.outlook.com

Changes in v8:
Rebased to tip.git's master branch.

Changes in v9:
Added a comment before 'max_retries_per_page'.
Moved 'args', 'map_fail_paddr' and 'ret' into the loop.
Added Kuppuswamy Sathyanarayanan's Reviewed-by.

diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index 1d6b863c42b0..746075d20cd2 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -703,14 +703,15 @@ static bool tdx_cache_flush_required(void)
}

/*
- * Inform the VMM of the guest's intent for this physical page: shared with
- * the VMM or private to the guest. The VMM is expected to change its mapping
- * of the page in response.
+ * Notify the VMM about page mapping conversion. More info about ABI
+ * can be found in TDX Guest-Host-Communication Interface (GHCI),
+ * section "TDG.VP.VMCALL<MapGPA>".
*/
-static bool tdx_enc_status_changed(unsigned long vaddr, int numpages, bool enc)
+static bool tdx_map_gpa(phys_addr_t start, phys_addr_t end, bool enc)
{
- phys_addr_t start = __pa(vaddr);
- phys_addr_t end = __pa(vaddr + numpages * PAGE_SIZE);
+ /* Retrying the hypercall a second time should succeed; use 3 just in case */
+ const int max_retries_per_page = 3;
+ int retry_count = 0;

if (!enc) {
/* Set the shared (decrypted) bits: */
@@ -718,12 +719,51 @@ static bool tdx_enc_status_changed(unsigned long vaddr, int numpages, bool enc)
end |= cc_mkdec(0);
}

- /*
- * Notify the VMM about page mapping conversion. More info about ABI
- * can be found in TDX Guest-Host-Communication Interface (GHCI),
- * section "TDG.VP.VMCALL<MapGPA>"
- */
- if (_tdx_hypercall(TDVMCALL_MAP_GPA, start, end - start, 0, 0))
+ while (retry_count < max_retries_per_page) {
+ struct tdx_hypercall_args args = {
+ .r10 = TDX_HYPERCALL_STANDARD,
+ .r11 = TDVMCALL_MAP_GPA,
+ .r12 = start,
+ .r13 = end - start };
+
+ u64 map_fail_paddr;
+ u64 ret = __tdx_hypercall_ret(&args);
+
+ if (ret != TDVMCALL_STATUS_RETRY)
+ return !ret;
+ /*
+ * The guest must retry the operation for the pages in the
+ * region starting at the GPA specified in R11. R11 comes
+ * from the untrusted VMM. Sanity check it.
+ */
+ map_fail_paddr = args.r11;
+ if (map_fail_paddr < start || map_fail_paddr >= end)
+ return false;
+
+ /* "Consume" a retry without forward progress */
+ if (map_fail_paddr == start) {
+ retry_count++;
+ continue;
+ }
+
+ start = map_fail_paddr;
+ retry_count = 0;
+ }
+
+ return false;
+}
+
+/*
+ * Inform the VMM of the guest's intent for this physical page: shared with
+ * the VMM or private to the guest. The VMM is expected to change its mapping
+ * of the page in response.
+ */
+static bool tdx_enc_status_changed(unsigned long vaddr, int numpages, bool enc)
+{
+ phys_addr_t start = __pa(vaddr);
+ phys_addr_t end = __pa(vaddr + numpages * PAGE_SIZE);
+
+ if (!tdx_map_gpa(start, end, enc))
return false;

/* shared->private conversion requires memory to be accepted before use */
diff --git a/arch/x86/include/asm/shared/tdx.h b/arch/x86/include/asm/shared/tdx.h
index 90ea813c4b99..9db89a99ae5b 100644
--- a/arch/x86/include/asm/shared/tdx.h
+++ b/arch/x86/include/asm/shared/tdx.h
@@ -24,6 +24,8 @@
#define TDVMCALL_MAP_GPA 0x10001
#define TDVMCALL_REPORT_FATAL_ERROR 0x10003

+#define TDVMCALL_STATUS_RETRY 1
+
#ifndef __ASSEMBLY__

/*
--
2.25.1


2023-08-11 03:53:21

by Dexuan Cui

[permalink] [raw]
Subject: [PATCH RESEND v9 2/2] x86/tdx: Support vmalloc() for tdx_enc_status_changed()

When a TDX guest runs on Hyper-V, the hv_netvsc driver's netvsc_init_buf()
allocates buffers using vzalloc(), and needs to share the buffers with the
host OS by calling set_memory_decrypted(), which is not working for
vmalloc() yet. Add the support by handling the pages one by one.

Co-developed-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Michael Kelley <[email protected]>
Reviewed-by: Kuppuswamy Sathyanarayanan <[email protected]>
Signed-off-by: Dexuan Cui <[email protected]>
---
arch/x86/coco/tdx/tdx.c | 35 +++++++++++++++++++++++++++++------
1 file changed, 29 insertions(+), 6 deletions(-)

Changes in v2:
Changed tdx_enc_status_changed() in place.

Changes in v3:
No change since v2.

Changes in v4:
Added Kirill's Co-developed-by since Kirill helped to improve the
code by adding tdx_enc_status_changed_phys().

Thanks Kirill for the clarification on load_unaligned_zeropad()!

Changes in v5:
Added Kirill's Signed-off-by.
Added Michael's Reviewed-by.

Changes in v6: None.

Changes in v7: None.
Note: there was a race between set_memory_encrypted() and
load_unaligned_zeropad(), which has been fixed by the 3 patches of
Kirill in the x86/tdx branch of the tip tree.

Changes in v8:
Rebased to tip.git's master branch.

Changes in v9:
Added Kuppuswamy Sathyanarayanan's Reviewed-by.

diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index 746075d20cd2..c1a2423a8159 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -7,6 +7,7 @@
#include <linux/cpufeature.h>
#include <linux/export.h>
#include <linux/io.h>
+#include <linux/mm.h>
#include <asm/coco.h>
#include <asm/tdx.h>
#include <asm/vmx.h>
@@ -753,6 +754,19 @@ static bool tdx_map_gpa(phys_addr_t start, phys_addr_t end, bool enc)
return false;
}

+static bool tdx_enc_status_changed_phys(phys_addr_t start, phys_addr_t end,
+ bool enc)
+{
+ if (!tdx_map_gpa(start, end, enc))
+ return false;
+
+ /* shared->private conversion requires memory to be accepted before use */
+ if (enc)
+ return tdx_accept_memory(start, end);
+
+ return true;
+}
+
/*
* Inform the VMM of the guest's intent for this physical page: shared with
* the VMM or private to the guest. The VMM is expected to change its mapping
@@ -760,15 +774,24 @@ static bool tdx_map_gpa(phys_addr_t start, phys_addr_t end, bool enc)
*/
static bool tdx_enc_status_changed(unsigned long vaddr, int numpages, bool enc)
{
- phys_addr_t start = __pa(vaddr);
- phys_addr_t end = __pa(vaddr + numpages * PAGE_SIZE);
+ unsigned long start = vaddr;
+ unsigned long end = start + numpages * PAGE_SIZE;

- if (!tdx_map_gpa(start, end, enc))
+ if (offset_in_page(start) != 0)
return false;

- /* shared->private conversion requires memory to be accepted before use */
- if (enc)
- return tdx_accept_memory(start, end);
+ if (!is_vmalloc_addr((void *)start))
+ return tdx_enc_status_changed_phys(__pa(start), __pa(end), enc);
+
+ while (start < end) {
+ phys_addr_t start_pa = slow_virt_to_phys((void *)start);
+ phys_addr_t end_pa = start_pa + PAGE_SIZE;
+
+ if (!tdx_enc_status_changed_phys(start_pa, end_pa, enc))
+ return false;
+
+ start += PAGE_SIZE;
+ }

return true;
}
--
2.25.1


2023-08-11 15:08:51

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH RESEND v9 1/2] x86/tdx: Retry TDVMCALL_MAP_GPA() when needed

On 8/10/23 19:12, Dexuan Cui wrote:
> GHCI spec for TDX 1.0 says that the MapGPA call may fail with the R10
> error code = TDG.VP.VMCALL_RETRY (1), and the guest must retry this
> operation for the pages in the region starting at the GPA specified
> in R11.
>
> When a fully enlightened TDX guest runs on Hyper-V, Hyper-V can return
> the retry error when set_memory_decrypted() is called to decrypt up to
> 1GB of swiotlb bounce buffers.

This changelog is not great. It gives zero background and wastes bytes
on telling me which register the error code is in (I don't care in a
changelog) and then using marketing fluff words like "fully enlightened".

Let's stick to the facts, give some background, and also avoid
regurgitating the GHCI, eh?

How's this?

x86/tdx: Retry partially-completed page conversion hypercalls

TDX guest memory is private by default and the VMM may not access it.
However, in cases where the guest needs to share data with the VMM,
the guest and the VMM can coordinate to make memory shared between
them.

The guest side of this protocol includes the "MapGPA" hypercall. This
call takes a guest physical address range. The hypercall spec (aka.
the GHCI) says that the MapGPA call is allowed to return partial
progress in mapping this range and indicate that fact with a special
error code. A guest that sees such partial progress is expected to
retry the operation for the portion of the address range that was not
completed.

Hyper-V does this partial completion dance when set_memory_decrypted()
is called to "decrypt" swiotlb bounce buffers that can be up to 1GB
in size. It is evidently the only VMM that does this, which is why
nobody noticed this until now.



2023-08-11 15:45:23

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH RESEND v9 2/2] x86/tdx: Support vmalloc() for tdx_enc_status_changed()

On 8/10/23 19:12, Dexuan Cui wrote:
> + if (!is_vmalloc_addr((void *)start))
> + return tdx_enc_status_changed_phys(__pa(start), __pa(end), enc);
> +
> + while (start < end) {
> + phys_addr_t start_pa = slow_virt_to_phys((void *)start);
> + phys_addr_t end_pa = start_pa + PAGE_SIZE;
> +
> + if (!tdx_enc_status_changed_phys(start_pa, end_pa, enc))
> + return false;
> +
> + start += PAGE_SIZE;
> + }

This creates two different paths for vmalloc() and the direct map.
There are two different ways to do va=>pa conversion, for instance.
Here's a single loop that works for both cases:

unsigned long step = end - start;
unsigned long addr;

/* Step through page-by-page for vmalloc() mappings: */
if (is_vmalloc_addr((void *)vaddr))
step = PAGE_SIZE;

for (addr = start; addr < end; addr += step) {
phys_addr_t start_pa = slow_virt_to_phys(addr);
phys_addr_t end_pa = start_pa + step;

if (!tdx_enc_status_changed_phys(start_pa, end_pa, enc))
return false;
}

Note that this also doesn't abuse 'start' by making it a loop variable.
It also, uh, uses a for() loop.

The only downside is that it costs a page table walk for direct map
virt=>phys conversion. I can live with that.

2023-08-11 20:20:27

by Dexuan Cui

[permalink] [raw]
Subject: RE: [PATCH RESEND v9 1/2] x86/tdx: Retry TDVMCALL_MAP_GPA() when needed

> From: Dave Hansen <[email protected]>
> Sent: Friday, August 11, 2023 7:27 AM
> [...]
> On 8/10/23 19:12, Dexuan Cui wrote:
> > GHCI spec for TDX 1.0 says that the MapGPA call may fail with the R10
> > error code = TDG.VP.VMCALL_RETRY (1), and the guest must retry this
> > operation for the pages in the region starting at the GPA specified
> > in R11.
> >
> > When a fully enlightened TDX guest runs on Hyper-V, Hyper-V can return
> > the retry error when set_memory_decrypted() is called to decrypt up to
> > 1GB of swiotlb bounce buffers.
>
> This changelog is not great. It gives zero background and wastes bytes
> on telling me which register the error code is in (I don't care in a
> changelog) and then using marketing fluff words like "fully enlightened".
>
> Let's stick to the facts, give some background, and also avoid
> regurgitating the GHCI, eh?
>
> How's this?

I appreciate the great changelog! Will use this in v10.

> x86/tdx: Retry partially-completed page conversion hypercalls
>
> TDX guest memory is private by default and the VMM may not access it.
> However, in cases where the guest needs to share data with the VMM,
> the guest and the VMM can coordinate to make memory shared between
> them.
>
> The guest side of this protocol includes the "MapGPA" hypercall. This
> call takes a guest physical address range. The hypercall spec (aka.
> the GHCI) says that the MapGPA call is allowed to return partial
> progress in mapping this range and indicate that fact with a special
> error code. A guest that sees such partial progress is expected to
> retry the operation for the portion of the address range that was not
> completed.
>
> Hyper-V does this partial completion dance when set_memory_decrypted()
> is called to "decrypt" swiotlb bounce buffers that can be up to 1GB
> in size. It is evidently the only VMM that does this, which is why
> nobody noticed this until now.

This clarifies the obscure "when needed" in my version by emphasizing
the "partial progress". This also gives necessary background info. Thanks!

2023-08-11 21:00:01

by Dexuan Cui

[permalink] [raw]
Subject: RE: [PATCH RESEND v9 2/2] x86/tdx: Support vmalloc() for tdx_enc_status_changed()

> From: Dave Hansen <[email protected]>
> Sent: Friday, August 11, 2023 7:10 AM
> [...]
> This creates two different paths for vmalloc() and the direct map.
> There are two different ways to do va=>pa conversion, for instance.
> Here's a single loop that works for both cases:
>
> unsigned long step = end - start;
> unsigned long addr;
>
> /* Step through page-by-page for vmalloc() mappings: */
> if (is_vmalloc_addr((void *)vaddr))
> step = PAGE_SIZE;
>
> for (addr = start; addr < end; addr += step) {
> phys_addr_t start_pa = slow_virt_to_phys(addr);
> phys_addr_t end_pa = start_pa + step;
>
> if (!tdx_enc_status_changed_phys(start_pa, end_pa, enc))
> return false;
> }
>
> Note that this also doesn't abuse 'start' by making it a loop variable.
> It also, uh, uses a for() loop.
>
> The only downside is that it costs a page table walk for direct map
> virt=>phys conversion. I can live with that.

Your version is concise and more readable.
Thanks for improving the patch! I'll use this in v10 shortly.