Subject: [PATCH v5 0/3] Add TDX Guest Attestation support

Hi All,

Intel's Trust Domain Extensions (TDX) protect guest VMs from malicious
hosts and some physical attacks. VM guest with TDX support is called
as TD Guest.

In TD Guest, the attestation process is used to verify the
trustworthiness of TD guest to the 3rd party servers. Such attestation
process is required by 3rd party servers before sending sensitive
information to TD guests. One usage example is to get encryption keys
from the key server for mounting the encrypted rootfs or secondary drive.

Following patches add the attestation support to TDX guest which
includes attestation user interface driver and related hypercall support.

Any distribution enabling TDX is also expected to need attestation. So
enable it by default with TDX guest support. The compiled size is
quite small (500 bytes).

Dependencies:
--------------

This feature has dependency on TDX guest core patch set series.

https://lore.kernel.org/all/[email protected]/T/

Changes since v4:
* Removed platform driver model in attestation driver and used
miscdevice and initcall approach.
* Since dma_alloc*() APIs require a valid device reference,
replaced it with __get_free_pages() and set_memory_decrypted()
for quote memory allocation.
* Removed tdx_mcall_tdreport() and moved TDG.MR.REPORT TDCALL code
to tdx_get_report().
* Used kmalloc() for TDREPORT memory allocation instead of
get_zeroed_page().
* Returned -EINVAL in default case of tdx_attest_ioctl().
* Added struct tdx_report_req to explicitly mention the
TDX_CMD_GET_REPORT IOCTL argument.
* Removed tdx_get_quote_hypercall() and moved hypercall code to
attestation driver itself.
* Removed GetQuote timeout support (since it is not defined in
spec)
* Added support to check for spurious callback interrupt in GetQuote
request.
* Fixed commit log and comments as per review suggestions.


Changes since v3:
* Moved the attestation driver from platform/x86 to arch/x86/coco/tdx/ and
renamed intel_tdx_attest.c to attest.c.
* Dropped CONFIG_INTEL_TDX_ATTESTATION and added support to compile
attestation changes with CONFIG_INTEL_TDX_GUEST option.
* Merged patch titled "x86/tdx: Add tdx_mcall_tdreport() API support" and
"platform/x86: intel_tdx_attest: Add TDX Guest attestation interface" into
a single patch.
* Moved GetQuote IOCTL support changes from patch titled "platform/x86:
intel_tdx_attest: Add TDX Guest attestation interface driver" to a
separate patch.
* Removed 8K size restriction when requesting quote, and added support
to let userspace decide the quote size.
* Added support to allow attestation agent configure quote generation
timeout value.
* Fixed commit log and comments as per review comments.

Changes since v2:
* As per Han's suggestion, modified the attestation driver to use
platform device driver model.
* Modified tdx_hcall_get_quote() and tdx_mcall_tdreport() APIs to
return TDCALL error code instead of generic error info (like -EIO).
* Removed attestation test app patch from this series to simplify
the patchset and review process. Test app patches will be submitted
once attestation support patches are merged.
* Since patches titled "x86/tdx: Add SetupEventNotifyInterrupt TDX
hypercall support" and "x86/tdx: Add TDX Guest event notify
interrupt vector support" are related, combining them into a
single patch.

Changes since v1:
* Moved test driver from "tools/tdx/attest/tdx-attest-test.c" to
"tools/arch/x86/tdx/attest/tdx-attest-test.c" as per Hans review
suggestion.
* Minor commit log and comment fixes in patches titled
"x86/tdx: Add tdx_mcall_tdreport() API support" and "x86/tdx:
Add tdx_hcall_get_quote() API support"
* Extended tdx_hcall_get_quote() API to accept GPA length as argument
to accomodate latest TDQUOTE TDVMCALL related specification update.
* Added support for tdx_setup_ev_notify_handler() and
tdx_remove_ev_notify_handler() in patch titled "x86/tdx: Add TDX
Guest event notify interrupt vector support"

Kuppuswamy Sathyanarayanan (3):
x86/tdx: Add TDX Guest attestation interface driver
x86/tdx: Add TDX Guest event notify interrupt support
x86/tdx: Add Quote generation support

arch/x86/coco/tdx/Makefile | 2 +-
arch/x86/coco/tdx/attest.c | 273 +++++++++++++++++++++++++++++
arch/x86/coco/tdx/tdx.c | 73 ++++++++
arch/x86/include/asm/hardirq.h | 3 +
arch/x86/include/asm/idtentry.h | 4 +
arch/x86/include/asm/irq_vectors.h | 7 +-
arch/x86/include/asm/tdx.h | 4 +
arch/x86/include/uapi/asm/tdx.h | 76 ++++++++
arch/x86/kernel/irq.c | 7 +
9 files changed, 447 insertions(+), 2 deletions(-)
create mode 100644 arch/x86/coco/tdx/attest.c
create mode 100644 arch/x86/include/uapi/asm/tdx.h

--
2.25.1


Subject: [PATCH v5 2/3] x86/tdx: Add TDX Guest event notify interrupt support

Host-guest event notification via configured interrupt vector is useful
in cases where a guest makes an asynchronous request and needs a
callback from the host to indicate the completion or to let the host
notify the guest about events like device removal. One usage example is,
callback requirement of GetQuote asynchronous hypercall.

In TDX guest, SetupEventNotifyInterrupt hypercall can be used by the
guest to specify which interrupt vector to use as an event-notify
vector to the VMM. Details about the SetupEventNotifyInterrupt
hypercall can be found in TDX Guest-Host Communication Interface
(GHCI) Specification, sec 3.5 "VP.VMCALL<SetupEventNotifyInterrupt>".
Add a tdx_hcall_set_notify_intr() helper function to implement the
SetupEventNotifyInterrupt hypercall.

Reserve 0xec IRQ vector address for TDX guest to receive the event
completion notification from VMM. Also add related IDT handler to
process the notification event.

Add support to track the notification event status via
/proc/interrupts.

Reviewed-by: Tony Luck <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---
arch/x86/coco/tdx/tdx.c | 73 ++++++++++++++++++++++++++++++
arch/x86/include/asm/hardirq.h | 3 ++
arch/x86/include/asm/idtentry.h | 4 ++
arch/x86/include/asm/irq_vectors.h | 7 ++-
arch/x86/include/asm/tdx.h | 4 ++
arch/x86/kernel/irq.c | 7 +++
6 files changed, 97 insertions(+), 1 deletion(-)

diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index 03deb4d6920d..b49211994864 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -11,6 +11,10 @@
#include <asm/insn.h>
#include <asm/insn-eval.h>
#include <asm/pgtable.h>
+#include <asm/apic.h>
+#include <asm/idtentry.h>
+#include <asm/irq_regs.h>
+#include <asm/desc.h>

/* TDX module Call Leaf IDs */
#define TDX_GET_INFO 1
@@ -19,6 +23,7 @@

/* TDX hypercall Leaf IDs */
#define TDVMCALL_MAP_GPA 0x10001
+#define TDVMCALL_SETUP_NOTIFY_INTR 0x10004

/* MMIO direction */
#define EPT_READ 0
@@ -34,6 +39,28 @@
#define VE_GET_PORT_NUM(e) ((e) >> 16)
#define VE_IS_IO_STRING(e) ((e) & BIT(4))

+/*
+ * Handler used to report notifications about
+ * TDX_GUEST_EVENT_NOTIFY_VECTOR IRQ. Currently it will be
+ * used only by the attestation driver. So, race condition
+ * with read/write operation is not considered.
+ */
+static void (*tdx_event_notify_handler)(void);
+
+/* Helper function to register tdx_event_notify_handler */
+void tdx_setup_ev_notify_handler(void (*handler)(void))
+{
+ tdx_event_notify_handler = handler;
+}
+EXPORT_SYMBOL_GPL(tdx_setup_ev_notify_handler);
+
+/* Helper function to unregister tdx_event_notify_handler */
+void tdx_remove_ev_notify_handler(void)
+{
+ tdx_event_notify_handler = NULL;
+}
+EXPORT_SYMBOL_GPL(tdx_remove_ev_notify_handler);
+
/*
* Wrapper for standard use of __tdx_hypercall with no output aside from
* return code.
@@ -98,6 +125,46 @@ static inline void tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
panic("TDCALL %lld failed (Buggy TDX module!)\n", fn);
}

+/* TDX guest event notification handler */
+DEFINE_IDTENTRY_SYSVEC(sysvec_tdx_event_notify)
+{
+ struct pt_regs *old_regs = set_irq_regs(regs);
+
+ inc_irq_stat(irq_tdx_event_notify_count);
+
+ if (tdx_event_notify_handler)
+ tdx_event_notify_handler();
+
+ ack_APIC_irq();
+
+ set_irq_regs(old_regs);
+}
+
+/*
+ * tdx_hcall_set_notify_intr() - Setup Event Notify Interrupt Vector.
+ *
+ * @vector: Vector address to be used for notification.
+ *
+ * return 0 on success or failure error number.
+ */
+static long tdx_hcall_set_notify_intr(u8 vector)
+{
+ /* Minimum vector value allowed is 32 */
+ if (vector < 32)
+ return -EINVAL;
+
+ /*
+ * Register callback vector address with VMM. More details
+ * about the ABI can be found in TDX Guest-Host-Communication
+ * Interface (GHCI), sec titled
+ * "TDG.VP.VMCALL<SetupEventNotifyInterrupt>".
+ */
+ if (_tdx_hypercall(TDVMCALL_SETUP_NOTIFY_INTR, vector, 0, 0, 0))
+ return -EIO;
+
+ return 0;
+}
+
static u64 get_cc_mask(void)
{
struct tdx_module_output out;
@@ -688,5 +755,11 @@ void __init tdx_early_init(void)
x86_platform.guest.enc_tlb_flush_required = tdx_tlb_flush_required;
x86_platform.guest.enc_status_change_finish = tdx_enc_status_changed;

+ alloc_intr_gate(TDX_GUEST_EVENT_NOTIFY_VECTOR,
+ asm_sysvec_tdx_event_notify);
+
+ if (tdx_hcall_set_notify_intr(TDX_GUEST_EVENT_NOTIFY_VECTOR))
+ pr_warn("Setting event notification interrupt failed\n");
+
pr_info("Guest detected\n");
}
diff --git a/arch/x86/include/asm/hardirq.h b/arch/x86/include/asm/hardirq.h
index 275e7fd20310..582deff56210 100644
--- a/arch/x86/include/asm/hardirq.h
+++ b/arch/x86/include/asm/hardirq.h
@@ -44,6 +44,9 @@ typedef struct {
unsigned int irq_hv_reenlightenment_count;
unsigned int hyperv_stimer0_count;
#endif
+#if IS_ENABLED(CONFIG_INTEL_TDX_GUEST)
+ unsigned int irq_tdx_event_notify_count;
+#endif
} ____cacheline_aligned irq_cpustat_t;

DECLARE_PER_CPU_SHARED_ALIGNED(irq_cpustat_t, irq_stat);
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 72184b0b2219..655086dd940e 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -700,6 +700,10 @@ DECLARE_IDTENTRY_SYSVEC(HYPERVISOR_CALLBACK_VECTOR, sysvec_xen_hvm_callback);
DECLARE_IDTENTRY_SYSVEC(HYPERVISOR_CALLBACK_VECTOR, sysvec_kvm_asyncpf_interrupt);
#endif

+#if IS_ENABLED(CONFIG_INTEL_TDX_GUEST)
+DECLARE_IDTENTRY_SYSVEC(TDX_GUEST_EVENT_NOTIFY_VECTOR, sysvec_tdx_event_notify);
+#endif
+
#undef X86_TRAP_OTHER

#endif
diff --git a/arch/x86/include/asm/irq_vectors.h b/arch/x86/include/asm/irq_vectors.h
index 43dcb9284208..82ac0c0a34b1 100644
--- a/arch/x86/include/asm/irq_vectors.h
+++ b/arch/x86/include/asm/irq_vectors.h
@@ -104,7 +104,12 @@
#define HYPERV_STIMER0_VECTOR 0xed
#endif

-#define LOCAL_TIMER_VECTOR 0xec
+#if IS_ENABLED(CONFIG_INTEL_TDX_GUEST)
+/* Vector on which TDX Guest event notification is delivered */
+#define TDX_GUEST_EVENT_NOTIFY_VECTOR 0xec
+#endif
+
+#define LOCAL_TIMER_VECTOR 0xeb

#define NR_VECTORS 256

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 020c81a7c729..eb4db837cc44 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -67,6 +67,10 @@ void tdx_safe_halt(void);

bool tdx_early_handle_ve(struct pt_regs *regs);

+void tdx_setup_ev_notify_handler(void (*handler)(void));
+
+void tdx_remove_ev_notify_handler(void);
+
#else

static inline void tdx_early_init(void) { };
diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
index 766ffe3ba313..a96ecd866723 100644
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -181,6 +181,13 @@ int arch_show_interrupts(struct seq_file *p, int prec)
seq_printf(p, "%10u ",
irq_stats(j)->kvm_posted_intr_wakeup_ipis);
seq_puts(p, " Posted-interrupt wakeup event\n");
+#endif
+#if IS_ENABLED(CONFIG_INTEL_TDX_GUEST)
+ seq_printf(p, "%*s: ", prec, "TGN");
+ for_each_online_cpu(j)
+ seq_printf(p, "%10u ",
+ irq_stats(j)->irq_tdx_event_notify_count);
+ seq_puts(p, " TDX Guest event notification\n");
#endif
return 0;
}
--
2.25.1

Subject: [PATCH v5 3/3] x86/tdx: Add Quote generation support

In TDX guest, the second stage in attestation process is to send the
TDREPORT to QE/QGS to generate the TD Quote. For platforms that does
not support communication channels like vsock or TCP/IP, implement
support to get TD Quote using TDVMCALL. GetQuote hypercall can be used
by the TD guest to request VMM facilitate the Quote generation via
QE/QGS. More details about GetQuote hypercall can be found in TDX
Guest-Host Communication Interface (GHCI) for Intel TDX 1.0, section
titled "TDG.VP.VMCALL<GetQuote>.

Since GetQuote is an asynchronous request hypercall, it will not block
till the TD Quote is generated. So VMM uses callback interrupt vector
configured by SetupEventNotifyInterrupt hypercall to notify the guest
about Quote generation completion or failure. Upon receiving the
completion notification, status can be found in the TD Quote request
header.

Although GHCI specification does not restrict parallel GetQuote
requests, since Quote generation is not in performance critical path
and the frequency of attestation requests are expected to be low, only
support serialized quote generation requests. Parallel quote request
support can be added once demand arises.

Currently the buffer required to get the TD Quote data and its shared
mapping conversion (using set_memory_decrypted()) is handled within the
IOCTL handler. Although it will increase the TDX_CMD_GET_QUOTE IOCTL
response time, it is negligible compared to the time required for the
quote generation completion. So IOCTL performance optimization is not
considered at this time.

Reviewed-by: Tony Luck <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---
arch/x86/coco/tdx/attest.c | 135 ++++++++++++++++++++++++++++++++
arch/x86/include/uapi/asm/tdx.h | 40 ++++++++++
2 files changed, 175 insertions(+)

diff --git a/arch/x86/coco/tdx/attest.c b/arch/x86/coco/tdx/attest.c
index 4543a0264ce7..6aba85297708 100644
--- a/arch/x86/coco/tdx/attest.c
+++ b/arch/x86/coco/tdx/attest.c
@@ -22,6 +22,7 @@
#include <linux/io.h>
#include <asm/apic.h>
#include <asm/tdx.h>
+#include <asm/coco.h>
#include <asm/irq_vectors.h>
#include <uapi/asm/tdx.h>

@@ -29,12 +30,20 @@

/* TDREPORT module call leaf ID */
#define TDX_GET_REPORT 4
+/* GetQuote hypercall leaf ID */
+#define TDVMCALL_GET_QUOTE 0x10002

/* Upper 32 bits has the status code, so mask it */
#define TDCALL_STATUS_CODE_MASK 0xffffffff00000000
#define TDCALL_STATUS_CODE(a) ((a) & TDCALL_STATUS_CODE_MASK)

static struct miscdevice miscdev;
+/* Completion object to track GetQuote completion status */
+static DECLARE_COMPLETION(req_compl);
+/* Mutex to serialize GetQuote requests */
+static DEFINE_MUTEX(quote_lock);
+/* Pointer to track current quote request */
+static void *tdquote;

static long tdx_get_report(void __user *argp)
{
@@ -88,6 +97,126 @@ static long tdx_get_report(void __user *argp)
return ret;
}

+/* tdx_get_quote_hypercall() - Request to get TD Quote using TDREPORT.
+ *
+ * @quote_buf : Address of 4KB aligned GPA memory which contains
+ * TDREPORT. It is also used as output buffer to copy
+ * the GetQuote output upon successful execution.
+ * @quote_len : Length of the GPA in bytes.
+ *
+ * Return hypercall status code.
+ */
+static long tdx_get_quote_hypercall(void *quote_buf, u64 quote_len)
+{
+ struct tdx_hypercall_args args = {0};
+
+ args.r10 = TDX_HYPERCALL_STANDARD;
+ args.r11 = TDVMCALL_GET_QUOTE;
+ args.r12 = cc_mkdec(virt_to_phys(quote_buf));
+ args.r13 = quote_len;
+
+ /*
+ * Pass the physical address of TDREPORT to the VMM and
+ * trigger the Quote generation. It is not a blocking
+ * call, hence completion of this request will be notified to
+ * the TD guest via a callback interrupt. More info about ABI
+ * can be found in TDX Guest-Host-Communication Interface
+ * (GHCI), sec titled "TDG.VP.VMCALL<GetQuote>".
+ */
+ return __tdx_hypercall(&args, 0);
+}
+
+static long tdx_get_quote(void __user *argp)
+{
+ struct tdx_quote_req quote_req;
+ long ret = 0;
+ int order;
+
+ /* Hold lock to serialize GetQuote requests */
+ mutex_lock(&quote_lock);
+
+ reinit_completion(&req_compl);
+
+ /* Copy GetQuote request struct from user buffer */
+ if (copy_from_user(&quote_req, argp, sizeof(struct tdx_quote_req))) {
+ ret = -EFAULT;
+ goto quote_failed;
+ }
+
+ /* Make sure the length & timeout is valid */
+ if (!quote_req.len || !quote_req.timeout) {
+ ret = -EINVAL;
+ goto quote_failed;
+ }
+
+ /* Get order for Quote buffer page allocation */
+ order = get_order(quote_req.len);
+
+ /*
+ * Allocate buffer to get TD Quote from the VMM.
+ * Size needs to be 4KB aligned (which is already
+ * met in page allocation).
+ */
+ tdquote = (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO, order);
+ if (!tdquote) {
+ ret = -ENOMEM;
+ goto quote_failed;
+ }
+
+ /*
+ * Since this buffer will be shared with the VMM via GetQuote
+ * hypercall, decrypt it.
+ */
+ ret = set_memory_decrypted((unsigned long)tdquote, 1UL << order);
+ if (ret)
+ goto quote_failed;
+
+ /* Copy TDREPORT from user buffer to kernel Quote buffer */
+ if (copy_from_user(tdquote, (void __user *)quote_req.buf, quote_req.len)) {
+ ret = -EFAULT;
+ goto quote_failed;
+ }
+
+ /* Submit GetQuote Request */
+ ret = tdx_get_quote_hypercall(tdquote, (1ULL << order) * PAGE_SIZE);
+ if (ret) {
+ pr_err("GetQuote hypercall failed, status:%lx\n", ret);
+ ret = -EIO;
+ goto quote_failed;
+ }
+
+ /* Wait for attestation completion */
+ ret = wait_for_completion_interruptible(&req_compl);
+ if (ret <= 0) {
+ ret = -EIO;
+ goto quote_failed;
+ }
+
+ /* Copy output data back to user buffer */
+ if (copy_to_user((void __user *)quote_req.buf, tdquote, quote_req.len))
+ ret = -EFAULT;
+
+quote_failed:
+ if (tdquote)
+ free_pages((unsigned long)tdquote, order);
+ tdquote = NULL;
+ mutex_unlock(&quote_lock);
+ return ret;
+}
+
+static void attestation_callback_handler(void)
+{
+ struct tdx_quote_hdr *quote_hdr;
+
+ quote_hdr = (struct tdx_quote_hdr *) tdquote;
+
+ /* Check for spurious callback IRQ case */
+ if (!tdquote || quote_hdr->status == GET_QUOTE_IN_FLIGHT)
+ return;
+
+ complete(&req_compl);
+}
+
static long tdx_attest_ioctl(struct file *file, unsigned int cmd,
unsigned long arg)
{
@@ -98,6 +227,9 @@ static long tdx_attest_ioctl(struct file *file, unsigned int cmd,
case TDX_CMD_GET_REPORT:
ret = tdx_get_report(argp);
break;
+ case TDX_CMD_GET_QUOTE:
+ ret = tdx_get_quote(argp);
+ break;
default:
pr_debug("cmd %d not supported\n", cmd);
ret = -EINVAL;
@@ -121,6 +253,9 @@ static int __init tdx_attestation_init(void)
if (!cpu_feature_enabled(X86_FEATURE_TDX_GUEST))
return -EIO;

+ /* Register attestation event notify handler */
+ tdx_setup_ev_notify_handler(attestation_callback_handler);
+
miscdev.name = DRIVER_NAME;
miscdev.minor = MISC_DYNAMIC_MINOR;
miscdev.fops = &tdx_attest_fops;
diff --git a/arch/x86/include/uapi/asm/tdx.h b/arch/x86/include/uapi/asm/tdx.h
index 9a7377723667..3eb60253432c 100644
--- a/arch/x86/include/uapi/asm/tdx.h
+++ b/arch/x86/include/uapi/asm/tdx.h
@@ -33,4 +33,44 @@ struct tdx_report_req {
/* Get TDREPORT from the TDX module using TDCALL[TDG.MR.REPORT) */
#define TDX_CMD_GET_REPORT _IOWR('T', 0x01, struct tdx_report_req)

+/* struct tdx_quote_req: Request to generate TD Quote using TDREPORT
+ *
+ * @buf : Pass user data that includes TDREPORT as input. Upon
+ * successful completion of IOCTL, output is copied
+ * back to the same buffer.
+ * @len : Length of the buffer.
+ */
+struct tdx_quote_req {
+ __u64 buf;
+ __u64 len;
+ __u32 timeout;
+};
+
+/* Get TD Quote from QE/QGS using TDREPORT passed by the user */
+#define TDX_CMD_GET_QUOTE _IOR('T', 0x02, struct tdx_quote_req)
+
+/* TD Quote status codes */
+#define GET_QUOTE_SUCCESS 0
+#define GET_QUOTE_IN_FLIGHT 0xffffffffffffffff
+#define GET_QUOTE_ERROR 0x8000000000000000
+#define GET_QUOTE_SERVICE_UNAVAILABLE 0x8000000000000001
+
+/*
+ * Format of Quote data header. More details can be found in TDX
+ * Guest-Host Communication Interface (GHCI) for Intel TDX 1.0,
+ * section titled "TDG.VP.VMCALL<GetQuote>"
+ */
+struct tdx_quote_hdr {
+ /* Quote version, filled by TD */
+ __u64 version;
+ /* Status code of Quote request, filled by VMM */
+ __u64 status;
+ /* Length of TDREPORT, filled by TD */
+ __u32 in_len;
+ /* Length of Quote, filled by VMM */
+ __u32 out_len;
+ /* Actual Quote data */
+ __u64 data[0];
+};
+
#endif /* _UAPI_ASM_X86_TDX_H */
--
2.25.1

Subject: [PATCH v5 1/3] x86/tdx: Add TDX Guest attestation interface driver

In TDX guest, attestation is used to verify the trustworthiness of a TD
to other entities before provisioning secrets to the TD.
   
One usage example is, when a TD guest uses encrypted drive and if the
decryption keys required to access the drive are stored in a secure 3rd
party keyserver, the key server can use attestation to verify TD's
trustworthiness and release the decryption keys to the TD.

The attestation process consists of two steps: TDREPORT generation and
Quote generation.

TDREPORT (TDREPORT_STRUCT) is a fixed-size data structure generated by
the TDX module which contains TD-specific information (such as TD
measurements), platform security version, and the MAC to protect the
integrity of the TDREPORT. The TD kernel uses TDCALL[TDG.MR.REPORT] to
get the TDREPORT from the TDX module. A user-provided 64-Byte
REPORTDATA is used as input and included in the TDREPORT. Typically it
can be some nonce provided by attestation service so the TDREPORT can
be verified uniquely. More details about TDREPORT can be found in
Intel TDX Module specification, section titled "TDG.MR.REPORT Leaf".

After getting the TDREPORT, the second step of the attestation process
is to send the TDREPORT to Quoting Enclave (QE) or Quote Generation
Service (QGS) to generate the Quote. However the method of sending the
TDREPORT to QE/QGS, communication channel used and data format used is
specific to the implementation of QE/QGS.

A typical implementation is, TD userspace attestation software gets the
TDREPORT from TD kernel, sends it to QE/QGS, and QE/QGS returns the
Quote. TD attestation software can use any available communication
channel to talk to QE/QGS, such as using vsock and tcp/ip.

To support the case that those communication channels are not directly
available to the TD, TDX also defines TDVMCALLs to allow TD to ask VMM
to help with sending the TDREPORT and receiving the Quote. This support
is documented in the GHCI spec section titled "5.4 TD attestation".

Implement a basic attestation driver to allow TD userspace to get the
TDREPORT, which is sent to QE by the attestation software to generate
a Quote for remote verification.

Also note that explicit access permissions are not enforced in this
driver because the quote and measurements are not a secret. However
the access permissions of the device node can be used to set any
desired access policy. The udev default is usually root access
only.

Reviewed-by: Tony Luck <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---
arch/x86/coco/tdx/Makefile | 2 +-
arch/x86/coco/tdx/attest.c | 138 ++++++++++++++++++++++++++++++++
arch/x86/include/uapi/asm/tdx.h | 36 +++++++++
3 files changed, 175 insertions(+), 1 deletion(-)
create mode 100644 arch/x86/coco/tdx/attest.c
create mode 100644 arch/x86/include/uapi/asm/tdx.h

diff --git a/arch/x86/coco/tdx/Makefile b/arch/x86/coco/tdx/Makefile
index 46c55998557d..d2db3e6770e5 100644
--- a/arch/x86/coco/tdx/Makefile
+++ b/arch/x86/coco/tdx/Makefile
@@ -1,3 +1,3 @@
# SPDX-License-Identifier: GPL-2.0

-obj-y += tdx.o tdcall.o
+obj-y += tdx.o tdcall.o attest.o
diff --git a/arch/x86/coco/tdx/attest.c b/arch/x86/coco/tdx/attest.c
new file mode 100644
index 000000000000..4543a0264ce7
--- /dev/null
+++ b/arch/x86/coco/tdx/attest.c
@@ -0,0 +1,138 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * attest.c - TDX guest attestation interface driver.
+ *
+ * Implements user interface to trigger attestation process.
+ *
+ * Copyright (C) 2022 Intel Corporation
+ *
+ */
+
+#define pr_fmt(fmt) "x86/tdx: attest: " fmt
+
+#include <linux/module.h>
+#include <linux/miscdevice.h>
+#include <linux/uaccess.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/slab.h>
+#include <linux/set_memory.h>
+#include <linux/dma-mapping.h>
+#include <linux/jiffies.h>
+#include <linux/io.h>
+#include <asm/apic.h>
+#include <asm/tdx.h>
+#include <asm/irq_vectors.h>
+#include <uapi/asm/tdx.h>
+
+#define DRIVER_NAME "tdx-attest"
+
+/* TDREPORT module call leaf ID */
+#define TDX_GET_REPORT 4
+
+/* Upper 32 bits has the status code, so mask it */
+#define TDCALL_STATUS_CODE_MASK 0xffffffff00000000
+#define TDCALL_STATUS_CODE(a) ((a) & TDCALL_STATUS_CODE_MASK)
+
+static struct miscdevice miscdev;
+
+static long tdx_get_report(void __user *argp)
+{
+ void *reportdata = NULL, *tdreport = NULL;
+ long ret = 0;
+
+ /* Allocate buffer space for REPORTDATA */
+ reportdata = kmalloc(TDX_REPORTDATA_LEN, GFP_KERNEL);
+ if (!reportdata)
+ return -ENOMEM;
+
+ /* Allocate buffer space for TDREPORT */
+ tdreport = kmalloc(TDX_REPORT_LEN, GFP_KERNEL);
+ if (!tdreport) {
+ ret = -ENOMEM;
+ goto failed;
+ }
+
+ /* Copy REPORTDATA from the user buffer */
+ if (copy_from_user(reportdata, argp, TDX_REPORTDATA_LEN)) {
+ ret = -EFAULT;
+ goto failed;
+ }
+
+ /*
+ * Generate TDREPORT using "TDG.MR.REPORT" TDCALL.
+ *
+ * Pass the physical address of user generated REPORTDATA
+ * and the physical address of the output buffer to the TDX
+ * module to generate the TDREPORT. Generated data contains
+ * measurements/configuration data of the TD guest. More info
+ * about ABI can be found in TDX 1.0 Module specification, sec
+ * titled "TDG.MR.REPORT".
+ */
+ ret = __tdx_module_call(TDX_GET_REPORT, virt_to_phys(tdreport),
+ virt_to_phys(reportdata), 0, 0, NULL);
+ if (ret) {
+ pr_debug("TDREPORT TDCALL failed, status:%lx\n",
+ TDCALL_STATUS_CODE(ret));
+ ret = -EIO;
+ goto failed;
+ }
+
+ /* Copy TDREPORT back to the user buffer */
+ if (copy_to_user(argp, tdreport, TDX_REPORT_LEN))
+ ret = -EFAULT;
+
+failed:
+ kfree(reportdata);
+ kfree(tdreport);
+ return ret;
+}
+
+static long tdx_attest_ioctl(struct file *file, unsigned int cmd,
+ unsigned long arg)
+{
+ void __user *argp = (void __user *)arg;
+ long ret = 0;
+
+ switch (cmd) {
+ case TDX_CMD_GET_REPORT:
+ ret = tdx_get_report(argp);
+ break;
+ default:
+ pr_debug("cmd %d not supported\n", cmd);
+ ret = -EINVAL;
+ break;
+ }
+
+ return ret;
+}
+
+static const struct file_operations tdx_attest_fops = {
+ .owner = THIS_MODULE,
+ .unlocked_ioctl = tdx_attest_ioctl,
+ .llseek = no_llseek,
+};
+
+static int __init tdx_attestation_init(void)
+{
+ long ret;
+
+ /* Make sure we are in a valid TDX platform */
+ if (!cpu_feature_enabled(X86_FEATURE_TDX_GUEST))
+ return -EIO;
+
+ miscdev.name = DRIVER_NAME;
+ miscdev.minor = MISC_DYNAMIC_MINOR;
+ miscdev.fops = &tdx_attest_fops;
+
+ ret = misc_register(&miscdev);
+ if (ret) {
+ pr_err("misc device registration failed\n");
+ return ret;
+ }
+
+ pr_debug("module initialization success\n");
+
+ return 0;
+}
+device_initcall(tdx_attestation_init)
diff --git a/arch/x86/include/uapi/asm/tdx.h b/arch/x86/include/uapi/asm/tdx.h
new file mode 100644
index 000000000000..9a7377723667
--- /dev/null
+++ b/arch/x86/include/uapi/asm/tdx.h
@@ -0,0 +1,36 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _UAPI_ASM_X86_TDX_H
+#define _UAPI_ASM_X86_TDX_H
+
+#include <linux/types.h>
+#include <linux/ioctl.h>
+
+/* Length of the REPORTDATA used in TDG.MR.REPORT TDCALL */
+#define TDX_REPORTDATA_LEN 64
+
+/* Length of TDREPORT used in TDG.MR.REPORT TDCALL */
+#define TDX_REPORT_LEN 1024
+
+/**
+ * struct tdx_report_req: Get TDREPORT from the TDX module.
+ *
+ * @reportdata : User-defined 64-Byte REPORTDATA to be included into
+ * TDREPORT. Typically it can be some nonce provided by
+ * attestation software so the generated TDREPORT can be
+ * uniquely verified.
+ * @tdreport : TDREPORT output from TDCALL[TDG.MR.REPORT] of size
+ * TDX_REPORT_LEN.
+ *
+ * Used in TDX_CMD_GET_REPORT IOCTL request.
+ */
+struct tdx_report_req {
+ union {
+ __u8 reportdata[TDX_REPORTDATA_LEN];
+ __u8 tdreport[TDX_REPORT_LEN];
+ };
+};
+
+/* Get TDREPORT from the TDX module using TDCALL[TDG.MR.REPORT) */
+#define TDX_CMD_GET_REPORT _IOWR('T', 0x01, struct tdx_report_req)
+
+#endif /* _UAPI_ASM_X86_TDX_H */
--
2.25.1

Subject: Re: [PATCH v5 3/3] x86/tdx: Add Quote generation support



On 5/3/22 3:28 PM, Sathyanarayanan Kuppuswamy wrote:
>
>
> On 5/3/22 3:24 PM, Dave Hansen wrote:
>> On 5/2/22 18:27, Kirill A. Shutemov wrote:
>>>> Again, Dave and Andi already commented you should use vmap() to
>>>> avoid breaking
>>>> up the direct-mapping.  Please use vmap() instead.
>>>>
>>>> https://lore.kernel.org/all/[email protected]/
>>>>
>>>>
>>>> Will review the rest later.
>>> I would rather convert it to use DMA API for memory allocation. It will
>>> tap into swiotlb buffer that already converted and there's no need to
>>> touch direct mapping. Both allocation and freeing such memory is cheaper
>>> because of that.
>>
>> Sathya, I don't quite understand why you are so forcefully declining to
>> incorporate review feedback on this point.  I gave very specific
>> feedback about the kind of mapping you need and that you should avoid
>> fragmenting the direct map if at all possible.
>>
>> Why is this code still fragmenting the direct map?
>
> I have already implemented it and testing it now.

I mean, I have already implemented the vmap based solution.

>
> In this discussion, we are comparing the use of DMA API for memory
> allocation vs vmap/sharing it in driver itself.
>

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2022-05-06 17:55:26

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v5 3/3] x86/tdx: Add Quote generation support

On Thu, 2022-05-05 at 16:06 -0700, Dave Hansen wrote:
> On 5/5/22 15:15, Kai Huang wrote:
> > set_memory_xx() is supposedly only for direct-mapping. Please use my
> > suggestion above.
>
> Kai, please take a look at some of the other users, especially
> set_memory_x(). See how long the "supposed" requirement holds up.
>
> That said, I've forgotten by now if this _could_ have used vmalloc() or
> vmap() or vmap_pfn(). None of the logic about why or how the allocator
> and mapping design decisions were made. Could that be be rectified for
> the next post?

Hi Dave,

(Sorry previous reply was too fast..)

I spent some time looking into how __change_page_attr_set_clr() is implemented,
which is called by all set_memory_xxx() variants. If my understanding is
correct, __change_page_attr_set_clr() will work for vmap() variants, because it
internally uses lookup_address(), which walks the page table directly, to find
the actual PTE (and PFN). So set_memory_decrypted() can be fixed to support
vmap() mapping for TDX.

However, looking at the code, set_memory_decrypted() calls
__change_page_attr_set_clr(&cpa, 1). The second argument is 'checkalias', which
means even we call set_memory_decrypted() against vmap() address, the aliasing
mappings will be changed too. And if I understand correctly, the aliasing
mapping includes direct-mapping too:

static int cpa_process_alias(struct cpa_data *cpa)
{
struct cpa_data alias_cpa;
unsigned long laddr = (unsigned long)__va(cpa->pfn << PAGE_SHIFT);
unsigned long vaddr;
int ret;

if (!pfn_range_is_mapped(cpa->pfn, cpa->pfn + 1))
return 0;

/*
* No need to redo, when the primary call touched the direct
* mapping already:
*/
vaddr = __cpa_addr(cpa, cpa->curpage);
if (!(within(vaddr, PAGE_OFFSET,
PAGE_OFFSET + (max_pfn_mapped << PAGE_SHIFT)))) {

alias_cpa = *cpa;
alias_cpa.vaddr = &laddr;
alias_cpa.flags &= ~(CPA_PAGES_ARRAY | CPA_ARRAY);
alias_cpa.curpage = 0;

cpa->force_flush_all = 1;

ret = __change_page_attr_set_clr(&alias_cpa, 0);
if (ret)
return ret;
}

#ifdef CONFIG_X86_64
/*
* If the primary call didn't touch the high mapping already
* and the physical address is inside the kernel map, we need
* to touch the high mapped kernel as well:
*/
if (!within(vaddr, (unsigned long)_text, _brk_end) &&
__cpa_pfn_in_highmap(cpa->pfn)) {
unsigned long temp_cpa_vaddr = (cpa->pfn << PAGE_SHIFT) +
__START_KERNEL_map - phys_base;
alias_cpa = *cpa;
alias_cpa.vaddr = &temp_cpa_vaddr;
alias_cpa.flags &= ~(CPA_PAGES_ARRAY | CPA_ARRAY);
alias_cpa.curpage = 0;

cpa->force_flush_all = 1;
/*
* The high mapping range is imprecise, so ignore the
* return value.
*/
__change_page_attr_set_clr(&alias_cpa, 0);
}
#endif

return 0;
}

As you can see, the first chunk checks whether the virtual address is direct-
mapping, and if it is not, direct mapping is changed too.

The second chunk even changes the high kernel mapping.

So, if we use set_memory_decrypted(), there's no difference whether the address
is vmap() or direct mapping address. The direct mapping will be changed anyway.

(However, it seems if we call set_memory_decrypted() against direct mapping
address, the vmap() mapping won't be impacted, because it seems
cpa_process_alias() doesn't check vmap() area..).

However I may have missed something. Kirill please help to confirm if you see
this.

--
Thanks,
-Kai



Subject: Re: [PATCH v5 3/3] x86/tdx: Add Quote generation support

Hi Kai,

On 5/5/22 3:50 AM, Kai Huang wrote:
>
>> + /* Submit GetQuote Request */
>> + ret = tdx_get_quote_hypercall(buf);
>> + if (ret) {
>> + pr_err("GetQuote hypercall failed, status:%lx\n", ret);
>> + ret = -EIO;
>> + goto free_entry;
>> + }
>> +
>> + /* Add current quote entry to quote_list */
>> + add_quote_entry(entry);
>> +
>> + /* Wait for attestation completion */
>> + ret = wait_for_completion_interruptible(&entry->compl);
>> + if (ret < 0) {
>> + ret = -EIO;
>> + goto del_entry;
>> + }
>
> This is misuse of wait_for_completion_interruptible().
>
> xxx_interruptible() essentially means this operation can be interrupted by
> signal. Using xxx_interruptible() in driver IOCTL essentially means when it
> returns due to signal, the IOCTL should return -EINTR to let userspace know that
> your application received some signal needs handling, and this IOCTL isn't
> finished and you should retry. So here we should return -EINTR (and cleanup all
> staff has been done) when wait_for_completion_interruptible() returns -
> ERESTARTSYS (in fact, it returns only -ERESTARTSYS or 0).


But in this case, I was expecting the user agent to check the Quote
buffer status code to understand the IN_FLIGHT, SUCCESS or FAILURE
status and handle it appropriately. So IMO, it should not matter what
we return for the failure case. For the IN_FLIGHT case, they can retry
if they want after checking the status code.

But I agree that EINTR is the appropriate return value for an
interrupted case. So, I will change it.

>
> Since normally userspace application just ignore signals, and in this particular
> case, asking userspace to retry just makes things more complicated to handle, I

I am not sure how the user agent is going to be implemented. So I don't
want to make any assumptions. In this case, we are not asking user space
to implement the retry support using signals. But we are just giving
them option to do it. It is up to them if they want to use it.

> think you can just use wait_for_completion_killable(), which only returns when
> the application receives signal that it is going to be killed.

If you agree with the above point, we can leave just it as
*_interruptible(). But if you still see other issues, please let me
know.

>
>> +
>> + /* Copy output data back to user buffer */
>> + if (copy_to_user((void __user *)quote_req.buf, buf->vmaddr,
>> quote_req.len))
>> + ret = -EFAULT;
>> +
>> +del_entry:
>> + del_quote_entry(entry);
>> +free_entry:
>> + free_quote_entry(entry);
>
> As I (and Isaku) mentioned before, when wait_for_completion_killable() returns
> with error, you cannot just convert the buffer to private and free it. The VMM
> is still owning it (IN_FLIGHT).

Do you know what happens when VMM writes to a page which already marked
private? Will MapGPA fail during shared-private conversion?

>
> One way to handle is you can put those buffers that are still owned by VMM to a
> new list, and have some kernel thread to periodically check buffer's status and
> free those are already released by VMM. I haven't thought thoroughly, so maybe
> there's better way to handle, though.

Instead of adding new thread to just handle the cleanup, maybe I
can move the entire callback interrupt logic (contents of
attestation_callback_handler()) to a work queue and wake up this
work queue whenever we get the callback interrupt.

We can let the same work queue handle the cleanup for interrupted
requests. As for as how to identify the interrupted request, we
can add a bit in queue_entry for it and set it when we exit
wait_for_completion*() due to signals.

I will do a sample logic and get back to you.



>

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

Subject: Re: [PATCH v5 3/3] x86/tdx: Add Quote generation support



On 5/5/22 5:11 PM, Kai Huang wrote:
> On Thu, 2022-05-05 at 16:06 -0700, Dave Hansen wrote:
>> On 5/5/22 15:15, Kai Huang wrote:
>>> set_memory_xx() is supposedly only for direct-mapping. Please use my
>>> suggestion above.
>>
>> Kai, please take a look at some of the other users, especially
>> set_memory_x(). See how long the "supposed" requirement holds up.
>
> Right I should not have used "supposed". My bad. I got the impression by
> roughly looking at set_memory_{uc|wc..}(). Looks they can only work on direct
> mapaping as they internally uses __pa():
>
> int set_memory_wc(unsigned long addr, int numpages)
> {
> int ret;
>
> ret = memtype_reserve(__pa(addr), __pa(addr) + numpages * PAGE_SIZE,
> _PAGE_CACHE_MODE_WC, NULL);
> if (ret)
> return ret;
>
> ret = _set_memory_wc(addr, numpages);
> if (ret)
> memtype_free(__pa(addr), __pa(addr) + numpages * PAGE_SIZE);
>
> return ret;
> }
>
> Don't all set_memory_xxx() functions have the same schematic?
>
>>
>> That said, I've forgotten by now if this _could_ have used vmalloc() or
>> vmap() or vmap_pfn(). None of the logic about why or how the allocator
>> and mapping design decisions were made. Could that be be rectified for
>> the next post?

In addition to Kai's reply, a few more points about where this memory
being used, and your comment history is listed below:

This memory is used by VMM to copy the TD Quote data after completing
the Quote generation request from the TD guest. It requires a physically
contiguous shared buffer of given length, which is passed to VMM using
GetQuote hypercall.

Initially, we have allocated this memory using alloc_pages* (or some
variant of it) and directly called set_memory_{de/en}crypted() to share
/unshare these pages. For the above-mentioned approach, you have
suggested to use vmap to not affect the direct mapping.

Regarding vmalloc(), we cannot use it because we need physically
contiguous space.

Regarding history about using DMA APIs vs VMAP approach is already
explained by Kai below.

I will add relevant details to the commit log in next patch submission.

>
> Looking at set_memory_{encrypted|decrypted}() again, it seems they currently
> only works on direct mapping for TDX (as sathya's code has showed). For AMD it
> appears they can work on any virtual address as AMD uses lookup_address() to
> find the PFN.
>
> So if the two are supposed to work on any virtual address, then it makes sense
> to fix at TDX side.
>
> Btw, regarding to my suggestion of using vmap() with prot_decrypted() +
> MapGPA(), after thinking again, I think there is also a problem -- the TLB for
> private mapping and the cache are not flushed. So looks we should fix
> set_memory_decrypted() to work on any virtual address and use it for vmap().
>
> Back to the "why and how the allocator and mapping design decisions were made",
> let me summarize options and my preference below:
>
> 1) Using DMA API. This guarantees for TDX1.0 the allocated buffer is shared
> (set_memory_decrypted() is called for swiotlb). But this may not guarantee the
> buffer is shared in future generation of TDX. This of course depends on how we
> are going to change those DMA API implementations for future TDX but
> conceptually using DMA API is more like for convenience purpose. Also, in order
> to use DMA API, we need more code to handle additional 'platform device' which
> is not mandatory for attestation driver.
>
> 2) Using vmap() + set_memory_decrypted(). This requires to change the latter to
> support any virtual address for TDX. But now I guess it's the right way since
> it's better to have some infrastructure to convert private page to shared
> besides DMA API anyway.
>
> 3) Using vmap() + MapGPA(). This requires additional work on TLB flush and
> cache flush. Now I think we should not use this.
>
> Given above, I personally think 2) is better.
>
> Kirill, what's your opinion?

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2022-05-09 02:03:32

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v5 3/3] x86/tdx: Add Quote generation support

On Thu, 2022-05-05 at 13:53 -0700, Sathyanarayanan Kuppuswamy wrote:
> Hi Kai,
>
> On 5/4/22 4:28 PM, Kai Huang wrote:
> > On Wed, 2022-05-04 at 15:49 -0700, Sathyanarayanan Kuppuswamy wrote:
> > > --- a/arch/x86/coco/tdx/tdx.c
> > > +++ b/arch/x86/coco/tdx/tdx.c
> > > @@ -15,6 +15,7 @@
> > >   #include <asm/idtentry.h>
> > >   #include <asm/irq_regs.h>
> > >   #include <asm/desc.h>
> > > +#include <asm/io.h>
> > >
> > >   /* TDX module Call Leaf IDs */
> > >   #define TDX_GET_INFO                   1
> > > @@ -680,8 +681,15 @@ static bool try_accept_one(phys_addr_t *start,
> > > unsigned long len,
> > >    */
> > >   static bool tdx_enc_status_changed(unsigned long vaddr, int numpages,
> > > bool enc)
> > >   {
> > > -       phys_addr_t start = __pa(vaddr);
> > > -       phys_addr_t end   = __pa(vaddr + numpages * PAGE_SIZE);
> > > +       phys_addr_t start;
> > > +       phys_addr_t end;
> > > +
> > > +       if (is_vmalloc_addr((void *)vaddr))
> > > +               start =  page_to_phys(vmalloc_to_page((void*)vaddr));
> > > +       else
> > > +               start = __pa(vaddr);
> > > +
> > > +       end = start + numpages * PAGE_SIZE;
> > >
> > >          if (!enc) {
> > >                  /* Set the shared (decrypted) bits: */
> >
> > Looks set_memory_decrypted() only works for direct-mapping, so you should not
> > use this. Instead, you can pass shared bit in 'prot' argument (using
> > pgprot_decrypted()) when you call vmap(), and explicitly call MapGPA().
>
> Is it because of the above change, or you see other direct-mapping
> dependencies in set_memory_*() functions?
>
>
>
set_memory_xx() is supposedly only for direct-mapping. Please use my
suggestion above.

--
Thanks,
-Kai



2022-05-09 02:12:10

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v5 3/3] x86/tdx: Add Quote generation support

On Thu, 2022-05-05 at 16:06 -0700, Dave Hansen wrote:
> On 5/5/22 15:15, Kai Huang wrote:
> > set_memory_xx() is supposedly only for direct-mapping. Please use my
> > suggestion above.
>
> Kai, please take a look at some of the other users, especially
> set_memory_x(). See how long the "supposed" requirement holds up.

Right I should not have used "supposed". My bad. I got the impression by
roughly looking at set_memory_{uc|wc..}(). Looks they can only work on direct
mapaping as they internally uses __pa():

int set_memory_wc(unsigned long addr, int numpages)
{
int ret;

ret = memtype_reserve(__pa(addr), __pa(addr) + numpages * PAGE_SIZE,
_PAGE_CACHE_MODE_WC, NULL);
if (ret)
return ret;

ret = _set_memory_wc(addr, numpages);
if (ret)
memtype_free(__pa(addr), __pa(addr) + numpages * PAGE_SIZE);

return ret;
}

Don't all set_memory_xxx() functions have the same schematic?

>
> That said, I've forgotten by now if this _could_ have used vmalloc() or
> vmap() or vmap_pfn(). None of the logic about why or how the allocator
> and mapping design decisions were made. Could that be be rectified for
> the next post?

Looking at set_memory_{encrypted|decrypted}() again, it seems they currently
only works on direct mapping for TDX (as sathya's code has showed). For AMD it
appears they can work on any virtual address as AMD uses lookup_address() to
find the PFN.

So if the two are supposed to work on any virtual address, then it makes sense
to fix at TDX side.

Btw, regarding to my suggestion of using vmap() with prot_decrypted() +
MapGPA(), after thinking again, I think there is also a problem -- the TLB for
private mapping and the cache are not flushed. So looks we should fix
set_memory_decrypted() to work on any virtual address and use it for vmap().

Back to the "why and how the allocator and mapping design decisions were made",
let me summarize options and my preference below:

1) Using DMA API. This guarantees for TDX1.0 the allocated buffer is shared
(set_memory_decrypted() is called for swiotlb). But this may not guarantee the
buffer is shared in future generation of TDX. This of course depends on how we
are going to change those DMA API implementations for future TDX but
conceptually using DMA API is more like for convenience purpose. Also, in order
to use DMA API, we need more code to handle additional 'platform device' which
is not mandatory for attestation driver.

2) Using vmap() + set_memory_decrypted(). This requires to change the latter to
support any virtual address for TDX. But now I guess it's the right way since
it's better to have some infrastructure to convert private page to shared
besides DMA API anyway.

3) Using vmap() + MapGPA(). This requires additional work on TLB flush and
cache flush. Now I think we should not use this.

Given above, I personally think 2) is better.

Kirill, what's your opinion?

2022-05-09 02:33:14

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v5 3/3] x86/tdx: Add Quote generation support

On Fri, May 06, 2022 at 12:11:03PM +1200, Kai Huang wrote:
> Kirill, what's your opinion?

I said before that I think DMA API is the right tool here.

Speculation about future of DMA in TDX is irrelevant here. If semantics
change we will need to re-evaluate all users. VirtIO uses DMA API and it
is conceptually the same use-case: communicate with the host.

But vmap() + set_memory_decrypted() also works and Sathya already has code
for it. I'm fine with this.

Going a step below to manual MapGPA() is just wrong. We introduced
abstructions for a reason. Protocol of changing GPA status is not trivial.
We should not spread it across all kernel codebase.

--
Kirill A. Shutemov

Subject: Re: [PATCH v5 3/3] x86/tdx: Add Quote generation support



On 5/5/22 3:15 PM, Kai Huang wrote:
> On Thu, 2022-05-05 at 13:53 -0700, Sathyanarayanan Kuppuswamy wrote:
>> Hi Kai,
>>
>> On 5/4/22 4:28 PM, Kai Huang wrote:
>>> On Wed, 2022-05-04 at 15:49 -0700, Sathyanarayanan Kuppuswamy wrote:
>>>> --- a/arch/x86/coco/tdx/tdx.c
>>>> +++ b/arch/x86/coco/tdx/tdx.c
>>>> @@ -15,6 +15,7 @@
>>>>   #include <asm/idtentry.h>
>>>>   #include <asm/irq_regs.h>
>>>>   #include <asm/desc.h>
>>>> +#include <asm/io.h>
>>>>
>>>>   /* TDX module Call Leaf IDs */
>>>>   #define TDX_GET_INFO                   1
>>>> @@ -680,8 +681,15 @@ static bool try_accept_one(phys_addr_t *start,
>>>> unsigned long len,
>>>>    */
>>>>   static bool tdx_enc_status_changed(unsigned long vaddr, int numpages,
>>>> bool enc)
>>>>   {
>>>> -       phys_addr_t start = __pa(vaddr);
>>>> -       phys_addr_t end   = __pa(vaddr + numpages * PAGE_SIZE);
>>>> +       phys_addr_t start;
>>>> +       phys_addr_t end;
>>>> +
>>>> +       if (is_vmalloc_addr((void *)vaddr))
>>>> +               start =  page_to_phys(vmalloc_to_page((void*)vaddr));
>>>> +       else
>>>> +               start = __pa(vaddr);
>>>> +
>>>> +       end = start + numpages * PAGE_SIZE;
>>>>
>>>>          if (!enc) {
>>>>                  /* Set the shared (decrypted) bits: */
>>>
>>> Looks set_memory_decrypted() only works for direct-mapping, so you should not
>>> use this. Instead, you can pass shared bit in 'prot' argument (using
>>> pgprot_decrypted()) when you call vmap(), and explicitly call MapGPA().
>>
>> Is it because of the above change, or you see other direct-mapping
>> dependencies in set_memory_*() functions?
>>
>>
>>
> set_memory_xx() is supposedly only for direct-mapping. Please use my
> suggestion above.

I did not find any other direct-mapping dependency in set_memory_*()
functions other than what I have fixed. If I missed anything, please
let me know.

Also, even if set_memory_*() functions does not support vmalloc'ed
memory, IMO, it is better to add this support to it.

I want to avoid custom solution if it is possible to use generic
function.

>

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2022-05-09 03:04:42

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v5 3/3] x86/tdx: Add Quote generation support

On 5/5/22 15:15, Kai Huang wrote:
> set_memory_xx() is supposedly only for direct-mapping. Please use my
> suggestion above.

Kai, please take a look at some of the other users, especially
set_memory_x(). See how long the "supposed" requirement holds up.

That said, I've forgotten by now if this _could_ have used vmalloc() or
vmap() or vmap_pfn(). None of the logic about why or how the allocator
and mapping design decisions were made. Could that be be rectified for
the next post?

2022-05-09 04:59:45

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v5 3/3] x86/tdx: Add Quote generation support

On Thu, 2022-05-05 at 12:03 -0700, Sathyanarayanan Kuppuswamy wrote:
> Hi Kai,
>
> On 5/5/22 3:50 AM, Kai Huang wrote:
> >
> > > + /* Submit GetQuote Request */
> > > + ret = tdx_get_quote_hypercall(buf);
> > > + if (ret) {
> > > + pr_err("GetQuote hypercall failed, status:%lx\n", ret);
> > > + ret = -EIO;
> > > + goto free_entry;
> > > + }
> > > +
> > > + /* Add current quote entry to quote_list */
> > > + add_quote_entry(entry);
> > > +
> > > + /* Wait for attestation completion */
> > > + ret = wait_for_completion_interruptible(&entry->compl);
> > > + if (ret < 0) {
> > > + ret = -EIO;
> > > + goto del_entry;
> > > + }
> >
> > This is misuse of wait_for_completion_interruptible().
> >
> > xxx_interruptible() essentially means this operation can be interrupted by
> > signal. Using xxx_interruptible() in driver IOCTL essentially means when it
> > returns due to signal, the IOCTL should return -EINTR to let userspace know that
> > your application received some signal needs handling, and this IOCTL isn't
> > finished and you should retry. So here we should return -EINTR (and cleanup all
> > staff has been done) when wait_for_completion_interruptible() returns -
> > ERESTARTSYS (in fact, it returns only -ERESTARTSYS or 0).
>
>
> But in this case, I was expecting the user agent to check the Quote
> buffer status code to understand the IN_FLIGHT, SUCCESS or FAILURE
> status and handle it appropriately. So IMO, it should not matter what
> we return for the failure case. For the IN_FLIGHT case, they can retry
> if they want after checking the status code.

Couple of issues around your statement:

1) When wait_for_completion_interruptible() returns error, you never copied the
data back to userspace. Therefore userspace cannot check buffer status. Your
assumption is wrong.
2) Even if you copy the data back to userspace, there's no guarantee *after* you
do the copy, the VMM won't update the buffer status. So this doesn't work.
3) You only provide one IOCTL. Even if userspace can retry, you will create a
new buffer, and ask VMM to do it again. Then what happens to the old buffer?
VMM is still owning it, and can update it, but you have already converted it
back to private, and freed it.

I really don't see how your statement is correct.

>
> But I agree that EINTR is the appropriate return value for an
> interrupted case. So, I will change it.
>
> >
> > Since normally userspace application just ignore signals, and in this particular
> > case, asking userspace to retry just makes things more complicated to handle, I
>
> I am not sure how the user agent is going to be implemented. So I don't
> want to make any assumptions. In this case, we are not asking user space
> to implement the retry support using signals. But we are just giving
> them option to do it. It is up to them if they want to use it.

As I see you are not providing any functionality to allow userspace to retry.



--
Thanks,
-Kai



2022-05-09 05:17:29

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v5 3/3] x86/tdx: Add Quote generation support

On 5/6/22 04:00, Kai Huang wrote:
> However I may have missed something. Kirill please help to confirm if you see
> this.

Kai, thanks for taking a more thorough look through this.

No matter what mechanism is used here, I'd like to see some commit
message material about the testing that was performed, including
experimental confirmation that it doesn't affect the direct map.

2022-05-09 07:21:28

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v5 3/3] x86/tdx: Add Quote generation support

On Fri, May 06, 2022 at 11:00:29PM +1200, Kai Huang wrote:
> However I may have missed something. Kirill please help to confirm if you see
> this.

That's correct. set_memory_decrypted() will change direct mapping as well
even if called on vmap(). I should have catched it before. Sorry.

--
Kirill A. Shutemov

2022-05-09 10:03:14

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v5 3/3] x86/tdx: Add Quote generation support

On Sat, 2022-05-07 at 03:42 +0300, Kirill A. Shutemov wrote:
> On Fri, May 06, 2022 at 12:11:03PM +1200, Kai Huang wrote:
> > Kirill, what's your opinion?
>
> I said before that I think DMA API is the right tool here.
>
> Speculation about future of DMA in TDX is irrelevant here. If semantics
> change we will need to re-evaluate all users. VirtIO uses DMA API and it
> is conceptually the same use-case: communicate with the host.

Virtio is designed for device driver to use, so it's fine to use DMA API. And
real DMA can happen to the virtio DMA buffers. Attestation doesn't have such
assumption.

DMA API has it's limitations. I don't see the protocol to convert GPA from
private to shared is so complicated (see below), and I think regardless this
attestation use case, it's perhaps worth to provide an additional simple
infrastructure for such case so it can be used when needed.

>
> But vmap() + set_memory_decrypted() also works and Sathya already has code
> for it. I'm fine with this.

Dave said (again) he wanted to avoid breaking up direct mapping.

https://lore.kernel.org/lkml/[email protected]/T/#m37778b8af5d72c3db79e3cfa4b46ee37836f771c

So we need to use seither use DMA API (which already breaks direct-mapping for
swiotlb), or we use vmap() + MapGPA() as I mentioned below.

>
> Going a step below to manual MapGPA() is just wrong. We introduced
> abstructions for a reason. Protocol of changing GPA status is not trivial.
> We should not spread it across all kernel codebase.
>

I kinda disagree with this. In fact, the protocol of changing GPA status isn't
that complicated. TD guest can have both private and shared mappings to the
same GPA simultaneously. We don't need to change all the mappings when
converting private page to shared.

For instance, we can use vmap() to have a shared mapping to a page, while the
page is still mapped as private in direct-mapping. TD uses MapGPA() to tell VMM
which mapping it wants to use, and it just needs to guarantee that the private
(direct) mapping won't be used. Speculative fetch using the direct mapping is
fine, as long as the core won't consume the data. The only thing we need to
guarantee is we need to flush any dirty cachelines before MapGPA(). My
understanding is we don't even need to flush the TLB of the private mapping.

And reading the GHCI MapGPA() again, to me MapGPA() itself requires VMM to
guarantee the TLB and cache flush:

"
If the GPA (range) was already mapped as an active, private page, the host VMM
may remove the private page from the TD by following the “Removing TD Private
Pages” sequence in the Intel TDX-module specification [3] to safely block the
mapping(s), flush the TLB and cache, and remove the mapping(s). The VMM is
designed to be able to then map the specified GPA (range) in the shared-EPT
structure and allow the TD to access the page(s) as a shared GPA (range).
"

You can see the cache flush is guaranteed by VMM.

Btw, the use of word "may" in "host VMM may remove..." in above paragraph is
horrible. It should use "must", just like to the "converting shared to private"
case:

"
If the Start GPA specified is a private GPA (GPA.S bit is clear), this MapGPA
TDG.VP.VMCALL can be used to help request the host VMM map the specific, private
page(s) (which mapping may involve converting the backing-physical page from a
shared page to a private page). As intended in this case, the VMM must unmap the
GPA from the shared-EPT region and invalidate the TLB and caches for the TD
vcpus to help ensure no stale mappings and cache contents exist.
"

As you can see "must" is used in "the VMM must unmap the GPA from the shared-EPT
...".

So I don't see why TD guest kernel cannot have a simple protocol to vmap() a
page (or couple of pages) as shared on-demand, like below:

page = alloc_page();

addr = vmap(page, pgprot_decrypted(PAGE_KERNEL));

clflush_cache_range(page_address(page), PAGE_SIZE);

MapGPA(page_to_phys(page) | cc_mkdec(0), PAGE_SIZE);

And we can even avoid above clflush_cache_range() if I understand correctly.

Or I missed something?


--
Thanks,
-Kai



2022-05-09 12:33:43

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v5 3/3] x86/tdx: Add Quote generation support

On Mon, May 09, 2022 at 03:37:22PM +1200, Kai Huang wrote:
> On Sat, 2022-05-07 at 03:42 +0300, Kirill A. Shutemov wrote:
> > On Fri, May 06, 2022 at 12:11:03PM +1200, Kai Huang wrote:
> > > Kirill, what's your opinion?
> >
> > I said before that I think DMA API is the right tool here.
> >
> > Speculation about future of DMA in TDX is irrelevant here. If semantics
> > change we will need to re-evaluate all users. VirtIO uses DMA API and it
> > is conceptually the same use-case: communicate with the host.
>
> Virtio is designed for device driver to use, so it's fine to use DMA API. And
> real DMA can happen to the virtio DMA buffers. Attestation doesn't have such
> assumption.

Whether attestation driver uses struct device is implementation detail.
I don't see what is you point.

> DMA API has it's limitations.

Could you elaborate here?


> So I don't see why TD guest kernel cannot have a simple protocol to vmap() a
> page (or couple of pages) as shared on-demand, like below:
>
> page = alloc_page();
>
> addr = vmap(page, pgprot_decrypted(PAGE_KERNEL));
>
> clflush_cache_range(page_address(page), PAGE_SIZE);
>
> MapGPA(page_to_phys(page) | cc_mkdec(0), PAGE_SIZE);
>
> And we can even avoid above clflush_cache_range() if I understand correctly.
>
> Or I missed something?

For completeness, cover free path too. Are you going to opencode page
accept too?

Private->Shared conversion is destructive. You have to split SEPT, flush
TLB. Backward conversion even more costly.

Rule of thumb is avoid conversion where possible. DMA API is there for
you.

--
Kirill A. Shutemov

2022-05-09 14:19:14

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v5 3/3] x86/tdx: Add Quote generation support

On 5/9/22 05:09, Kirill A. Shutemov wrote:
> Private->Shared conversion is destructive. You have to split SEPT, flush
> TLB. Backward conversion even more costly.
>
> Rule of thumb is avoid conversion where possible. DMA API is there for
> you.

Kirill, I understand that the DMA API is a quick fix today. But is it
_really_ the right long-term interface?

There will surely come a time when TDX I/O devices won't be using fixed
bounce buffers. What will the quote generation code do then? How will
we know to come back around and fix this up?

Does SEV or the s390 ultravisor need anything like this?

2022-05-09 15:43:57

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v5 3/3] x86/tdx: Add Quote generation support

On Mon, May 09, 2022 at 07:14:20AM -0700, Dave Hansen wrote:
> On 5/9/22 05:09, Kirill A. Shutemov wrote:
> > Private->Shared conversion is destructive. You have to split SEPT, flush
> > TLB. Backward conversion even more costly.
> >
> > Rule of thumb is avoid conversion where possible. DMA API is there for
> > you.
>
> Kirill, I understand that the DMA API is a quick fix today. But is it
> _really_ the right long-term interface?

Yes, I think so.

> There will surely come a time when TDX I/O devices won't be using fixed
> bounce buffers. What will the quote generation code do then? How will
> we know to come back around and fix this up?

VirtIO will not go away with TDX I/O in picture. TDX I/O will be addition
to existing stuff, not replacement.

And we have hooks in place to accommodate this: force_dma_unencrypted()
will return false for devices capable of TDX I/O. While the rest of
devices, including VirtIO and attestation, keep using existing paths with
swiotlb.

> Does SEV or the s390 ultravisor need anything like this?

At quick glance sev-guest.c uses set_memory_decrypted()/encrypted() for
allocation and freeing shared memory. I consider it inferior to using DMA
API.

--
Kirill A. Shutemov

Subject: Re: [PATCH v5 3/3] x86/tdx: Add Quote generation support



On 5/9/22 8:35 AM, Kirill A. Shutemov wrote:
> On Mon, May 09, 2022 at 07:14:20AM -0700, Dave Hansen wrote:
>> On 5/9/22 05:09, Kirill A. Shutemov wrote:
>>> Private->Shared conversion is destructive. You have to split SEPT, flush
>>> TLB. Backward conversion even more costly.
>>>
>>> Rule of thumb is avoid conversion where possible. DMA API is there for
>>> you.
>>
>> Kirill, I understand that the DMA API is a quick fix today. But is it
>> _really_ the right long-term interface?
>
> Yes, I think so.
>
>> There will surely come a time when TDX I/O devices won't be using fixed
>> bounce buffers. What will the quote generation code do then? How will
>> we know to come back around and fix this up?
>
> VirtIO will not go away with TDX I/O in picture. TDX I/O will be addition
> to existing stuff, not replacement.
>
> And we have hooks in place to accommodate this: force_dma_unencrypted()
> will return false for devices capable of TDX I/O. While the rest of
> devices, including VirtIO and attestation, keep using existing paths with
> swiotlb.
>
>> Does SEV or the s390 ultravisor need anything like this?
>
> At quick glance sev-guest.c uses set_memory_decrypted()/encrypted() for
> allocation and freeing shared memory. I consider it inferior to using DMA
> API.

Following is the link for the SEV attestation driver. It does seem to
use alloc_pages() and set_memory_*() calls.

https://lore.kernel.org/lkml/[email protected]/

>

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

Subject: Re: [PATCH v5 3/3] x86/tdx: Add Quote generation support



On 5/9/22 4:54 PM, Kai Huang wrote:
> And I suggested before, we can allocate a default size buffer (i.e. 4 pages),
> which is large enough to cover all requests for now, during driver
> initialization. This avoids IOCTL time conversion. We should still have code
> in the IOCTL to check the request buffer size and when it is larger than the
> default, the old should be freed a larger one should be allocated. But for now
> this code path will never happen.
>
> Btw above is based on assumption that we don't support concurrent IOCTLs. This
> version Sathya somehow changed to support concurrent IOCTLs but this was a
> surprise as I thought we somehow agreed we don't need to support this.

Yes. Initially, I did not want to add this support to keep the code
simple. But recent changes (to handle cases like, cleanup the VMM
owned page after user prematurely ends the current request, and to
support VMAP model) already made the code complicated. With current
framework changes, since it is easy to extend the code to support
concurrent GetQuote requests, I have just enabled this support.


>
> Anyway, now I don't have strong opinion here. To me alloc_pages() +
> set_memory_decrypted() is also fine (seems AMD is using this anyway). Will let
> Dave to decide.

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2022-05-10 02:58:47

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v5 3/3] x86/tdx: Add Quote generation support

On Tue, 2022-05-10 at 04:30 +0300, Kirill A. Shutemov wrote:
> On Tue, May 10, 2022 at 11:54:12AM +1200, Kai Huang wrote:
> > On Mon, 2022-05-09 at 15:09 +0300, Kirill A. Shutemov wrote:
> > > On Mon, May 09, 2022 at 03:37:22PM +1200, Kai Huang wrote:
> > > > On Sat, 2022-05-07 at 03:42 +0300, Kirill A. Shutemov wrote:
> > > > > On Fri, May 06, 2022 at 12:11:03PM +1200, Kai Huang wrote:
> > > > > > Kirill, what's your opinion?
> > > > >
> > > > > I said before that I think DMA API is the right tool here.
> > > > >
> > > > > Speculation about future of DMA in TDX is irrelevant here. If semantics
> > > > > change we will need to re-evaluate all users. VirtIO uses DMA API and it
> > > > > is conceptually the same use-case: communicate with the host.
> > > >
> > > > Virtio is designed for device driver to use, so it's fine to use DMA API. And
> > > > real DMA can happen to the virtio DMA buffers. Attestation doesn't have such
> > > > assumption.
> > >
> > > Whether attestation driver uses struct device is implementation detail.
> > > I don't see what is you point.
> >
> > No real DMA is involved in attestation.
>
> As with VirtIO. So what?
>

No real DMA can happen to virtio buffer. Consider DPDK which uses virtio +
vhost-user. But I don't want to argue anymore about this topic. :)

--
Thanks,
-Kai



2022-05-10 10:05:26

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v5 3/3] x86/tdx: Add Quote generation support

On Tue, May 10, 2022 at 11:54:12AM +1200, Kai Huang wrote:
> On Mon, 2022-05-09 at 15:09 +0300, Kirill A. Shutemov wrote:
> > On Mon, May 09, 2022 at 03:37:22PM +1200, Kai Huang wrote:
> > > On Sat, 2022-05-07 at 03:42 +0300, Kirill A. Shutemov wrote:
> > > > On Fri, May 06, 2022 at 12:11:03PM +1200, Kai Huang wrote:
> > > > > Kirill, what's your opinion?
> > > >
> > > > I said before that I think DMA API is the right tool here.
> > > >
> > > > Speculation about future of DMA in TDX is irrelevant here. If semantics
> > > > change we will need to re-evaluate all users. VirtIO uses DMA API and it
> > > > is conceptually the same use-case: communicate with the host.
> > >
> > > Virtio is designed for device driver to use, so it's fine to use DMA API. And
> > > real DMA can happen to the virtio DMA buffers. Attestation doesn't have such
> > > assumption.
> >
> > Whether attestation driver uses struct device is implementation detail.
> > I don't see what is you point.
>
> No real DMA is involved in attestation.

As with VirtIO. So what?

--
Kirill A. Shutemov

2022-05-10 11:19:09

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v5 3/3] x86/tdx: Add Quote generation support

On Mon, 2022-05-09 at 15:09 +0300, Kirill A. Shutemov wrote:
> On Mon, May 09, 2022 at 03:37:22PM +1200, Kai Huang wrote:
> > On Sat, 2022-05-07 at 03:42 +0300, Kirill A. Shutemov wrote:
> > > On Fri, May 06, 2022 at 12:11:03PM +1200, Kai Huang wrote:
> > > > Kirill, what's your opinion?
> > >
> > > I said before that I think DMA API is the right tool here.
> > >
> > > Speculation about future of DMA in TDX is irrelevant here. If semantics
> > > change we will need to re-evaluate all users. VirtIO uses DMA API and it
> > > is conceptually the same use-case: communicate with the host.
> >
> > Virtio is designed for device driver to use, so it's fine to use DMA API. And
> > real DMA can happen to the virtio DMA buffers. Attestation doesn't have such
> > assumption.
>
> Whether attestation driver uses struct device is implementation detail.
> I don't see what is you point.

No real DMA is involved in attestation.

>
> > So I don't see why TD guest kernel cannot have a simple protocol to vmap() a
> > page (or couple of pages) as shared on-demand, like below:
> >
> > page = alloc_page();
> >
> > addr = vmap(page, pgprot_decrypted(PAGE_KERNEL));
> >
> > clflush_cache_range(page_address(page), PAGE_SIZE);
> >
> > MapGPA(page_to_phys(page) | cc_mkdec(0), PAGE_SIZE);
> >
> > And we can even avoid above clflush_cache_range() if I understand correctly.
> >
> > Or I missed something?
>
> For completeness, cover free path too. Are you going to opencode page
> accept too?

Call __tdx_module_call(TDX_ACCEPT_PAGE, ...) right after MapGPA() to convert
back to private. I don't think there is any problem?

>
> Private->Shared conversion is destructive. You have to split SEPT, flush
> TLB. Backward conversion even more costly.

I think I won't call it destructive.

And I suggested before, we can allocate a default size buffer (i.e. 4 pages),
which is large enough to cover all requests for now, during driver
initialization. This avoids IOCTL time conversion. We should still have code
in the IOCTL to check the request buffer size and when it is larger than the
default, the old should be freed a larger one should be allocated. But for now
this code path will never happen.

Btw above is based on assumption that we don't support concurrent IOCTLs. This
version Sathya somehow changed to support concurrent IOCTLs but this was a
surprise as I thought we somehow agreed we don't need to support this.

Anyway, now I don't have strong opinion here. To me alloc_pages() +
set_memory_decrypted() is also fine (seems AMD is using this anyway). Will let
Dave to decide.

--
Thanks,
-Kai

2022-05-10 13:09:18

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v5 3/3] x86/tdx: Add Quote generation support

On Tue, 2022-05-10 at 11:54 +1200, Kai Huang wrote:
> On Mon, 2022-05-09 at 15:09 +0300, Kirill A. Shutemov wrote:
> > On Mon, May 09, 2022 at 03:37:22PM +1200, Kai Huang wrote:
> > > On Sat, 2022-05-07 at 03:42 +0300, Kirill A. Shutemov wrote:
> > > > On Fri, May 06, 2022 at 12:11:03PM +1200, Kai Huang wrote:
> > > > > Kirill, what's your opinion?
> > > >
> > > > I said before that I think DMA API is the right tool here.
> > > >
> > > > Speculation about future of DMA in TDX is irrelevant here. If semantics
> > > > change we will need to re-evaluate all users. VirtIO uses DMA API and it
> > > > is conceptually the same use-case: communicate with the host.
> > >
> > > Virtio is designed for device driver to use, so it's fine to use DMA API. And
> > > real DMA can happen to the virtio DMA buffers. Attestation doesn't have such
> > > assumption.
> >
> > Whether attestation driver uses struct device is implementation detail.
> > I don't see what is you point.
>
> No real DMA is involved in attestation.
>
> >
> > > So I don't see why TD guest kernel cannot have a simple protocol to vmap() a
> > > page (or couple of pages) as shared on-demand, like below:
> > >
> > > page = alloc_page();
> > >
> > > addr = vmap(page, pgprot_decrypted(PAGE_KERNEL));
> > >
> > > clflush_cache_range(page_address(page), PAGE_SIZE);
> > >
> > > MapGPA(page_to_phys(page) | cc_mkdec(0), PAGE_SIZE);
> > >
> > > And we can even avoid above clflush_cache_range() if I understand correctly.
> > >
> > > Or I missed something?
> >
> > For completeness, cover free path too. Are you going to opencode page
> > accept too?
>
> Call __tdx_module_call(TDX_ACCEPT_PAGE, ...) right after MapGPA() to convert
> back to private. I don't think there is any problem?
>
> >
> > Private->Shared conversion is destructive. You have to split SEPT, flush
> > TLB. Backward conversion even more costly.
>
> I think I won't call it destructive.
>
> And I suggested before, we can allocate a default size buffer (i.e. 4 pages),
> which is large enough to cover all requests for now, during driver
> initialization. This avoids IOCTL time conversion. We should still have code
> in the IOCTL to check the request buffer size and when it is larger than the
> default, the old should be freed a larger one should be allocated. But for now
> this code path will never happen.
>
> Btw above is based on assumption that we don't support concurrent IOCTLs. This
> version Sathya somehow changed to support concurrent IOCTLs but this was a
> surprise as I thought we somehow agreed we don't need to support this.

Hi Dave,

Sorry I forgot to mention that GHCI 1.5 defines a generic TDVMCALL<Service> for
a TD to communicate with VMM or another TD or some service in the host. This
TDVMCALL can support many sub-commands. For now only sub-commands for TD
migration is defined, but we can have more.

For this, we cannot assume the size of the command buffer, and I don't see why
we don't want to support concurrent TDVMCALLs. So looks from long term, we will
very likely need IOCTL time buffer private-shared conversion.


--
Thanks,
-Kai



Subject: Re: [PATCH v5 3/3] x86/tdx: Add Quote generation support

Hi Dave,

On 5/10/22 3:42 AM, Kai Huang wrote:
> On Tue, 2022-05-10 at 11:54 +1200, Kai Huang wrote:
>> On Mon, 2022-05-09 at 15:09 +0300, Kirill A. Shutemov wrote:
>>> On Mon, May 09, 2022 at 03:37:22PM +1200, Kai Huang wrote:
>>>> On Sat, 2022-05-07 at 03:42 +0300, Kirill A. Shutemov wrote:
>>>>> On Fri, May 06, 2022 at 12:11:03PM +1200, Kai Huang wrote:
>>>>>> Kirill, what's your opinion?
>>>>>
>>>>> I said before that I think DMA API is the right tool here.
>>>>>
>>>>> Speculation about future of DMA in TDX is irrelevant here. If semantics
>>>>> change we will need to re-evaluate all users. VirtIO uses DMA API and it
>>>>> is conceptually the same use-case: communicate with the host.
>>>>
>>>> Virtio is designed for device driver to use, so it's fine to use DMA API. And
>>>> real DMA can happen to the virtio DMA buffers. Attestation doesn't have such
>>>> assumption.
>>>
>>> Whether attestation driver uses struct device is implementation detail.
>>> I don't see what is you point.
>>
>> No real DMA is involved in attestation.
>>
>>>
>>>> So I don't see why TD guest kernel cannot have a simple protocol to vmap() a
>>>> page (or couple of pages) as shared on-demand, like below:
>>>>
>>>> page = alloc_page();
>>>>
>>>> addr = vmap(page, pgprot_decrypted(PAGE_KERNEL));
>>>>
>>>> clflush_cache_range(page_address(page), PAGE_SIZE);
>>>>
>>>> MapGPA(page_to_phys(page) | cc_mkdec(0), PAGE_SIZE);
>>>>
>>>> And we can even avoid above clflush_cache_range() if I understand correctly.
>>>>
>>>> Or I missed something?
>>>
>>> For completeness, cover free path too. Are you going to opencode page
>>> accept too?
>>
>> Call __tdx_module_call(TDX_ACCEPT_PAGE, ...) right after MapGPA() to convert
>> back to private. I don't think there is any problem?
>>
>>>
>>> Private->Shared conversion is destructive. You have to split SEPT, flush
>>> TLB. Backward conversion even more costly.
>>
>> I think I won't call it destructive.
>>
>> And I suggested before, we can allocate a default size buffer (i.e. 4 pages),
>> which is large enough to cover all requests for now, during driver
>> initialization. This avoids IOCTL time conversion. We should still have code
>> in the IOCTL to check the request buffer size and when it is larger than the
>> default, the old should be freed a larger one should be allocated. But for now
>> this code path will never happen.
>>
>> Btw above is based on assumption that we don't support concurrent IOCTLs. This
>> version Sathya somehow changed to support concurrent IOCTLs but this was a
>> surprise as I thought we somehow agreed we don't need to support this.
>
> Hi Dave,
>
> Sorry I forgot to mention that GHCI 1.5 defines a generic TDVMCALL<Service> for
> a TD to communicate with VMM or another TD or some service in the host. This
> TDVMCALL can support many sub-commands. For now only sub-commands for TD
> migration is defined, but we can have more.
>
> For this, we cannot assume the size of the command buffer, and I don't see why
> we don't want to support concurrent TDVMCALLs. So looks from long term, we will
> very likely need IOCTL time buffer private-shared conversion.
>
>


Let me summarize the discussion so far.

Problem: Allocate shared buffer without breaking the direct map.

Solution 1: Use alloc_pages*()/vmap()/set_memory_*crypted() APIs

Pros/Cons:

1. Uses virtual mapped address for shared/private conversion and
   hence does not touch the direct mapping.

2. Current version of set_memory_*crypted() APIs  modifies the
   aliased mappings, which also includes the direct mapping. So if we
   want to use set_memory_*() APIs, we need a new variant that does not
   touch the direct mapping. An alternative solution is to open code the
   page attribute conversion, cache flushing and MapGpa/Page acceptance
   logic in the attestation driver itself. But, IMO, this is not
preferred because it is not favorable to sprinkle the mapping
conversion code in multiple places in the kernel. It is better to use
a single API if possible.

3. This solution can possibly break the SEPT entries on private-shared
   conversion. The backward conversion is also costly. IMO, since the
   attestation requests are not very frequent, we don't need to be
overly concerned about the cost involved in the conversion.

Solution 2: Use DMA alloc APIs.

Pros/Cons:

1. Simpler to use. It taps into the SWIOTLB buffers and does not
   affect the direct map. Since we will be using already converted
   memory, allocation/freeing will be cheaper compared to solution 1.

2. There is a concern that it is not a long term solution. Since
   with advent of TDX IO, not all DMA allocations need to use
   SWIOTLB model. But as per Kirill's comments, this is not a concern
   and force_dma_unencrypted() hook can be used to differentiate which
   devices need to use TDX IO vs SWIOTLB model.

3. Using DMA APIs requires a valid bus device as argument and hence
   requires this driver converted into a platform device driver. But,
   since this driver does not do real DMA, making above changes just
   to use the DMA API is not recommended.

Since both solutions fix the problem (and there are pros/cons), and both
Kai/Kirill's comments conclusion is, there is no hard preference and
to let you decide on it.

Since you have already made a comment about "irrespective of which
model is chosen, you need the commit log talk about the solution and
how it not touches the direct map", I have posted the v6 version
adapting Solution 1.

Please let me know if you agree with this direction or have comments
about the solution.

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer