Subject: [PATCH v8 0/5] Add TDX Guest Attestation support

Hi All,

Intel's Trust Domain Extensions (TDX) protect guest VMs from malicious
hosts and some physical attacks. VM guest with TDX support is called
as TD Guest.

In TD Guest, the attestation process is used to verify the
trustworthiness of TD guest to the 3rd party servers. Such attestation
process is required by 3rd party servers before sending sensitive
information to TD guests. One usage example is to get encryption keys
from the key server for mounting the encrypted rootfs or secondary drive.

Following patches add the attestation support to TDX guest which
includes attestation user interface driver and related hypercall support.

Any distribution enabling TDX is also expected to need attestation. So
enable it by default with TDX guest support. The compiled size is
quite small (~500 bytes).

Changes since v7:
* Removed exports symbols for tdx_setup_ev_notify_handler() and
tdx_remove_ev_notify_handler().
* Changed "struct quote_buf *buf" member in "struct quote_entry"
from pointer to embed object.
* Rebased on top of v5.19-rc1.

Changes since v6:
* Fixed race between wait_for_completion_*() and
quote_callback_handler() in tdx_get_quote() when user terminates the
request.
* Fixed commit log and comments.

Changes since v5:
* Added support for parallel GetQuote requests.
* Add noalias variants of set_memory_*crypted() functions to
changes page attribute without touching direct map.
* Made set_memory_*crypted() functions vmalloc address compatible.
* Use vmap()/set_memory_*crypted() functions to share/unshare
memory without touching the direct map.
* Add support to let driver handle the memory cleanup for the
early termination of user requests.
* Removed unused headers in attest.c
* Fixed commit log and comments as per review comments.

Changes since v4:
* Removed platform driver model in attestation driver and used
miscdevice and initcall approach.
* Since dma_alloc*() APIs require a valid device reference,
replaced it with __get_free_pages() and set_memory_decrypted()
for quote memory allocation.
* Removed tdx_mcall_tdreport() and moved TDG.MR.REPORT TDCALL code
to tdx_get_report().
* Used kmalloc() for TDREPORT memory allocation instead of
get_zeroed_page().
* Returned -EINVAL in default case of tdx_attest_ioctl().
* Added struct tdx_report_req to explicitly mention the
TDX_CMD_GET_REPORT IOCTL argument.
* Removed tdx_get_quote_hypercall() and moved hypercall code to
attestation driver itself.
* Removed GetQuote timeout support (since it is not defined in
spec)
* Added support to check for spurious callback interrupt in GetQuote
request.
* Fixed commit log and comments as per review suggestions.


Changes since v3:
* Moved the attestation driver from platform/x86 to arch/x86/coco/tdx/ and
renamed intel_tdx_attest.c to attest.c.
* Dropped CONFIG_INTEL_TDX_ATTESTATION and added support to compile
attestation changes with CONFIG_INTEL_TDX_GUEST option.
* Merged patch titled "x86/tdx: Add tdx_mcall_tdreport() API support" and
"platform/x86: intel_tdx_attest: Add TDX Guest attestation interface" into
a single patch.
* Moved GetQuote IOCTL support changes from patch titled "platform/x86:
intel_tdx_attest: Add TDX Guest attestation interface driver" to a
separate patch.
* Removed 8K size restriction when requesting quote, and added support
to let userspace decide the quote size.
* Added support to allow attestation agent configure quote generation
timeout value.
* Fixed commit log and comments as per review comments.

Changes since v2:
* As per Han's suggestion, modified the attestation driver to use
platform device driver model.
* Modified tdx_hcall_get_quote() and tdx_mcall_tdreport() APIs to
return TDCALL error code instead of generic error info (like -EIO).
* Removed attestation test app patch from this series to simplify
the patchset and review process. Test app patches will be submitted
once attestation support patches are merged.
* Since patches titled "x86/tdx: Add SetupEventNotifyInterrupt TDX
hypercall support" and "x86/tdx: Add TDX Guest event notify
interrupt vector support" are related, combining them into a
single patch.

Changes since v1:
* Moved test driver from "tools/tdx/attest/tdx-attest-test.c" to
"tools/arch/x86/tdx/attest/tdx-attest-test.c" as per Hans review
suggestion.
* Minor commit log and comment fixes in patches titled
"x86/tdx: Add tdx_mcall_tdreport() API support" and "x86/tdx:
Add tdx_hcall_get_quote() API support"
* Extended tdx_hcall_get_quote() API to accept GPA length as argument
to accomodate latest TDQUOTE TDVMCALL related specification update.
* Added support for tdx_setup_ev_notify_handler() and
tdx_remove_ev_notify_handler() in patch titled "x86/tdx: Add TDX
Guest event notify interrupt vector support"


Kuppuswamy Sathyanarayanan (5):
x86/tdx: Add TDX Guest attestation interface driver
x86/tdx: Add TDX Guest event notify interrupt support
x86/mm: Make tdx_enc_status_changed() vmalloc address compatible
x86/mm: Add noalias variants of set_memory_*crypted() functions
x86/tdx: Add Quote generation support

arch/x86/coco/tdx/Makefile | 2 +-
arch/x86/coco/tdx/attest.c | 422 +++++++++++++++++++++++++++++
arch/x86/coco/tdx/tdx.c | 82 +++++-
arch/x86/include/asm/hardirq.h | 3 +
arch/x86/include/asm/idtentry.h | 4 +
arch/x86/include/asm/irq_vectors.h | 7 +-
arch/x86/include/asm/set_memory.h | 2 +
arch/x86/include/asm/tdx.h | 4 +
arch/x86/include/uapi/asm/tdx.h | 87 ++++++
arch/x86/kernel/irq.c | 7 +
arch/x86/mm/pat/set_memory.c | 26 +-
11 files changed, 636 insertions(+), 10 deletions(-)
create mode 100644 arch/x86/coco/tdx/attest.c
create mode 100644 arch/x86/include/uapi/asm/tdx.h

--
2.25.1


Subject: [PATCH v8 3/5] x86/mm: Make tdx_enc_status_changed() vmalloc address compatible

set_memory_*crypted() APIs are used to change encryption or decryption
page attributes for the given address. It also by default support the
conversion for the vmalloc'ed memory address.

In TDX Guest, tdx_enc_status_changed() function is triggered by
set_memory_*crypted() APIs when converting memory from/to shared or
private. Internally this function uses __pa() for physical address
conversion, which breaks the vmalloc address compatibility of the
set_memory_*crypted() APIs.

So add support to fix the vmalloc'ed address compatibility issue.

Acked-by: Wander Lairson Costa <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---
arch/x86/coco/tdx/tdx.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index 94542876f26a..445f58547776 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -15,6 +15,7 @@
#include <asm/idtentry.h>
#include <asm/irq_regs.h>
#include <asm/desc.h>
+#include <asm/io.h>

/* TDX module Call Leaf IDs */
#define TDX_GET_INFO 1
@@ -678,8 +679,14 @@ static bool try_accept_one(phys_addr_t *start, unsigned long len,
*/
static bool tdx_enc_status_changed(unsigned long vaddr, int numpages, bool enc)
{
- phys_addr_t start = __pa(vaddr);
- phys_addr_t end = __pa(vaddr + numpages * PAGE_SIZE);
+ phys_addr_t start, end;
+
+ if (is_vmalloc_addr((void *)vaddr))
+ start = vmalloc_to_pfn((void *) vaddr) << PAGE_SHIFT;
+ else
+ start = __pa(vaddr);
+
+ end = start + numpages * PAGE_SIZE;

if (!enc) {
/* Set the shared (decrypted) bits: */
--
2.25.1

Subject: [PATCH v8 1/5] x86/tdx: Add TDX Guest attestation interface driver

In TDX guest, attestation is used to verify the trustworthiness of a TD
to other entities before provisioning secrets to the TD.

One usage example is, when a TD guest uses encrypted drive and if the
decryption keys required to access the drive are stored in a secure 3rd
party keyserver, the key server can use attestation to verify TD's
trustworthiness and release the decryption keys to the TD.

The attestation process consists of two steps: TDREPORT generation and
Quote generation.

TDREPORT (TDREPORT_STRUCT) is a fixed-size data structure generated by
the TDX module which contains TD-specific information (such as TD
measurements), platform security version, and the MAC to protect the
integrity of the TDREPORT. The TD kernel uses TDCALL[TDG.MR.REPORT] to
get the TDREPORT from the TDX module. A user-provided 64-Byte
REPORTDATA is used as input and included in the TDREPORT. Typically it
can be some nonce provided by attestation service so the TDREPORT can
be verified uniquely. More details about TDREPORT can be found in
Intel TDX Module specification, section titled "TDG.MR.REPORT Leaf".

TDREPORT can only be verified on local platform as the MAC key is bound
to the platform. To support remote verification of the TDREPORT, TDX
leverages Intel SGX Quote Enclave (QE) to verify the TDREPORT locally
and convert it to a remote verifiable Quote.

After getting the TDREPORT, the second step of the attestation process
is to send it to the QE to generate the Quote. TDX doesn't support SGX
inside the TD, so the QE can be deployed in the host, or in another
legacy VM with SGX support. How to send the TDREPORT to QE and receive
the Quote is implementation and deployment specific.

Implement a basic attestation driver to allow TD userspace to get the
TDREPORT. The TD userspace attestation software can get the TDREPORT
and then choose whatever communication channel available (i.e. vsock)
to send the TDREPORT to QE and receive the Quote.

Also note that explicit access permissions are not enforced in this
driver because the quote and measurements are not a secret. However
the access permissions of the device node can be used to set any
desired access policy. The udev default is usually root access
only.

Operations like getting TDREPORT or Quote generation involves sending
a blob of data as input and getting another blob of data as output. It
was considered to use a sysfs interface for this, but it doesn't fit
well into the standard sysfs model for configuring values. It would be
possible to do read/write on files, but it would need multiple file
descriptors, which would be somewhat messy. IOCTLs seems to be the best
fitting and simplest model for this use case. Also, the REPORTDATA used
in TDREPORT generation can possibly come from attestation service to
uniquely verify the Quote (like per instance verification). In such
case, since REPORTDATA is a secret, using sysfs to share it is insecure
compared to sending it via IOCTL.

Reviewed-by: Tony Luck <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Acked-by: Kai Huang <[email protected]>
Acked-by: Wander Lairson Costa <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---
arch/x86/coco/tdx/Makefile | 2 +-
arch/x86/coco/tdx/attest.c | 118 ++++++++++++++++++++++++++++++++
arch/x86/include/uapi/asm/tdx.h | 42 ++++++++++++
3 files changed, 161 insertions(+), 1 deletion(-)
create mode 100644 arch/x86/coco/tdx/attest.c
create mode 100644 arch/x86/include/uapi/asm/tdx.h

diff --git a/arch/x86/coco/tdx/Makefile b/arch/x86/coco/tdx/Makefile
index 46c55998557d..d2db3e6770e5 100644
--- a/arch/x86/coco/tdx/Makefile
+++ b/arch/x86/coco/tdx/Makefile
@@ -1,3 +1,3 @@
# SPDX-License-Identifier: GPL-2.0

-obj-y += tdx.o tdcall.o
+obj-y += tdx.o tdcall.o attest.o
diff --git a/arch/x86/coco/tdx/attest.c b/arch/x86/coco/tdx/attest.c
new file mode 100644
index 000000000000..24db0bad4923
--- /dev/null
+++ b/arch/x86/coco/tdx/attest.c
@@ -0,0 +1,118 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * attest.c - TDX guest attestation interface driver.
+ *
+ * Implements user interface to trigger attestation process.
+ *
+ * Copyright (C) 2022 Intel Corporation
+ *
+ */
+
+#define pr_fmt(fmt) "x86/tdx: attest: " fmt
+
+#include <linux/miscdevice.h>
+#include <linux/mm.h>
+#include <linux/io.h>
+#include <asm/tdx.h>
+#include <uapi/asm/tdx.h>
+
+#define DRIVER_NAME "tdx-attest"
+
+/* TDREPORT module call leaf ID */
+#define TDX_GET_REPORT 4
+
+static struct miscdevice miscdev;
+
+static long tdx_get_report(void __user *argp)
+{
+ void *reportdata = NULL, *tdreport = NULL;
+ long ret;
+
+ /* Allocate buffer space for REPORTDATA */
+ reportdata = kmalloc(TDX_REPORTDATA_LEN, GFP_KERNEL);
+ if (!reportdata)
+ return -ENOMEM;
+
+ /* Allocate buffer space for TDREPORT */
+ tdreport = kmalloc(TDX_REPORT_LEN, GFP_KERNEL);
+ if (!tdreport) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ /* Copy REPORTDATA from the user buffer */
+ if (copy_from_user(reportdata, argp, TDX_REPORTDATA_LEN)) {
+ ret = -EFAULT;
+ goto out;
+ }
+
+ /*
+ * Generate TDREPORT using "TDG.MR.REPORT" TDCALL.
+ *
+ * Get the TDREPORT using REPORTDATA as input. Refer to
+ * section 22.3.3 TDG.MR.REPORT leaf in the TDX Module 1.0
+ * Specification for detailed information.
+ */
+ ret = __tdx_module_call(TDX_GET_REPORT, virt_to_phys(tdreport),
+ virt_to_phys(reportdata), 0, 0, NULL);
+ if (ret) {
+ pr_debug("TDREPORT TDCALL failed, status:%lx\n", ret);
+ ret = -EIO;
+ goto out;
+ }
+
+ /* Copy TDREPORT back to the user buffer */
+ if (copy_to_user(argp, tdreport, TDX_REPORT_LEN))
+ ret = -EFAULT;
+
+out:
+ kfree(reportdata);
+ kfree(tdreport);
+ return ret;
+}
+
+static long tdx_attest_ioctl(struct file *file, unsigned int cmd,
+ unsigned long arg)
+{
+ void __user *argp = (void __user *)arg;
+ long ret = -EINVAL;
+
+ switch (cmd) {
+ case TDX_CMD_GET_REPORT:
+ ret = tdx_get_report(argp);
+ break;
+ default:
+ pr_debug("cmd %d not supported\n", cmd);
+ break;
+ }
+
+ return ret;
+}
+
+static const struct file_operations tdx_attest_fops = {
+ .owner = THIS_MODULE,
+ .unlocked_ioctl = tdx_attest_ioctl,
+ .llseek = no_llseek,
+};
+
+static int __init tdx_attestation_init(void)
+{
+ int ret;
+
+ /* Make sure we are in a valid TDX platform */
+ if (!cpu_feature_enabled(X86_FEATURE_TDX_GUEST))
+ return -EIO;
+
+ miscdev.name = DRIVER_NAME;
+ miscdev.minor = MISC_DYNAMIC_MINOR;
+ miscdev.fops = &tdx_attest_fops;
+
+ ret = misc_register(&miscdev);
+ if (ret) {
+ pr_err("misc device registration failed\n");
+ return ret;
+ }
+
+ return 0;
+}
+device_initcall(tdx_attestation_init)
diff --git a/arch/x86/include/uapi/asm/tdx.h b/arch/x86/include/uapi/asm/tdx.h
new file mode 100644
index 000000000000..8b57dea67eab
--- /dev/null
+++ b/arch/x86/include/uapi/asm/tdx.h
@@ -0,0 +1,42 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _UAPI_ASM_X86_TDX_H
+#define _UAPI_ASM_X86_TDX_H
+
+#include <linux/types.h>
+#include <linux/ioctl.h>
+
+/* Length of the REPORTDATA used in TDG.MR.REPORT TDCALL */
+#define TDX_REPORTDATA_LEN 64
+
+/* Length of TDREPORT used in TDG.MR.REPORT TDCALL */
+#define TDX_REPORT_LEN 1024
+
+/**
+ * struct tdx_report_req: Get TDREPORT using REPORTDATA as input.
+ *
+ * @reportdata : User-defined 64-Byte REPORTDATA to be included into
+ * TDREPORT. Typically it can be some nonce provided by
+ * attestation service, so the generated TDREPORT can be
+ * uniquely verified.
+ * @tdreport : TDREPORT output from TDCALL[TDG.MR.REPORT] of size
+ * TDX_REPORT_LEN.
+ *
+ * Used in TDX_CMD_GET_REPORT IOCTL request.
+ */
+struct tdx_report_req {
+ union {
+ __u8 reportdata[TDX_REPORTDATA_LEN];
+ __u8 tdreport[TDX_REPORT_LEN];
+ };
+};
+
+/*
+ * TDX_CMD_GET_REPORT - Get TDREPORT using TDCALL[TDG.MR.REPORT]
+ *
+ * Return 0 on success, -EIO on TDCALL execution failure, and
+ * standard errno on other general error cases.
+ *
+ */
+#define TDX_CMD_GET_REPORT _IOWR('T', 0x01, struct tdx_report_req)
+
+#endif /* _UAPI_ASM_X86_TDX_H */
--
2.25.1

Subject: [PATCH v8 2/5] x86/tdx: Add TDX Guest event notify interrupt support

Host-guest event notification via configured interrupt vector is useful
in cases where a guest makes an asynchronous request and needs a
callback from the host to indicate the completion or to let the host
notify the guest about events like device removal. One usage example is,
callback requirement of GetQuote asynchronous hypercall.

In TDX guest, SetupEventNotifyInterrupt hypercall can be used by the
guest to specify which interrupt vector to use as an event-notify
vector to the VMM. Details about the SetupEventNotifyInterrupt
hypercall can be found in TDX Guest-Host Communication Interface
(GHCI) Specification, sec 3.5 "VP.VMCALL<SetupEventNotifyInterrupt>".
Add a tdx_hcall_set_notify_intr() helper function to implement the
SetupEventNotifyInterrupt hypercall.

Reserve 0xec IRQ vector address for TDX guest to receive the event
completion notification from VMM. Also add related IDT handler to
process the notification event.

Add support to track the notification event status via
/proc/interrupts.

Reviewed-by: Tony Luck <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Acked-by: Wander Lairson Costa <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---
arch/x86/coco/tdx/tdx.c | 71 ++++++++++++++++++++++++++++++
arch/x86/include/asm/hardirq.h | 3 ++
arch/x86/include/asm/idtentry.h | 4 ++
arch/x86/include/asm/irq_vectors.h | 7 ++-
arch/x86/include/asm/tdx.h | 4 ++
arch/x86/kernel/irq.c | 7 +++
6 files changed, 95 insertions(+), 1 deletion(-)

diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index 03deb4d6920d..94542876f26a 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -11,6 +11,10 @@
#include <asm/insn.h>
#include <asm/insn-eval.h>
#include <asm/pgtable.h>
+#include <asm/apic.h>
+#include <asm/idtentry.h>
+#include <asm/irq_regs.h>
+#include <asm/desc.h>

/* TDX module Call Leaf IDs */
#define TDX_GET_INFO 1
@@ -19,6 +23,7 @@

/* TDX hypercall Leaf IDs */
#define TDVMCALL_MAP_GPA 0x10001
+#define TDVMCALL_SETUP_NOTIFY_INTR 0x10004

/* MMIO direction */
#define EPT_READ 0
@@ -34,6 +39,26 @@
#define VE_GET_PORT_NUM(e) ((e) >> 16)
#define VE_IS_IO_STRING(e) ((e) & BIT(4))

+/*
+ * Handler used to report notifications about
+ * TDX_GUEST_EVENT_NOTIFY_VECTOR IRQ. Currently it will be
+ * used only by the attestation driver. So, race condition
+ * with read/write operation is not considered.
+ */
+static void (*tdx_event_notify_handler)(void);
+
+/* Helper function to register tdx_event_notify_handler */
+void tdx_setup_ev_notify_handler(void (*handler)(void))
+{
+ tdx_event_notify_handler = handler;
+}
+
+/* Helper function to unregister tdx_event_notify_handler */
+void tdx_remove_ev_notify_handler(void)
+{
+ tdx_event_notify_handler = NULL;
+}
+
/*
* Wrapper for standard use of __tdx_hypercall with no output aside from
* return code.
@@ -98,6 +123,46 @@ static inline void tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
panic("TDCALL %lld failed (Buggy TDX module!)\n", fn);
}

+/* TDX guest event notification handler */
+DEFINE_IDTENTRY_SYSVEC(sysvec_tdx_event_notify)
+{
+ struct pt_regs *old_regs = set_irq_regs(regs);
+
+ inc_irq_stat(irq_tdx_event_notify_count);
+
+ if (tdx_event_notify_handler)
+ tdx_event_notify_handler();
+
+ ack_APIC_irq();
+
+ set_irq_regs(old_regs);
+}
+
+/*
+ * tdx_hcall_set_notify_intr() - Setup Event Notify Interrupt Vector.
+ *
+ * @vector: Vector address to be used for notification.
+ *
+ * return 0 on success or failure error number.
+ */
+static long tdx_hcall_set_notify_intr(u8 vector)
+{
+ /* Minimum vector value allowed is 32 */
+ if (vector < 32)
+ return -EINVAL;
+
+ /*
+ * Register callback vector address with VMM. More details
+ * about the ABI can be found in TDX Guest-Host-Communication
+ * Interface (GHCI), sec titled
+ * "TDG.VP.VMCALL<SetupEventNotifyInterrupt>".
+ */
+ if (_tdx_hypercall(TDVMCALL_SETUP_NOTIFY_INTR, vector, 0, 0, 0))
+ return -EIO;
+
+ return 0;
+}
+
static u64 get_cc_mask(void)
{
struct tdx_module_output out;
@@ -688,5 +753,11 @@ void __init tdx_early_init(void)
x86_platform.guest.enc_tlb_flush_required = tdx_tlb_flush_required;
x86_platform.guest.enc_status_change_finish = tdx_enc_status_changed;

+ alloc_intr_gate(TDX_GUEST_EVENT_NOTIFY_VECTOR,
+ asm_sysvec_tdx_event_notify);
+
+ if (tdx_hcall_set_notify_intr(TDX_GUEST_EVENT_NOTIFY_VECTOR))
+ pr_warn("Setting event notification interrupt failed\n");
+
pr_info("Guest detected\n");
}
diff --git a/arch/x86/include/asm/hardirq.h b/arch/x86/include/asm/hardirq.h
index 275e7fd20310..582deff56210 100644
--- a/arch/x86/include/asm/hardirq.h
+++ b/arch/x86/include/asm/hardirq.h
@@ -44,6 +44,9 @@ typedef struct {
unsigned int irq_hv_reenlightenment_count;
unsigned int hyperv_stimer0_count;
#endif
+#if IS_ENABLED(CONFIG_INTEL_TDX_GUEST)
+ unsigned int irq_tdx_event_notify_count;
+#endif
} ____cacheline_aligned irq_cpustat_t;

DECLARE_PER_CPU_SHARED_ALIGNED(irq_cpustat_t, irq_stat);
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 72184b0b2219..655086dd940e 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -700,6 +700,10 @@ DECLARE_IDTENTRY_SYSVEC(HYPERVISOR_CALLBACK_VECTOR, sysvec_xen_hvm_callback);
DECLARE_IDTENTRY_SYSVEC(HYPERVISOR_CALLBACK_VECTOR, sysvec_kvm_asyncpf_interrupt);
#endif

+#if IS_ENABLED(CONFIG_INTEL_TDX_GUEST)
+DECLARE_IDTENTRY_SYSVEC(TDX_GUEST_EVENT_NOTIFY_VECTOR, sysvec_tdx_event_notify);
+#endif
+
#undef X86_TRAP_OTHER

#endif
diff --git a/arch/x86/include/asm/irq_vectors.h b/arch/x86/include/asm/irq_vectors.h
index 43dcb9284208..82ac0c0a34b1 100644
--- a/arch/x86/include/asm/irq_vectors.h
+++ b/arch/x86/include/asm/irq_vectors.h
@@ -104,7 +104,12 @@
#define HYPERV_STIMER0_VECTOR 0xed
#endif

-#define LOCAL_TIMER_VECTOR 0xec
+#if IS_ENABLED(CONFIG_INTEL_TDX_GUEST)
+/* Vector on which TDX Guest event notification is delivered */
+#define TDX_GUEST_EVENT_NOTIFY_VECTOR 0xec
+#endif
+
+#define LOCAL_TIMER_VECTOR 0xeb

#define NR_VECTORS 256

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 020c81a7c729..eb4db837cc44 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -67,6 +67,10 @@ void tdx_safe_halt(void);

bool tdx_early_handle_ve(struct pt_regs *regs);

+void tdx_setup_ev_notify_handler(void (*handler)(void));
+
+void tdx_remove_ev_notify_handler(void);
+
#else

static inline void tdx_early_init(void) { };
diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
index 766ffe3ba313..a96ecd866723 100644
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -181,6 +181,13 @@ int arch_show_interrupts(struct seq_file *p, int prec)
seq_printf(p, "%10u ",
irq_stats(j)->kvm_posted_intr_wakeup_ipis);
seq_puts(p, " Posted-interrupt wakeup event\n");
+#endif
+#if IS_ENABLED(CONFIG_INTEL_TDX_GUEST)
+ seq_printf(p, "%*s: ", prec, "TGN");
+ for_each_online_cpu(j)
+ seq_printf(p, "%10u ",
+ irq_stats(j)->irq_tdx_event_notify_count);
+ seq_puts(p, " TDX Guest event notification\n");
#endif
return 0;
}
--
2.25.1

Subject: [PATCH v8 5/5] x86/tdx: Add Quote generation support

In TDX guest, the second stage in attestation process is to send the
TDREPORT to QE/QGS to generate the TD Quote. For platforms that does
not support communication channels like vsock or TCP/IP, implement
support to get TD Quote using hypercall. GetQuote hypercall can be used
by the TD guest to request VMM facilitate the Quote generation via
QE/QGS. More details about GetQuote hypercall can be found in TDX
Guest-Host Communication Interface (GHCI) for Intel TDX 1.0, section
titled "TDG.VP.VMCALL<GetQuote>.

Since GetQuote is an asynchronous request hypercall, it will not block
till the TD Quote is generated. So VMM uses callback interrupt vector
configured by SetupEventNotifyInterrupt hypercall to notify the guest
about Quote generation completion or failure.

GetQuote TDVMCALL requires TD guest pass a 4K aligned shared buffer
with TDREPORT data as input, which is further used by the VMM to copy
the TD Quote result after successful Quote generation. To create the
shared buffer without breaking the direct map, allocate physically
contiguous kernel memory and create a virtual mapping for it using
vmap(). set_memory_*crypted_noalias() functions can be used to share or
unshare the vmapped page without affecting the direct map.

Also note that, shared buffer allocation is currently handled in IOCTL
handler, although it will increase the TDX_CMD_GET_QUOTE IOCTL response
time, it is negligible compared to the time required for the quote
generation completion. So IOCTL performance optimization is not
considered at this time.

For shared buffer allocation, alternatives like using the DMA API is
also considered. Although it simpler to use, it is not preferred because
dma_alloc_*() APIs require a valid bus device as argument, which would
need converting the attestation driver into a platform device driver.
This is unnecessary, and since the attestation driver does not do real
DMA, there is no need to use real DMA APIs.

Add support for TDX_CMD_GET_QUOTE IOCTL to allow attestation agent
submit GetQuote requests from the user space. Since Quote generation
is an asynchronous request, IOCTL will block indefinitely for the VMM
response in wait_for_completion_interruptible() call. Using this call
will also add an option for the user to end the current request
prematurely by raising any signals. This can be used by attestation
agent to implement Quote generation timeout feature. If attestation
agent is aware of time it can validly wait for QE/QGS response, then
a possible timeout support can be implemented in the user application
using signals. Quote generation timeout feature is currently not
implemented in the driver because the current TDX specification does
not have any recommendation for it.

After submitting the GetQuote request using hypercall, the shared buffer
allocated for the current request is owned by the VMM. So, during this
wait window, if the user terminates the request by raising a signal or
by terminating the application, add a logic to do the memory cleanup
after receiving the VMM response at a later time. Such memory cleanup
support requires accepting the page again using TDX_ACCEPT_PAGE TDX
Module call. So to not overload the callback IRQ handler, move the
callback handler logic to a separate work queue.

To support parallel GetQuote requests, use linked list to track the
active GetQuote requests and upon receiving the callback IRQ, loop
through the active requests and mark the processed requests complete.
Users can open multiple instances of the attestation device and send
GetQuote requests in parallel.

Reviewed-by: Tony Luck <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
---
arch/x86/coco/tdx/attest.c | 304 ++++++++++++++++++++++++++++++++
arch/x86/include/uapi/asm/tdx.h | 45 +++++
2 files changed, 349 insertions(+)

diff --git a/arch/x86/coco/tdx/attest.c b/arch/x86/coco/tdx/attest.c
index 24db0bad4923..a2a94622922c 100644
--- a/arch/x86/coco/tdx/attest.c
+++ b/arch/x86/coco/tdx/attest.c
@@ -13,16 +13,56 @@
#include <linux/miscdevice.h>
#include <linux/mm.h>
#include <linux/io.h>
+#include <linux/set_memory.h>
+#include <linux/mutex.h>
#include <asm/tdx.h>
+#include <asm/coco.h>
#include <uapi/asm/tdx.h>

#define DRIVER_NAME "tdx-attest"

/* TDREPORT module call leaf ID */
#define TDX_GET_REPORT 4
+/* GetQuote hypercall leaf ID */
+#define TDVMCALL_GET_QUOTE 0x10002
+
+/* Used for buffer allocation in GetQuote request */
+struct quote_buf {
+ /* vmapped address of kernel buffer (size is page aligned) */
+ void *vmaddr;
+ /* Number of pages */
+ int count;
+};
+
+/* List entry of quote_list */
+struct quote_entry {
+ /* Flag to check validity of the GetQuote request */
+ bool valid;
+ /* Kernel buffer to share data with VMM */
+ struct quote_buf buf;
+ /* Completion object to track completion of GetQuote request */
+ struct completion compl;
+ struct list_head list;
+};

static struct miscdevice miscdev;

+/*
+ * To support parallel GetQuote requests, use the list
+ * to track active GetQuote requests.
+ */
+static LIST_HEAD(quote_list);
+
+/* Lock to protect quote_list */
+static DEFINE_MUTEX(quote_lock);
+
+/*
+ * Workqueue to handle Quote data after Quote generation
+ * notification from VMM.
+ */
+struct workqueue_struct *quote_wq;
+struct work_struct quote_work;
+
static long tdx_get_report(void __user *argp)
{
void *reportdata = NULL, *tdreport = NULL;
@@ -71,6 +111,260 @@ static long tdx_get_report(void __user *argp)
return ret;
}

+/* tdx_get_quote_hypercall() - Request to get TD Quote using TDREPORT */
+static long tdx_get_quote_hypercall(struct quote_buf *buf)
+{
+ struct tdx_hypercall_args args = {0};
+
+ args.r10 = TDX_HYPERCALL_STANDARD;
+ args.r11 = TDVMCALL_GET_QUOTE;
+ args.r12 = cc_mkdec(page_to_phys(vmalloc_to_page(buf->vmaddr)));
+ args.r13 = buf->count * PAGE_SIZE;
+
+ /*
+ * Pass the physical address of TDREPORT to the VMM and
+ * trigger the Quote generation. It is not a blocking
+ * call, hence completion of this request will be notified to
+ * the TD guest via a callback interrupt. More info about ABI
+ * can be found in TDX Guest-Host-Communication Interface
+ * (GHCI), sec titled "TDG.VP.VMCALL<GetQuote>".
+ */
+ return __tdx_hypercall(&args, 0);
+}
+
+/*
+ * init_quote_buf() - Initialize the quote buffer by allocating
+ * a shared buffer of given size.
+ *
+ * Size is page aligned and the allocated memory is decrypted
+ * to allow VMM to access it. Uses VMAP to create a virtual
+ * mapping, which is further used to create a shared mapping
+ * for the buffer without affecting the direct map.
+ */
+static int init_quote_buf(struct quote_buf *buf, u64 req_size)
+{
+ int size = PAGE_ALIGN(req_size);
+ void *addr = NULL, *vmaddr = NULL;
+ int count = size >> PAGE_SHIFT;
+ struct page **pages = NULL;
+ int i;
+
+ addr = alloc_pages_exact(size, GFP_KERNEL);
+ if (!addr)
+ return -ENOMEM;
+
+ /* Allocate mem for array of page ptrs */
+ pages = kcalloc(count, sizeof(*pages), GFP_KERNEL);
+ if (!pages) {
+ free_pages_exact(addr, size);
+ return -ENOMEM;
+ }
+
+ for (i = 0; i < count; i++)
+ pages[i] = virt_to_page(addr + i * PAGE_SIZE);
+
+ /*
+ * Use VMAP to create a virtual mapping, which is used
+ * to create shared mapping without affecting the
+ * direct map. Use VM_MAP_PUT_PAGES to allow vmap()
+ * responsible for freeing the pages when using vfree().
+ */
+ vmaddr = vmap(pages, count, VM_MAP_PUT_PAGES, PAGE_KERNEL);
+ if (!vmaddr) {
+ kfree(pages);
+ free_pages_exact(addr, size);
+ return -EIO;
+ }
+
+ /* Use noalias variant to not affect the direct mapping */
+ if (set_memory_decrypted_noalias((unsigned long)vmaddr, count)) {
+ vfree(vmaddr);
+ return -EIO;
+ }
+
+ buf->vmaddr = vmaddr;
+ buf->count = count;
+
+ return 0;
+}
+
+/* Remove the shared mapping and free the memory */
+static void deinit_quote_buf(struct quote_buf *buf)
+{
+ if (!buf)
+ return;
+
+ /* Mark pages private */
+ if (set_memory_encrypted_noalias((unsigned long)buf->vmaddr,
+ buf->count)) {
+ pr_warn("Failed to encrypt %d pages at %p", buf->count,
+ buf->vmaddr);
+ return;
+ }
+
+ vfree(buf->vmaddr);
+}
+
+static struct quote_entry *alloc_quote_entry(u64 buf_len)
+{
+ struct quote_entry *entry = NULL;
+
+ entry = kmalloc(sizeof(*entry), GFP_KERNEL);
+ if (!entry)
+ return NULL;
+
+ /* Init buffer for quote request */
+ if (init_quote_buf(&entry->buf, buf_len)) {
+ kfree(entry);
+ return NULL;
+ }
+
+ init_completion(&entry->compl);
+ entry->valid = true;
+
+ return entry;
+}
+
+static void free_quote_entry(struct quote_entry *entry)
+{
+ deinit_quote_buf(&entry->buf);
+ kfree(entry);
+}
+
+/* Must be called with quote_lock held */
+static void _del_quote_entry(struct quote_entry *entry)
+{
+ list_del(&entry->list);
+ free_quote_entry(entry);
+}
+
+static void del_quote_entry(struct quote_entry *entry)
+{
+ mutex_lock(&quote_lock);
+ _del_quote_entry(entry);
+ mutex_unlock(&quote_lock);
+}
+
+/* Handles early termination of GetQuote requests */
+void terminate_quote_request(struct quote_entry *entry)
+{
+ struct tdx_quote_hdr *quote_hdr;
+
+ /*
+ * For early termination, if the request is not yet
+ * processed by VMM (GET_QUOTE_IN_FLIGHT), the VMM
+ * still owns the shared buffer, so mark the request
+ * invalid to let quote_callback_handler() handle the
+ * memory cleanup function. If the request is already
+ * processed, then do the cleanup and return.
+ */
+
+ mutex_lock(&quote_lock);
+ quote_hdr = (struct tdx_quote_hdr *)entry->buf.vmaddr;
+ if (quote_hdr->status == GET_QUOTE_IN_FLIGHT) {
+ entry->valid = false;
+ mutex_unlock(&quote_lock);
+ return;
+ }
+ _del_quote_entry(entry);
+ mutex_unlock(&quote_lock);
+}
+
+static long tdx_get_quote(void __user *argp)
+{
+ struct quote_entry *entry;
+ struct tdx_quote_req req;
+ struct quote_buf *buf;
+ long ret;
+
+ /* Copy GetQuote request struct from user buffer */
+ if (copy_from_user(&req, argp, sizeof(struct tdx_quote_req)))
+ return -EFAULT;
+
+ /* Make sure the length is valid */
+ if (!req.len)
+ return -EINVAL;
+
+ entry = alloc_quote_entry(req.len);
+ if (!entry)
+ return -ENOMEM;
+
+ buf = &entry->buf;
+
+ /* Copy TDREPORT from user buffer to kernel Quote buffer */
+ if (copy_from_user(buf->vmaddr, (void __user *)req.buf, req.len)) {
+ free_quote_entry(entry);
+ return -EFAULT;
+ }
+
+ mutex_lock(&quote_lock);
+
+ /* Submit GetQuote Request */
+ ret = tdx_get_quote_hypercall(buf);
+ if (ret) {
+ mutex_unlock(&quote_lock);
+ pr_err("GetQuote hypercall failed, status:%lx\n", ret);
+ free_quote_entry(entry);
+ return -EIO;
+ }
+
+ /* Add current quote entry to quote_list to track active requests */
+ list_add_tail(&entry->list, &quote_list);
+
+ mutex_unlock(&quote_lock);
+
+ /* Wait for attestation completion */
+ ret = wait_for_completion_interruptible(&entry->compl);
+ if (ret < 0) {
+ terminate_quote_request(entry);
+ return -EINTR;
+ }
+
+ /*
+ * If GetQuote request completed successfully, copy the result
+ * back to the user and do the cleanup.
+ */
+ if (copy_to_user((void __user *)req.buf, buf->vmaddr, req.len))
+ ret = -EFAULT;
+
+ /*
+ * Reaching here means GetQuote request is processed
+ * successfully. So do the cleanup and return 0.
+ */
+ del_quote_entry(entry);
+
+ return 0;
+}
+
+static void attestation_callback_handler(void)
+{
+ queue_work(quote_wq, &quote_work);
+}
+
+static void quote_callback_handler(struct work_struct *work)
+{
+ struct tdx_quote_hdr *quote_hdr;
+ struct quote_entry *entry, *next;
+
+ /* Find processed quote request and mark it complete */
+ mutex_lock(&quote_lock);
+ list_for_each_entry_safe(entry, next, &quote_list, list) {
+ quote_hdr = (struct tdx_quote_hdr *)entry->buf.vmaddr;
+ if (quote_hdr->status == GET_QUOTE_IN_FLIGHT)
+ continue;
+ /*
+ * If user invalidated the current request, remove the
+ * entry from the quote list and free it. If the request
+ * is still valid, mark it complete.
+ */
+ if (entry->valid)
+ complete(&entry->compl);
+ else
+ _del_quote_entry(entry);
+ }
+ mutex_unlock(&quote_lock);
+}
+
static long tdx_attest_ioctl(struct file *file, unsigned int cmd,
unsigned long arg)
{
@@ -81,6 +375,9 @@ static long tdx_attest_ioctl(struct file *file, unsigned int cmd,
case TDX_CMD_GET_REPORT:
ret = tdx_get_report(argp);
break;
+ case TDX_CMD_GET_QUOTE:
+ ret = tdx_get_quote(argp);
+ break;
default:
pr_debug("cmd %d not supported\n", cmd);
break;
@@ -103,6 +400,13 @@ static int __init tdx_attestation_init(void)
if (!cpu_feature_enabled(X86_FEATURE_TDX_GUEST))
return -EIO;

+ quote_wq = create_singlethread_workqueue("tdx_quote_handler");
+
+ INIT_WORK(&quote_work, quote_callback_handler);
+
+ /* Register attestation event notify handler */
+ tdx_setup_ev_notify_handler(attestation_callback_handler);
+
miscdev.name = DRIVER_NAME;
miscdev.minor = MISC_DYNAMIC_MINOR;
miscdev.fops = &tdx_attest_fops;
diff --git a/arch/x86/include/uapi/asm/tdx.h b/arch/x86/include/uapi/asm/tdx.h
index 8b57dea67eab..51944dd0462f 100644
--- a/arch/x86/include/uapi/asm/tdx.h
+++ b/arch/x86/include/uapi/asm/tdx.h
@@ -39,4 +39,49 @@ struct tdx_report_req {
*/
#define TDX_CMD_GET_REPORT _IOWR('T', 0x01, struct tdx_report_req)

+/* struct tdx_quote_req: Request to generate TD Quote using TDREPORT
+ *
+ * @buf : Pass user data that includes TDREPORT as input. Upon
+ * successful completion of IOCTL, output is copied
+ * back to the same buffer.
+ * @len : Length of the buffer.
+ */
+struct tdx_quote_req {
+ __u64 buf;
+ __u64 len;
+};
+
+/*
+ * TDX_CMD_GET_QUOTE - Get TD Quote from QE/QGS using GetQuote
+ * TDVMCALL.
+ *
+ * Returns 0 on success, -EINTR for interrupted request, and
+ * standard errono on other failures.
+ */
+#define TDX_CMD_GET_QUOTE _IOR('T', 0x02, struct tdx_quote_req)
+
+/* TD Quote status codes */
+#define GET_QUOTE_SUCCESS 0
+#define GET_QUOTE_IN_FLIGHT 0xffffffffffffffff
+#define GET_QUOTE_ERROR 0x8000000000000000
+#define GET_QUOTE_SERVICE_UNAVAILABLE 0x8000000000000001
+
+/*
+ * Format of Quote data header. More details can be found in TDX
+ * Guest-Host Communication Interface (GHCI) for Intel TDX 1.0,
+ * section titled "TDG.VP.VMCALL<GetQuote>"
+ */
+struct tdx_quote_hdr {
+ /* Quote version, filled by TD */
+ __u64 version;
+ /* Status code of Quote request, filled by VMM */
+ __u64 status;
+ /* Length of TDREPORT, filled by TD */
+ __u32 in_len;
+ /* Length of Quote, filled by VMM */
+ __u32 out_len;
+ /* Actual Quote data */
+ __u64 data[0];
+};
+
#endif /* _UAPI_ASM_X86_TDX_H */
--
2.25.1

2022-06-14 13:30:11

by Wander Lairson Costa

[permalink] [raw]
Subject: Re: [PATCH v8 5/5] x86/tdx: Add Quote generation support

On Wed, Jun 08, 2022 at 07:52:20PM -0700, Kuppuswamy Sathyanarayanan wrote:

[snip]

> +}
> +
> +/* Remove the shared mapping and free the memory */
> +static void deinit_quote_buf(struct quote_buf *buf)
> +{
> + if (!buf)
> + return;

nit: the null check isn't necessary anymore, is it?

[snip]

Subject: Re: [PATCH v8 5/5] x86/tdx: Add Quote generation support



On 6/14/22 5:30 AM, Wander Lairson Costa wrote:
> On Wed, Jun 08, 2022 at 07:52:20PM -0700, Kuppuswamy Sathyanarayanan wrote:
>
> [snip]
>
>> +}
>> +
>> +/* Remove the shared mapping and free the memory */
>> +static void deinit_quote_buf(struct quote_buf *buf)
>> +{
>> + if (!buf)
>> + return;
>
> nit: the null check isn't necessary anymore, is it?

Yes. I will remove it.

>
> [snip]
>

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2022-06-20 14:30:10

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v8 2/5] x86/tdx: Add TDX Guest event notify interrupt support

On Wed, 2022-06-08 at 19:52 -0700, Kuppuswamy Sathyanarayanan wrote:
> Host-guest event notification via configured interrupt vector is useful
> in cases where a guest makes an asynchronous request and needs a
> callback from the host to indicate the completion or to let the host
> notify the guest about events like device removal. One usage example is,
> callback requirement of GetQuote asynchronous hypercall.

Although this paragraph is from GHCI spec, IMHO it is not very helpful. In
fact, I think this paragraph is not that right and should be removed from GHCI.
The reason is such event notification from VMM in cases like "device removal" is
too vague. There's no _specification_ in GHCI around which "device removal"
should VMM inject such event. For instance, I _think_ the Qemu enumerated ACPI-
based hotplug should continue to work in TD.

That being said, if a TD has multiple devices, it cannot know whether the VMM
will inject the removal event via the vector set by SetupEventNotifyInterrupt.
And for the same device in the same TD, different VMMs may use different way to
notify its removal.

It seems GetQuote is the only user of SetupEventNotifyInterrupt. Maybe we
should just declare it is for GetQuote.

Isaku, what do you think? Does this make sense?

>
> In TDX guest, SetupEventNotifyInterrupt hypercall can be used by the
> guest to specify which interrupt vector to use as an event-notify
> vector to the VMM. Details about the SetupEventNotifyInterrupt
> hypercall can be found in TDX Guest-Host Communication Interface
> (GHCI) Specification, sec 3.5 "VP.VMCALL<SetupEventNotifyInterrupt>".
> Add a tdx_hcall_set_notify_intr() helper function to implement the
> SetupEventNotifyInterrupt hypercall.

As you also used "can" above, the GHCI only says the VMM _CAN_ inject the vector
set by SetupEventNotifyInterrupt, but not must (3.3 TDG.VP.VMCALL<GetQuote>).
This means theoretically TD should implement pooling mode in case VMM doesn't
support injecting event via vector done by SetupEventNotifyInterrupt?

Perhaps we should update the GHCI spec to use must..

>
> Reserve 0xec IRQ vector address for TDX guest to receive the event
> completion notification from VMM. Also add related IDT handler to
> process the notification event.

Here lacks why we need to choose to reserve a system vector. For instance, why
we cannot choose to use device IRQ way which only requires one vector on one
cpu. As you can see reserving a system vector isn't ideal especially for
attestation as it is not a frequent operation. It is wasteful of using IRQ
resource especially on server systems with a lot of CPUs.

The reason is SetupEventNotifyInterrupt TDVMCALL only has one argument, which is
vector, but cannot specify which CPU that the VMM should inject the event to.
The GHCI spec doesn't say which CPU the VMM should inject to (i.e. must inject
to the CPU on which SetupEventNotifyInterrupt is called), so we can only assume
VMM can inject to any CPU.

Btw, x86 maintainers,

I'd like to check with you to see whether we should improve the existing
SetupEventNotifyInterrupt so we can choose to use request_irq() style for
attestation. Using request_irq() means we don't need to reserve a system
vector, but can allocate a vector dynamically when needed.

Assuming we update SetupEventNotifyInterrupt to also allow TD to specify which
CPU (i.e. via APICID) to inject (along with the vector), my understanding is we
can use below way (idea only) to dynamically allocate a vector on one CPU when
attestation is needed:


int cpu, vector;
int irq;

// request an IRQ, and prevent it from being migrated
irq = __irq_domain_alloc_irqs(x86_vector_domain, 0, 1, ...);
request_irq(irq, ...);

// get vector, cpu from irq

TDVMCALL<SetupEventNotifyInterrupt>(vector, 
apic->cpu_present_to_apidid(cpu));

Is this reasonable? If yes, is it worth to do?

--
Thanks,
-Kai


>
> Add support to track the notification event status via
> /proc/interrupts.
>
> Reviewed-by: Tony Luck <[email protected]>
> Reviewed-by: Andi Kleen <[email protected]>
> Acked-by: Kirill A. Shutemov <[email protected]>
> Acked-by: Wander Lairson Costa <[email protected]>
> Signed-off-by: Kuppuswamy Sathyanarayanan <[email protected]>
> ---
> arch/x86/coco/tdx/tdx.c | 71 ++++++++++++++++++++++++++++++
> arch/x86/include/asm/hardirq.h | 3 ++
> arch/x86/include/asm/idtentry.h | 4 ++
> arch/x86/include/asm/irq_vectors.h | 7 ++-
> arch/x86/include/asm/tdx.h | 4 ++
> arch/x86/kernel/irq.c | 7 +++
> 6 files changed, 95 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
> index 03deb4d6920d..94542876f26a 100644
> --- a/arch/x86/coco/tdx/tdx.c
> +++ b/arch/x86/coco/tdx/tdx.c
> @@ -11,6 +11,10 @@
> #include <asm/insn.h>
> #include <asm/insn-eval.h>
> #include <asm/pgtable.h>
> +#include <asm/apic.h>
> +#include <asm/idtentry.h>
> +#include <asm/irq_regs.h>
> +#include <asm/desc.h>
>
> /* TDX module Call Leaf IDs */
> #define TDX_GET_INFO 1
> @@ -19,6 +23,7 @@
>
> /* TDX hypercall Leaf IDs */
> #define TDVMCALL_MAP_GPA 0x10001
> +#define TDVMCALL_SETUP_NOTIFY_INTR 0x10004
>
> /* MMIO direction */
> #define EPT_READ 0
> @@ -34,6 +39,26 @@
> #define VE_GET_PORT_NUM(e) ((e) >> 16)
> #define VE_IS_IO_STRING(e) ((e) & BIT(4))
>
> +/*
> + * Handler used to report notifications about
> + * TDX_GUEST_EVENT_NOTIFY_VECTOR IRQ. Currently it will be
> + * used only by the attestation driver. So, race condition
> + * with read/write operation is not considered.
> + */
> +static void (*tdx_event_notify_handler)(void);
> +
> +/* Helper function to register tdx_event_notify_handler */
> +void tdx_setup_ev_notify_handler(void (*handler)(void))
> +{
> + tdx_event_notify_handler = handler;
> +}
> +
> +/* Helper function to unregister tdx_event_notify_handler */
> +void tdx_remove_ev_notify_handler(void)
> +{
> + tdx_event_notify_handler = NULL;
> +}
> +
> /*
> * Wrapper for standard use of __tdx_hypercall with no output aside from
> * return code.
> @@ -98,6 +123,46 @@ static inline void tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
> panic("TDCALL %lld failed (Buggy TDX module!)\n", fn);
> }
>
> +/* TDX guest event notification handler */
> +DEFINE_IDTENTRY_SYSVEC(sysvec_tdx_event_notify)
> +{
> + struct pt_regs *old_regs = set_irq_regs(regs);
> +
> + inc_irq_stat(irq_tdx_event_notify_count);
> +
> + if (tdx_event_notify_handler)
> + tdx_event_notify_handler();
> +
> + ack_APIC_irq();
> +
> + set_irq_regs(old_regs);
> +}
> +
> +/*
> + * tdx_hcall_set_notify_intr() - Setup Event Notify Interrupt Vector.
> + *
> + * @vector: Vector address to be used for notification.
> + *
> + * return 0 on success or failure error number.
> + */
> +static long tdx_hcall_set_notify_intr(u8 vector)
> +{
> + /* Minimum vector value allowed is 32 */
> + if (vector < 32)
> + return -EINVAL;
> +
> + /*
> + * Register callback vector address with VMM. More details
> + * about the ABI can be found in TDX Guest-Host-Communication
> + * Interface (GHCI), sec titled
> + * "TDG.VP.VMCALL<SetupEventNotifyInterrupt>".
> + */
> + if (_tdx_hypercall(TDVMCALL_SETUP_NOTIFY_INTR, vector, 0, 0, 0))
> + return -EIO;
> +
> + return 0;
> +}
> +
> static u64 get_cc_mask(void)
> {
> struct tdx_module_output out;
> @@ -688,5 +753,11 @@ void __init tdx_early_init(void)
> x86_platform.guest.enc_tlb_flush_required = tdx_tlb_flush_required;
> x86_platform.guest.enc_status_change_finish = tdx_enc_status_changed;
>
> + alloc_intr_gate(TDX_GUEST_EVENT_NOTIFY_VECTOR,
> + asm_sysvec_tdx_event_notify);
> +
> + if (tdx_hcall_set_notify_intr(TDX_GUEST_EVENT_NOTIFY_VECTOR))
> + pr_warn("Setting event notification interrupt failed\n");
> +
> pr_info("Guest detected\n");
> }
> diff --git a/arch/x86/include/asm/hardirq.h b/arch/x86/include/asm/hardirq.h
> index 275e7fd20310..582deff56210 100644
> --- a/arch/x86/include/asm/hardirq.h
> +++ b/arch/x86/include/asm/hardirq.h
> @@ -44,6 +44,9 @@ typedef struct {
> unsigned int irq_hv_reenlightenment_count;
> unsigned int hyperv_stimer0_count;
> #endif
> +#if IS_ENABLED(CONFIG_INTEL_TDX_GUEST)
> + unsigned int irq_tdx_event_notify_count;
> +#endif
> } ____cacheline_aligned irq_cpustat_t;
>
> DECLARE_PER_CPU_SHARED_ALIGNED(irq_cpustat_t, irq_stat);
> diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
> index 72184b0b2219..655086dd940e 100644
> --- a/arch/x86/include/asm/idtentry.h
> +++ b/arch/x86/include/asm/idtentry.h
> @@ -700,6 +700,10 @@ DECLARE_IDTENTRY_SYSVEC(HYPERVISOR_CALLBACK_VECTOR, sysvec_xen_hvm_callback);
> DECLARE_IDTENTRY_SYSVEC(HYPERVISOR_CALLBACK_VECTOR, sysvec_kvm_asyncpf_interrupt);
> #endif
>
> +#if IS_ENABLED(CONFIG_INTEL_TDX_GUEST)
> +DECLARE_IDTENTRY_SYSVEC(TDX_GUEST_EVENT_NOTIFY_VECTOR, sysvec_tdx_event_notify);
> +#endif
> +
> #undef X86_TRAP_OTHER
>
> #endif
> diff --git a/arch/x86/include/asm/irq_vectors.h b/arch/x86/include/asm/irq_vectors.h
> index 43dcb9284208..82ac0c0a34b1 100644
> --- a/arch/x86/include/asm/irq_vectors.h
> +++ b/arch/x86/include/asm/irq_vectors.h
> @@ -104,7 +104,12 @@
> #define HYPERV_STIMER0_VECTOR 0xed
> #endif
>
> -#define LOCAL_TIMER_VECTOR 0xec
> +#if IS_ENABLED(CONFIG_INTEL_TDX_GUEST)
> +/* Vector on which TDX Guest event notification is delivered */
> +#define TDX_GUEST_EVENT_NOTIFY_VECTOR 0xec
> +#endif
> +
> +#define LOCAL_TIMER_VECTOR 0xeb
>
> #define NR_VECTORS 256
>
> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> index 020c81a7c729..eb4db837cc44 100644
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -67,6 +67,10 @@ void tdx_safe_halt(void);
>
> bool tdx_early_handle_ve(struct pt_regs *regs);
>
> +void tdx_setup_ev_notify_handler(void (*handler)(void));
> +
> +void tdx_remove_ev_notify_handler(void);
> +
> #else
>
> static inline void tdx_early_init(void) { };
> diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
> index 766ffe3ba313..a96ecd866723 100644
> --- a/arch/x86/kernel/irq.c
> +++ b/arch/x86/kernel/irq.c
> @@ -181,6 +181,13 @@ int arch_show_interrupts(struct seq_file *p, int prec)
> seq_printf(p, "%10u ",
> irq_stats(j)->kvm_posted_intr_wakeup_ipis);
> seq_puts(p, " Posted-interrupt wakeup event\n");
> +#endif
> +#if IS_ENABLED(CONFIG_INTEL_TDX_GUEST)
> + seq_printf(p, "%*s: ", prec, "TGN");
> + for_each_online_cpu(j)
> + seq_printf(p, "%10u ",
> + irq_stats(j)->irq_tdx_event_notify_count);
> + seq_puts(p, " TDX Guest event notification\n");
> #endif
> return 0;
> }


Subject: Re: [PATCH v8 2/5] x86/tdx: Add TDX Guest event notify interrupt support

Hi,

+ Jun

On 6/20/22 5:33 AM, Kai Huang wrote:
> On Wed, 2022-06-08 at 19:52 -0700, Kuppuswamy Sathyanarayanan wrote:
>> Host-guest event notification via configured interrupt vector is useful
>> in cases where a guest makes an asynchronous request and needs a
>> callback from the host to indicate the completion or to let the host
>> notify the guest about events like device removal. One usage example is,
>> callback requirement of GetQuote asynchronous hypercall.
>
> Although this paragraph is from GHCI spec, IMHO it is not very helpful. In
> fact, I think this paragraph is not that right and should be removed from GHCI.
> The reason is such event notification from VMM in cases like "device removal" is
> too vague. There's no _specification_ in GHCI around which "device removal"
> should VMM inject such event. For instance, I _think_ the Qemu enumerated ACPI-
> based hotplug should continue to work in TD.

Yes. It just says that it *can* be used to signal a device removal. It is just
an example for where it can be used. But I agree that such a use case is vague.
If it makes it better, I am fine with removing it.

Copied from sec 3.5 TDG.VP.VMCALL<SetupEventNotifyInterrupt>:

"Example of an operation that can use the event notify is the host
VMM signaling a device removal to the TD, in response to which a TD may
unload a device driver."

>
> That being said, if a TD has multiple devices, it cannot know whether the VMM
> will inject the removal event via the vector set by SetupEventNotifyInterrupt.
> And for the same device in the same TD, different VMMs may use different way to
> notify its removal.

As per current design, If it is used for device removal, I think all registered
device drivers will get the notification and the individual device driver has
to check whether it is applicable for them.

If the SetupEventNotifyInterrupt TDVMCALL specification is extended to specify
the exact device or use case detail, then it can optimize the implementation.

>
> It seems GetQuote is the only user of SetupEventNotifyInterrupt. Maybe we
> should just declare it is for GetQuote.

Ok.

>
> Isaku, what do you think? Does this make sense?
>
>>
>> In TDX guest, SetupEventNotifyInterrupt hypercall can be used by the
>> guest to specify which interrupt vector to use as an event-notify
>> vector to the VMM. Details about the SetupEventNotifyInterrupt
>> hypercall can be found in TDX Guest-Host Communication Interface
>> (GHCI) Specification, sec 3.5 "VP.VMCALL<SetupEventNotifyInterrupt>".
>> Add a tdx_hcall_set_notify_intr() helper function to implement the
>> SetupEventNotifyInterrupt hypercall.
>
> As you also used "can" above, the GHCI only says the VMM _CAN_ inject the vector
> set by SetupEventNotifyInterrupt, but not must (3.3 TDG.VP.VMCALL<GetQuote>).
> This means theoretically TD should implement pooling mode in case VMM doesn't
> support injecting event via vector done by SetupEventNotifyInterrupt?

Yes. But GetQuote specification does not talk about the pooling mode
use case as well. So I think it is just a wording confusion.

>
> Perhaps we should update the GHCI spec to use must..

Ok.

>
>>
>> Reserve 0xec IRQ vector address for TDX guest to receive the event
>> completion notification from VMM. Also add related IDT handler to
>> process the notification event.
>
> Here lacks why we need to choose to reserve a system vector. For instance, why
> we cannot choose to use device IRQ way which only requires one vector on one

As you have explained below, as per current spec, it just expects a system
vector.

> cpu. As you can see reserving a system vector isn't ideal especially for
> attestation as it is not a frequent operation. It is wasteful of using IRQ

I agree that event notification is currently only used for attestation. But I
think in future there could be other use cases for it. If the intention is just
to use it for attestation, then we can just modify the GetQuote TDVMCALL to pass
the vector address, and there is no need for new TDVMCALL. I think the intention
here is to have generic method for VMM to notify TD about some events. I am not
clear about the possible future use cases, so I cannot comment on frequency of
its use.

Jun, any comments?



> resource especially on server systems with a lot of CPUs.

FWIW, this reservation is protected with CONFIG_INTEL_TDX_GUEST. So it will be
reserved only for TDX use case.


>
> The reason is SetupEventNotifyInterrupt TDVMCALL only has one argument, which is
> vector, but cannot specify which CPU that the VMM should inject the event to.
> The GHCI spec doesn't say which CPU the VMM should inject to (i.e. must inject
> to the CPU on which SetupEventNotifyInterrupt is called), so we can only assume
> VMM can inject to any CPU.
>
> Btw, x86 maintainers,
>
> I'd like to check with you to see whether we should improve the existing
> SetupEventNotifyInterrupt so we can choose to use request_irq() style for
> attestation. Using request_irq() means we don't need to reserve a system
> vector, but can allocate a vector dynamically when needed.
>
> Assuming we update SetupEventNotifyInterrupt to also allow TD to specify which
> CPU (i.e. via APICID) to inject (along with the vector), my understanding is we
> can use below way (idea only) to dynamically allocate a vector on one CPU when
> attestation is needed:
>
>
> int cpu, vector;
> int irq;
>
> // request an IRQ, and prevent it from being migrated
> irq = __irq_domain_alloc_irqs(x86_vector_domain, 0, 1, ...);
> request_irq(irq, ...);
>
> // get vector, cpu from irq
>
> TDVMCALL<SetupEventNotifyInterrupt>(vector, 
> apic->cpu_present_to_apidid(cpu));
>
> Is this reasonable? If yes, is it worth to do?
>

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2022-06-23 09:50:23

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v8 2/5] x86/tdx: Add TDX Guest event notify interrupt support

>
> >
> > That being said, if a TD has multiple devices, it cannot know whether the VMM
> > will inject the removal event via the vector set by SetupEventNotifyInterrupt.
> > And for the same device in the same TD, different VMMs may use different way to
> > notify its removal.
>
> As per current design, If it is used for device removal, I think all registered
> device drivers will get the notification and the individual device driver has
> to check whether it is applicable for them.

The problem is there's no _specification_ around this. As I said above, I don't
see why TDX architecture cannot support Qemu enumerated ACPI-based hotplug,
which uses SCI. Maybe I am missing something here, but IMHO we should have some
_specification_ around SetupEventNotifyInterrupt in GHCI.

>
> If the SetupEventNotifyInterrupt TDVMCALL specification is extended to specify
> the exact device or use case detail, then it can optimize the implementation.
>
> >
> > It seems GetQuote is the only user of SetupEventNotifyInterrupt. Maybe we
> > should just declare it is for GetQuote.
>
> Ok.

If you believe this is reasonable, perhaps you can drive the spec change.

>
> >
> > Isaku, what do you think? Does this make sense?
> >
> > >
> > > In TDX guest, SetupEventNotifyInterrupt hypercall can be used by the
> > > guest to specify which interrupt vector to use as an event-notify
> > > vector to the VMM. Details about the SetupEventNotifyInterrupt
> > > hypercall can be found in TDX Guest-Host Communication Interface
> > > (GHCI) Specification, sec 3.5 "VP.VMCALL<SetupEventNotifyInterrupt>".
> > > Add a tdx_hcall_set_notify_intr() helper function to implement the
> > > SetupEventNotifyInterrupt hypercall.
> >
> > As you also used "can" above, the GHCI only says the VMM _CAN_ inject the vector
> > set by SetupEventNotifyInterrupt, but not must (3.3 TDG.VP.VMCALL<GetQuote>).
> > This means theoretically TD should implement pooling mode in case VMM doesn't
> > support injecting event via vector done by SetupEventNotifyInterrupt?
>
> Yes. But GetQuote specification does not talk about the pooling mode
> use case as well. So I think it is just a wording confusion.

It doesn't need to mention I think. "can" means VMM can choose to inject or
not, but not must, which basically implies GetQuote should support pooling.

>
> >
> > Perhaps we should update the GHCI spec to use must..
>
> Ok.

If you don't want to support pooling, I guess you'd better to improve the GHCI.

[...]

>
>
> > resource especially on server systems with a lot of CPUs.
>
> FWIW, this reservation is protected with CONFIG_INTEL_TDX_GUEST. So it will be
> reserved only for TDX use case.

This reason doesn't stand. I think distributions basically tends to enable all
Kconfig options so one binary works on all machines, so it can be (maybe likely)
turned on even on bare-metal machines.

--
Thanks,
-Kai


2022-06-23 10:37:25

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v8 2/5] x86/tdx: Add TDX Guest event notify interrupt support

On Mon, 2022-06-20 at 08:44 -0700, Sathyanarayanan Kuppuswamy wrote:
> I agree that event notification is currently only used for attestation. But I
> think in future there could be other use cases for it. 
>

We have TDVMCALL<Service> in GHCI 1.5. It allows specifying the vector. I
don't know whether there's any future case that still needs
SetupEventNotifyInterrupt.

> If the intention is just
> to use it for attestation, then we can just modify the GetQuote TDVMCALL to pass
> the vector address, and there is no need for new TDVMCALL.

The TDVMCALL<Service> in GHCI 1.5 actually supports specifying the vector. It
currently neither clearly says "the VMM should inject event to the vCPU on which
the TDVMCALL<Service> is called", nor it allows specifying which vCPU to inject.
Assuming we can lobby (I am already lobbying, actually) to make either way, in
fact I think it can be used as IRQ way (like I described in my first reply).

I once lobbied to completely remove GetQuote (and SetupEventNotifyInterrupt)
from GHCI 1.0 and move TDVMCALL<Service> from GHCI 1.5 to 1.0, and define a new
sub-service for attestation -- GetQuote will be one command then, and we can
define more attestation related commands (which is likely) when needed. I
believe using TDVMCALL<Service> for GetQuote can actually allow kernel to use
IRQ way, but doesn't need a system vector. But this wasn't successful.


--
Thanks,
-Kai


2022-06-24 17:19:04

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v8 1/5] x86/tdx: Add TDX Guest attestation interface driver

On 6/8/22 19:52, Kuppuswamy Sathyanarayanan wrote:
> TDREPORT can only be verified on local platform as the MAC key is bound
> to the platform. To support remote verification of the TDREPORT, TDX
> leverages Intel SGX Quote Enclave (QE) to verify the TDREPORT locally
> and convert it to a remote verifiable Quote.
>
> After getting the TDREPORT, the second step of the attestation process
> is to send it to the QE to generate the Quote. TDX doesn't support SGX
> inside the TD, so the QE can be deployed in the host, or in another
> legacy VM with SGX support. How to send the TDREPORT to QE and receive
> the Quote is implementation and deployment specific.

On high-level comment: this whole series is quite agnostic about what is
actually attested. On some level, I guess it doesn't matter. But, it
makes me a bit uneasy. There needs to be at least *some* kind of claim
about that somewhere, preferably a sentence in the cover letter and
sentence or two here.

Second, how can someone test this code? It appears that they need to
assemble a veritable Rube Goldberg machine. The least we could do is
have a selftest that just calls the ioctl() and makes sure that
something halfway sane comes out of it.

> In such
> case, since REPORTDATA is a secret, using sysfs to share it is insecure
> compared to sending it via IOCTL.

Huh? How is sysfs "insecure"?

> + /*
> + * Generate TDREPORT using "TDG.MR.REPORT" TDCALL.
> + *
> + * Get the TDREPORT using REPORTDATA as input. Refer to
> + * section 22.3.3 TDG.MR.REPORT leaf in the TDX Module 1.0
> + * Specification for detailed information.
> + */
> + ret = __tdx_module_call(TDX_GET_REPORT, virt_to_phys(tdreport),
> + virt_to_phys(reportdata), 0, 0, NULL);

One of those 0's is a "Report sub type". Those bits in the module call
are presumably because it won't *always* be zero. But, if there's ever
a new report type, we'll need another user/kernel ABI. That doesn't
seem great.

The TDX module spec doesn't even give a name to "sub report type 0".
That is not super helpful for deciding what it *means*.

Right now, the entire ABI is basically:

ret = ioctl(TDX_GET_REPORT, &buffer);

That _implies_ a ton of stuff:

1. report sub type 0
2. a specific input length REPORTDATA
3. a specific output length

I don't want to over-engineer this thing, but it would make a lot of
sense to me to feed the ioctl() with something like this:

struct tdx_report
{
u8 report_sub_type;
u64 report_input_data;
u64 report_input_data_len;

u64 report_output_data;
u64 report_output_data_len;
}

That makes *everything* explicit. It makes it utterly clear what goes
where, *AND* it makes userspace declare it.

But, you have:

> +/**
> + * struct tdx_report_req: Get TDREPORT using REPORTDATA as input.
> + *
> + * @reportdata : User-defined 64-Byte REPORTDATA to be included into
> + * TDREPORT. Typically it can be some nonce provided by
> + * attestation service, so the generated TDREPORT can be
> + * uniquely verified.
> + * @tdreport : TDREPORT output from TDCALL[TDG.MR.REPORT] of size
> + * TDX_REPORT_LEN.
> + *
> + * Used in TDX_CMD_GET_REPORT IOCTL request.
> + */
> +struct tdx_report_req {
> + union {
> + __u8 reportdata[TDX_REPORTDATA_LEN];
> + __u8 tdreport[TDX_REPORT_LEN];
> + };
> +};
> +

and a bunch of code copying in and out of this structure:

> +static long tdx_get_report(void __user *argp)
> +{
> + void *reportdata = NULL, *tdreport = NULL;
...
> + /* Copy REPORTDATA from the user buffer */
> + if (copy_from_user(reportdata, argp, TDX_REPORTDATA_LEN)) {
> + ret = -EFAULT;
> + goto out;
> + }

But none of that code even bothers to *use* the structure!

What's with the union? Are you really just trying to save 64 bytes of
space? Is that worth it?

How many of these "drivers" are we going to need which are thinly veiled
ioctl()s that are only TDX module call wrappers?

2022-06-24 18:38:58

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v8 0/5] Add TDX Guest Attestation support

On 6/8/22 19:52, Kuppuswamy Sathyanarayanan wrote:
> Following patches add the attestation support to TDX guest which
> includes attestation user interface driver and related hypercall support.

This is also the place where you lay out the roadmap:

1. Get a report
2. Get a quote
2a. Interrupt support because quotes take a long time
2b. Actual quote module calls and ABI

Right? That seems worth a few sentences in the cover letter.

Subject: Re: [PATCH v8 2/5] x86/tdx: Add TDX Guest event notify interrupt support

+ Jiewen

Jiewen, Can you please comment on the specification related queries?

On 6/20/22 8:44 AM, Sathyanarayanan Kuppuswamy wrote:
> Hi,
>
> + Jun
>
> On 6/20/22 5:33 AM, Kai Huang wrote:
>> On Wed, 2022-06-08 at 19:52 -0700, Kuppuswamy Sathyanarayanan wrote:
>>> Host-guest event notification via configured interrupt vector is useful
>>> in cases where a guest makes an asynchronous request and needs a
>>> callback from the host to indicate the completion or to let the host
>>> notify the guest about events like device removal. One usage example is,
>>> callback requirement of GetQuote asynchronous hypercall.
>>
>> Although this paragraph is from GHCI spec, IMHO it is not very helpful. In
>> fact, I think this paragraph is not that right and should be removed from GHCI.
>> The reason is such event notification from VMM in cases like "device removal" is
>> too vague. There's no _specification_ in GHCI around which "device removal"
>> should VMM inject such event. For instance, I _think_ the Qemu enumerated ACPI-
>> based hotplug should continue to work in TD.
>
> Yes. It just says that it *can* be used to signal a device removal. It is just
> an example for where it can be used. But I agree that such a use case is vague.
> If it makes it better, I am fine with removing it.
>
> Copied from sec 3.5 TDG.VP.VMCALL<SetupEventNotifyInterrupt>:
>
> "Example of an operation that can use the event notify is the host
> VMM signaling a device removal to the TD, in response to which a TD may
> unload a device driver."
>
>>
>> That being said, if a TD has multiple devices, it cannot know whether the VMM
>> will inject the removal event via the vector set by SetupEventNotifyInterrupt.
>> And for the same device in the same TD, different VMMs may use different way to
>> notify its removal.
>
> As per current design, If it is used for device removal, I think all registered
> device drivers will get the notification and the individual device driver has
> to check whether it is applicable for them.
>
> If the SetupEventNotifyInterrupt TDVMCALL specification is extended to specify
> the exact device or use case detail, then it can optimize the implementation.
>
>>
>> It seems GetQuote is the only user of SetupEventNotifyInterrupt. Maybe we
>> should just declare it is for GetQuote.
>
> Ok.
>
>>
>> Isaku, what do you think? Does this make sense?
>>
>>>
>>> In TDX guest, SetupEventNotifyInterrupt hypercall can be used by the
>>> guest to specify which interrupt vector to use as an event-notify
>>> vector to the VMM. Details about the SetupEventNotifyInterrupt
>>> hypercall can be found in TDX Guest-Host Communication Interface
>>> (GHCI) Specification, sec 3.5 "VP.VMCALL<SetupEventNotifyInterrupt>".
>>> Add a tdx_hcall_set_notify_intr() helper function to implement the
>>> SetupEventNotifyInterrupt hypercall.
>>
>> As you also used "can" above, the GHCI only says the VMM _CAN_ inject the vector
>> set by SetupEventNotifyInterrupt, but not must (3.3 TDG.VP.VMCALL<GetQuote>).
>> This means theoretically TD should implement pooling mode in case VMM doesn't
>> support injecting event via vector done by SetupEventNotifyInterrupt?
>
> Yes. But GetQuote specification does not talk about the pooling mode
> use case as well. So I think it is just a wording confusion.
>
>>
>> Perhaps we should update the GHCI spec to use must..
>
> Ok.
>
>>
>>>
>>> Reserve 0xec IRQ vector address for TDX guest to receive the event
>>> completion notification from VMM. Also add related IDT handler to
>>> process the notification event.
>>
>> Here lacks why we need to choose to reserve a system vector. For instance, why
>> we cannot choose to use device IRQ way which only requires one vector on one
>
> As you have explained below, as per current spec, it just expects a system
> vector.
>
>> cpu. As you can see reserving a system vector isn't ideal especially for
>> attestation as it is not a frequent operation. It is wasteful of using IRQ
>
> I agree that event notification is currently only used for attestation. But I
> think in future there could be other use cases for it. If the intention is just
> to use it for attestation, then we can just modify the GetQuote TDVMCALL to pass
> the vector address, and there is no need for new TDVMCALL. I think the intention
> here is to have generic method for VMM to notify TD about some events. I am not
> clear about the possible future use cases, so I cannot comment on frequency of
> its use.
>
> Jun, any comments?
>
>
>
>> resource especially on server systems with a lot of CPUs.
>
> FWIW, this reservation is protected with CONFIG_INTEL_TDX_GUEST. So it will be
> reserved only for TDX use case.
>
>
>>
>> The reason is SetupEventNotifyInterrupt TDVMCALL only has one argument, which is
>> vector, but cannot specify which CPU that the VMM should inject the event to.
>> The GHCI spec doesn't say which CPU the VMM should inject to (i.e. must inject
>> to the CPU on which SetupEventNotifyInterrupt is called), so we can only assume
>> VMM can inject to any CPU.
>>
>> Btw, x86 maintainers,
>>
>> I'd like to check with you to see whether we should improve the existing
>> SetupEventNotifyInterrupt so we can choose to use request_irq() style for
>> attestation. Using request_irq() means we don't need to reserve a system
>> vector, but can allocate a vector dynamically when needed.
>>
>> Assuming we update SetupEventNotifyInterrupt to also allow TD to specify which
>> CPU (i.e. via APICID) to inject (along with the vector), my understanding is we
>> can use below way (idea only) to dynamically allocate a vector on one CPU when
>> attestation is needed:
>>
>>
>> int cpu, vector;
>> int irq;
>>
>> // request an IRQ, and prevent it from being migrated
>> irq = __irq_domain_alloc_irqs(x86_vector_domain, 0, 1, ...);
>> request_irq(irq, ...);
>>
>> // get vector, cpu from irq
>>
>> TDVMCALL<SetupEventNotifyInterrupt>(vector, 
>> apic->cpu_present_to_apidid(cpu));
>>
>> Is this reasonable? If yes, is it worth to do?
>>
>

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2022-06-25 00:04:37

by Nakajima, Jun

[permalink] [raw]
Subject: Re: [PATCH v8 2/5] x86/tdx: Add TDX Guest event notify interrupt support

Replying to this (not the latest one) to reduce the quoting levels, adding Jiewen.


> On Jun 20, 2022, at 8:44 AM, Sathyanarayanan Kuppuswamy <[email protected]> wrote:
>
> Hi,
>
> + Jun
>
> On 6/20/22 5:33 AM, Kai Huang wrote:
>> On Wed, 2022-06-08 at 19:52 -0700, Kuppuswamy Sathyanarayanan wrote:
>>> Host-guest event notification via configured interrupt vector is useful
>>> in cases where a guest makes an asynchronous request and needs a
>>> callback from the host to indicate the completion or to let the host
>>> notify the guest about events like device removal. One usage example is,
>>> callback requirement of GetQuote asynchronous hypercall.
>>
>> Although this paragraph is from GHCI spec, IMHO it is not very helpful. In
>> fact, I think this paragraph is not that right and should be removed from GHCI.
>> The reason is such event notification from VMM in cases like "device removal" is
>> too vague. There's no _specification_ in GHCI around which "device removal"
>> should VMM inject such event. For instance, I _think_ the Qemu enumerated ACPI-
>> based hotplug should continue to work in TD.
>
> Yes. It just says that it *can* be used to signal a device removal. It is just
> an example for where it can be used. But I agree that such a use case is vague.
> If it makes it better, I am fine with removing it.

Yes, the “device removal” is just an example, especially, "the TD OS should be designed to not use the event notification for trusted operations”, based on the context of the spec.

>>
>>>
>>> Reserve 0xec IRQ vector address for TDX guest to receive the event
>>> completion notification from VMM. Also add related IDT handler to
>>> process the notification event.
>>
>> Here lacks why we need to choose to reserve a system vector. For instance, why
>> we cannot choose to use device IRQ way which only requires one vector on one
>
> As you have explained below, as per current spec, it just expects a system
> vector.
>
>> cpu. As you can see reserving a system vector isn't ideal especially for
>> attestation as it is not a frequent operation. It is wasteful of using IRQ
>
> I agree that event notification is currently only used for attestation. But I
> think in future there could be other use cases for it. If the intention is just
> to use it for attestation, then we can just modify the GetQuote TDVMCALL to pass
> the vector address, and there is no need for new TDVMCALL. I think the intention
> here is to have generic method for VMM to notify TD about some events. I am not
> clear about the possible future use cases, so I cannot comment on frequency of
> its use.
>
> Jun, any comments?
>

The GHCI spec was not just clear, and we’ll update the spec, for example:

3.5 TDG.VP.VMCALL<SetupEventNotifyInterrupt>
...
From:

“The host VMM should use SEAMCALL [TDWRVPS] leaf to inject an interrupt at the requested-interrupt vector into the TD via the posted-interrupt descriptor. “

To:

“The host VMM should use SEAMCALL [TDWRVPS] leaf to inject an interrupt at the requested-interrupt vector into the TD VCPU that executed TDG.VP.VMCALL <SetupEventNotifyInterrupt>, via the posted-interrupt descriptor. “

---
Jun

2022-06-25 04:05:39

by Yao, Jiewen

[permalink] [raw]
Subject: RE: [PATCH v8 2/5] x86/tdx: Add TDX Guest event notify interrupt support

Thank you, Jun.

Yes. I confirmed that we will include below change to GHCI.next spec.

================
3.5 TDG.VP.VMCALL<SetupEventNotifyInterrupt>

From: "The host VMM should use SEAMCALL [TDWRVPS] leaf to inject an interrupt at the requested-interrupt vector into the TD via the posted-interrupt descriptor. "

To: "The host VMM should use SEAMCALL [TDWRVPS] leaf to inject an interrupt at the requested-interrupt vector into the TD VCPU that executed TDG.VP.VMCALL <SetupEventNotifyInterrupt> via the posted-interrupt descriptor. "

3.13 TDG.VP.VMCALL<Service>

Table 3-39: TDG.VP.VMCALL< Service >-Input Operands

From: R14:
"Event notification interrupt vector - (valid values 32~255) selected by TD
0: blocking action. VMM need get response then return.
1~31: Reserved. Should not be used.
32~255: Non-block action. VMM can return immediately and signal the interrupt vector when the response is ready. "

To: R14:
"Event notification interrupt vector - (valid values 32~255) selected by TD
0: blocking action. VMM need get response then return.
1~31: Reserved. Should not be used.
32~255: Non-block action. VMM can return immediately and signal the interrupt vector when the response is ready. VMM should inject interrupt vector into the TD VCPU that executed TDG.VP.VMCALL<Service>."

================

Thank you
Yao Jiewen


> -----Original Message-----
> From: Nakajima, Jun <[email protected]>
> Sent: Saturday, June 25, 2022 7:42 AM
> To: Sathyanarayanan Kuppuswamy
> <[email protected]>
> Cc: Huang, Kai <[email protected]>; Thomas Gleixner <[email protected]>;
> Ingo Molnar <[email protected]>; Borislav Petkov <[email protected]>; Dave
> Hansen <[email protected]>; [email protected]; H . Peter Anvin
> <[email protected]>; Kirill A . Shutemov <[email protected]>; Luck,
> Tony <[email protected]>; Andi Kleen <[email protected]>; Wander Lairson
> Costa <[email protected]>; Isaku Yamahata <[email protected]>;
> [email protected]; [email protected];
> [email protected]; Cox, Philip <[email protected]>; LKML
> <[email protected]>; Yao, Jiewen <[email protected]>
> Subject: Re: [PATCH v8 2/5] x86/tdx: Add TDX Guest event notify interrupt
> support
>
> Replying to this (not the latest one) to reduce the quoting levels, adding Jiewen.
>
>
> > On Jun 20, 2022, at 8:44 AM, Sathyanarayanan Kuppuswamy
> <[email protected]> wrote:
> >
> > Hi,
> >
> > + Jun
> >
> > On 6/20/22 5:33 AM, Kai Huang wrote:
> >> On Wed, 2022-06-08 at 19:52 -0700, Kuppuswamy Sathyanarayanan wrote:
> >>> Host-guest event notification via configured interrupt vector is useful
> >>> in cases where a guest makes an asynchronous request and needs a
> >>> callback from the host to indicate the completion or to let the host
> >>> notify the guest about events like device removal. One usage example is,
> >>> callback requirement of GetQuote asynchronous hypercall.
> >>
> >> Although this paragraph is from GHCI spec, IMHO it is not very helpful. In
> >> fact, I think this paragraph is not that right and should be removed from GHCI.
> >> The reason is such event notification from VMM in cases like "device
> removal" is
> >> too vague. There's no _specification_ in GHCI around which "device
> removal"
> >> should VMM inject such event. For instance, I _think_ the Qemu enumerated
> ACPI-
> >> based hotplug should continue to work in TD.
> >
> > Yes. It just says that it *can* be used to signal a device removal. It is just
> > an example for where it can be used. But I agree that such a use case is vague.
> > If it makes it better, I am fine with removing it.
>
> Yes, the “device removal” is just an example, especially, "the TD OS should be
> designed to not use the event notification for trusted operations”, based on the
> context of the spec.
>
> >>
> >>>
> >>> Reserve 0xec IRQ vector address for TDX guest to receive the event
> >>> completion notification from VMM. Also add related IDT handler to
> >>> process the notification event.
> >>
> >> Here lacks why we need to choose to reserve a system vector. For instance,
> why
> >> we cannot choose to use device IRQ way which only requires one vector on
> one
> >
> > As you have explained below, as per current spec, it just expects a system
> > vector.
> >
> >> cpu. As you can see reserving a system vector isn't ideal especially for
> >> attestation as it is not a frequent operation. It is wasteful of using IRQ
> >
> > I agree that event notification is currently only used for attestation. But I
> > think in future there could be other use cases for it. If the intention is just
> > to use it for attestation, then we can just modify the GetQuote TDVMCALL to
> pass
> > the vector address, and there is no need for new TDVMCALL. I think the
> intention
> > here is to have generic method for VMM to notify TD about some events. I am
> not
> > clear about the possible future use cases, so I cannot comment on frequency
> of
> > its use.
> >
> > Jun, any comments?
> >
>
> The GHCI spec was not just clear, and we’ll update the spec, for example:
>
> 3.5 TDG.VP.VMCALL<SetupEventNotifyInterrupt>
> ...
> From:
>
> “The host VMM should use SEAMCALL [TDWRVPS] leaf to inject an interrupt at
> the requested-interrupt vector into the TD via the posted-interrupt descriptor. “
>
> To:
>
> “The host VMM should use SEAMCALL [TDWRVPS] leaf to inject an interrupt at
> the requested-interrupt vector into the TD VCPU that executed TDG.VP.VMCALL
> <SetupEventNotifyInterrupt>, via the posted-interrupt descriptor. “
>
> ---
> Jun

2022-06-27 11:24:55

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v8 2/5] x86/tdx: Add TDX Guest event notify interrupt support

On Sat, 2022-06-25 at 15:35 +1200, Yao, Jiewen wrote:
> Thank you, Jun.
>
> Yes. I confirmed that we will include below change to GHCI.next spec.
>
> ================
> 3.5 TDG.VP.VMCALL<SetupEventNotifyInterrupt>
>
> From: "The host VMM should use SEAMCALL [TDWRVPS] leaf to inject an interrupt at the requested-interrupt vector into the TD via the posted-interrupt descriptor. "
>
> To: "The host VMM should use SEAMCALL [TDWRVPS] leaf to inject an interrupt at the requested-interrupt vector into the TD VCPU that executed TDG.VP.VMCALL <SetupEventNotifyInterrupt> via the posted-interrupt descriptor. "
>

Hi Sathy,

With this change, I don't think we should use system vector anymore. Instead,
we just need one non-migratable IRQ which has a fixed vector on a fixed cpu.

--
Thanks,
-Kai


Subject: Re: [PATCH v8 0/5] Add TDX Guest Attestation support

Hi,

On 6/24/22 11:24 AM, Dave Hansen wrote:
> On 6/8/22 19:52, Kuppuswamy Sathyanarayanan wrote:
>> Following patches add the attestation support to TDX guest which
>> includes attestation user interface driver and related hypercall support.
>
> This is also the place where you lay out the roadmap:
>
> 1. Get a report
> 2. Get a quote
> 2a. Interrupt support because quotes take a long time
> 2b. Actual quote module calls and ABI
>
> Right? That seems worth a few sentences in the cover letter.

Ok. I will update the cover letter with brief introduction to
changes involved.

How about following?

In TDX guest, attestation process generally involves the following steps:

1. Get the TDREPORT using user specified REPORTDATA. This is implemented
   using TDG.MR.TDREPORT Module call. An IOCTL interface is added to let
   userspace get the TDREPORT data  (implemented in patch #1).
   
2. Using the TDREPORT data, generate a remotely verifiable signed Quote.
   Quote can be generated either using GetQuote hypercall or by communicating
   with VMM/Quoting Enclave(QE) using VSOCK. In this patch set, only the
GetQuote hypercall model is supported. Since Quote generation is an
asynchronous request, and takes more time, we let VMM notify the TDX Guest
using the callback interrupt. Patch # 2-5 implements Quote generation support,
in which Patch # 2 implements the callback interrupt support.

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

Subject: Re: [PATCH v8 1/5] x86/tdx: Add TDX Guest attestation interface driver

Hi Dave,

Thanks for the review.

On 6/24/22 9:51 AM, Dave Hansen wrote:
> On 6/8/22 19:52, Kuppuswamy Sathyanarayanan wrote:
>> TDREPORT can only be verified on local platform as the MAC key is bound
>> to the platform. To support remote verification of the TDREPORT, TDX
>> leverages Intel SGX Quote Enclave (QE) to verify the TDREPORT locally
>> and convert it to a remote verifiable Quote.
>>
>> After getting the TDREPORT, the second step of the attestation process
>> is to send it to the QE to generate the Quote. TDX doesn't support SGX
>> inside the TD, so the QE can be deployed in the host, or in another
>> legacy VM with SGX support. How to send the TDREPORT to QE and receive
>> the Quote is implementation and deployment specific.
>
> On high-level comment: this whole series is quite agnostic about what is
> actually attested. On some level, I guess it doesn't matter. But, it
> makes me a bit uneasy. There needs to be at least *some* kind of claim
> about that somewhere, preferably a sentence in the cover letter and
> sentence or two here.

We are attesting that the given TD Guest is secure and trustworthy, and
the remote server requesting attestation can safely transfer the secrets.

Are you looking for something like the above?

Or you want to talk about the details involved (like TDREPORT hash
details)?

If it is not what you are looking for, then maybe I misunderstood your
question. Can you please elaborate?

>
> Second, how can someone test this code? It appears that they need to
> assemble a veritable Rube Goldberg machine. The least we could do is
> have a selftest that just calls the ioctl() and makes sure that
> something halfway sane comes out of it.

My initial submission included a test app. But I removed it later to
reduce the review load. I thought to submit the test app once feature
support patches are merged.

https://lore.kernel.org/all/9247fade9db5ae6eb183b2f92fdedb898282376a.1648664666.git.sathyanarayanan.kuppuswamy@intel.com/

If you prefer, I can add it back to the next submission with the latest changes.

>
>> In such
>> case, since REPORTDATA is a secret, using sysfs to share it is insecure
>> compared to sending it via IOCTL.
>
> Huh? How is sysfs "insecure"?

REPORTDATA (input) we pass to the Module call might come from attestation
service as a per session unique ID. If it is shared via sysfs, there is
a possibility for untrusted software to read it and trigger some form of
reply attack. So in this context, sysfs is insecure compared to IOCTL
approach. You can find the related discussion in,

https://lore.kernel.org/lkml/[email protected]/

>
>> + /*
>> + * Generate TDREPORT using "TDG.MR.REPORT" TDCALL.
>> + *
>> + * Get the TDREPORT using REPORTDATA as input. Refer to
>> + * section 22.3.3 TDG.MR.REPORT leaf in the TDX Module 1.0
>> + * Specification for detailed information.
>> + */
>> + ret = __tdx_module_call(TDX_GET_REPORT, virt_to_phys(tdreport),
>> + virt_to_phys(reportdata), 0, 0, NULL);
>
> One of those 0's is a "Report sub type". Those bits in the module call
> are presumably because it won't *always* be zero. But, if there's ever
> a new report type, we'll need another user/kernel ABI. That doesn't
> seem great.
>
> The TDX module spec doesn't even give a name to "sub report type 0".
> That is not super helpful for deciding what it *means*.
>
> Right now, the entire ABI is basically:
>
> ret = ioctl(TDX_GET_REPORT, &buffer);
>
> That _implies_ a ton of stuff:
>
> 1. report sub type 0
> 2. a specific input length REPORTDATA
> 3. a specific output length
>
> I don't want to over-engineer this thing, but it would make a lot of
> sense to me to feed the ioctl() with something like this:
>
> struct tdx_report
> {
> u8 report_sub_type;
> u64 report_input_data;
> u64 report_input_data_len;
>
> u64 report_output_data;
> u64 report_output_data_len;
> }
>
> That makes *everything* explicit. It makes it utterly clear what goes
> where, *AND* it makes userspace declare it.

Ok. Parameter to handle subtype makes sense. But regarding the input and
output length, currently there is no indication in specification that it
will change and there is no way to pass it in the module call. So do you
still prefer to add parameters for it?

>
> But, you have:
>
>> +/**
>> + * struct tdx_report_req: Get TDREPORT using REPORTDATA as input.
>> + *
>> + * @reportdata : User-defined 64-Byte REPORTDATA to be included into
>> + * TDREPORT. Typically it can be some nonce provided by
>> + * attestation service, so the generated TDREPORT can be
>> + * uniquely verified.
>> + * @tdreport : TDREPORT output from TDCALL[TDG.MR.REPORT] of size
>> + * TDX_REPORT_LEN.
>> + *
>> + * Used in TDX_CMD_GET_REPORT IOCTL request.
>> + */
>> +struct tdx_report_req {
>> + union {
>> + __u8 reportdata[TDX_REPORTDATA_LEN];
>> + __u8 tdreport[TDX_REPORT_LEN];
>> + };
>> +};
>> +
>
> and a bunch of code copying in and out of this structure:
>
>> +static long tdx_get_report(void __user *argp)
>> +{
>> + void *reportdata = NULL, *tdreport = NULL;
> ...
>> + /* Copy REPORTDATA from the user buffer */
>> + if (copy_from_user(reportdata, argp, TDX_REPORTDATA_LEN)) {
>> + ret = -EFAULT;
>> + goto out;
>> + }
>
> But none of that code even bothers to *use* the structure!

I have created that struct mainly to document the ABI details for the
userspace clarity (like input and output). Since the length of the input
and output is defined as macro, it worked fine for driver implementation.
I am fine with modifying the code to use it. Please let me know if you
prefer it.

>
> What's with the union? Are you really just trying to save 64 bytes of
> space? Is that worth it?

Makes sense. I will change it to separate variables.

>
> How many of these "drivers" are we going to need which are thinly veiled
> ioctl()s that are only TDX module call wrappers?

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

Subject: Re: [PATCH v8 2/5] x86/tdx: Add TDX Guest event notify interrupt support



On 6/27/22 4:21 AM, Kai Huang wrote:
> On Sat, 2022-06-25 at 15:35 +1200, Yao, Jiewen wrote:
>> Thank you, Jun.
>>
>> Yes. I confirmed that we will include below change to GHCI.next spec.
>>
>> ================
>> 3.5 TDG.VP.VMCALL<SetupEventNotifyInterrupt>
>>
>> From: "The host VMM should use SEAMCALL [TDWRVPS] leaf to inject an interrupt at the requested-interrupt vector into the TD via the posted-interrupt descriptor. "
>>
>> To: "The host VMM should use SEAMCALL [TDWRVPS] leaf to inject an interrupt at the requested-interrupt vector into the TD VCPU that executed TDG.VP.VMCALL <SetupEventNotifyInterrupt> via the posted-interrupt descriptor. "
>>
>
> Hi Sathy,
>
> With this change, I don't think we should use system vector anymore. Instead,
> we just need one non-migratable IRQ which has a fixed vector on a fixed cpu.
>

Yes. I will try your suggestion and make necessary changes in next version.

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2022-06-27 17:47:32

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v8 1/5] x86/tdx: Add TDX Guest attestation interface driver

On 6/27/22 07:50, Sathyanarayanan Kuppuswamy wrote:
>> On high-level comment: this whole series is quite agnostic about what is
>> actually attested. On some level, I guess it doesn't matter. But, it
>> makes me a bit uneasy. There needs to be at least *some* kind of claim
>> about that somewhere, preferably a sentence in the cover letter and
>> sentence or two here.
>
> We are attesting that the given TD Guest is secure and trustworthy, and
> the remote server requesting attestation can safely transfer the secrets.

What is *actually* attested?

> Are you looking for something like the above?

No, I'd like some actual facts, please. I want key examples of things
visible to an end user that might change that might determine if a guest
is "secure and trustworthy".

Does a kernel command-line parameter change attestation? How about the
disk image? The kernel image? initrd? Memory *size*? Devices?

Compare this mechanism to something that exists outside the TDX world.

>> Second, how can someone test this code? It appears that they need to
>> assemble a veritable Rube Goldberg machine. The least we could do is
>> have a selftest that just calls the ioctl() and makes sure that
>> something halfway sane comes out of it.
>
> My initial submission included a test app. But I removed it later to
> reduce the review load. I thought to submit the test app once feature
> support patches are merged.
>
> https://lore.kernel.org/all/9247fade9db5ae6eb183b2f92fdedb898282376a.1648664666.git.sathyanarayanan.kuppuswamy@intel.com/
>
> If you prefer, I can add it back to the next submission with the latest changes.

I doubt anyone will ever run a "test app". Why not just make this a
selftest?

>>> In such
>>> case, since REPORTDATA is a secret, using sysfs to share it is insecure
>>> compared to sending it via IOCTL.
>>
>> Huh? How is sysfs "insecure"?
>
> REPORTDATA (input) we pass to the Module call might come from attestation
> service as a per session unique ID. If it is shared via sysfs, there is
> a possibility for untrusted software to read it and trigger some form of
> reply attack. So in this context, sysfs is insecure compared to IOCTL
> approach. You can find the related discussion in,
>
> https://lore.kernel.org/lkml/[email protected]/

I still don't get it.

How is a 400 sysfs file "insecure"? This sounds "if the filesystem
permissions are broken" paranoia. In other words, you're protecting
against a malicious root user.

In other words, I don't think the ABI here has been well thought out.

Also, this is basically a:

Inputs:

* nonce
* report type

Outputs:

* report

I have to wonder how hard it would be to have this be:

fd = open("/dev/foo");
ioctl(fd, REPORT, type, flags??);
write(fd, &inputs, inputs_len);
read(fd, &output, outputs_len);



>>> + /*
>>> + * Generate TDREPORT using "TDG.MR.REPORT" TDCALL.
>>> + *
>>> + * Get the TDREPORT using REPORTDATA as input. Refer to
>>> + * section 22.3.3 TDG.MR.REPORT leaf in the TDX Module 1.0
>>> + * Specification for detailed information.
>>> + */
>>> + ret = __tdx_module_call(TDX_GET_REPORT, virt_to_phys(tdreport),
>>> + virt_to_phys(reportdata), 0, 0, NULL);
>>
>> One of those 0's is a "Report sub type". Those bits in the module call
>> are presumably because it won't *always* be zero. But, if there's ever
>> a new report type, we'll need another user/kernel ABI. That doesn't
>> seem great.
>>
>> The TDX module spec doesn't even give a name to "sub report type 0".
>> That is not super helpful for deciding what it *means*.
>>
>> Right now, the entire ABI is basically:
>>
>> ret = ioctl(TDX_GET_REPORT, &buffer);
>>
>> That _implies_ a ton of stuff:
>>
>> 1. report sub type 0
>> 2. a specific input length REPORTDATA
>> 3. a specific output length
>>
>> I don't want to over-engineer this thing, but it would make a lot of
>> sense to me to feed the ioctl() with something like this:
>>
>> struct tdx_report
>> {
>> u8 report_sub_type;
>> u64 report_input_data;
>> u64 report_input_data_len;
>>
>> u64 report_output_data;
>> u64 report_output_data_len;
>> }
>>
>> That makes *everything* explicit. It makes it utterly clear what goes
>> where, *AND* it makes userspace declare it.
>
> Ok. Parameter to handle subtype makes sense. But regarding the input and
> output length, currently there is no indication in specification that it
> will change and there is no way to pass it in the module call. So do you
> still prefer to add parameters for it?

It seems like a more flexible ABI to me. What you have here seems to be
a *PURE* passthrough of the module ABI.

>> But, you have:
>>
>>> +/**
>>> + * struct tdx_report_req: Get TDREPORT using REPORTDATA as input.
>>> + *
>>> + * @reportdata : User-defined 64-Byte REPORTDATA to be included into
>>> + * TDREPORT. Typically it can be some nonce provided by
>>> + * attestation service, so the generated TDREPORT can be
>>> + * uniquely verified.
>>> + * @tdreport : TDREPORT output from TDCALL[TDG.MR.REPORT] of size
>>> + * TDX_REPORT_LEN.
>>> + *
>>> + * Used in TDX_CMD_GET_REPORT IOCTL request.
>>> + */
>>> +struct tdx_report_req {
>>> + union {
>>> + __u8 reportdata[TDX_REPORTDATA_LEN];
>>> + __u8 tdreport[TDX_REPORT_LEN];
>>> + };
>>> +};
>>> +
>>
>> and a bunch of code copying in and out of this structure:
>>
>>> +static long tdx_get_report(void __user *argp)
>>> +{
>>> + void *reportdata = NULL, *tdreport = NULL;
>> ...
>>> + /* Copy REPORTDATA from the user buffer */
>>> + if (copy_from_user(reportdata, argp, TDX_REPORTDATA_LEN)) {
>>> + ret = -EFAULT;
>>> + goto out;
>>> + }
>>
>> But none of that code even bothers to *use* the structure!
>
> I have created that struct mainly to document the ABI details for the
> userspace clarity (like input and output). Since the length of the input
> and output is defined as macro, it worked fine for driver implementation.
> I am fine with modifying the code to use it. Please let me know if you
> prefer it.

Let's say I'm reviewing kernel code. I want to know what the ABI of
'reportdata' is. How do I find it?

Now, imagine that the code is:


struct tdx_report_req __user *user_rd = argp;
struct tdx_report_req *kern_rd;
...

if (copy_from_user(&kern_rd->reportdata,
&user_rd->reportdata,
sizeof(kern_rd->reportdata)) {


This tells me a *LOT*. It says that the user and kernel buffers are the
same ABI, and I'm also much more that the copy matches the data
structure sizes.

We don't put structures in kernel headers just for our health. We put
them so the kernel can use them.

>> What's with the union? Are you really just trying to save 64 bytes of
>> space? Is that worth it?
>
> Makes sense. I will change it to separate variables.
>
>> How many of these "drivers" are we going to need which are thinly veiled
>> ioctl()s that are only TDX module call wrappers?

This is actually a really big question. This code is obviously just
trying to do one thing very narrowly and not thinking about the future
at all. Can we please consider what the future will be like and try to
*architect* this instead of having the kernel just play a glorified game
of telephone?

2022-06-27 19:05:35

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v8 0/5] Add TDX Guest Attestation support

On 6/27/22 07:51, Sathyanarayanan Kuppuswamy wrote:
> In TDX guest, attestation process generally involves the following steps:
>
> 1. Get the TDREPORT using user specified REPORTDATA. This is implemented
>    using TDG.MR.TDREPORT Module call. An IOCTL interface is added to let
>    userspace get the TDREPORT data  (implemented in patch #1).
>    
> 2. Using the TDREPORT data, generate a remotely verifiable signed Quote.
>    Quote can be generated either using GetQuote hypercall or by communicating
>    with VMM/Quoting Enclave(QE) using VSOCK. In this patch set, only the
> GetQuote hypercall model is supported. Since Quote generation is an
> asynchronous request, and takes more time, we let VMM notify the TDX Guest
> using the callback interrupt. Patch # 2-5 implements Quote generation support,
> in which Patch # 2 implements the callback interrupt support.

IMNHO, too much gibberish, not enough English, too much superfluous
information.

For instance, why do we need a quote and a report? Why does this have
an interrupt?

Subject: Re: [PATCH v8 1/5] x86/tdx: Add TDX Guest attestation interface driver

Hi,

On 6/27/22 10:24 AM, Dave Hansen wrote:
> On 6/27/22 07:50, Sathyanarayanan Kuppuswamy wrote:

>>> Second, how can someone test this code? It appears that they need to
>>> assemble a veritable Rube Goldberg machine. The least we could do is
>>> have a selftest that just calls the ioctl() and makes sure that
>>> something halfway sane comes out of it.
>>
>> My initial submission included a test app. But I removed it later to
>> reduce the review load. I thought to submit the test app once feature
>> support patches are merged.
>>
>> https://lore.kernel.org/all/9247fade9db5ae6eb183b2f92fdedb898282376a.1648664666.git.sathyanarayanan.kuppuswamy@intel.com/
>>
>> If you prefer, I can add it back to the next submission with the latest changes.
>
> I doubt anyone will ever run a "test app". Why not just make this a
> selftest?

Fine with me. I can change it into selftest.

>
>>>> In such
>>>> case, since REPORTDATA is a secret, using sysfs to share it is insecure
>>>> compared to sending it via IOCTL.
>>>
>>> Huh? How is sysfs "insecure"?
>>
>> REPORTDATA (input) we pass to the Module call might come from attestation
>> service as a per session unique ID. If it is shared via sysfs, there is
>> a possibility for untrusted software to read it and trigger some form of
>> reply attack. So in this context, sysfs is insecure compared to IOCTL
>> approach. You can find the related discussion in,
>>
>> https://lore.kernel.org/lkml/[email protected]/
>
> I still don't get it.
>
> How is a 400 sysfs file "insecure"? This sounds "if the filesystem
> permissions are broken" paranoia. In other words, you're protecting
> against a malicious root user.

AFAIK, root is not the only user of the attestation interface. General users can
also use it (For example, in a use case like pytorch, attestation may be requested
by server before passing the training data). So if the permission is not "root
only", then other users or application in TD can access the sysfs file to read
its contents.

Also, the security concern mentioned is just an additional point. Following are
the main reasons for choosing IOCTL over sysfs.

1. Sysfs is generally used for config purposes. Not for request/response kind of
use case (Attestation falls under this category).
2. If we only use sysfs model for GetReport, the design might look like below:

/sys/kernel/../report/input
/sys/kernel/../report/subtype
/sys/kernel/../report/input_len
/sys/kernel/../report/output
/sys/kernel/../report/output_len
/sys/kernel/../report/status
/sys/kernel/../report/trigger

We might need similar files for GetQuote use case as well. IMO, such a design is
more complicated than using IOCTL.

>
> In other words, I don't think the ABI here has been well thought out.
>
> Also, this is basically a:
>
> Inputs:
>
> * nonce
> * report type
>
> Outputs:
>
> * report
>
> I have to wonder how hard it would be to have this be:
>
> fd = open("/dev/foo");
> ioctl(fd, REPORT, type, flags??);
> write(fd, &inputs, inputs_len);
> read(fd, &output, outputs_len);
>

It is not hard to implement it this way. But I don't see it being
very different from just using IOCTL. config/{read/write} model is
usually preferred for applications in which you have a requirement to do
multiple read/writes after one time config (use cases like serial
port read, printer write or reading temperature, etc). But in our case
we will mostly use it once after every config.

Also, splitting input over IOCTL and write system call will require us
to add additional code to store the state.

I have attempted a sample implementation (untested) for reference. But I
feel our current code is simpler. It handles allocation and input/output
validation in one place compared to spreading it across multiple handlers.

struct report_req {
int subtype;
void *reportdata;
int reportdata_len;
};

struct tdx_attest_req {
unsigned int cmd;
void *data;
};


static long tdx_attest_ioctl(struct file *file, unsigned int cmd,
unsigned long arg)
{
void __user *argp = (void __user *)arg;
struct tdx_attest_req *areq = file->private_data;
struct report_req *rep_data = NULL;
struct tdx_report_req user_req;
long ret = -EINVAL;

switch (cmd) {
case TDX_CMD_GET_REPORT:
/* Allocate space for TDREPORT request */
rep_data = kmalloc(sizeof(struct report_req), GFP_KERNEL);
if (!rep_data)
return -ENOMEM;

/* Copy user request data */
if (copy_from_user(&user_req, argp, sizeof(user_req))) {
kfree(rep_data);
return -EFAULT;
}

/* Copy user request details to areq->data */
rep_data->subtype = user_req.subtype;
areq->cmd = cmd;
areq->data = rep_data;

ret = 0;
break;
default:
pr_debug("cmd %d not supported\n", cmd);
break;
}

return ret;
}

static ssize_t tdx_attest_read(struct file *filp, char __user *buffer,
size_t length, loff_t *offset)
{
struct tdx_attest_req *areq = filp->private_data;
struct report_req *rep_data;
void *tdreport;
long ret;

if (!areq)
return -EINVAL;

switch (areq->cmd) {
case TDX_CMD_GET_REPORT:

/* Check for valid length and offset */
if (length != TDX_REPORT_LEN || offset != 0)
return -EINVAL;

rep_data = areq->data;

/* Make sure we have valid reportdata */
if (!rep_data->reportdata)
return -EINVAL;

/* Allocate space for output data */
tdreport = kzalloc(length, GFP_KERNEL);
if (!tdreport)
return -ENOMEM;
/*
* Generate TDREPORT using "TDG.MR.REPORT" TDCALL.
*
* Get the TDREPORT using REPORTDATA as input. Refer to
* section 22.3.3 TDG.MR.REPORT leaf in the TDX Module 1.0
* Specification for detailed information.
*/
ret = __tdx_module_call(TDX_GET_REPORT, virt_to_phys(tdreport),
virt_to_phys(rep_data->reportdata), 0, 0, NULL);
if (ret) {
pr_debug("TDREPORT TDCALL failed, status:%lx\n", ret);
kfree(tdreport);
return -EIO;
}

/* Copy REPORTDATA from the user buffer */
if (copy_to_user(buffer, tdreport, length)) {
kfree(tdreport);
return -EFAULT;
}

return length;
default:
pr_debug("cmd %d not supported\n", areq->cmd);
break;
}

return 0;
}

static ssize_t tdx_attest_write(struct file *filp, const char __user *buffer,
size_t length, loff_t *offset)
{
struct tdx_attest_req *areq = filp->private_data;
struct report_req *rep_data;

if (!areq)
return -EINVAL;

switch (areq->cmd) {
case TDX_CMD_GET_REPORT:

/* Check for valid length and offset */
if (length != TDX_REPORTDATA_LEN || offset != 0)
return -EINVAL;

rep_data = areq->data;

/* Allocate space to store REPORTDATA */
rep_data->reportdata = kzalloc(length, GFP_KERNEL);
if (!rep_data->reportdata)
return -ENOMEM;

/* Copy REPORTDATA from the user buffer */
if (copy_from_user(rep_data->reportdata, buffer, length)) {
kfree(rep_data->reportdata);
rep_data->reportdata = NULL;
return -EFAULT;
}

rep_data->reportdata_len = length;

return length;
default:
pr_debug("cmd %d not supported\n", areq->cmd);
break;
}

return 0;
}


static int tdx_attest_open(struct inode *inode, struct file *filp)
{
struct tdx_attest_req *areq;

/* Allocate space to track attestation request */
areq = kmalloc(sizeof(*areq), GFP_KERNEL);
if (!areq)
return -ENOMEM;

filp->private_data = areq;

return 0;
}

static int tdx_attest_release(struct inode *inode, struct file *filp)
{
kfree(filp->private_data);
filp->private_data = NULL;

return 0;
}

static const struct file_operations tdx_attest_fops = {
.owner = THIS_MODULE,
.open = tdx_attest_open,
.read = tdx_attest_read,
.write = tdx_attest_write,
.unlocked_ioctl = tdx_attest_ioctl,
.release = tdx_attest_release,
.llseek = no_llseek,
};

>>> How many of these "drivers" are we going to need which are thinly veiled
>>> ioctl()s that are only TDX module call wrappers?
>
> This is actually a really big question. This code is obviously just
> trying to do one thing very narrowly and not thinking about the future
> at all. Can we please consider what the future will be like and try to
> *architect* this instead of having the kernel just play a glorified game
> of telephone?

I am not very clear about other possible use cases.

Kirill/Kai/Isaku, Do you think we might need similar infrastructure for any
other TDX Module calls or TDVMCALLs?


--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2022-07-05 13:06:32

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v8 1/5] x86/tdx: Add TDX Guest attestation interface driver

On Thu, 2022-06-30 at 16:50 -0700, Sathyanarayanan Kuppuswamy wrote:
> Hi,
>
> On 6/27/22 10:24 AM, Dave Hansen wrote:
> > On 6/27/22 07:50, Sathyanarayanan Kuppuswamy wrote:
>
> > > > Second, how can someone test this code? It appears that they need to
> > > > assemble a veritable Rube Goldberg machine. The least we could do is
> > > > have a selftest that just calls the ioctl() and makes sure that
> > > > something halfway sane comes out of it.
> > >
> > > My initial submission included a test app. But I removed it later to
> > > reduce the review load. I thought to submit the test app once feature
> > > support patches are merged.
> > >
> > > https://lore.kernel.org/all/9247fade9db5ae6eb183b2f92fdedb898282376a.1648664666.git.sathyanarayanan.kuppuswamy@intel.com/
> > >
> > > If you prefer, I can add it back to the next submission with the latest changes.
> >
> > I doubt anyone will ever run a "test app". Why not just make this a
> > selftest?
>
> Fine with me. I can change it into selftest.
>
> >
> > > > > In such
> > > > > case, since REPORTDATA is a secret, using sysfs to share it is insecure
> > > > > compared to sending it via IOCTL.
> > > >
> > > > Huh? How is sysfs "insecure"?
> > >
> > > REPORTDATA (input) we pass to the Module call might come from attestation
> > > service as a per session unique ID. If it is shared via sysfs, there is
> > > a possibility for untrusted software to read it and trigger some form of
> > > reply attack. So in this context, sysfs is insecure compared to IOCTL
> > > approach. You can find the related discussion in,
> > >
> > > https://lore.kernel.org/lkml/[email protected]/
> >
> > I still don't get it.
> >
> > How is a 400 sysfs file "insecure"? This sounds "if the filesystem
> > permissions are broken" paranoia. In other words, you're protecting
> > against a malicious root user.
>
> AFAIK, root is not the only user of the attestation interface. General users can
> also use it (For example, in a use case like pytorch, attestation may be requested
> by server before passing the training data). So if the permission is not "root
> only", then other users or application in TD can access the sysfs file to read
> its contents.
>
> Also, the security concern mentioned is just an additional point. Following are
> the main reasons for choosing IOCTL over sysfs.
>
> 1. Sysfs is generally used for config purposes. Not for request/response kind of
> use case (Attestation falls under this category).
> 2. If we only use sysfs model for GetReport, the design might look like below:
>
> /sys/kernel/../report/input
> /sys/kernel/../report/subtype
> /sys/kernel/../report/input_len
> /sys/kernel/../report/output
> /sys/kernel/../report/output_len
> /sys/kernel/../report/status
> /sys/kernel/../report/trigger

I don't think you need all those if using /sysfs approach. You only need
'reportdata' and 'tdreport' to start with (see below my old reply).  

echo <REPORTDATA> > /sys/kernel/coco/tdx/attest/reportdata
cat /sys/kernel/coco/tdx/attest/tdreport

Each "echo <REPORTDATA>" to '/sys/.../reportdata' triggers the driver
to call TDCALL to get the TDREPORT, which is available at 
'/sys/.../tdreport'.

You can add more (such as subtype) in the future if needed (and I doubt it will
ever happen) but this doesn't break any existing ABI. 'output/output_len' also
not needed, kernel can return the report with right size.

Btw, although the /sysfs may not be as secure as IOCTL -- as you mentioned
above, other programs with the same permission can get the TD report by reading
/sysfs and use it as a "reply attack" -- but I am not sure whether such
"potential reply attack" is a true issue or not. For instance, typically the
attestation service should already have established a secure TLS connection with
TD attestation agent before it provides the 'nonce' (reportdata), and the
attestation should reject the TD report from other connection, etc.

>
> We might need similar files for GetQuote use case as well. IMO, such a design is
> more complicated than using IOCTL.

Using /sysfs for TD report doesn't necessarily mean you must use /sysfs for
Quote. I don't think we should mixing them up. For instance, even if we use
/dev/xxx for getting TD report, we can use a separate device node for getting
the Quote:

/dev/tdreport
/dev/tdquote

I believe there should be pros/cons comparing to using single /dev/attest, but I
haven't thought this very carefully.


> >
> > In other words, I don't think the ABI here has been well thought out.
> >
> > Also, this is basically a:
> >
> > Inputs:
> >
> > * nonce
> > * report type
> >
> > Outputs:
> >
> > * report
> >
> > I have to wonder how hard it would be to have this be:
> >
> > fd = open("/dev/foo");
> > ioctl(fd, REPORT, type, flags??);
> > write(fd, &inputs, inputs_len);
> > read(fd, &output, outputs_len);

It looks like the kernel and userspace still need data structures to agree on
the input/output data format. I guess with this approach, we can start with
what we need now, and if we need more in the future, we define new data
structures for input and output?

> >
>
> It is not hard to implement it this way. But I don't see it being
> very different from just using IOCTL. config/{read/write} model is
> usually preferred for applications in which you have a requirement to do
> multiple read/writes after one time config (use cases like serial
> port read, printer write or reading temperature, etc). But in our case
> we will mostly use it once after every config.
>
> Also, splitting input over IOCTL and write system call will require us
> to add additional code to store the state.
>
> I have attempted a sample implementation (untested) for reference. But I
> feel our current code is simpler. It handles allocation and input/output
> validation in one place compared to spreading it across multiple handlers.
>
> struct report_req {
> int subtype;
> void *reportdata;
> int reportdata_len;
> };
>
> struct tdx_attest_req {
> unsigned int cmd;
> void *data;
> };

Is it supposed to be used for Quote too?

I dislike the idea of mixing up getting TD report and getting Quote (make TD
report and Quote both as a sub-commands, etc).

As we have adequately put, the new IOCTLs to support getting Quote isn't even
mandatory -- we just need some communication channel between TD attestation
agent and the QE, such as vsock.

>
>
> static long tdx_attest_ioctl(struct file *file, unsigned int cmd,
> unsigned long arg)
> {
> void __user *argp = (void __user *)arg;
> struct tdx_attest_req *areq = file->private_data;
> struct report_req *rep_data = NULL;
> struct tdx_report_req user_req;
> long ret = -EINVAL;
>
> switch (cmd) {
> case TDX_CMD_GET_REPORT:
> /* Allocate space for TDREPORT request */
> rep_data = kmalloc(sizeof(struct report_req), GFP_KERNEL);
> if (!rep_data)
> return -ENOMEM;
>
> /* Copy user request data */
> if (copy_from_user(&user_req, argp, sizeof(user_req))) {
> kfree(rep_data);
> return -EFAULT;
> }
>
> /* Copy user request details to areq->data */
> rep_data->subtype = user_req.subtype;
> areq->cmd = cmd;
> areq->data = rep_data;
>
> ret = 0;
> break;
> default:
> pr_debug("cmd %d not supported\n", cmd);
> break;
> }
>
> return ret;
> }
>
> static ssize_t tdx_attest_read(struct file *filp, char __user *buffer,
> size_t length, loff_t *offset)
> {
> struct tdx_attest_req *areq = filp->private_data;
> struct report_req *rep_data;
> void *tdreport;
> long ret;
>
> if (!areq)
> return -EINVAL;
>
> switch (areq->cmd) {
> case TDX_CMD_GET_REPORT:
>
> /* Check for valid length and offset */
> if (length != TDX_REPORT_LEN || offset != 0)
> return -EINVAL;
>
> rep_data = areq->data;
>
> /* Make sure we have valid reportdata */
> if (!rep_data->reportdata)
> return -EINVAL;
>
> /* Allocate space for output data */
> tdreport = kzalloc(length, GFP_KERNEL);
> if (!tdreport)
> return -ENOMEM;
> /*
> * Generate TDREPORT using "TDG.MR.REPORT" TDCALL.
> *
> * Get the TDREPORT using REPORTDATA as input. Refer to
> * section 22.3.3 TDG.MR.REPORT leaf in the TDX Module 1.0
> * Specification for detailed information.
> */
> ret = __tdx_module_call(TDX_GET_REPORT, virt_to_phys(tdreport),
> virt_to_phys(rep_data->reportdata), 0, 0, NULL);
> if (ret) {
> pr_debug("TDREPORT TDCALL failed, status:%lx\n", ret);
> kfree(tdreport);
> return -EIO;
> }
>
> /* Copy REPORTDATA from the user buffer */
> if (copy_to_user(buffer, tdreport, length)) {
> kfree(tdreport);
> return -EFAULT;
> }
>
> return length;
> default:
> pr_debug("cmd %d not supported\n", areq->cmd);
> break;
> }
>
> return 0;
> }
>
> static ssize_t tdx_attest_write(struct file *filp, const char __user *buffer,
> size_t length, loff_t *offset)
> {
> struct tdx_attest_req *areq = filp->private_data;
> struct report_req *rep_data;
>
> if (!areq)
> return -EINVAL;
>
> switch (areq->cmd) {
> case TDX_CMD_GET_REPORT:
>
> /* Check for valid length and offset */
> if (length != TDX_REPORTDATA_LEN || offset != 0)
> return -EINVAL;
>
> rep_data = areq->data;
>
> /* Allocate space to store REPORTDATA */
> rep_data->reportdata = kzalloc(length, GFP_KERNEL);
> if (!rep_data->reportdata)
> return -ENOMEM;
>
> /* Copy REPORTDATA from the user buffer */
> if (copy_from_user(rep_data->reportdata, buffer, length)) {
> kfree(rep_data->reportdata);
> rep_data->reportdata = NULL;
> return -EFAULT;
> }
>
> rep_data->reportdata_len = length;
>
> return length;
> default:
> pr_debug("cmd %d not supported\n", areq->cmd);
> break;
> }
>
> return 0;
> }
>
>
> static int tdx_attest_open(struct inode *inode, struct file *filp)
> {
> struct tdx_attest_req *areq;
>
> /* Allocate space to track attestation request */
> areq = kmalloc(sizeof(*areq), GFP_KERNEL);
> if (!areq)
> return -ENOMEM;
>
> filp->private_data = areq;
>
> return 0;
> }
>
> static int tdx_attest_release(struct inode *inode, struct file *filp)
> {
> kfree(filp->private_data);
> filp->private_data = NULL;
>
> return 0;
> }
>
> static const struct file_operations tdx_attest_fops = {
> .owner = THIS_MODULE,
> .open = tdx_attest_open,
> .read = tdx_attest_read,
> .write = tdx_attest_write,
> .unlocked_ioctl = tdx_attest_ioctl,
> .release = tdx_attest_release,
> .llseek = no_llseek,
> };
>
> > > > How many of these "drivers" are we going to need which are thinly veiled
> > > > ioctl()s that are only TDX module call wrappers?
> >
> > This is actually a really big question. This code is obviously just
> > trying to do one thing very narrowly and not thinking about the future
> > at all. Can we please consider what the future will be like and try to
> > *architect* this instead of having the kernel just play a glorified game
> > of telephone?
>
> I am not very clear about other possible use cases.
>
> Kirill/Kai/Isaku, Do you think we might need similar infrastructure for any
> other TDX Module calls or TDVMCALLs?
>
>

So far only attestation related TDCALL and TDVMCALL requires interaction to
userspace.

For attestation, conceptually, we need two "sets" of ABIs: 1) around getting the
TD report; 2) around getting the Quote.

For 1) we are discussing above.

For 2), currently we have only GetQuote TDVMCALL. It's very possible we will
need more sub-commands around Quote (not only get the Quote) -- logically, Quote
generation service defines Quote related commands anyway.  

Theoretically, we only need one TDVMCALL (or a fixed set of TDVMCALLs) for
sending/receiving data as a communication channel (as an alternative to vsock,
etc) to support any Quote related sub-commands, but it seems we are not heading
that way.

--
Thanks,
-Kai


Subject: Re: [PATCH v8 1/5] x86/tdx: Add TDX Guest attestation interface driver

Hi,

On 7/5/22 5:07 AM, Kai Huang wrote:
> On Thu, 2022-06-30 at 16:50 -0700, Sathyanarayanan Kuppuswamy wrote:
>> Hi,
>>
>> On 6/27/22 10:24 AM, Dave Hansen wrote:
>>> On 6/27/22 07:50, Sathyanarayanan Kuppuswamy wrote:
>>
>>>>> Second, how can someone test this code? It appears that they need to
>>>>> assemble a veritable Rube Goldberg machine. The least we could do is
>>>>> have a selftest that just calls the ioctl() and makes sure that
>>>>> something halfway sane comes out of it.
>>>>
>>>> My initial submission included a test app. But I removed it later to
>>>> reduce the review load. I thought to submit the test app once feature
>>>> support patches are merged.
>>>>
>>>> https://lore.kernel.org/all/9247fade9db5ae6eb183b2f92fdedb898282376a.1648664666.git.sathyanarayanan.kuppuswamy@intel.com/
>>>>
>>>> If you prefer, I can add it back to the next submission with the latest changes.
>>>
>>> I doubt anyone will ever run a "test app". Why not just make this a
>>> selftest?
>>
>> Fine with me. I can change it into selftest.
>>
>>>
>>>>>> In such
>>>>>> case, since REPORTDATA is a secret, using sysfs to share it is insecure
>>>>>> compared to sending it via IOCTL.
>>>>>
>>>>> Huh? How is sysfs "insecure"?
>>>>
>>>> REPORTDATA (input) we pass to the Module call might come from attestation
>>>> service as a per session unique ID. If it is shared via sysfs, there is
>>>> a possibility for untrusted software to read it and trigger some form of
>>>> reply attack. So in this context, sysfs is insecure compared to IOCTL
>>>> approach. You can find the related discussion in,
>>>>
>>>> https://lore.kernel.org/lkml/[email protected]/
>>>
>>> I still don't get it.
>>>
>>> How is a 400 sysfs file "insecure"? This sounds "if the filesystem
>>> permissions are broken" paranoia. In other words, you're protecting
>>> against a malicious root user.
>>
>> AFAIK, root is not the only user of the attestation interface. General users can
>> also use it (For example, in a use case like pytorch, attestation may be requested
>> by server before passing the training data). So if the permission is not "root
>> only", then other users or application in TD can access the sysfs file to read
>> its contents.
>>
>> Also, the security concern mentioned is just an additional point. Following are
>> the main reasons for choosing IOCTL over sysfs.
>>
>> 1. Sysfs is generally used for config purposes. Not for request/response kind of
>> use case (Attestation falls under this category).
>> 2. If we only use sysfs model for GetReport, the design might look like below:
>>
>> /sys/kernel/../report/input
>> /sys/kernel/../report/subtype
>> /sys/kernel/../report/input_len
>> /sys/kernel/../report/output
>> /sys/kernel/../report/output_len
>> /sys/kernel/../report/status
>> /sys/kernel/../report/trigger
>
> I don't think you need all those if using /sysfs approach. You only need
> 'reportdata' and 'tdreport' to start with (see below my old reply).  
>
> echo <REPORTDATA> > /sys/kernel/coco/tdx/attest/reportdata
> cat /sys/kernel/coco/tdx/attest/tdreport
>
> Each "echo <REPORTDATA>" to '/sys/.../reportdata' triggers the driver
> to call TDCALL to get the TDREPORT, which is available at 
> '/sys/.../tdreport'.
>
> You can add more (such as subtype) in the future if needed (and I doubt it will
> ever happen) but this doesn't break any existing ABI. 'output/output_len' also
> not needed, kernel can return the report with right size.

I have included *_len files for readability. You can get away with it. But I think
you still need a status file to understand whether TDCALL is successful or not.
Another way is, you can try to write -1 or something to tdreport file if failed.

Yes, there are many ways to design this. But I still think IOCTL is a better fit
for the request/response kind of use case.

I will let the maintainer decide whether you want to take this route.

>
> Btw, although the /sysfs may not be as secure as IOCTL -- as you mentioned
> above, other programs with the same permission can get the TD report by reading
> /sysfs and use it as a "reply attack" -- but I am not sure whether such
> "potential reply attack" is a true issue or not. For instance, typically the
> attestation service should already have established a secure TLS connection with
> TD attestation agent before it provides the 'nonce' (reportdata), and the
> attestation should reject the TD report from other connection, etc.

As I have mentioned, security issue is just an additional point. If this is not
very valid or real, I can remove it.

>
>>
>> We might need similar files for GetQuote use case as well. IMO, such a design is
>> more complicated than using IOCTL.
>
> Using /sysfs for TD report doesn't necessarily mean you must use /sysfs for
> Quote. I don't think we should mixing them up. For instance, even if we use
> /dev/xxx for getting TD report, we can use a separate device node for getting
> the Quote:
>
> /dev/tdreport
> /dev/tdquote
>
> I believe there should be pros/cons comparing to using single /dev/attest, but I
> haven't thought this very carefully.

Yes. There are pros and cons for different approaches. But I personally think
using one device and IOCTL ABI model is a simpler design for this use case. AMD also
uses a similar model for their attestation use case.

>
>
>>>
>>> In other words, I don't think the ABI here has been well thought out.
>>>
>>> Also, this is basically a:
>>>
>>> Inputs:
>>>
>>> * nonce
>>> * report type
>>>
>>> Outputs:
>>>
>>> * report
>>>
>>> I have to wonder how hard it would be to have this be:
>>>
>>> fd = open("/dev/foo");
>>> ioctl(fd, REPORT, type, flags??);
>>> write(fd, &inputs, inputs_len);
>>> read(fd, &output, outputs_len);
>
> It looks like the kernel and userspace still need data structures to agree on
> the input/output data format. I guess with this approach, we can start with
> what we need now, and if we need more in the future, we define new data
> structures for input and output?
>
>>>
>>
>> It is not hard to implement it this way. But I don't see it being
>> very different from just using IOCTL. config/{read/write} model is
>> usually preferred for applications in which you have a requirement to do
>> multiple read/writes after one time config (use cases like serial
>> port read, printer write or reading temperature, etc). But in our case
>> we will mostly use it once after every config.
>>
>> Also, splitting input over IOCTL and write system call will require us
>> to add additional code to store the state.
>>
>> I have attempted a sample implementation (untested) for reference. But I
>> feel our current code is simpler. It handles allocation and input/output
>> validation in one place compared to spreading it across multiple handlers.
>>
>> struct report_req {
>> int subtype;
>> void *reportdata;
>> int reportdata_len;
>> };
>>
>> struct tdx_attest_req {
>> unsigned int cmd;
>> void *data;
>> };
>
> Is it supposed to be used for Quote too?

Yes. It is to save the command request in kernel during IOCTL request.

>
> I dislike the idea of mixing up getting TD report and getting Quote (make TD
> report and Quote both as a sub-commands, etc).
>
> As we have adequately put, the new IOCTLs to support getting Quote isn't even
> mandatory -- we just need some communication channel between TD attestation
> agent and the QE, such as vsock.

Yes. There are other ways to get the Quote. But we have a requirement to support
TDVMCALL based method.

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2022-07-05 19:30:06

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v8 1/5] x86/tdx: Add TDX Guest attestation interface driver

On 7/5/22 11:45, Sathyanarayanan Kuppuswamy wrote:
> Yes, there are many ways to design this. But I still think IOCTL is a better fit
> for the request/response kind of use case.

Are there any other similar ABIs in the kernel? What kind of mechanism
do they use?

Subject: Re: [PATCH v8 1/5] x86/tdx: Add TDX Guest attestation interface driver



On 7/5/22 11:52 AM, Dave Hansen wrote:
> On 7/5/22 11:45, Sathyanarayanan Kuppuswamy wrote:
>> Yes, there are many ways to design this. But I still think IOCTL is a better fit
>> for the request/response kind of use case.
>
> Are there any other similar ABIs in the kernel? What kind of mechanism
> do they use?

AMD also plans to add attestation support. It also uses IOCTL approach.

https://patchwork.kernel.org/project/linux-mm/patch/[email protected]/

SGX is a related feature. It also uses IOCTL approach for enclave provisioning.

arch/x86/kernel/cpu/sgx/ioctl.c

Other examples (not very related) are,

drivers/platform/chrome/cros_ec_chardev.c - It is an embedded controller driver which
has IOCTL support to read memory region from the device.
drivers/s390/crypto/pkey_api.c - It has IOCTL interfaces to read/write/generate crypto
keys.
drivers/crypto/ccp/sev-dev.c - sev_ioctl() has some IOCTL to set/get keys.
drivers/platform/x86/intel_scu_ipcutil.c - Uses IOCTL to read contents of registers.




--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2022-07-05 22:54:58

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v8 1/5] x86/tdx: Add TDX Guest attestation interface driver

On Tue, 2022-07-05 at 14:21 -0700, Sathyanarayanan Kuppuswamy wrote:
> SGX is a related feature. It also uses IOCTL approach for enclave provisioning.
>
> arch/x86/kernel/cpu/sgx/ioctl.c

SGX isn't a good example here. The IOCTLs are used to create enclaves, but not
for attestation. SGX attestation relies on enclave itself to get/verify the
report, etc, so has no interaction with the kernel.

Subject: Re: [PATCH v8 1/5] x86/tdx: Add TDX Guest attestation interface driver



On 7/5/22 3:31 PM, Kai Huang wrote:
> On Tue, 2022-07-05 at 14:21 -0700, Sathyanarayanan Kuppuswamy wrote:
>> SGX is a related feature. It also uses IOCTL approach for enclave provisioning.
>>
>> arch/x86/kernel/cpu/sgx/ioctl.c
>
> SGX isn't a good example here. The IOCTLs are used to create enclaves, but not
> for attestation. SGX attestation relies on enclave itself to get/verify the
> report, etc, so has no interaction with the kernel.

If you are looking for an attestation specific example, you can only check the AMD
code.

https://patchwork.kernel.org/project/linux-mm/patch/[email protected]/

Also, sev_get_attestation_report() in arch/x86/kvm/svm/sev.c also implements attestation
IOCTL specific to KVM.

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2022-07-06 23:16:06

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v8 1/5] x86/tdx: Add TDX Guest attestation interface driver

On Wed, 2022-07-06 at 15:27 -0700, Sathyanarayanan Kuppuswamy wrote:
>
> On 7/5/22 3:31 PM, Kai Huang wrote:
> > On Tue, 2022-07-05 at 14:21 -0700, Sathyanarayanan Kuppuswamy wrote:
> > > SGX is a related feature. It also uses IOCTL approach for enclave provisioning.
> > >
> > > arch/x86/kernel/cpu/sgx/ioctl.c
> >
> > SGX isn't a good example here. The IOCTLs are used to create enclaves, but not
> > for attestation. SGX attestation relies on enclave itself to get/verify the
> > report, etc, so has no interaction with the kernel.
>
> If you are looking for an attestation specific example, you can only check the AMD
> code.
>
> https://patchwork.kernel.org/project/linux-mm/patch/[email protected]/
>
> Also, sev_get_attestation_report() in arch/x86/kvm/svm/sev.c also implements attestation
> IOCTL specific to KVM.
>

I think we only need to look at how attestation is implemented in kernel for
other vendors, so yes it would be helpful to look at AMD's implementation.
I'll probably also look at it when I have some time.

--
Thanks,
-Kai


Subject: Re: [PATCH v8 2/5] x86/tdx: Add TDX Guest event notify interrupt support

Hi Kai/Dave,

On 6/27/22 4:21 AM, Kai Huang wrote:
> On Sat, 2022-06-25 at 15:35 +1200, Yao, Jiewen wrote:
>> Thank you, Jun.
>>
>> Yes. I confirmed that we will include below change to GHCI.next spec.
>>
>> ================
>> 3.5 TDG.VP.VMCALL<SetupEventNotifyInterrupt>
>>
>> From: "The host VMM should use SEAMCALL [TDWRVPS] leaf to inject an interrupt at the requested-interrupt vector into the TD via the posted-interrupt descriptor. "
>>
>> To: "The host VMM should use SEAMCALL [TDWRVPS] leaf to inject an interrupt at the requested-interrupt vector into the TD VCPU that executed TDG.VP.VMCALL <SetupEventNotifyInterrupt> via the posted-interrupt descriptor. "
>>
>
> Hi Sathy,
>
> With this change, I don't think we should use system vector anymore. Instead,
> we just need one non-migratable IRQ which has a fixed vector on a fixed cpu.
>

Thanks. As suggested, I have attempted to allocate IRQ vector at runtime
using irq_domain_alloc_irqs() call. Vector is allocated from
"x86_vector_domain" as Kai suggested.

Since I am not well versed in this area, I would like expert comments on it.
Mainly for IRQ allocation logic in tdx_late_init(). I have tested this version using
QEMU and it works fine.


diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index 928dcf7a20d9..dcc878546574 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -5,12 +5,16 @@
#define pr_fmt(fmt) "tdx: " fmt

#include <linux/cpufeature.h>
+#include <linux/interrupt.h>
+#include <linux/irq.h>
+#include <linux/numa.h>
#include <asm/coco.h>
#include <asm/tdx.h>
#include <asm/vmx.h>
#include <asm/insn.h>
#include <asm/insn-eval.h>
#include <asm/pgtable.h>
+#include <asm/irqdomain.h>

/* TDX module Call Leaf IDs */
#define TDX_GET_INFO 1
@@ -19,6 +23,7 @@

/* TDX hypercall Leaf IDs */
#define TDVMCALL_MAP_GPA 0x10001
+#define TDVMCALL_SETUP_NOTIFY_INTR 0x10004

/* MMIO direction */
#define EPT_READ 0
@@ -34,6 +39,26 @@
#define VE_GET_PORT_NUM(e) ((e) >> 16)
#define VE_IS_IO_STRING(e) ((e) & BIT(4))

+/*
+ * Handler used to report notifications about
+ * TDX_GUEST_EVENT_NOTIFY_VECTOR IRQ. Currently it will be
+ * used only by the attestation driver. So, race condition
+ * with read/write operation is not considered.
+ */
+static void (*tdx_event_notify_handler)(void);
+
+/* Helper function to register tdx_event_notify_handler */
+void tdx_setup_ev_notify_handler(void (*handler)(void))
+{
+ tdx_event_notify_handler = handler;
+}
+
+/* Helper function to unregister tdx_event_notify_handler */
+void tdx_remove_ev_notify_handler(void)
+{
+ tdx_event_notify_handler = NULL;
+}
+
/*
* Wrapper for standard use of __tdx_hypercall with no output aside from
* return code.
@@ -98,6 +123,31 @@ static inline void tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
panic("TDCALL %lld failed (Buggy TDX module!)\n", fn);
}

+/*
+ * tdx_hcall_set_notify_intr() - Setup Event Notify Interrupt Vector.
+ *
+ * @vector: Vector address to be used for notification.
+ *
+ * return 0 on success or failure error number.
+ */
+static long tdx_hcall_set_notify_intr(u8 vector)
+{
+ /* Minimum vector value allowed is 32 */
+ if (vector < 32)
+ return -EINVAL;
+
+ /*
+ * Register callback vector address with VMM. More details
+ * about the ABI can be found in TDX Guest-Host-Communication
+ * Interface (GHCI), sec titled
+ * "TDG.VP.VMCALL<SetupEventNotifyInterrupt>".
+ */
+ if (_tdx_hypercall(TDVMCALL_SETUP_NOTIFY_INTR, vector, 0, 0, 0))
+ return -EIO;
+
+ return 0;
+}
+
static u64 get_cc_mask(void)
{
struct tdx_module_output out;
@@ -775,3 +825,52 @@ void __init tdx_early_init(void)

pr_info("Guest detected\n");
}
+
+static irqreturn_t tdx_ev_handler(int irq, void *dev_id)
+{
+ tdx_event_notify_handler();
+ return IRQ_HANDLED;
+}
+
+static int __init tdx_late_init(void)
+{
+ struct irq_alloc_info info;
+ struct irq_cfg *cfg;
+ int evirq, cpu;
+
+ if (!cpu_feature_enabled(X86_FEATURE_TDX_GUEST))
+ return 0;
+
+ if (!x86_vector_domain) {
+ pr_err("x86 vector domain is NULL\n");
+ return 0;
+ }
+
+ init_irq_alloc_info(&info, NULL);
+
+ evirq = irq_domain_alloc_irqs(x86_vector_domain, 1, NUMA_NO_NODE, &info);
+
+ cpu = get_cpu();
+
+ irq_set_handler(evirq, handle_edge_irq);
+
+ /*
+ * Event notification vector will be delivered to the CPU
+ * in which TDVMCALL_SETUP_NOTIFY_INTR hypercall is requested.
+ * So set the IRQ affinity to the current CPU.
+ */
+ irq_set_affinity(evirq, cpumask_of(cpu));
+
+ if (request_irq(evirq, tdx_ev_handler, 0, "tdx_evirq", NULL))
+ pr_err("Request event IRQ failed\n");
+
+ cfg = irq_cfg(evirq);
+
+ if (tdx_hcall_set_notify_intr(cfg->vector))
+ pr_err("Setting event notification interrupt failed\n");
+
+ put_cpu();
+
+ return 0;
+}
+late_initcall(tdx_late_init);
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 020c81a7c729..eb4db837cc44 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -67,6 +67,10 @@ void tdx_safe_halt(void);

bool tdx_early_handle_ve(struct pt_regs *regs);

+void tdx_setup_ev_notify_handler(void (*handler)(void));
+
+void tdx_remove_ev_notify_handler(void);
+
#else

static inline void tdx_early_init(void) { };


--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2022-07-14 10:57:35

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v8 2/5] x86/tdx: Add TDX Guest event notify interrupt support

On Wed, 2022-07-13 at 17:46 -0700, Sathyanarayanan Kuppuswamy wrote:
> Hi Kai/Dave,
>
> On 6/27/22 4:21 AM, Kai Huang wrote:
> > On Sat, 2022-06-25 at 15:35 +1200, Yao, Jiewen wrote:
> > > Thank you, Jun.
> > >
> > > Yes. I confirmed that we will include below change to GHCI.next spec.
> > >
> > > ================
> > > 3.5 TDG.VP.VMCALL<SetupEventNotifyInterrupt>
> > >
> > > From: "The host VMM should use SEAMCALL [TDWRVPS] leaf to inject an interrupt at the requested-interrupt vector into the TD via the posted-interrupt descriptor. "
> > >
> > > To: "The host VMM should use SEAMCALL [TDWRVPS] leaf to inject an interrupt at the requested-interrupt vector into the TD VCPU that executed TDG.VP.VMCALL <SetupEventNotifyInterrupt> via the posted-interrupt descriptor. "
> > >
> >
> > Hi Sathy,
> >
> > With this change, I don't think we should use system vector anymore. Instead,
> > we just need one non-migratable IRQ which has a fixed vector on a fixed cpu.
> >
>
> Thanks. As suggested, I have attempted to allocate IRQ vector at runtime
> using irq_domain_alloc_irqs() call. Vector is allocated from
> "x86_vector_domain" as Kai suggested.

I am not expert either. I said "idea only" in my first reply :)

>
> Since I am not well versed in this area, I would like expert comments on it.
> Mainly for IRQ allocation logic in tdx_late_init(). I have tested this version using
> QEMU and it works fine.
>
>
> diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
> index 928dcf7a20d9..dcc878546574 100644
> --- a/arch/x86/coco/tdx/tdx.c
> +++ b/arch/x86/coco/tdx/tdx.c
> @@ -5,12 +5,16 @@
> #define pr_fmt(fmt) "tdx: " fmt
>
> #include <linux/cpufeature.h>
> +#include <linux/interrupt.h>
> +#include <linux/irq.h>
> +#include <linux/numa.h>
> #include <asm/coco.h>
> #include <asm/tdx.h>
> #include <asm/vmx.h>
> #include <asm/insn.h>
> #include <asm/insn-eval.h>
> #include <asm/pgtable.h>
> +#include <asm/irqdomain.h>
>
> /* TDX module Call Leaf IDs */
> #define TDX_GET_INFO 1
> @@ -19,6 +23,7 @@
>
> /* TDX hypercall Leaf IDs */
> #define TDVMCALL_MAP_GPA 0x10001
> +#define TDVMCALL_SETUP_NOTIFY_INTR 0x10004
>
> /* MMIO direction */
> #define EPT_READ 0
> @@ -34,6 +39,26 @@
> #define VE_GET_PORT_NUM(e) ((e) >> 16)
> #define VE_IS_IO_STRING(e) ((e) & BIT(4))
>
> +/*
> + * Handler used to report notifications about
> + * TDX_GUEST_EVENT_NOTIFY_VECTOR IRQ. Currently it will be
> + * used only by the attestation driver. So, race condition
> + * with read/write operation is not considered.
> + */
> +static void (*tdx_event_notify_handler)(void);
> +
> +/* Helper function to register tdx_event_notify_handler */
> +void tdx_setup_ev_notify_handler(void (*handler)(void))
> +{
> + tdx_event_notify_handler = handler;
> +}
> +
> +/* Helper function to unregister tdx_event_notify_handler */
> +void tdx_remove_ev_notify_handler(void)
> +{
> + tdx_event_notify_handler = NULL;
> +}
> +

Looks it's weird that you still need it. Couldn't we pass the function to
handle GetQuote directly to request_irq()?

> /*
> * Wrapper for standard use of __tdx_hypercall with no output aside from
> * return code.
> @@ -98,6 +123,31 @@ static inline void tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
> panic("TDCALL %lld failed (Buggy TDX module!)\n", fn);
> }
>
> +/*
> + * tdx_hcall_set_notify_intr() - Setup Event Notify Interrupt Vector.
> + *
> + * @vector: Vector address to be used for notification.
> + *
> + * return 0 on success or failure error number.
> + */
> +static long tdx_hcall_set_notify_intr(u8 vector)
> +{
> + /* Minimum vector value allowed is 32 */
> + if (vector < 32)
> + return -EINVAL;
> +
> + /*
> + * Register callback vector address with VMM. More details
> + * about the ABI can be found in TDX Guest-Host-Communication
> + * Interface (GHCI), sec titled
> + * "TDG.VP.VMCALL<SetupEventNotifyInterrupt>".
> + */
> + if (_tdx_hypercall(TDVMCALL_SETUP_NOTIFY_INTR, vector, 0, 0, 0))
> + return -EIO;
> +
> + return 0;
> +}
> +
> static u64 get_cc_mask(void)
> {
> struct tdx_module_output out;
> @@ -775,3 +825,52 @@ void __init tdx_early_init(void)
>
> pr_info("Guest detected\n");
> }
> +
> +static irqreturn_t tdx_ev_handler(int irq, void *dev_id)
> +{
> + tdx_event_notify_handler();
> + return IRQ_HANDLED;
> +}
> +
> +static int __init tdx_late_init(void)
> +{
> + struct irq_alloc_info info;
> + struct irq_cfg *cfg;
> + int evirq, cpu;
> +
> + if (!cpu_feature_enabled(X86_FEATURE_TDX_GUEST))
> + return 0;
> +
> + if (!x86_vector_domain) {
> + pr_err("x86 vector domain is NULL\n");
> + return 0;
> + }
> +
> + init_irq_alloc_info(&info, NULL);
> +
> + evirq = irq_domain_alloc_irqs(x86_vector_domain, 1, NUMA_NO_NODE, &info);

If I read correctly, if you set info->mask to cpumask_of(cpu), and pass it to
irq_domain_alloc_irqs(), the x86_vector_domain.alloc will immediately allocate a
vector on the given cpu, so you don't need to call irq_set_affinity() and wait
to allocate the vector on _this_ cpu until request_irq().

> +
> + cpu = get_cpu();
> +
> + irq_set_handler(evirq, handle_edge_irq);
> +
> + /*
> + * Event notification vector will be delivered to the CPU
> + * in which TDVMCALL_SETUP_NOTIFY_INTR hypercall is requested.
> + * So set the IRQ affinity to the current CPU.
> + */
> + irq_set_affinity(evirq, cpumask_of(cpu));
> +
> + if (request_irq(evirq, tdx_ev_handler, 0, "tdx_evirq", NULL))
> + pr_err("Request event IRQ failed\n");
> +
> + cfg = irq_cfg(evirq);
> +
> + if (tdx_hcall_set_notify_intr(cfg->vector))
> + pr_err("Setting event notification interrupt failed\n");
> +
> + put_cpu();

So the IRQ's affinity isn't kernel managed. Also looks it doesn't have anything
like IRQF_NOBALANCING to prevent it from being migrated. If I understand
correctly, userspace can still change the affinity which could cause the vector
being reallocated on another cpu?

Perhaps you can pass IRQF_NO_BALANCING to request_irq()?


--
Thanks,
-Kai


Subject: Re: [PATCH v8 2/5] x86/tdx: Add TDX Guest event notify interrupt support



On 7/14/22 3:42 AM, Kai Huang wrote:
> On Wed, 2022-07-13 at 17:46 -0700, Sathyanarayanan Kuppuswamy wrote:
>> Hi Kai/Dave,
>>
>> On 6/27/22 4:21 AM, Kai Huang wrote:
>>> On Sat, 2022-06-25 at 15:35 +1200, Yao, Jiewen wrote:
>>>> Thank you, Jun.
>>>>
>>>> Yes. I confirmed that we will include below change to GHCI.next spec.
>>>>
>>>> ================
>>>> 3.5 TDG.VP.VMCALL<SetupEventNotifyInterrupt>
>>>>
>>>> From: "The host VMM should use SEAMCALL [TDWRVPS] leaf to inject an interrupt at the requested-interrupt vector into the TD via the posted-interrupt descriptor. "
>>>>
>>>> To: "The host VMM should use SEAMCALL [TDWRVPS] leaf to inject an interrupt at the requested-interrupt vector into the TD VCPU that executed TDG.VP.VMCALL <SetupEventNotifyInterrupt> via the posted-interrupt descriptor. "
>>>>
>>>
>>> Hi Sathy,
>>>
>>> With this change, I don't think we should use system vector anymore. Instead,
>>> we just need one non-migratable IRQ which has a fixed vector on a fixed cpu.
>>>
>>
>> Thanks. As suggested, I have attempted to allocate IRQ vector at runtime
>> using irq_domain_alloc_irqs() call. Vector is allocated from
>> "x86_vector_domain" as Kai suggested.
>
> I am not expert either. I said "idea only" in my first reply :)
>
>>
>> Since I am not well versed in this area, I would like expert comments on it.
>> Mainly for IRQ allocation logic in tdx_late_init(). I have tested this version using
>> QEMU and it works fine.
>>
>>
>> diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
>> index 928dcf7a20d9..dcc878546574 100644
>> --- a/arch/x86/coco/tdx/tdx.c
>> +++ b/arch/x86/coco/tdx/tdx.c
>> @@ -5,12 +5,16 @@
>> #define pr_fmt(fmt) "tdx: " fmt
>>
>> #include <linux/cpufeature.h>
>> +#include <linux/interrupt.h>
>> +#include <linux/irq.h>
>> +#include <linux/numa.h>
>> #include <asm/coco.h>
>> #include <asm/tdx.h>
>> #include <asm/vmx.h>
>> #include <asm/insn.h>
>> #include <asm/insn-eval.h>
>> #include <asm/pgtable.h>
>> +#include <asm/irqdomain.h>
>>
>> /* TDX module Call Leaf IDs */
>> #define TDX_GET_INFO 1
>> @@ -19,6 +23,7 @@
>>
>> /* TDX hypercall Leaf IDs */
>> #define TDVMCALL_MAP_GPA 0x10001
>> +#define TDVMCALL_SETUP_NOTIFY_INTR 0x10004
>>
>> /* MMIO direction */
>> #define EPT_READ 0
>> @@ -34,6 +39,26 @@
>> #define VE_GET_PORT_NUM(e) ((e) >> 16)
>> #define VE_IS_IO_STRING(e) ((e) & BIT(4))
>>
>> +/*
>> + * Handler used to report notifications about
>> + * TDX_GUEST_EVENT_NOTIFY_VECTOR IRQ. Currently it will be
>> + * used only by the attestation driver. So, race condition
>> + * with read/write operation is not considered.
>> + */
>> +static void (*tdx_event_notify_handler)(void);
>> +
>> +/* Helper function to register tdx_event_notify_handler */
>> +void tdx_setup_ev_notify_handler(void (*handler)(void))
>> +{
>> + tdx_event_notify_handler = handler;
>> +}
>> +
>> +/* Helper function to unregister tdx_event_notify_handler */
>> +void tdx_remove_ev_notify_handler(void)
>> +{
>> + tdx_event_notify_handler = NULL;
>> +}
>> +
>
> Looks it's weird that you still need it. Couldn't we pass the function to
> handle GetQuote directly to request_irq()?


Since event notification is not only specific to attestation, I did not want to
directly tie it to GetQuote handler.

Alternative approach is, to make this IRQ vector shared and let its users do
request_irq() and free_irq() directly.

Something like below?

--- a/arch/x86/coco/tdx/attest.c
+++ b/arch/x86/coco/tdx/attest.c
@@ -399,7 +399,7 @@ static const struct file_operations tdx_attest_fops = {

static int __init tdx_attestation_init(void)
{
- int ret;
+ int ret, irq;

/* Make sure we are in a valid TDX platform */
if (!cpu_feature_enabled(X86_FEATURE_TDX_GUEST))
@@ -422,6 +422,14 @@ static int __init tdx_attestation_init(void)
return ret;
}

+ irq = tdx_get_evirq();
+
+ if (irq < 0)
+ return -EIO;
+
+ if (request_irq(irq, quote_callback_handler, IRQF_NOBALANCING|IRQF_SHARED,
+ "tdx_quote", &miscdev))
+ pr_err("Register GetQuote IRQ handler failed\n");
return 0;
}
device_initcall(tdx_attestation_init)
diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index 2d09a7a4005e..e2e043fe488d 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -41,6 +41,13 @@
#define VE_GET_PORT_NUM(e) ((e) >> 16)
#define VE_IS_IO_STRING(e) ((e) & BIT(4))

+static int tdx_evirq = -1;
+
+int tdx_get_evirq(void)
+{
+ return tdx_evirq;
+}
+
/*
* Handler used to report notifications about
* TDX_GUEST_EVENT_NOTIFY_VECTOR IRQ. Currently it will be
@@ -866,16 +873,16 @@ static int __init tdx_late_init(void)
*/
info.mask = cpumask_of(cpu);

- evirq = irq_domain_alloc_irqs(x86_vector_domain, 1, NUMA_NO_NODE, &info);
+ tdx_evirq = irq_domain_alloc_irqs(x86_vector_domain, 1, NUMA_NO_NODE, &info);

- irq_set_handler(evirq, handle_edge_irq);
+ irq_set_handler(tdx_evirq, handle_edge_irq);

- cfg = irq_cfg(evirq);
+ cfg = irq_cfg(tdx_evirq);

if (tdx_hcall_set_notify_intr(cfg->vector))
pr_err("Setting event notification interrupt failed\n");

- if (request_irq(evirq, tdx_ev_handler, IRQF_NOBALANCING, "tdx_evirq", NULL))
- pr_err("Request event IRQ failed\n");


put_cpu();
diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index eb4db837cc44..79b5220f5f5d 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -71,12 +71,15 @@ void tdx_setup_ev_notify_handler(void (*handler)(void));

void tdx_remove_ev_notify_handler(void);

+int tdx_get_evirq(void);
+
#else

static inline void tdx_early_init(void) { };
static inline void tdx_safe_halt(void) { };

static inline bool tdx_early_handle_ve(struct pt_regs *regs) { return false; }
+static inline int tdx_get_evirq(void) { return -1; };

#endif /* CONFIG_INTEL_TDX_GUEST */



>
>> /*
>> * Wrapper for standard use of __tdx_hypercall with no output aside from
>> * return code.
>> @@ -98,6 +123,31 @@ static inline void tdx_module_call(u64 fn, u64 rcx, u64 rdx, u64 r8, u64 r9,
>> panic("TDCALL %lld failed (Buggy TDX module!)\n", fn);
>> }
>>
>> +/*
>> + * tdx_hcall_set_notify_intr() - Setup Event Notify Interrupt Vector.
>> + *
>> + * @vector: Vector address to be used for notification.
>> + *
>> + * return 0 on success or failure error number.
>> + */
>> +static long tdx_hcall_set_notify_intr(u8 vector)
>> +{
>> + /* Minimum vector value allowed is 32 */
>> + if (vector < 32)
>> + return -EINVAL;
>> +
>> + /*
>> + * Register callback vector address with VMM. More details
>> + * about the ABI can be found in TDX Guest-Host-Communication
>> + * Interface (GHCI), sec titled
>> + * "TDG.VP.VMCALL<SetupEventNotifyInterrupt>".
>> + */
>> + if (_tdx_hypercall(TDVMCALL_SETUP_NOTIFY_INTR, vector, 0, 0, 0))
>> + return -EIO;
>> +
>> + return 0;
>> +}
>> +
>> static u64 get_cc_mask(void)
>> {
>> struct tdx_module_output out;
>> @@ -775,3 +825,52 @@ void __init tdx_early_init(void)
>>
>> pr_info("Guest detected\n");
>> }
>> +
>> +static irqreturn_t tdx_ev_handler(int irq, void *dev_id)
>> +{
>> + tdx_event_notify_handler();
>> + return IRQ_HANDLED;
>> +}
>> +
>> +static int __init tdx_late_init(void)
>> +{
>> + struct irq_alloc_info info;
>> + struct irq_cfg *cfg;
>> + int evirq, cpu;
>> +
>> + if (!cpu_feature_enabled(X86_FEATURE_TDX_GUEST))
>> + return 0;
>> +
>> + if (!x86_vector_domain) {
>> + pr_err("x86 vector domain is NULL\n");
>> + return 0;
>> + }
>> +
>> + init_irq_alloc_info(&info, NULL);
>> +
>> + evirq = irq_domain_alloc_irqs(x86_vector_domain, 1, NUMA_NO_NODE, &info);
>
> If I read correctly, if you set info->mask to cpumask_of(cpu), and pass it to
> irq_domain_alloc_irqs(), the x86_vector_domain.alloc will immediately allocate a
> vector on the given cpu, so you don't need to call irq_set_affinity() and wait
> to allocate the vector on _this_ cpu until request_irq().

Agreed. I moved it to info.mask.

>
>> +
>> + cpu = get_cpu();
>> +
>> + irq_set_handler(evirq, handle_edge_irq);
>> +
>> + /*
>> + * Event notification vector will be delivered to the CPU
>> + * in which TDVMCALL_SETUP_NOTIFY_INTR hypercall is requested.
>> + * So set the IRQ affinity to the current CPU.
>> + */
>> + irq_set_affinity(evirq, cpumask_of(cpu));
>> +
>> + if (request_irq(evirq, tdx_ev_handler, 0, "tdx_evirq", NULL))
>> + pr_err("Request event IRQ failed\n");
>> +
>> + cfg = irq_cfg(evirq);
>> +
>> + if (tdx_hcall_set_notify_intr(cfg->vector))
>> + pr_err("Setting event notification interrupt failed\n");
>> +
>> + put_cpu();
>
> So the IRQ's affinity isn't kernel managed. Also looks it doesn't have anything
> like IRQF_NOBALANCING to prevent it from being migrated. If I understand
> correctly, userspace can still change the affinity which could cause the vector
> being reallocated on another cpu?
>
> Perhaps you can pass IRQF_NO_BALANCING to request_irq()?

Makes sense. Tried it and it works. I will include this change.

>
>

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2022-07-15 00:27:45

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v8 2/5] x86/tdx: Add TDX Guest event notify interrupt support


> > > +/*
> > > + * Handler used to report notifications about
> > > + * TDX_GUEST_EVENT_NOTIFY_VECTOR IRQ. Currently it will be
> > > + * used only by the attestation driver. So, race condition
> > > + * with read/write operation is not considered.
> > > + */
> > > +static void (*tdx_event_notify_handler)(void);
> > > +
> > > +/* Helper function to register tdx_event_notify_handler */
> > > +void tdx_setup_ev_notify_handler(void (*handler)(void))
> > > +{
> > > + tdx_event_notify_handler = handler;
> > > +}
> > > +
> > > +/* Helper function to unregister tdx_event_notify_handler */
> > > +void tdx_remove_ev_notify_handler(void)
> > > +{
> > > + tdx_event_notify_handler = NULL;
> > > +}
> > > +
> >
> > Looks it's weird that you still need it. Couldn't we pass the function to
> > handle GetQuote directly to request_irq()?
>
>
> Since event notification is not only specific to attestation, I did not want to
> directly tie it to GetQuote handler.
>
> Alternative approach is, to make this IRQ vector shared and let its users do
> request_irq() and free_irq() directly.

I think IRQF_SHARED is the logic. We don't need it now as for now we have only
one user, though.

The "multiple users" case may never happen, so doesn't justify the additional
'tdx_event_notify_handler' code at this moment.

>
> Something like below?
>
> --- a/arch/x86/coco/tdx/attest.c
> +++ b/arch/x86/coco/tdx/attest.c
> @@ -399,7 +399,7 @@ static const struct file_operations tdx_attest_fops = {
>
> static int __init tdx_attestation_init(void)
> {
> - int ret;
> + int ret, irq;
>
> /* Make sure we are in a valid TDX platform */
> if (!cpu_feature_enabled(X86_FEATURE_TDX_GUEST))
> @@ -422,6 +422,14 @@ static int __init tdx_attestation_init(void)
> return ret;
> }
>
> + irq = tdx_get_evirq();

The function seems unnecessary. You can put 'extern int tdx_evtirq;' to
asm/tdx.h if needed and use it directly.

Subject: Re: [PATCH v8 1/5] x86/tdx: Add TDX Guest attestation interface driver

Hi Dave,

On 7/6/22 3:27 PM, Sathyanarayanan Kuppuswamy wrote:
>
>
> On 7/5/22 3:31 PM, Kai Huang wrote:
>> On Tue, 2022-07-05 at 14:21 -0700, Sathyanarayanan Kuppuswamy wrote:
>>> SGX is a related feature. It also uses IOCTL approach for enclave provisioning.
>>>
>>> arch/x86/kernel/cpu/sgx/ioctl.c
>>
>> SGX isn't a good example here. The IOCTLs are used to create enclaves, but not
>> for attestation. SGX attestation relies on enclave itself to get/verify the
>> report, etc, so has no interaction with the kernel.
>
> If you are looking for an attestation specific example, you can only check the AMD
> code.
>
> https://patchwork.kernel.org/project/linux-mm/patch/[email protected]/
>
> Also, sev_get_attestation_report() in arch/x86/kvm/svm/sev.c also implements attestation
> IOCTL specific to KVM.
>

As mentioned above, AMD also uses IOCTL model for attestation use case. So you agree
with us using similar approach here? Please let me know your comments.

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2022-07-21 16:42:10

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v8 5/5] x86/tdx: Add Quote generation support

On 6/8/22 19:52, Kuppuswamy Sathyanarayanan wrote:
> For shared buffer allocation, alternatives like using the DMA API is
> also considered. Although it simpler to use, it is not preferred because
> dma_alloc_*() APIs require a valid bus device as argument, which would
> need converting the attestation driver into a platform device driver.
> This is unnecessary, and since the attestation driver does not do real
> DMA, there is no need to use real DMA APIs.

Let's actually try to walk through the requirements for the memory
allocation here.

1. The guest kernel needs to allocate some guest physical memory
for the attestation data buffer
2. The guest physical memory must be mapped by the guest so that
it can be read/written.
3. The guest mapping must be a "TDX Shared" mapping. Since all
guest physical memory is "TDX Private" by default, something
must convert the memory from Private->Shared.
4. If there are alias mappings with "TDX Private" page table
permissions, those mappings must never be used while the page is
in its shared state.
4a. load_unaligned_zeropad() must be prevented from being used
on the page immediately preceding a Private alias to a Shared
page.
5. Actions that increasingly fracture the direct map must be avoided.
Attestation may happen many times and repeated allocations that
fracture the direct map have performance consequences.
6. A softer requirement: presuming that bounce buffers won't be used
for TDX devices *forever*, it would be nice to use a mechanism that
will continue to work on systems that don't have swiotlb on.

I think we've talked about three different solutions:

== vmalloc() ==

So, let's say we used a relatively plain vmalloc(). That's great for
#1->#3 as long as the vmalloc() mapping gets the "TDX Shared" bit set
properly on its PTEs. But, it falls over for *either* #4 or #5. If it
leaves the direct map alone, it's exposed to load_unaligned_zeropad().
If it unmaps the memory from the direct map, it runs afoul of #5.

== order-1 + vmap() ==

Let's now consider a vmalloc() variant: allocate a bunch of order-1
pages and vmap() page[1], leaving page[0] as a guard page against
load_unaligned_zeropad() on the direct map. That works, but it's an
annoying amount of code.

== swiotlb pages ==

Using the swiotlb bounce buffer pages is the other proposed option.
They already have a working kernel mapping and have already been
converted. They are mitigated against load_unaligned_zeropad(). They
do cause direct map fracturing, but only once since they're allocated
statically. They don't increasingly degrade things. It's a one-time
cost. Their interaction with #6 is not great.

Did I miss anything? Does that accurately capture where we are?

2022-07-21 17:03:14

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v8 5/5] x86/tdx: Add Quote generation support

On 7/21/22 09:42, Sathyanarayanan Kuppuswamy wrote:
> On 7/21/22 9:08 AM, Dave Hansen wrote:
>> On 6/8/22 19:52, Kuppuswamy Sathyanarayanan wrote:
>>> For shared buffer allocation, alternatives like using the DMA API is
>>> also considered. Although it simpler to use, it is not preferred because
>>> dma_alloc_*() APIs require a valid bus device as argument, which would
>>> need converting the attestation driver into a platform device driver.
>>> This is unnecessary, and since the attestation driver does not do real
>>> DMA, there is no need to use real DMA APIs.
>> Let's actually try to walk through the requirements for the memory
>> allocation here.
>>
>> 1. The guest kernel needs to allocate some guest physical memory
>> for the attestation data buffer
> Physically contiguous memory.

Remind me how large the quote structures are.

2022-07-21 17:09:35

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v8 5/5] x86/tdx: Add Quote generation support

On 7/21/22 09:54, Sathyanarayanan Kuppuswamy wrote:
>
> On 7/21/22 9:49 AM, Dave Hansen wrote:
>> On 7/21/22 09:42, Sathyanarayanan Kuppuswamy wrote:
>>> On 7/21/22 9:08 AM, Dave Hansen wrote:
>>>> On 6/8/22 19:52, Kuppuswamy Sathyanarayanan wrote:
>>>>> For shared buffer allocation, alternatives like using the DMA API is
>>>>> also considered. Although it simpler to use, it is not preferred because
>>>>> dma_alloc_*() APIs require a valid bus device as argument, which would
>>>>> need converting the attestation driver into a platform device driver.
>>>>> This is unnecessary, and since the attestation driver does not do real
>>>>> DMA, there is no need to use real DMA APIs.
>>>> Let's actually try to walk through the requirements for the memory
>>>> allocation here.
>>>>
>>>> 1. The guest kernel needs to allocate some guest physical memory
>>>> for the attestation data buffer
>>> Physically contiguous memory.
>> Remind me how large the quote structures are.
> It depends on the attestation service. In addition to TDREPORT (1K size),
> during quote generation, additional data can be included in the signed
> quote. So the spec allows variable length. User agent will communicate with
> attestation service to identify the appropriate buffer length. Our test
> uses 8K buffers

What is this "additional data"? Is that "REPORTDATA" from the TDX
module spec?

"Additional REPORTDATA, a 64-byte value, is provided by the
guest TD to be included in the TDG.MR.REPORT."


Subject: Re: [PATCH v8 5/5] x86/tdx: Add Quote generation support

Hi Dave,

On 7/21/22 9:08 AM, Dave Hansen wrote:
> On 6/8/22 19:52, Kuppuswamy Sathyanarayanan wrote:
>> For shared buffer allocation, alternatives like using the DMA API is
>> also considered. Although it simpler to use, it is not preferred because
>> dma_alloc_*() APIs require a valid bus device as argument, which would
>> need converting the attestation driver into a platform device driver.
>> This is unnecessary, and since the attestation driver does not do real
>> DMA, there is no need to use real DMA APIs.
>
> Let's actually try to walk through the requirements for the memory
> allocation here.
>
> 1. The guest kernel needs to allocate some guest physical memory
> for the attestation data buffer

Physically contiguous memory.

> 2. The guest physical memory must be mapped by the guest so that
> it can be read/written.
> 3. The guest mapping must be a "TDX Shared" mapping. Since all
> guest physical memory is "TDX Private" by default, something
> must convert the memory from Private->Shared.
> 4. If there are alias mappings with "TDX Private" page table
> permissions, those mappings must never be used while the page is
> in its shared state.
> 4a. load_unaligned_zeropad() must be prevented from being used
> on the page immediately preceding a Private alias to a Shared
> page.
> 5. Actions that increasingly fracture the direct map must be avoided.
> Attestation may happen many times and repeated allocations that
> fracture the direct map have performance consequences.
> 6. A softer requirement: presuming that bounce buffers won't be used
> for TDX devices *forever*, it would be nice to use a mechanism that
> will continue to work on systems that don't have swiotlb on.
>

Other than the above-mentioned correction, the rest of the requirements
are correct.

> I think we've talked about three different solutions:
>
> == vmalloc() ==
>
> So, let's say we used a relatively plain vmalloc(). That's great for
> #1->#3 as long as the vmalloc() mapping gets the "TDX Shared" bit set
> properly on its PTEs. But, it falls over for *either* #4 or #5. If it
> leaves the direct map alone, it's exposed to load_unaligned_zeropad().
> If it unmaps the memory from the direct map, it runs afoul of #5
Since we need physically contiguous memory, vmalloc is not preferred.

>
> == order-1 + vmap() ==
>
> Let's now consider a vmalloc() variant: allocate a bunch of order-1
> pages and vmap() page[1], leaving page[0] as a guard page against
> load_unaligned_zeropad() on the direct map. That works, but it's an
> annoying amount of code.
>
> == swiotlb pages ==
>
> Using the swiotlb bounce buffer pages is the other proposed option.
> They already have a working kernel mapping and have already been
> converted. They are mitigated against load_unaligned_zeropad(). They
> do cause direct map fracturing, but only once since they're allocated
> statically. They don't increasingly degrade things. It's a one-time
> cost. Their interaction with #6 is not great.
>
> Did I miss anything? Does that accurately capture where we are?



--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

Subject: Re: [PATCH v8 5/5] x86/tdx: Add Quote generation support



On 7/21/22 9:49 AM, Dave Hansen wrote:
> On 7/21/22 09:42, Sathyanarayanan Kuppuswamy wrote:
>> On 7/21/22 9:08 AM, Dave Hansen wrote:
>>> On 6/8/22 19:52, Kuppuswamy Sathyanarayanan wrote:
>>>> For shared buffer allocation, alternatives like using the DMA API is
>>>> also considered. Although it simpler to use, it is not preferred because
>>>> dma_alloc_*() APIs require a valid bus device as argument, which would
>>>> need converting the attestation driver into a platform device driver.
>>>> This is unnecessary, and since the attestation driver does not do real
>>>> DMA, there is no need to use real DMA APIs.
>>> Let's actually try to walk through the requirements for the memory
>>> allocation here.
>>>
>>> 1. The guest kernel needs to allocate some guest physical memory
>>> for the attestation data buffer
>> Physically contiguous memory.
>
> Remind me how large the quote structures are.

It depends on the attestation service. In addition to TDREPORT (1K size),
during quote generation, additional data can be included in the signed
quote. So the spec allows variable length. User agent will communicate with
attestation service to identify the appropriate buffer length. Our test
uses 8K buffers.

>

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

Subject: Re: [PATCH v8 5/5] x86/tdx: Add Quote generation support

Hi Dave,

On 7/21/22 10:02 AM, Dave Hansen wrote:
> On 7/21/22 09:54, Sathyanarayanan Kuppuswamy wrote:
>>
>> On 7/21/22 9:49 AM, Dave Hansen wrote:
>>> On 7/21/22 09:42, Sathyanarayanan Kuppuswamy wrote:
>>>> On 7/21/22 9:08 AM, Dave Hansen wrote:
>>>>> On 6/8/22 19:52, Kuppuswamy Sathyanarayanan wrote:
>>>>>> For shared buffer allocation, alternatives like using the DMA API is
>>>>>> also considered. Although it simpler to use, it is not preferred because
>>>>>> dma_alloc_*() APIs require a valid bus device as argument, which would
>>>>>> need converting the attestation driver into a platform device driver.
>>>>>> This is unnecessary, and since the attestation driver does not do real
>>>>>> DMA, there is no need to use real DMA APIs.
>>>>> Let's actually try to walk through the requirements for the memory
>>>>> allocation here.
>>>>>
>>>>> 1. The guest kernel needs to allocate some guest physical memory
>>>>> for the attestation data buffer
>>>> Physically contiguous memory.
>>> Remind me how large the quote structures are.
>> It depends on the attestation service. In addition to TDREPORT (1K size),
>> during quote generation, additional data can be included in the signed
>> quote. So the spec allows variable length. User agent will communicate with
>> attestation service to identify the appropriate buffer length. Our test
>> uses 8K buffers
>
> What is this "additional data"? Is that "REPORTDATA" from the TDX
> module spec?
>
> "Additional REPORTDATA, a 64-byte value, is provided by the
> guest TD to be included in the TDG.MR.REPORT."


It is not the REPORTDATA. REPORTDATA is already included in the TDREPORT during
REPORT generation process (using TDG.MR.REPORT).

Spec does not clearly define the details about the additional data. I think it is
related to certificates.

Isaku, do you have more details about it?

Current ABI allows attestation service and agent to decide the quote size. So
we can't make assumptions on what that size will be.

Following is copied from TDX Module specification, sec titled "Measurement
and Attestation"

An Intel SGX Quoting Enclave, written specifically to support quoting Intel TDX TDs,
uses a new ENCLU instruction leaf, EVERIFYREPORT2, to help check the integrity of the
TDG.MR.REPORT. If it passes, the Quoting Enclave can use a certified quote signing key
to sign a quote containing the guest TD’s measurements and the additional data being
quoted.

>
>

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2022-07-21 17:56:20

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v8 5/5] x86/tdx: Add Quote generation support

On 7/21/22 10:16, Sathyanarayanan Kuppuswamy wrote:
>> What is this "additional data"? Is that "REPORTDATA" from the TDX
>> module spec?
>>
>> "Additional REPORTDATA, a 64-byte value, is provided by the
>> guest TD to be included in the TDG.MR.REPORT."
>
> It is not the REPORTDATA. REPORTDATA is already included in the TDREPORT during
> REPORT generation process (using TDG.MR.REPORT).
>
> Spec does not clearly define the details about the additional data. I think it is
> related to certificates.
>
> Isaku, do you have more details about it?
>
> Current ABI allows attestation service and agent to decide the quote size. So
> we can't make assumptions on what that size will be.
>
> Following is copied from TDX Module specification, sec titled "Measurement
> and Attestation"
>
> An Intel SGX Quoting Enclave, written specifically to support quoting Intel TDX TDs,
> uses a new ENCLU instruction leaf, EVERIFYREPORT2, to help check the integrity of the
> TDG.MR.REPORT. If it passes, the Quoting Enclave can use a certified quote signing key
> to sign a quote containing the guest TD’s measurements and the additional data being
> quoted.

<sigh> We're off in the weeds again.

How many bytes does the TDX module read from and write to the guest
physical address space? What are the absolute limits? What is the
minimum and the maximum that the kernel needs to handle?

Subject: Re: [PATCH v8 5/5] x86/tdx: Add Quote generation support

Hi Dave,

On 7/21/22 10:19 AM, Dave Hansen wrote:
> How many bytes does the TDX module read from and write to the guest
> physical address space? What are the absolute limits? What is the
> minimum and the maximum that the kernel needs to handle?

Minimum is 1K (equal to TDREPORT size, on input Quoting Enclave reads 1K TDREPORT
data from GPA)

Maximum size is not defined in the spec. I think for future compatibility, spec
does not limit the quote size. For most cases, I think 8K should be big enough.

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2022-07-21 18:59:12

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v8 5/5] x86/tdx: Add Quote generation support

On Thu, Jul 21, 2022 at 10:19:30AM -0700,
Dave Hansen <[email protected]> wrote:

> On 7/21/22 10:16, Sathyanarayanan Kuppuswamy wrote:
> >> What is this "additional data"? Is that "REPORTDATA" from the TDX
> >> module spec?
> >>
> >> "Additional REPORTDATA, a 64-byte value, is provided by the
> >> guest TD to be included in the TDG.MR.REPORT."
> >
> > It is not the REPORTDATA. REPORTDATA is already included in the TDREPORT during
> > REPORT generation process (using TDG.MR.REPORT).
> >
> > Spec does not clearly define the details about the additional data. I think it is
> > related to certificates.
> >
> > Isaku, do you have more details about it?
> >
> > Current ABI allows attestation service and agent to decide the quote size. So
> > we can't make assumptions on what that size will be.
> >
> > Following is copied from TDX Module specification, sec titled "Measurement
> > and Attestation"
> >
> > An Intel SGX Quoting Enclave, written specifically to support quoting Intel TDX TDs,
> > uses a new ENCLU instruction leaf, EVERIFYREPORT2, to help check the integrity of the
> > TDG.MR.REPORT. If it passes, the Quoting Enclave can use a certified quote signing key
> > to sign a quote containing the guest TD’s measurements and the additional data being
> > quoted.
>
> <sigh> We're off in the weeds again.
>
> How many bytes does the TDX module read from and write to the guest
> physical address space? What are the absolute limits? What is the
> minimum and the maximum that the kernel needs to handle?

It's VMM (exactly, user space VMM like qemu in KVM case) that handles getquote
request and reads/writes the shared pages.
With KVM, there is no hard limit for user space VMM to access guest memory.

Regarding to the size of quote, there is no hard limit because it's for future
usage. But here is a rough idea about it.

- 4KB(1 page) can accommodate TDREPORT that is defined right now.
- 128KB is too small for foreseen use cases.
- 1MB is too big. big enough for future.

Thanks,
--
Isaku Yamahata <[email protected]>

2022-07-21 18:59:33

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v8 5/5] x86/tdx: Add Quote generation support

On 7/21/22 11:42, Isaku Yamahata wrote:
> Regarding to the size of quote, there is no hard limit because it's for future
> usage. But here is a rough idea about it.
>
> - 4KB(1 page) can accommodate TDREPORT that is defined right now.
> - 128KB is too small for foreseen use cases.
> - 1MB is too big. big enough for future.

Again, we're off in the weeds.

How does the VMM know how much to read/write? I have a theory: the spec
says that R12 is:

"Shared 4KB GPA as input – the memory contains a
TDREPORT_STRUCT."

That's *A* 4KB GPA. The maximum is one 4KB page. That's the only thing
that makes sense because there's no length in the ABI anywhere.

What am I missing?

Subject: Re: [PATCH v8 5/5] x86/tdx: Add Quote generation support



On 7/21/22 11:52 AM, Dave Hansen wrote:
> On 7/21/22 11:42, Isaku Yamahata wrote:
>> Regarding to the size of quote, there is no hard limit because it's for future
>> usage. But here is a rough idea about it.
>>
>> - 4KB(1 page) can accommodate TDREPORT that is defined right now.
>> - 128KB is too small for foreseen use cases.
>> - 1MB is too big. big enough for future.
>
> Again, we're off in the weeds.
>
> How does the VMM know how much to read/write? I have a theory: the spec
> says that R12 is:
>
> "Shared 4KB GPA as input – the memory contains a
> TDREPORT_STRUCT."
>
> That's *A* 4KB GPA. The maximum is one 4KB page. That's the only thing
> that makes sense because there's no length in the ABI anywhere.
>
> What am I missing?

I think you are looking into the old spec. Please check the version
"FEBRUARY 2022"

Following are the ABI details:

R11 - TDG.VP.VMCALL< GetQuote > sub-function per Table 2-3
R12 - Shared GPA as input – the memory contains a TDREPORT_STRUCT. The
same buffer is used as output – the memory contains a TD Quote.
R13 - Size of shared GPA. The size must be 4KB-aligned.

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2022-07-21 19:33:02

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v8 5/5] x86/tdx: Add Quote generation support

On 7/21/22 11:57, Sathyanarayanan Kuppuswamy wrote:
>> How does the VMM know how much to read/write? I have a theory: the spec
>> says that R12 is:
>>
>> "Shared 4KB GPA as input – the memory contains a
>> TDREPORT_STRUCT."
>>
>> That's *A* 4KB GPA. The maximum is one 4KB page. That's the only thing
>> that makes sense because there's no length in the ABI anywhere.
>>
>> What am I missing?
> I think you are looking into the old spec. Please check the version
> "FEBRUARY 2022"
>
> Following are the ABI details:
>
> R11 - TDG.VP.VMCALL< GetQuote > sub-function per Table 2-3
> R12 - Shared GPA as input – the memory contains a TDREPORT_STRUCT. The
> same buffer is used as output – the memory contains a TD Quote.
> R13 - Size of shared GPA. The size must be 4KB-aligned.

Yeah, silly me. I assumed the ABI was stable and wouldn't be, you know,
adding and removing parameters.

I still don't know how this all works. You just said:

> Current ABI allows attestation service and agent to decide the quote size. So
> we can't make assumptions on what that size will be.

But, the guest *HAS* to make assumptions, right? It's allocating the
buffer and handing a pointer and size over to the host. It's also guest
*userspace*. In fact, this implementation *ABSOLUTELY* makes
assumptions about the buffer size.

If host userspace some day decides it needs 5MB of space, then all the
guests will just stop working. This implementation is limited by the
max page allocator size.

This all just seems to work by chance.

Subject: Re: [PATCH v8 5/5] x86/tdx: Add Quote generation support

Hi Dave,

On 7/21/22 12:23 PM, Dave Hansen wrote:
> On 7/21/22 11:57, Sathyanarayanan Kuppuswamy wrote:
>>> How does the VMM know how much to read/write? I have a theory: the spec
>>> says that R12 is:
>>>
>>> "Shared 4KB GPA as input – the memory contains a
>>> TDREPORT_STRUCT."
>>>
>>> That's *A* 4KB GPA. The maximum is one 4KB page. That's the only thing
>>> that makes sense because there's no length in the ABI anywhere.
>>>
>>> What am I missing?
>> I think you are looking into the old spec. Please check the version
>> "FEBRUARY 2022"
>>
>> Following are the ABI details:
>>
>> R11 - TDG.VP.VMCALL< GetQuote > sub-function per Table 2-3
>> R12 - Shared GPA as input – the memory contains a TDREPORT_STRUCT. The
>> same buffer is used as output – the memory contains a TD Quote.
>> R13 - Size of shared GPA. The size must be 4KB-aligned.
>
> Yeah, silly me. I assumed the ABI was stable and wouldn't be, you know,
> adding and removing parameters.
>
> I still don't know how this all works. You just said:

At a high level, the quote generation communication flow is as below:

Entities involved are,

Guest TD user app (send TDREPORT) <-> Quoting Enclave (QE) (Sign TDREPORT and send Quote)

Steps involved are,

1. Attestation agent (in TD userspace)  will get the TDREPORT data from
   the TDX module.
2. To generate a remotely verifiable Quote, send the TDREPORT data to the
   Quoting enclave (QE). QE will verify the integrity of TDREPORT and sign
   it with the attestation key.
   * QE can be hosted as simple app in the host, or it can be hosted in a special
     TD guest.
3. Method of sending TDREPORT to QE depends on the QE requirements. In most
   cases, it will support TCP/IP or vsock communication models. So the attestation
   agent can easily send the TDREPORT via TCP/IP and get the Quote data.

But for platforms that do not want to support TCP/IP or socket communication,
TDX ABI defines a method of getting Quote using TDVMCALL. It is a less common
case. Our current discussion is related to this approach.

Entities involved in this approach are,

TD Guest user app <-> TD Guest kernel <-> VMM  <-> QE

Communication flow looks like below:

1. Attestation agent (in TD userspace)  will get the TDREPORT data from
   the TDX module.
2. Check with QE about the required quote size (it can be some form of
   agreement between QE and attestation agent).
3. Allocate space for Quote buffer and update the header details like
mentioned below and send the quote buffer and length details via
IOCTL

+/*
+ * Format of Quote data header. More details can be found in TDX
+ * Guest-Host Communication Interface (GHCI) for Intel TDX 1.0,
+ * section titled "TDG.VP.VMCALL<GetQuote>"
+ */
+struct tdx_quote_hdr {
+       /* Quote version, filled by TD */
+       __u64 version;
+       /* Status code of Quote request, filled by VMM */
+       __u64 status;
+       /* Length of TDREPORT, filled by TD */
+       __u32 in_len;
+       /* Length of Quote, filled by VMM */
+       __u32 out_len;
+       /* Actual Quote data or TDREPORT on input */
+       __u64 data[0];
+};
+
+/* struct tdx_quote_req: Request to generate TD Quote using TDREPORT
+ *
+ * @buf         : Pass user data that includes TDREPORT as input. Upon
+ *                successful completion of IOCTL, output is copied
+ *                back to the same buffer.
+ * @len         : Length of the Quote buffer.
+ */
+struct tdx_quote_req {
+       __u64 buf;
+       __u64 len;
+}

4. Once an IOCTL request is received, the kernel will allocate a Quote buffer
of the requested size, and use GetQuote hypercall to send this request to VMM.

   GetQuote (TDVMCALL type, kernel buffer physical addr, length of the buffer).
   
5. Upon hypercall request, VMM will further send the details to the QE. Once
QE processes this request and generates the Quote, it will update the status
details in the Quote header and copy the Quote back to the guest GPA. After
completing the request, it will also send the IRQ notification to the TD
Guest kernel. Event notification IRQ requirement is due to this step.
   
6. Once TD Guest kernel receives event notification, it will copy the
   contents of the Quote back to the user buffer and complete the IOCTL
request.

>
>> Current ABI allows attestation service and agent to decide the quote size. So
>> we can't make assumptions on what that size will be.
>
> But, the guest *HAS* to make assumptions, right? It's allocating the
> buffer and handing a pointer and size over to the host. It's also guest
> *userspace*. In fact, this implementation *ABSOLUTELY* makes
> assumptions about the buffer size.
>
> If host userspace some day decides it needs 5MB of space, then all the
> guests will just stop working. This implementation is limited by the
> max page allocator size.

Agree. We are assuming that the user agent will not request buffer that
exceeds the limit of the page allocator. But as Isaku mentioned, it is
expected that the Quote size will only be in KBs. I think the main
reason for ABI not limiting the size is, for flexibility and to
support future requirements.

Also, as mentioned above, this approach is not a common case. TCP/IP or
vsock models are generally supported and can be used for Quote generation.

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2022-07-21 23:35:41

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v8 5/5] x86/tdx: Add Quote generation support

On Thu, 2022-07-21 at 12:23 -0700, Dave Hansen wrote:
> On 7/21/22 11:57, Sathyanarayanan Kuppuswamy wrote:
> > > How does the VMM know how much to read/write? I have a theory: the spec
> > > says that R12 is:
> > >
> > > "Shared 4KB GPA as input – the memory contains a
> > > TDREPORT_STRUCT."
> > >
> > > That's *A* 4KB GPA. The maximum is one 4KB page. That's the only thing
> > > that makes sense because there's no length in the ABI anywhere.
> > >
> > > What am I missing?
> > I think you are looking into the old spec. Please check the version
> > "FEBRUARY 2022"
> >
> > Following are the ABI details:
> >
> > R11 - TDG.VP.VMCALL< GetQuote > sub-function per Table 2-3
> > R12 - Shared GPA as input – the memory contains a TDREPORT_STRUCT. The
> > same buffer is used as output – the memory contains a TD Quote.
> > R13 - Size of shared GPA. The size must be 4KB-aligned.
>
> Yeah, silly me. I assumed the ABI was stable and wouldn't be, you know,
> adding and removing parameters.
>
> I still don't know how this all works. You just said:
>
> > Current ABI allows attestation service and agent to decide the quote size. So
> > we can't make assumptions on what that size will be.
>
> But, the guest *HAS* to make assumptions, right? It's allocating the
> buffer and handing a pointer and size over to the host. It's also guest
> *userspace*. In fact, this implementation *ABSOLUTELY* makes
> assumptions about the buffer size.
>
> If host userspace some day decides it needs 5MB of space, then all the
> guests will just stop working. This implementation is limited by the
> max page allocator size.
>
> This all just seems to work by chance.

The Intel's Quote format is pretty much defined in some spec. I don't know
whether the spec is public or not, though. But Quote by definition has Intel's
certificates embedded so the size is indeed variable -- even though in reality
it can be treated as fixed as Intel's certificates don't change often.

Intel's QGS (Quote Generation Service) once had an API to query Quote size but
it got removed somehow, and instead, Intel used hard-coded (8K or 16K I forgot)
buffer, which is big enough for now.

Also, theoretically, 3rd party can add more staff (i.e. their certificates) when
they deploy their own attestation services, so the Quote can even be 3rd party
dependent.

So in short, yes, we Quote size is variable and it is determined by userspace.
But in reality, 16K is big enough for foreseeable future.

When using the vsock, userspace just uses standard socket API to read data from
QGS. It has all the flexibility, for instance, it can read the header first
(which is couple of bytes and fixed size) and decode the exact Quote size. Then
it can allocate a large enough buffer and read once for all. How vsock in
kernel uses whatever shared buffer size is implemented by vsock and transparent
to the userspace.

But with GetQuote TDVMCALL we don't have the luxury. Userspace needs to tell a
big enough buffer to the kernel since the GetQuote must receive the entire Quote
at once.

That being said, ideally, what we need is a TDVMCALL based communication
channel, instead of bunches of TDVMCALLs with each being associated one specific
operation (i.e. GetQuote). But obviously this isn't the direction we are
heading.


--
Thanks,
-Kai


2022-07-21 23:40:02

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v8 5/5] x86/tdx: Add Quote generation support

On Thu, 2022-07-21 at 09:08 -0700, Dave Hansen wrote:
> On 6/8/22 19:52, Kuppuswamy Sathyanarayanan wrote:
> > For shared buffer allocation, alternatives like using the DMA API is
> > also considered. Although it simpler to use, it is not preferred because
> > dma_alloc_*() APIs require a valid bus device as argument, which would
> > need converting the attestation driver into a platform device driver.
> > This is unnecessary, and since the attestation driver does not do real
> > DMA, there is no need to use real DMA APIs.
>
> Let's actually try to walk through the requirements for the memory
> allocation here.
>
> 1. The guest kernel needs to allocate some guest physical memory
> for the attestation data buffer
> 2. The guest physical memory must be mapped by the guest so that
> it can be read/written.
> 3. The guest mapping must be a "TDX Shared" mapping. Since all
> guest physical memory is "TDX Private" by default, something
> must convert the memory from Private->Shared.
> 4. If there are alias mappings with "TDX Private" page table
> permissions, those mappings must never be used while the page is
> in its shared state.
> 4a. load_unaligned_zeropad() must be prevented from being used
> on the page immediately preceding a Private alias to a Shared
> page.
> 5. Actions that increasingly fracture the direct map must be avoided.
> Attestation may happen many times and repeated allocations that
> fracture the direct map have performance consequences.
> 6. A softer requirement: presuming that bounce buffers won't be used
> for TDX devices *forever*, it would be nice to use a mechanism that
> will continue to work on systems that don't have swiotlb on.
>
> I think we've talked about three different solutions:
>
> == vmalloc() ==
>
> So, let's say we used a relatively plain vmalloc(). That's great for
> #1->#3 as long as the vmalloc() mapping gets the "TDX Shared" bit set
> properly on its PTEs. But, it falls over for *either* #4 or #5. If it
> leaves the direct map alone, it's exposed to load_unaligned_zeropad().
> If it unmaps the memory from the direct map, it runs afoul of #5.
>
> == order-1 + vmap() ==
>
> Let's now consider a vmalloc() variant: allocate a bunch of order-1
> pages and vmap() page[1], leaving page[0] as a guard page against
> load_unaligned_zeropad() on the direct map. That works, but it's an
> annoying amount of code.
>
> == swiotlb pages ==
>
> Using the swiotlb bounce buffer pages is the other proposed option.
> They already have a working kernel mapping and have already been
> converted. They are mitigated against load_unaligned_zeropad(). They
> do cause direct map fracturing, but only once since they're allocated
> statically. They don't increasingly degrade things. It's a one-time
> cost. Their interaction with #6 is not great.
>
> Did I miss anything? Does that accurately capture where we are?

We can also reserve a dedicated CMA, but Kirill didn't like it.

--
Thanks,
-Kai


2022-07-22 00:52:00

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v8 5/5] x86/tdx: Add Quote generation support

On 6/8/22 19:52, Kuppuswamy Sathyanarayanan wrote:
> In TDX guest, the second stage in attestation process is to send the
> TDREPORT to QE/QGS to generate the TD Quote. For platforms that does
> not support communication channels like vsock or TCP/IP, implement
> support to get TD Quote using hypercall. GetQuote hypercall can be used
> by the TD guest to request VMM facilitate the Quote generation via
> QE/QGS. More details about GetQuote hypercall can be found in TDX
> Guest-Host Communication Interface (GHCI) for Intel TDX 1.0, section
> titled "TDG.VP.VMCALL<GetQuote>.

So, the quote portion of this is basically a bidirectional blob sender.
It's to send a blob between guest userspace to host userspace.

Do we *REALLY* need specific driver functionality for this? For
instance, is there no existing virtio device that can send blobs back
and forth?

2022-07-22 19:26:05

by Isaku Yamahata

[permalink] [raw]
Subject: Re: [PATCH v8 5/5] x86/tdx: Add Quote generation support

On Thu, Jul 21, 2022 at 05:27:08PM -0700,
Dave Hansen <[email protected]> wrote:

> On 6/8/22 19:52, Kuppuswamy Sathyanarayanan wrote:
> > In TDX guest, the second stage in attestation process is to send the
> > TDREPORT to QE/QGS to generate the TD Quote. For platforms that does
> > not support communication channels like vsock or TCP/IP, implement
> > support to get TD Quote using hypercall. GetQuote hypercall can be used
> > by the TD guest to request VMM facilitate the Quote generation via
> > QE/QGS. More details about GetQuote hypercall can be found in TDX
> > Guest-Host Communication Interface (GHCI) for Intel TDX 1.0, section
> > titled "TDG.VP.VMCALL<GetQuote>.
>
> So, the quote portion of this is basically a bidirectional blob sender.
> It's to send a blob between guest userspace to host userspace.
>
> Do we *REALLY* need specific driver functionality for this? For
> instance, is there no existing virtio device that can send blobs back
> and forth?

It's virtio-vsock. If virtio-vsock is available, the communication works.
However, some users would like to disable virtio-vsock on their environment for
some reasons. Even virtio at all. Especially for confidential computing use
case. It's their choice. It can't be assumed that virtio is available.

The goal is VMM-agnostic (but TDX-specific) interface for that.
--
Isaku Yamahata <[email protected]>

2022-07-22 19:59:17

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v8 5/5] x86/tdx: Add Quote generation support

On 7/22/22 12:05, Isaku Yamahata wrote:
>> So, the quote portion of this is basically a bidirectional blob sender.
>> It's to send a blob between guest userspace to host userspace.
>>
>> Do we *REALLY* need specific driver functionality for this? For
>> instance, is there no existing virtio device that can send blobs back
>> and forth?
> It's virtio-vsock. If virtio-vsock is available, the communication works.
> However, some users would like to disable virtio-vsock on their environment for
> some reasons. Even virtio at all. Especially for confidential computing use
> case. It's their choice. It can't be assumed that virtio is available.
>
> The goal is VMM-agnostic (but TDX-specific) interface for that.

You're basically saying that every confidential computing technology
should have its own host user <-> guest kernel <-> guest user ABI.
That's insanity. If we do this, we need *one* interface that says "talk
to the hypervisor" that's common for all hypervisors and hardware
vendors, or at least more than *one*.

We don't need a way to talk to hypervisors for Intel systems and another
for AMD and yet another on whatever.

2022-07-22 21:33:08

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v8 5/5] x86/tdx: Add Quote generation support

On 7/22/22 14:18, Sathyanarayanan Kuppuswamy wrote:
> For cases where your platform does not want to support or enable the generic
> interface (like vsock), isn't it better to have a fallback approach? I am not
> saying we should have such an ABI for all cases. But attestation is a must-have
> feature for the TDX guest, and we want to support it in all TD guest platforms.
> I think the GHCI ABI is added to meet this requirement.

This logic is basically: it's in the spec so it must be useful. I don't
buy that, sorry.

Subject: Re: [PATCH v8 5/5] x86/tdx: Add Quote generation support

+ Jun

On 7/22/22 12:13 PM, Dave Hansen wrote:
> On 7/22/22 12:05, Isaku Yamahata wrote:
>>> So, the quote portion of this is basically a bidirectional blob sender.
>>> It's to send a blob between guest userspace to host userspace.
>>>
>>> Do we *REALLY* need specific driver functionality for this? For
>>> instance, is there no existing virtio device that can send blobs back
>>> and forth?
>> It's virtio-vsock. If virtio-vsock is available, the communication works.
>> However, some users would like to disable virtio-vsock on their environment for
>> some reasons. Even virtio at all. Especially for confidential computing use
>> case. It's their choice. It can't be assumed that virtio is available.
>>
>> The goal is VMM-agnostic (but TDX-specific) interface for that.
>
> You're basically saying that every confidential computing technology
> should have its own host user <-> guest kernel <-> guest user ABI.
> That's insanity. If we do this, we need *one* interface that says "talk
> to the hypervisor" that's common for all hypervisors and hardware
> vendors, or at least more than *one*.
>
> We don't need a way to talk to hypervisors for Intel systems and another
> for AMD and yet another on whatever.

For cases where your platform does not want to support or enable the generic
interface (like vsock), isn't it better to have a fallback approach? I am not
saying we should have such an ABI for all cases. But attestation is a must-have
feature for the TDX guest, and we want to support it in all TD guest platforms.
I think the GHCI ABI is added to meet this requirement.

Jun/Isaku, if you are aware of the exact requirement for this hypercall, please
share it. Also let us know your comments on this topic.

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2022-07-25 12:05:40

by Kai Huang

[permalink] [raw]
Subject: Re: [PATCH v8 5/5] x86/tdx: Add Quote generation support

On Fri, 2022-07-22 at 12:13 -0700, Dave Hansen wrote:
> On 7/22/22 12:05, Isaku Yamahata wrote:
> > > So, the quote portion of this is basically a bidirectional blob sender.
> > > It's to send a blob between guest userspace to host userspace.
> > >
> > > Do we *REALLY* need specific driver functionality for this? For
> > > instance, is there no existing virtio device that can send blobs back
> > > and forth?
> > It's virtio-vsock. If virtio-vsock is available, the communication works.
> > However, some users would like to disable virtio-vsock on their environment for
> > some reasons. Even virtio at all. Especially for confidential computing use
> > case. It's their choice. It can't be assumed that virtio is available.
> >
> > The goal is VMM-agnostic (but TDX-specific) interface for that.
>
> You're basically saying that every confidential computing technology
> should have its own host user <-> guest kernel <-> guest user ABI.
> That's insanity. If we do this, we need *one* interface that says "talk
> to the hypervisor" that's common for all hypervisors and hardware
> vendors, or at least more than *one*.
>
> We don't need a way to talk to hypervisors for Intel systems and another
> for AMD and yet another on whatever.

Maybe the GetQuote support can be a "Intel driver" with a dedicated Kconfig
option? Not all customers want it anyway, so having a Kconfig option can also
reduce code size/attack window.

--
Thanks,
-Kai


2022-07-25 21:28:21

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v8 5/5] x86/tdx: Add Quote generation support

On 7/25/22 13:19, Nakajima, Jun wrote:
> 3. Need to be available in minimal/early runtime environments,
> including pre-boot, e.g. guest BIOS, no user-space yet.

Jun, are we talking about the same thing here? This patch is for a
guest userspace -> guest kernel ABI. This facility is *FOR* userspace.
It can't possibly be used before userspace is running.

I'm horribly confused.

2022-07-25 21:28:43

by Nakajima, Jun

[permalink] [raw]
Subject: Re: [PATCH v8 5/5] x86/tdx: Add Quote generation support

> On Jul 22, 2022, at 2:18 PM, Sathyanarayanan Kuppuswamy <[email protected]> wrote:
>
> + Jun
>
> On 7/22/22 12:13 PM, Dave Hansen wrote:
>> On 7/22/22 12:05, Isaku Yamahata wrote:
>>>> So, the quote portion of this is basically a bidirectional blob sender.
>>>> It's to send a blob between guest userspace to host userspace.
>>>>
>>>> Do we *REALLY* need specific driver functionality for this? For
>>>> instance, is there no existing virtio device that can send blobs back
>>>> and forth?
>>> It's virtio-vsock. If virtio-vsock is available, the communication works.
>>> However, some users would like to disable virtio-vsock on their environment for
>>> some reasons. Even virtio at all. Especially for confidential computing use
>>> case. It's their choice. It can't be assumed that virtio is available.
>>>
>>> The goal is VMM-agnostic (but TDX-specific) interface for that.
>>
>> You're basically saying that every confidential computing technology
>> should have its own host user <-> guest kernel <-> guest user ABI.
>> That's insanity. If we do this, we need *one* interface that says "talk
>> to the hypervisor" that's common for all hypervisors and hardware
>> vendors, or at least more than *one*.
>>
>> We don't need a way to talk to hypervisors for Intel systems and another
>> for AMD and yet another on whatever.
>
> For cases where your platform does not want to support or enable the generic
> interface (like vsock), isn't it better to have a fallback approach? I am not
> saying we should have such an ABI for all cases. But attestation is a must-have
> feature for the TDX guest, and we want to support it in all TD guest platforms.
> I think the GHCI ABI is added to meet this requirement.
>
> Jun/Isaku, if you are aware of the exact requirement for this hypercall, please
> share it. Also let us know your comments on this topic.
>

Yes, a quote is a blob, and there are special things with that because of the nature (i.e. the essential data for verification).
1. It’s small (e.g. 4KB or something like that).
2. One-time. It shouldn't change even if you repeat the request (GetQuote).
3. Need to be available in minimal/early runtime environments, including pre-boot, e.g. guest BIOS, no user-space yet.

In my view, getting a quote using virtio-vsock is overkill both for the host and the guest. The host may not want the guests to talk directly to the host because of security concerns.

This particular patch allows the guest user-space to get a quote for an attestation service, and it can be used for the first attestation of the guest, i.e. when the guest kernel has not been verified yet. It’s useful when getting the key for the guest confidential data, such as user data volume, before mounting the volume. Can we use virtio-vsock there? Yes, but, again, overkill and limited availability.

Since we don’t want to allow the user-space to use a hypercall (to get a quote), I think some sort of driver should be needed there. So, I think this driver is useful at least as a fallback when virtio-vsock is not available.


---
Jun


Subject: Re: [PATCH v8 5/5] x86/tdx: Add Quote generation support

Hi Jun,

On 7/25/22 2:56 PM, Nakajima, Jun wrote:
>
>> On Jul 25, 2022, at 1:23 PM, Hansen, Dave <[email protected]> wrote:
>>
>> On 7/25/22 13:19, Nakajima, Jun wrote:
>>> 3. Need to be available in minimal/early runtime environments,
>>> including pre-boot, e.g. guest BIOS, no user-space yet.
>>
>> Jun, are we talking about the same thing here? This patch is for a
>> guest userspace -> guest kernel ABI. This facility is *FOR* userspace.
>> It can't possibly be used before userspace is running.
>>
>> I'm horribly confused.
>
> I responded to one of Sathya’s questions, especially why we have the GetQuote in GHCI.
> And the hypervisor needs to implement that anyway because it doesn’t matter (or doesn’t know) whether the TD guest is running in BIOS, the kernel, or userspace. Of course, the facility in this patch is for userspace, but we don’t want to suggest to implement two different GetQuote code paths for guests, depending on the guest state, e.g. in the OS (kernel or userspace) or guest BIOS.

Ok. Since both host and QE need to support GetQuote hypercall to
handle attestation request from the BIOS, QE/host may want to use
the same communication model for requests from the guest user space
as well.

>
> ---
> Jun
>
>
>
>

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

2022-07-25 23:08:32

by Nakajima, Jun

[permalink] [raw]
Subject: Re: [PATCH v8 5/5] x86/tdx: Add Quote generation support


> On Jul 25, 2022, at 1:23 PM, Hansen, Dave <[email protected]> wrote:
>
> On 7/25/22 13:19, Nakajima, Jun wrote:
>> 3. Need to be available in minimal/early runtime environments,
>> including pre-boot, e.g. guest BIOS, no user-space yet.
>
> Jun, are we talking about the same thing here? This patch is for a
> guest userspace -> guest kernel ABI. This facility is *FOR* userspace.
> It can't possibly be used before userspace is running.
>
> I'm horribly confused.

I responded to one of Sathya’s questions, especially why we have the GetQuote in GHCI.
And the hypervisor needs to implement that anyway because it doesn’t matter (or doesn’t know) whether the TD guest is running in BIOS, the kernel, or userspace. Of course, the facility in this patch is for userspace, but we don’t want to suggest to implement two different GetQuote code paths for guests, depending on the guest state, e.g. in the OS (kernel or userspace) or guest BIOS.

---
Jun




2022-08-09 06:32:27

by Guorui Yu

[permalink] [raw]
Subject: Re: [PATCH v8 5/5] x86/tdx: Add Quote generation support

To my opinion, virtio-vsock is a wonderful way to achieve the
comunication between the host and the guest under traditional scenario.

But from secuirty perspective, virtio-vsock itself may involve too much
of components, which may lead to considerable attack surfaces. In order
for virtio to work properly, the guest must maintain several vrings
placed in shared memory, which in turn depends on swiotlb in current
codebase. As this flow is NOT designed for confidential scenario, there
are some on-going/existing works on virtio hardening [1] [2] [3].

> Do we *REALLY* need specific driver functionality for this? For
> instance, is there no existing virtio device that can send blobs back
> and forth?

And we may indeed have use cases where virtio is not used at all, such
as when we consider Intel-TDX/AMD-SNP protected VMs as a SGX enclave
replacement. In such use cases, end users may expect a smaller
dependency on the GetQuote interface. It doesn't seem natural that the
GetQuote interface cannot be used because virtio-vsock is disabled (via
kernel compile options or the upcoming device-filter), since GetQuote
doesn't inherently depend on it.

Thanks,
Guorui

[1]
https://lore.kernel.org/lkml/[email protected]/
[2] https://lore.kernel.org/lkml/[email protected]/
[3]
https://lore.kernel.org/linux-iommu/[email protected]/

2022-11-21 02:45:39

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v8 5/5] x86/tdx: Add Quote generation support

On 11/20/22 18:04, Guorui Yu wrote:
> Kindly ping for any comments here?

No comments from me!

Seriously, this is from June. I honestly have no idea what this patch
is for or why you're asking about it. If it were important, I would
have expected a new post with a new cover letter telling me some time in
the past 5-6 months.

Oh, and btw, I haven't seen you reviewing any other x86 submissions
during that past 5-6 months either. I'm tend to be more likely to bump
things up near the front of the queue for folks that are helping to keep
the queue short and manageable.

2022-11-21 02:49:03

by Guorui Yu

[permalink] [raw]
Subject: Re: [PATCH v8 5/5] x86/tdx: Add Quote generation support

Hi Dave,

Kindly ping for any comments here?

Sincerely,
Guorui

在 2022/8/9 14:20, Guorui Yu 写道:
> To my opinion, virtio-vsock is a wonderful way to achieve the
> comunication between the host and the guest under traditional scenario.
>
> But from secuirty perspective, virtio-vsock itself may involve too much
> of components, which may lead to considerable attack surfaces. In order
> for virtio to work properly, the guest must maintain several vrings
> placed in shared memory, which in turn depends on swiotlb in current
> codebase. As this flow is NOT designed for confidential scenario, there
> are some on-going/existing works on virtio hardening [1] [2] [3].
>
> > Do we *REALLY* need specific driver functionality for this?  For
> > instance, is there no existing virtio device that can send blobs back
> > and forth?
>
> And we may indeed have use cases where virtio is not used at all, such
> as when we consider Intel-TDX/AMD-SNP protected VMs as a SGX enclave
> replacement. In such use cases, end users may expect a smaller
> dependency on the GetQuote interface. It doesn't seem natural that the
> GetQuote interface cannot be used because virtio-vsock is disabled (via
> kernel compile options or the upcoming device-filter), since GetQuote
> doesn't inherently depend on it.
>
> Thanks,
> Guorui
>
> [1]
> https://lore.kernel.org/lkml/[email protected]/
> [2] https://lore.kernel.org/lkml/[email protected]/
> [3]
> https://lore.kernel.org/linux-iommu/[email protected]/

2023-01-07 01:18:12

by Erdem Aktas

[permalink] [raw]
Subject: Re: [PATCH v8 5/5] x86/tdx: Add Quote generation support

I apologize for bumping an old thread up and getting in the discussion
this late.

Bumping this thread because as Chong mentioned below, it is essential
for us to have TDG.VP.VMCALL<GetQuote> interface as we do not
enable/support virtio in many use cases.
Coming from the "Add TDX Guest Attestation support patch set [1]"
which does not have the Quote support as per Dave's review suggestion,
I am looking for what our options are here.

@Hansen, Dave : as far as I understand your concerns were:

>>Do we *REALLY* need specific driver functionality for this? For
>>instance, is there no existing virtio device that can send blobs back and forth?

As Chong and I said, we have to have a vsock alternative solution if
we want to provide TDX based products for our customers. Otherwise it
will really limit our ability to adopt this technology.

>> You're basically saying that every confidential computing technology
>> should have its own host user <-> guest kernel <-> guest user ABI.
>> That's insanity. If we do this, we need *one* interface that says "talk
>> to the hypervisor" that's common for all hypervisors and hardware
>> vendors, or at least more than *one*.

This is already happening, right? AMD has GHCB ABI while intel has
GHCI. Also there is already a guest driver and IOCTL interface for
TDREPORT, so would it be really that insane to have another IOCTL
interface for GetQuote?

I would like to get your opinion on how we can progress on this?

-Erdem

[1] https://lore.kernel.org/lkml/20221116223820.819090-1-sathyanarayanan.kuppuswamy@linux.intel.com/t/#m341e6475982318d74be160c09a1856ad87e184cf

> From: Chong Cai <[email protected]>
> Date: Fri, Jan 6, 2023 at 2:02 PM
> Subject: Re: [PATCH v8 5/5] x86/tdx: Add Quote generation support
> To: Nakajima, Jun <[email protected]>
> Cc: <[email protected]>, <[email protected]>, <[email protected]>, <[email protected]>, <[email protected]>, <[email protected]>, <[email protected]>, <[email protected]>, <[email protected]>, <[email protected]>, <[email protected]>, <[email protected]>, <[email protected]>, <[email protected]>, <[email protected]>, <[email protected]>, <[email protected]>, <[email protected]>, <[email protected]>
>
>
> Hi everyone,
>
>
> Sorry for being late in the discussion.
>
>
> We just want to add that we have use cases where virtio/vsock is not supported at all; therefore, getquote TDVMCALL support is required for us to adopt this technology.
>
>
> Is there anything we can help here?
>
>


On Sun, Nov 20, 2022 at 6:26 PM Dave Hansen <[email protected]> wrote:
>
> On 11/20/22 18:04, Guorui Yu wrote:
> > Kindly ping for any comments here?
>
> No comments from me!
>
> Seriously, this is from June. I honestly have no idea what this patch
> is for or why you're asking about it. If it were important, I would
> have expected a new post with a new cover letter telling me some time in
> the past 5-6 months.
>
> Oh, and btw, I haven't seen you reviewing any other x86 submissions
> during that past 5-6 months either. I'm tend to be more likely to bump
> things up near the front of the queue for folks that are helping to keep
> the queue short and manageable.