Nitro Enclaves (NE) is a new Amazon Elastic Compute Cloud (EC2) capability
that allows customers to carve out isolated compute environments within EC2
instances [1].
For example, an application that processes sensitive data and runs in a VM,
can be separated from other applications running in the same VM. This
application then runs in a separate VM than the primary VM, namely an enclave.
An enclave runs alongside the VM that spawned it. This setup matches low latency
applications needs. The resources that are allocated for the enclave, such as
memory and CPUs, are carved out of the primary VM. Each enclave is mapped to a
process running in the primary VM, that communicates with the NE driver via an
ioctl interface.
In this sense, there are two components:
1. An enclave abstraction process - a user space process running in the primary
VM guest that uses the provided ioctl interface of the NE driver to spawn an
enclave VM (that's 2 below).
There is a NE emulated PCI device exposed to the primary VM. The driver for this
new PCI device is included in the NE driver.
The ioctl logic is mapped to PCI device commands e.g. the NE_START_ENCLAVE ioctl
maps to an enclave start PCI command. The PCI device commands are then
translated into actions taken on the hypervisor side; that's the Nitro
hypervisor running on the host where the primary VM is running. The Nitro
hypervisor is based on core KVM technology.
2. The enclave itself - a VM running on the same host as the primary VM that
spawned it. Memory and CPUs are carved out of the primary VM and are dedicated
for the enclave VM. An enclave does not have persistent storage attached.
The memory regions carved out of the primary VM and given to an enclave need to
be aligned 2 MiB / 1 GiB physically contiguous memory regions (or multiple of
this size e.g. 8 MiB). The memory can be allocated e.g. by using hugetlbfs from
user space [2][3]. The memory size for an enclave needs to be at least 64 MiB.
The enclave memory and CPUs need to be from the same NUMA node.
An enclave runs on dedicated cores. CPU 0 and its CPU siblings need to remain
available for the primary VM. A CPU pool has to be set for NE purposes by an
user with admin capability. See the cpu list section from the kernel
documentation [4] for how a CPU pool format looks.
An enclave communicates with the primary VM via a local communication channel,
using virtio-vsock [5]. The primary VM has virtio-pci vsock emulated device,
while the enclave VM has a virtio-mmio vsock emulated device. The vsock device
uses eventfd for signaling. The enclave VM sees the usual interfaces - local
APIC and IOAPIC - to get interrupts from virtio-vsock device. The virtio-mmio
device is placed in memory below the typical 4 GiB.
The application that runs in the enclave needs to be packaged in an enclave
image together with the OS ( e.g. kernel, ramdisk, init ) that will run in the
enclave VM. The enclave VM has its own kernel and follows the standard Linux
boot protocol.
The kernel bzImage, the kernel command line, the ramdisk(s) are part of the
Enclave Image Format (EIF); plus an EIF header including metadata such as magic
number, eif version, image size and CRC.
Hash values are computed for the entire enclave image (EIF), the kernel and
ramdisk(s). That's used, for example, to check that the enclave image that is
loaded in the enclave VM is the one that was intended to be run.
These crypto measurements are included in a signed attestation document
generated by the Nitro Hypervisor and further used to prove the identity of the
enclave; KMS is an example of service that NE is integrated with and that checks
the attestation doc.
The enclave image (EIF) is loaded in the enclave memory at offset 8 MiB. The
init process in the enclave connects to the vsock CID of the primary VM and a
predefined port - 9000 - to send a heartbeat value - 0xb7. This mechanism is
used to check in the primary VM that the enclave has booted.
If the enclave VM crashes or gracefully exits, an interrupt event is received by
the NE driver. This event is sent further to the user space enclave process
running in the primary VM via a poll notification mechanism. Then the user space
enclave process can exit.
Thank you.
Andra
[1] https://aws.amazon.com/ec2/nitro/nitro-enclaves/
[2] https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt
[3] https://lwn.net/Articles/807108/
[4] https://www.kernel.org/doc/html/latest/admin-guide/kernel-parameters.html
[5] https://man7.org/linux/man-pages/man7/vsock.7.html
---
Patch Series Changelog
The patch series is built on top of v5.9-rc1.
GitHub repo branch for the latest version of the patch series:
* https://github.com/andraprs/linux/tree/ne-driver-upstream-v7
v6 -> v7
* Rebase on top of v5.9-rc1.
* Use the NE misc device parent field to get the NE PCI device.
* Update the naming and add more comments to make more clear the logic of
handling full CPU cores and dedicating them to the enclave.
* Remove, for now, the dependency on ARM64 arch in Kconfig. x86 is currently
supported, with Arm to come afterwards. The NE kernel driver can be currently
built for aarch64 arch.
* Clarify in the ioctls documentation that the return value is -1 and errno is
set on failure.
* Update the error code value for NE_ERR_INVALID_MEM_REGION_SIZE as it gets in
user space as value 25 (ENOTTY) instead of 515. Update the NE custom error
codes values range to not be the same as the ones defined in
include/linux/errno.h, although these are not propagated to user space.
* Update the documentation to include references to the NE PCI device id and
MMIO bar.
* Update check for duplicate user space memory regions to cover additional
possible scenarios.
* Calculate the number of threads per core and not use smp_num_siblings that is
x86 specific.
* v6: https://lore.kernel.org/lkml/[email protected]/
v5 -> v6
* Rebase on top of v5.8.
* Update documentation to kernel-doc format.
* Update sample to include the enclave image loading logic.
* Remove the ioctl to query API version.
* Check for invalid provided flags field via ioctl calls args.
* Check for duplicate provided user space memory regions.
* Check for aligned memory regions.
* Include, in the sample, usage info for NUMA-aware hugetlb config.
* v5: https://lore.kernel.org/lkml/[email protected]/
v4 -> v5
* Rebase on top of v5.8-rc5.
* Add more details about the ioctl calls usage e.g. error codes.
* Update the ioctl to set an enclave vCPU to not return a fd.
* Add specific NE error codes.
* Split the NE CPU pool in CPU cores cpumasks.
* Remove log on copy_from_user() / copy_to_user() failure.
* Release the reference to the NE PCI device on failure paths.
* Close enclave fd on copy_to_user() failure.
* Set empty string in case of invalid NE CPU pool sysfs value.
* Early exit on NE CPU pool setup if enclave(s) already running.
* Add more sanity checks for provided vCPUs e.g. maximum possible value.
* Split logic for checking if a vCPU is in pool / getting a vCPU from pool.
* Exit without unpinning the pages on NE PCI dev request failure.
* Add check for the memory region user space address alignment.
* Update the logic to set memory region to not have a hardcoded check for 2 MiB.
* Add arch dependency for Arm / x86.
* v4: https://lore.kernel.org/lkml/[email protected]/
v3 -> v4
* Rebase on top of v5.8-rc2.
* Add NE API version and the corresponding ioctl call.
* Add enclave / image load flags options.
* Decouple NE ioctl interface from KVM API.
* Remove the "packed" attribute and include padding in the NE data structures.
* Update documentation based on the changes from v4.
* Update sample to match the updates in v4.
* Remove the NE CPU pool init during NE kernel module loading.
* Setup the NE CPU pool at runtime via a sysfs file for the kernel parameter.
* Check if the enclave memory and CPUs are from the same NUMA node.
* Add minimum enclave memory size definition.
* v3: https://lore.kernel.org/lkml/[email protected]/
v2 -> v3
* Rebase on top of v5.7-rc7.
* Add changelog to each patch in the series.
* Remove "ratelimited" from the logs that are not in the ioctl call paths.
* Update static calls sanity checks.
* Remove file ops that do nothing for now.
* Remove GPL additional wording as SPDX-License-Identifier is already in place.
* v2: https://lore.kernel.org/lkml/[email protected]/
v1 -> v2
* Rebase on top of v5.7-rc6.
* Adapt codebase based on feedback from v1.
* Update ioctl number definition - major and minor.
* Add sample / documentation for the ioctl interface basic flow usage.
* Update cover letter to include more context on the NE overall.
* Add fix for the enclave / vcpu fd creation error cleanup path.
* Add fix reported by kbuild test robot <[email protected]>.
* v1: https://lore.kernel.org/lkml/[email protected]/
---
Andra Paraschiv (18):
nitro_enclaves: Add ioctl interface definition
nitro_enclaves: Define the PCI device interface
nitro_enclaves: Define enclave info for internal bookkeeping
nitro_enclaves: Init PCI device driver
nitro_enclaves: Handle PCI device command requests
nitro_enclaves: Handle out-of-band PCI device events
nitro_enclaves: Init misc device providing the ioctl interface
nitro_enclaves: Add logic for creating an enclave VM
nitro_enclaves: Add logic for setting an enclave vCPU
nitro_enclaves: Add logic for getting the enclave image load info
nitro_enclaves: Add logic for setting an enclave memory region
nitro_enclaves: Add logic for starting an enclave
nitro_enclaves: Add logic for terminating an enclave
nitro_enclaves: Add Kconfig for the Nitro Enclaves driver
nitro_enclaves: Add Makefile for the Nitro Enclaves driver
nitro_enclaves: Add sample for ioctl interface usage
nitro_enclaves: Add overview documentation
MAINTAINERS: Add entry for the Nitro Enclaves driver
Documentation/nitro_enclaves/ne_overview.rst | 87 +
.../userspace-api/ioctl/ioctl-number.rst | 5 +-
MAINTAINERS | 13 +
drivers/virt/Kconfig | 2 +
drivers/virt/Makefile | 2 +
drivers/virt/nitro_enclaves/Kconfig | 20 +
drivers/virt/nitro_enclaves/Makefile | 11 +
drivers/virt/nitro_enclaves/ne_misc_dev.c | 1648 +++++++++++++++++
drivers/virt/nitro_enclaves/ne_misc_dev.h | 99 +
drivers/virt/nitro_enclaves/ne_pci_dev.c | 606 ++++++
drivers/virt/nitro_enclaves/ne_pci_dev.h | 327 ++++
include/linux/nitro_enclaves.h | 11 +
include/uapi/linux/nitro_enclaves.h | 337 ++++
samples/nitro_enclaves/.gitignore | 2 +
samples/nitro_enclaves/Makefile | 16 +
samples/nitro_enclaves/ne_ioctl_sample.c | 850 +++++++++
16 files changed, 4035 insertions(+), 1 deletion(-)
create mode 100644 Documentation/nitro_enclaves/ne_overview.rst
create mode 100644 drivers/virt/nitro_enclaves/Kconfig
create mode 100644 drivers/virt/nitro_enclaves/Makefile
create mode 100644 drivers/virt/nitro_enclaves/ne_misc_dev.c
create mode 100644 drivers/virt/nitro_enclaves/ne_misc_dev.h
create mode 100644 drivers/virt/nitro_enclaves/ne_pci_dev.c
create mode 100644 drivers/virt/nitro_enclaves/ne_pci_dev.h
create mode 100644 include/linux/nitro_enclaves.h
create mode 100644 include/uapi/linux/nitro_enclaves.h
create mode 100644 samples/nitro_enclaves/.gitignore
create mode 100644 samples/nitro_enclaves/Makefile
create mode 100644 samples/nitro_enclaves/ne_ioctl_sample.c
--
2.20.1 (Apple Git-117)
Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.
The Nitro Enclaves driver handles the enclave lifetime management. This
includes enclave creation, termination and setting up its resources such
as memory and CPU.
An enclave runs alongside the VM that spawned it. It is abstracted as a
process running in the VM that launched it. The process interacts with
the NE driver, that exposes an ioctl interface for creating an enclave
and setting up its resources.
Signed-off-by: Alexandru Vasile <[email protected]>
Signed-off-by: Andra Paraschiv <[email protected]>
Reviewed-by: Alexander Graf <[email protected]>
Reviewed-by: Stefan Hajnoczi <[email protected]>
---
Changelog
v6 -> v7
* Clarify in the ioctls documentation that the return value is -1 and
errno is set on failure.
* Update the error code value for NE_ERR_INVALID_MEM_REGION_SIZE as it
gets in user space as value 25 (ENOTTY) instead of 515. Update the
NE custom error codes values range to not be the same as the ones
defined in include/linux/errno.h, although these are not propagated
to user space.
v5 -> v6
* Fix typo in the description about the NE CPU pool.
* Update documentation to kernel-doc format.
* Remove the ioctl to query API version.
v4 -> v5
* Add more details about the ioctl calls usage e.g. error codes, file
descriptors used.
* Update the ioctl to set an enclave vCPU to not return a file
descriptor.
* Add specific NE error codes.
v3 -> v4
* Decouple NE ioctl interface from KVM API.
* Add NE API version and the corresponding ioctl call.
* Add enclave / image load flags options.
v2 -> v3
* Remove the GPL additional wording as SPDX-License-Identifier is
already in place.
v1 -> v2
* Add ioctl for getting enclave image load metadata.
* Update NE_ENCLAVE_START ioctl name to NE_START_ENCLAVE.
* Add entry in Documentation/userspace-api/ioctl/ioctl-number.rst for NE
ioctls.
* Update NE ioctls definition based on the updated ioctl range for major
and minor.
---
.../userspace-api/ioctl/ioctl-number.rst | 5 +-
include/linux/nitro_enclaves.h | 11 +
include/uapi/linux/nitro_enclaves.h | 337 ++++++++++++++++++
3 files changed, 352 insertions(+), 1 deletion(-)
create mode 100644 include/linux/nitro_enclaves.h
create mode 100644 include/uapi/linux/nitro_enclaves.h
diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
index 2a198838fca9..5f7ff00f394e 100644
--- a/Documentation/userspace-api/ioctl/ioctl-number.rst
+++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
@@ -328,8 +328,11 @@ Code Seq# Include File Comments
0xAC 00-1F linux/raw.h
0xAD 00 Netfilter device in development:
<mailto:[email protected]>
-0xAE all linux/kvm.h Kernel-based Virtual Machine
+0xAE 00-1F linux/kvm.h Kernel-based Virtual Machine
<mailto:[email protected]>
+0xAE 40-FF linux/kvm.h Kernel-based Virtual Machine
+ <mailto:[email protected]>
+0xAE 20-3F linux/nitro_enclaves.h Nitro Enclaves
0xAF 00-1F linux/fsl_hypervisor.h Freescale hypervisor
0xB0 all RATIO devices in development:
<mailto:[email protected]>
diff --git a/include/linux/nitro_enclaves.h b/include/linux/nitro_enclaves.h
new file mode 100644
index 000000000000..d91ef2bfdf47
--- /dev/null
+++ b/include/linux/nitro_enclaves.h
@@ -0,0 +1,11 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright 2020 Amazon.com, Inc. or its affiliates. All Rights Reserved.
+ */
+
+#ifndef _LINUX_NITRO_ENCLAVES_H_
+#define _LINUX_NITRO_ENCLAVES_H_
+
+#include <uapi/linux/nitro_enclaves.h>
+
+#endif /* _LINUX_NITRO_ENCLAVES_H_ */
diff --git a/include/uapi/linux/nitro_enclaves.h b/include/uapi/linux/nitro_enclaves.h
new file mode 100644
index 000000000000..1f81aa9f94bb
--- /dev/null
+++ b/include/uapi/linux/nitro_enclaves.h
@@ -0,0 +1,337 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/*
+ * Copyright 2020 Amazon.com, Inc. or its affiliates. All Rights Reserved.
+ */
+
+#ifndef _UAPI_LINUX_NITRO_ENCLAVES_H_
+#define _UAPI_LINUX_NITRO_ENCLAVES_H_
+
+#include <linux/types.h>
+
+/**
+ * DOC: Nitro Enclaves (NE) Kernel Driver Interface
+ */
+
+/**
+ * NE_CREATE_VM - The command is used to create a slot that is associated with
+ * an enclave VM.
+ * The generated unique slot id is an output parameter.
+ * The ioctl can be invoked on the /dev/nitro_enclaves fd, before
+ * setting any resources, such as memory and vCPUs, for an
+ * enclave. Memory and vCPUs are set for the slot mapped to an enclave.
+ * A NE CPU pool has to be set before calling this function. The
+ * pool can be set after the NE driver load, using
+ * /sys/module/nitro_enclaves/parameters/ne_cpus.
+ * Its format is the detailed in the cpu-lists section:
+ * https://www.kernel.org/doc/html/latest/admin-guide/kernel-parameters.html
+ * CPU 0 and its siblings have to remain available for the
+ * primary / parent VM, so they cannot be set for enclaves. Full
+ * CPU core(s), from the same NUMA node, need(s) to be included
+ * in the CPU pool.
+ *
+ * Context: Process context.
+ * Return:
+ * * Enclave file descriptor - Enclave file descriptor used with
+ * ioctl calls to set vCPUs and memory
+ * regions, then start the enclave.
+ * * -1 - There was a failure in the ioctl logic.
+ * On failure, errno is set to:
+ * * EFAULT - copy_to_user() failure.
+ * * ENOMEM - Memory allocation failure for internal
+ * bookkeeping variables.
+ * * NE_ERR_NO_CPUS_AVAIL_IN_POOL - No NE CPU pool set / no CPUs available
+ * in the pool.
+ * * Error codes from get_unused_fd_flags() and anon_inode_getfile().
+ * * Error codes from the NE PCI device request.
+ */
+#define NE_CREATE_VM _IOR(0xAE, 0x20, __u64)
+
+/**
+ * NE_ADD_VCPU - The command is used to set a vCPU for an enclave. The vCPU can
+ * be auto-chosen from the NE CPU pool or it can be set by the
+ * caller, with the note that it needs to be available in the NE
+ * CPU pool. Full CPU core(s), from the same NUMA node, need(s) to
+ * be associated with an enclave.
+ * The vCPU id is an input / output parameter. If its value is 0,
+ * then a CPU is chosen from the enclave CPU pool and returned via
+ * this parameter.
+ * The ioctl can be invoked on the enclave fd, before an enclave
+ * is started.
+ *
+ * Context: Process context.
+ * Return:
+ * * 0 - Logic succesfully completed.
+ * * -1 - There was a failure in the ioctl logic.
+ * On failure, errno is set to:
+ * * EFAULT - copy_from_user() / copy_to_user() failure.
+ * * ENOMEM - Memory allocation failure for internal
+ * bookkeeping variables.
+ * * EIO - Current task mm is not the same as the one
+ * that created the enclave.
+ * * NE_ERR_NO_CPUS_AVAIL_IN_POOL - No CPUs available in the NE CPU pool.
+ * * NE_ERR_VCPU_ALREADY_USED - The provided vCPU is already used.
+ * * NE_ERR_VCPU_NOT_IN_CPU_POOL - The provided vCPU is not available in the
+ * NE CPU pool.
+ * * NE_ERR_VCPU_INVALID_CPU_CORE - The core id of the provided vCPU is invalid
+ * or out of range.
+ * * NE_ERR_NOT_IN_INIT_STATE - The enclave is not in init state
+ * (init = before being started).
+ * * NE_ERR_INVALID_VCPU - The provided vCPU is not in the available
+ * CPUs range.
+ * * Error codes from the NE PCI device request.
+ */
+#define NE_ADD_VCPU _IOWR(0xAE, 0x21, __u32)
+
+/**
+ * NE_GET_IMAGE_LOAD_INFO - The command is used to get information needed for
+ * in-memory enclave image loading e.g. offset in
+ * enclave memory to start placing the enclave image.
+ * The image load info is an input / output parameter.
+ * It includes info provided by the caller - flags -
+ * and returns the offset in enclave memory where to
+ * start placing the enclave image.
+ * The ioctl can be invoked on the enclave fd, before
+ * an enclave is started.
+ *
+ * Context: Process context.
+ * Return:
+ * * 0 - Logic succesfully completed.
+ * * -1 - There was a failure in the ioctl logic.
+ * On failure, errno is set to:
+ * * EFAULT - copy_from_user() / copy_to_user() failure.
+ * * EINVAL - Invalid flag value.
+ * * NE_ERR_NOT_IN_INIT_STATE - The enclave is not in init state (init =
+ * before being started).
+ */
+#define NE_GET_IMAGE_LOAD_INFO _IOWR(0xAE, 0x22, struct ne_image_load_info)
+
+/**
+ * NE_SET_USER_MEMORY_REGION - The command is used to set a memory region for an
+ * enclave, given the allocated memory from the
+ * userspace. Enclave memory needs to be from the
+ * same NUMA node as the enclave CPUs.
+ * The user memory region is an input parameter. It
+ * includes info provided by the caller - flags,
+ * memory size and userspace address.
+ * The ioctl can be invoked on the enclave fd,
+ * before an enclave is started.
+ *
+ * Context: Process context.
+ * Return:
+ * * 0 - Logic succesfully completed.
+ * * -1 - There was a failure in the ioctl logic.
+ * On failure, errno is set to:
+ * * EFAULT - copy_from_user() failure.
+ * * EINVAL - Invalid flag value.
+ * * EIO - Current task mm is not the same as
+ * the one that created the enclave.
+ * * ENOMEM - Memory allocation failure for internal
+ * bookkeeping variables.
+ * * NE_ERR_NOT_IN_INIT_STATE - The enclave is not in init state
+ * (init = before being started).
+ * * NE_ERR_INVALID_MEM_REGION_SIZE - The memory size of the region is not
+ * multiple of 2 MiB.
+ * * NE_ERR_INVALID_MEM_REGION_ADDR - Invalid user space address given.
+ * * NE_ERR_UNALIGNED_MEM_REGION_ADDR - Unaligned user space address given.
+ * * NE_ERR_MEM_REGION_ALREADY_USED - The memory region is already used.
+ * * NE_ERR_MEM_NOT_HUGE_PAGE - The memory regions is not backed by
+ * huge pages.
+ * * NE_ERR_MEM_DIFFERENT_NUMA_NODE - The memory region is not from the same
+ * NUMA node as the CPUs.
+ * * NE_ERR_MEM_MAX_REGIONS - The number of memory regions set for
+ * the enclave reached maximum.
+ * * Error codes from get_user_pages().
+ * * Error codes from the NE PCI device request.
+ */
+#define NE_SET_USER_MEMORY_REGION _IOW(0xAE, 0x23, struct ne_user_memory_region)
+
+/**
+ * NE_START_ENCLAVE - The command is used to trigger enclave start after the
+ * enclave resources, such as memory and CPU, have been set.
+ * The enclave start info is an input / output parameter. It
+ * includes info provided by the caller - enclave cid and
+ * flags - and returns the cid (if input cid is 0).
+ * The ioctl can be invoked on the enclave fd, after an
+ * enclave slot is created and resources, such as memory and
+ * vCPUs are set for an enclave.
+ *
+ * Context: Process context.
+ * Return:
+ * * 0 - Logic succesfully completed.
+ * * -1 - There was a failure in the ioctl logic.
+ * On failure, errno is set to:
+ * * EFAULT - copy_from_user() / copy_to_user() failure.
+ * * EINVAL - Invalid flag value.
+ * * NE_ERR_NOT_IN_INIT_STATE - The enclave is not in init state
+ * (init = before being started).
+ * * NE_ERR_NO_MEM_REGIONS_ADDED - No memory regions are set.
+ * * NE_ERR_NO_VCPUS_ADDED - No vCPUs are set.
+ * * NE_ERR_FULL_CORES_NOT_USED - Full core(s) not set for the enclave.
+ * * NE_ERR_ENCLAVE_MEM_MIN_SIZE - Enclave memory is less than minimum
+ * memory size (64 MiB).
+ * * Error codes from the NE PCI device request.
+ */
+#define NE_START_ENCLAVE _IOWR(0xAE, 0x24, struct ne_enclave_start_info)
+
+/**
+ * DOC: NE specific error codes
+ */
+
+/**
+ * NE_ERR_VCPU_ALREADY_USED - The provided vCPU is already used.
+ */
+#define NE_ERR_VCPU_ALREADY_USED (256)
+/**
+ * NE_ERR_VCPU_NOT_IN_CPU_POOL - The provided vCPU is not available in the
+ * NE CPU pool.
+ */
+#define NE_ERR_VCPU_NOT_IN_CPU_POOL (257)
+/**
+ * NE_ERR_VCPU_INVALID_CPU_CORE - The core id of the provided vCPU is invalid
+ * or out of range of the NE CPU pool.
+ */
+#define NE_ERR_VCPU_INVALID_CPU_CORE (258)
+/**
+ * NE_ERR_INVALID_MEM_REGION_SIZE - The user space memory region size is not
+ * multiple of 2 MiB.
+ */
+#define NE_ERR_INVALID_MEM_REGION_SIZE (259)
+/**
+ * NE_ERR_INVALID_MEM_REGION_ADDR - The user space memory region address range
+ * is invalid.
+ */
+#define NE_ERR_INVALID_MEM_REGION_ADDR (260)
+/**
+ * NE_ERR_UNALIGNED_MEM_REGION_ADDR - The user space memory region address is
+ * not aligned.
+ */
+#define NE_ERR_UNALIGNED_MEM_REGION_ADDR (261)
+/**
+ * NE_ERR_MEM_REGION_ALREADY_USED - The user space memory region is already used.
+ */
+#define NE_ERR_MEM_REGION_ALREADY_USED (262)
+/**
+ * NE_ERR_MEM_NOT_HUGE_PAGE - The user space memory region is not backed by
+ * contiguous physical huge page(s).
+ */
+#define NE_ERR_MEM_NOT_HUGE_PAGE (263)
+/**
+ * NE_ERR_MEM_DIFFERENT_NUMA_NODE - The user space memory region is backed by
+ * pages from different NUMA nodes than the CPUs.
+ */
+#define NE_ERR_MEM_DIFFERENT_NUMA_NODE (264)
+/**
+ * NE_ERR_MEM_MAX_REGIONS - The supported max memory regions per enclaves has
+ * been reached.
+ */
+#define NE_ERR_MEM_MAX_REGIONS (265)
+/**
+ * NE_ERR_NO_MEM_REGIONS_ADDED - The command to start an enclave is triggered
+ * and no memory regions are added.
+ */
+#define NE_ERR_NO_MEM_REGIONS_ADDED (266)
+/**
+ * NE_ERR_NO_VCPUS_ADDED - The command to start an enclave is triggered and no
+ * vCPUs are added.
+ */
+#define NE_ERR_NO_VCPUS_ADDED (267)
+/**
+ * NE_ERR_ENCLAVE_MEM_MIN_SIZE - The enclave memory size is lower than the
+ * minimum supported.
+ */
+#define NE_ERR_ENCLAVE_MEM_MIN_SIZE (268)
+/**
+ * NE_ERR_FULL_CORES_NOT_USED - The command to start an enclave is triggered and
+ * full CPU cores are not set.
+ */
+#define NE_ERR_FULL_CORES_NOT_USED (269)
+/**
+ * NE_ERR_NOT_IN_INIT_STATE - The enclave is not in init state when setting
+ * resources or triggering start.
+ */
+#define NE_ERR_NOT_IN_INIT_STATE (270)
+/**
+ * NE_ERR_INVALID_VCPU - The provided vCPU is out of range of the available CPUs.
+ */
+#define NE_ERR_INVALID_VCPU (271)
+/**
+ * NE_ERR_NO_CPUS_AVAIL_IN_POOL - The command to create an enclave is triggered
+ * and no CPUs are available in the pool.
+ */
+#define NE_ERR_NO_CPUS_AVAIL_IN_POOL (272)
+
+/**
+ * DOC: Image load info flags
+ */
+
+/**
+ * NE_EIF_IMAGE - Enclave Image Format (EIF)
+ */
+#define NE_EIF_IMAGE (0x01)
+
+/**
+ * struct ne_image_load_info - Info necessary for in-memory enclave image
+ * loading (in / out).
+ * @flags: Flags to determine the enclave image type
+ * (e.g. Enclave Image Format - EIF) (in).
+ * @memory_offset: Offset in enclave memory where to start placing the
+ * enclave image (out).
+ */
+struct ne_image_load_info {
+ __u64 flags;
+ __u64 memory_offset;
+};
+
+/**
+ * DOC: User memory region flags
+ */
+
+/**
+ * NE_DEFAULT_MEMORY_REGION - Memory region for enclave general usage.
+ */
+#define NE_DEFAULT_MEMORY_REGION (0x00)
+
+#define NE_MEMORY_REGION_MAX_FLAG_VAL (0x01)
+
+/**
+ * struct ne_user_memory_region - Memory region to be set for an enclave (in).
+ * @flags: Flags to determine the usage for the memory region (in).
+ * @memory_size: The size, in bytes, of the memory region to be set for
+ * an enclave (in).
+ * @userspace_addr: The start address of the userspace allocated memory of
+ * the memory region to set for an enclave (in).
+ */
+struct ne_user_memory_region {
+ __u64 flags;
+ __u64 memory_size;
+ __u64 userspace_addr;
+};
+
+/**
+ * DOC: Enclave start info flags
+ */
+
+/**
+ * NE_ENCLAVE_PRODUCTION_MODE - Start enclave in production mode.
+ */
+#define NE_ENCLAVE_PRODUCTION_MODE (0x00)
+/**
+ * NE_ENCLAVE_DEBUG_MODE - Start enclave in debug mode.
+ */
+#define NE_ENCLAVE_DEBUG_MODE (0x01)
+
+#define NE_ENCLAVE_START_MAX_FLAG_VAL (0x02)
+
+/**
+ * struct ne_enclave_start_info - Setup info necessary for enclave start (in / out).
+ * @flags: Flags for the enclave to start with (e.g. debug mode) (in).
+ * @enclave_cid: Context ID (CID) for the enclave vsock device. If 0 as
+ * input, the CID is autogenerated by the hypervisor and
+ * returned back as output by the driver (in / out).
+ */
+struct ne_enclave_start_info {
+ __u64 flags;
+ __u64 enclave_cid;
+};
+
+#endif /* _UAPI_LINUX_NITRO_ENCLAVES_H_ */
--
2.20.1 (Apple Git-117)
Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.
The Nitro Enclaves (NE) driver communicates with a new PCI device, that
is exposed to a virtual machine (VM) and handles commands meant for
handling enclaves lifetime e.g. creation, termination, setting memory
regions. The communication with the PCI device is handled using a MMIO
space and MSI-X interrupts.
This device communicates with the hypervisor on the host, where the VM
that spawned the enclave itself runs, e.g. to launch a VM that is used
for the enclave.
Define the MMIO space of the NE PCI device, the commands that are
provided by this device. Add an internal data structure used as private
data for the PCI device driver and the function for the PCI device
command requests handling.
Signed-off-by: Alexandru-Catalin Vasile <[email protected]>
Signed-off-by: Alexandru Ciobotaru <[email protected]>
Signed-off-by: Andra Paraschiv <[email protected]>
Reviewed-by: Alexander Graf <[email protected]>
---
Changelog
v6 -> v7
* Update the documentation to include references to the NE PCI device id
and MMIO bar.
v5 -> v6
* Update documentation to kernel-doc format.
v4 -> v5
* Add a TODO for including flags in the request to the NE PCI device to
set a memory region for an enclave. It is not used for now.
v3 -> v4
* Remove the "packed" attribute and include padding in the NE data
structures.
v2 -> v3
* Remove the GPL additional wording as SPDX-License-Identifier is
already in place.
v1 -> v2
* Update path naming to drivers/virt/nitro_enclaves.
* Update NE_ENABLE_OFF / NE_ENABLE_ON defines.
---
drivers/virt/nitro_enclaves/ne_pci_dev.h | 327 +++++++++++++++++++++++
1 file changed, 327 insertions(+)
create mode 100644 drivers/virt/nitro_enclaves/ne_pci_dev.h
diff --git a/drivers/virt/nitro_enclaves/ne_pci_dev.h b/drivers/virt/nitro_enclaves/ne_pci_dev.h
new file mode 100644
index 000000000000..336fa344d630
--- /dev/null
+++ b/drivers/virt/nitro_enclaves/ne_pci_dev.h
@@ -0,0 +1,327 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright 2020 Amazon.com, Inc. or its affiliates. All Rights Reserved.
+ */
+
+#ifndef _NE_PCI_DEV_H_
+#define _NE_PCI_DEV_H_
+
+#include <linux/atomic.h>
+#include <linux/list.h>
+#include <linux/mutex.h>
+#include <linux/pci.h>
+#include <linux/pci_ids.h>
+#include <linux/wait.h>
+
+/**
+ * DOC: Nitro Enclaves (NE) PCI device
+ */
+
+/**
+ * PCI_DEVICE_ID_NE - Nitro Enclaves PCI device id.
+ */
+#define PCI_DEVICE_ID_NE (0xe4c1)
+/**
+ * PCI_BAR_NE - Nitro Enclaves PCI device MMIO BAR.
+ */
+#define PCI_BAR_NE (0x03)
+
+/**
+ * DOC: Device registers in the NE PCI device MMIO BAR
+ */
+
+/**
+ * NE_ENABLE - (1 byte) Register to notify the device that the driver is using
+ * it (Read/Write).
+ */
+#define NE_ENABLE (0x0000)
+#define NE_ENABLE_OFF (0x00)
+#define NE_ENABLE_ON (0x01)
+
+/**
+ * NE_VERSION - (2 bytes) Register to select the device run-time version
+ * (Read/Write).
+ */
+#define NE_VERSION (0x0002)
+#define NE_VERSION_MAX (0x0001)
+
+/**
+ * NE_COMMAND - (4 bytes) Register to notify the device what command was
+ * requested (Write-Only).
+ */
+#define NE_COMMAND (0x0004)
+
+/**
+ * NE_EVTCNT - (4 bytes) Register to notify the driver that a reply or a device
+ * event is available (Read-Only):
+ * - Lower half - command reply counter
+ * - Higher half - out-of-band device event counter
+ */
+#define NE_EVTCNT (0x000c)
+#define NE_EVTCNT_REPLY_SHIFT (0)
+#define NE_EVTCNT_REPLY_MASK (0x0000ffff)
+#define NE_EVTCNT_REPLY(cnt) (((cnt) & NE_EVTCNT_REPLY_MASK) >> \
+ NE_EVTCNT_REPLY_SHIFT)
+#define NE_EVTCNT_EVENT_SHIFT (16)
+#define NE_EVTCNT_EVENT_MASK (0xffff0000)
+#define NE_EVTCNT_EVENT(cnt) (((cnt) & NE_EVTCNT_EVENT_MASK) >> \
+ NE_EVTCNT_EVENT_SHIFT)
+
+/**
+ * NE_SEND_DATA - (240 bytes) Buffer for sending the command request payload
+ * (Read/Write).
+ */
+#define NE_SEND_DATA (0x0010)
+
+/**
+ * NE_RECV_DATA - (240 bytes) Buffer for receiving the command reply payload
+ * (Read-Only).
+ */
+#define NE_RECV_DATA (0x0100)
+
+/**
+ * DOC: Device MMIO buffer sizes
+ */
+
+/**
+ * NE_SEND_DATA_SIZE / NE_RECV_DATA_SIZE - 240 bytes for send / recv buffer.
+ */
+#define NE_SEND_DATA_SIZE (240)
+#define NE_RECV_DATA_SIZE (240)
+
+/**
+ * DOC: MSI-X interrupt vectors
+ */
+
+/**
+ * NE_VEC_REPLY - MSI-X vector used for command reply notification.
+ */
+#define NE_VEC_REPLY (0)
+
+/**
+ * NE_VEC_EVENT - MSI-X vector used for out-of-band events e.g. enclave crash.
+ */
+#define NE_VEC_EVENT (1)
+
+/**
+ * enum ne_pci_dev_cmd_type - Device command types.
+ * @INVALID_CMD: Invalid command.
+ * @ENCLAVE_START: Start an enclave, after setting its resources.
+ * @ENCLAVE_GET_SLOT: Get the slot uid of an enclave.
+ * @ENCLAVE_STOP: Terminate an enclave.
+ * @SLOT_ALLOC : Allocate a slot for an enclave.
+ * @SLOT_FREE: Free the slot allocated for an enclave
+ * @SLOT_ADD_MEM: Add a memory region to an enclave slot.
+ * @SLOT_ADD_VCPU: Add a vCPU to an enclave slot.
+ * @SLOT_COUNT : Get the number of allocated slots.
+ * @NEXT_SLOT: Get the next slot in the list of allocated slots.
+ * @SLOT_INFO: Get the info for a slot e.g. slot uid, vCPUs count.
+ * @SLOT_ADD_BULK_VCPUS: Add a number of vCPUs, not providing CPU ids.
+ * @MAX_CMD: A gatekeeper for max possible command type.
+ */
+enum ne_pci_dev_cmd_type {
+ INVALID_CMD = 0,
+ ENCLAVE_START = 1,
+ ENCLAVE_GET_SLOT = 2,
+ ENCLAVE_STOP = 3,
+ SLOT_ALLOC = 4,
+ SLOT_FREE = 5,
+ SLOT_ADD_MEM = 6,
+ SLOT_ADD_VCPU = 7,
+ SLOT_COUNT = 8,
+ NEXT_SLOT = 9,
+ SLOT_INFO = 10,
+ SLOT_ADD_BULK_VCPUS = 11,
+ MAX_CMD,
+};
+
+/**
+ * DOC: Device commands - payload structure for requests and replies.
+ */
+
+/**
+ * struct enclave_start_req - ENCLAVE_START request.
+ * @slot_uid: Slot unique id mapped to the enclave to start.
+ * @enclave_cid: Context ID (CID) for the enclave vsock device.
+ * If 0, CID is autogenerated.
+ * @flags: Flags for the enclave to start with (e.g. debug mode).
+ */
+struct enclave_start_req {
+ u64 slot_uid;
+ u64 enclave_cid;
+ u64 flags;
+};
+
+/**
+ * struct enclave_get_slot_req - ENCLAVE_GET_SLOT request.
+ * @enclave_cid: Context ID (CID) for the enclave vsock device.
+ */
+struct enclave_get_slot_req {
+ u64 enclave_cid;
+};
+
+/**
+ * struct enclave_stop_req - ENCLAVE_STOP request.
+ * @slot_uid: Slot unique id mapped to the enclave to stop.
+ */
+struct enclave_stop_req {
+ u64 slot_uid;
+};
+
+/**
+ * struct slot_alloc_req - SLOT_ALLOC request.
+ * @unused: In order to avoid weird sizeof edge cases.
+ */
+struct slot_alloc_req {
+ u8 unused;
+};
+
+/**
+ * struct slot_free_req - SLOT_FREE request.
+ * @slot_uid: Slot unique id mapped to the slot to free.
+ */
+struct slot_free_req {
+ u64 slot_uid;
+};
+
+/* TODO: Add flags field to the request to add memory region. */
+/**
+ * struct slot_add_mem_req - SLOT_ADD_MEM request.
+ * @slot_uid: Slot unique id mapped to the slot to add the memory region to.
+ * @paddr: Physical address of the memory region to add to the slot.
+ * @size: Memory size, in bytes, of the memory region to add to the slot.
+ */
+struct slot_add_mem_req {
+ u64 slot_uid;
+ u64 paddr;
+ u64 size;
+};
+
+/**
+ * struct slot_add_vcpu_req - SLOT_ADD_VCPU request.
+ * @slot_uid: Slot unique id mapped to the slot to add the vCPU to.
+ * @vcpu_id: vCPU ID of the CPU to add to the enclave.
+ * @padding: Padding for the overall data structure.
+ */
+struct slot_add_vcpu_req {
+ u64 slot_uid;
+ u32 vcpu_id;
+ u8 padding[4];
+};
+
+/**
+ * struct slot_count_req - SLOT_COUNT request.
+ * @unused: In order to avoid weird sizeof edge cases.
+ */
+struct slot_count_req {
+ u8 unused;
+};
+
+/**
+ * struct next_slot_req - NEXT_SLOT request.
+ * @slot_uid: Slot unique id of the next slot in the iteration.
+ */
+struct next_slot_req {
+ u64 slot_uid;
+};
+
+/**
+ * struct slot_info_req - SLOT_INFO request.
+ * @slot_uid: Slot unique id mapped to the slot to get information about.
+ */
+struct slot_info_req {
+ u64 slot_uid;
+};
+
+/**
+ * struct slot_add_bulk_vcpus_req - SLOT_ADD_BULK_VCPUS request.
+ * @slot_uid: Slot unique id mapped to the slot to add vCPUs to.
+ * @nr_vcpus: Number of vCPUs to add to the slot.
+ */
+struct slot_add_bulk_vcpus_req {
+ u64 slot_uid;
+ u64 nr_vcpus;
+};
+
+/**
+ * struct ne_pci_dev_cmd_reply - NE PCI device command reply.
+ * @rc : Return code of the logic that processed the request.
+ * @padding0: Padding for the overall data structure.
+ * @slot_uid: Valid for all commands except SLOT_COUNT.
+ * @enclave_cid: Valid for ENCLAVE_START command.
+ * @slot_count : Valid for SLOT_COUNT command.
+ * @mem_regions: Valid for SLOT_ALLOC and SLOT_INFO commands.
+ * @mem_size: Valid for SLOT_INFO command.
+ * @nr_vcpus: Valid for SLOT_INFO command.
+ * @flags: Valid for SLOT_INFO command.
+ * @state: Valid for SLOT_INFO command.
+ * @padding1: Padding for the overall data structure.
+ */
+struct ne_pci_dev_cmd_reply {
+ s32 rc;
+ u8 padding0[4];
+ u64 slot_uid;
+ u64 enclave_cid;
+ u64 slot_count;
+ u64 mem_regions;
+ u64 mem_size;
+ u64 nr_vcpus;
+ u64 flags;
+ u16 state;
+ u8 padding1[6];
+};
+
+/**
+ * struct ne_pci_dev - Nitro Enclaves (NE) PCI device.
+ * @cmd_reply_avail: Variable set if a reply has been sent by the
+ * PCI device.
+ * @cmd_reply_wait_q: Wait queue for handling command reply from the
+ * PCI device.
+ * @enclaves_list: List of the enclaves managed by the PCI device.
+ * @enclaves_list_mutex: Mutex for accessing the list of enclaves.
+ * @event_wq: Work queue for handling out-of-band events
+ * triggered by the Nitro Hypervisor which require
+ * enclave state scanning and propagation to the
+ * enclave process.
+ * @iomem_base : MMIO region of the PCI device.
+ * @notify_work: Work item for every received out-of-band event.
+ * @pci_dev_mutex: Mutex for accessing the PCI device MMIO space.
+ * @pdev: PCI device data structure.
+ */
+struct ne_pci_dev {
+ atomic_t cmd_reply_avail;
+ wait_queue_head_t cmd_reply_wait_q;
+ struct list_head enclaves_list;
+ struct mutex enclaves_list_mutex;
+ struct workqueue_struct *event_wq;
+ void __iomem *iomem_base;
+ struct work_struct notify_work;
+ struct mutex pci_dev_mutex;
+ struct pci_dev *pdev;
+};
+
+/**
+ * ne_do_request() - Submit command request to the PCI device based on the command
+ * type and retrieve the associated reply.
+ * @pdev: PCI device to send the command to and receive the reply from.
+ * @cmd_type: Command type of the request sent to the PCI device.
+ * @cmd_request: Command request payload.
+ * @cmd_request_size: Size of the command request payload.
+ * @cmd_reply: Command reply payload.
+ * @cmd_reply_size: Size of the command reply payload.
+ *
+ * Context: Process context. This function uses the ne_pci_dev mutex to handle
+ * one command at a time.
+ * Return:
+ * * 0 on success.
+ * * Negative return value on failure.
+ */
+int ne_do_request(struct pci_dev *pdev, enum ne_pci_dev_cmd_type cmd_type,
+ void *cmd_request, size_t cmd_request_size,
+ struct ne_pci_dev_cmd_reply *cmd_reply,
+ size_t cmd_reply_size);
+
+/* Nitro Enclaves (NE) PCI device driver */
+extern struct pci_driver ne_pci_driver;
+
+#endif /* _NE_PCI_DEV_H_ */
--
2.20.1 (Apple Git-117)
Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.
The Nitro Enclaves PCI device exposes a MMIO space that this driver
uses to submit command requests and to receive command replies e.g. for
enclave creation / termination or setting enclave resources.
Add logic for handling PCI device command requests based on the given
command type.
Register an MSI-X interrupt vector for command reply notifications to
handle this type of communication events.
Signed-off-by: Alexandru-Catalin Vasile <[email protected]>
Signed-off-by: Andra Paraschiv <[email protected]>
Reviewed-by: Alexander Graf <[email protected]>
---
Changelog
v6 -> v7
* No changes.
v5 -> v6
* Update documentation to kernel-doc format.
v4 -> v5
* Remove sanity checks for situations that shouldn't happen, only if
buggy system or broken logic at all.
v3 -> v4
* Use dev_err instead of custom NE log pattern.
* Return IRQ_NONE when interrupts are not handled.
v2 -> v3
* Remove the WARN_ON calls.
* Update static calls sanity checks.
* Remove "ratelimited" from the logs that are not in the ioctl call
paths.
v1 -> v2
* Add log pattern for NE.
* Remove the BUG_ON calls.
* Update goto labels to match their purpose.
* Add fix for kbuild report:
https://lore.kernel.org/lkml/202004231644.xTmN4Z1z%[email protected]/
---
drivers/virt/nitro_enclaves/ne_pci_dev.c | 204 +++++++++++++++++++++++
1 file changed, 204 insertions(+)
diff --git a/drivers/virt/nitro_enclaves/ne_pci_dev.c b/drivers/virt/nitro_enclaves/ne_pci_dev.c
index 31650dcd592e..77ccbc43bce3 100644
--- a/drivers/virt/nitro_enclaves/ne_pci_dev.c
+++ b/drivers/virt/nitro_enclaves/ne_pci_dev.c
@@ -33,6 +33,187 @@ static const struct pci_device_id ne_pci_ids[] = {
MODULE_DEVICE_TABLE(pci, ne_pci_ids);
+/**
+ * ne_submit_request() - Submit command request to the PCI device based on the
+ * command type.
+ * @pdev: PCI device to send the command to.
+ * @cmd_type: Command type of the request sent to the PCI device.
+ * @cmd_request: Command request payload.
+ * @cmd_request_size: Size of the command request payload.
+ *
+ * Context: Process context. This function is called with the ne_pci_dev mutex held.
+ * Return:
+ * * 0 on success.
+ * * Negative return value on failure.
+ */
+static int ne_submit_request(struct pci_dev *pdev, enum ne_pci_dev_cmd_type cmd_type,
+ void *cmd_request, size_t cmd_request_size)
+{
+ struct ne_pci_dev *ne_pci_dev = pci_get_drvdata(pdev);
+
+ memcpy_toio(ne_pci_dev->iomem_base + NE_SEND_DATA, cmd_request, cmd_request_size);
+
+ iowrite32(cmd_type, ne_pci_dev->iomem_base + NE_COMMAND);
+
+ return 0;
+}
+
+/**
+ * ne_retrieve_reply() - Retrieve reply from the PCI device.
+ * @pdev: PCI device to receive the reply from.
+ * @cmd_reply: Command reply payload.
+ * @cmd_reply_size: Size of the command reply payload.
+ *
+ * Context: Process context. This function is called with the ne_pci_dev mutex held.
+ * Return:
+ * * 0 on success.
+ * * Negative return value on failure.
+ */
+static int ne_retrieve_reply(struct pci_dev *pdev, struct ne_pci_dev_cmd_reply *cmd_reply,
+ size_t cmd_reply_size)
+{
+ struct ne_pci_dev *ne_pci_dev = pci_get_drvdata(pdev);
+
+ memcpy_fromio(cmd_reply, ne_pci_dev->iomem_base + NE_RECV_DATA, cmd_reply_size);
+
+ return 0;
+}
+
+/**
+ * ne_wait_for_reply() - Wait for a reply of a PCI device command.
+ * @pdev: PCI device for which a reply is waited.
+ *
+ * Context: Process context. This function is called with the ne_pci_dev mutex held.
+ * Return:
+ * * 0 on success.
+ * * Negative return value on failure.
+ */
+static int ne_wait_for_reply(struct pci_dev *pdev)
+{
+ struct ne_pci_dev *ne_pci_dev = pci_get_drvdata(pdev);
+ int rc = -EINVAL;
+
+ /*
+ * TODO: Update to _interruptible and handle interrupted wait event
+ * e.g. -ERESTARTSYS, incoming signals + update timeout, if needed.
+ */
+ rc = wait_event_timeout(ne_pci_dev->cmd_reply_wait_q,
+ atomic_read(&ne_pci_dev->cmd_reply_avail) != 0,
+ msecs_to_jiffies(NE_DEFAULT_TIMEOUT_MSECS));
+ if (!rc)
+ return -ETIMEDOUT;
+
+ return 0;
+}
+
+int ne_do_request(struct pci_dev *pdev, enum ne_pci_dev_cmd_type cmd_type,
+ void *cmd_request, size_t cmd_request_size,
+ struct ne_pci_dev_cmd_reply *cmd_reply, size_t cmd_reply_size)
+{
+ struct ne_pci_dev *ne_pci_dev = pci_get_drvdata(pdev);
+ int rc = -EINVAL;
+
+ if (cmd_type <= INVALID_CMD || cmd_type >= MAX_CMD) {
+ dev_err_ratelimited(&pdev->dev, "Invalid cmd type=%u\n", cmd_type);
+
+ return -EINVAL;
+ }
+
+ if (!cmd_request) {
+ dev_err_ratelimited(&pdev->dev, "Null cmd request\n");
+
+ return -EINVAL;
+ }
+
+ if (cmd_request_size > NE_SEND_DATA_SIZE) {
+ dev_err_ratelimited(&pdev->dev, "Invalid req size=%zu for cmd type=%u\n",
+ cmd_request_size, cmd_type);
+
+ return -EINVAL;
+ }
+
+ if (!cmd_reply) {
+ dev_err_ratelimited(&pdev->dev, "Null cmd reply\n");
+
+ return -EINVAL;
+ }
+
+ if (cmd_reply_size > NE_RECV_DATA_SIZE) {
+ dev_err_ratelimited(&pdev->dev, "Invalid reply size=%zu\n", cmd_reply_size);
+
+ return -EINVAL;
+ }
+
+ /*
+ * Use this mutex so that the PCI device handles one command request at
+ * a time.
+ */
+ mutex_lock(&ne_pci_dev->pci_dev_mutex);
+
+ atomic_set(&ne_pci_dev->cmd_reply_avail, 0);
+
+ rc = ne_submit_request(pdev, cmd_type, cmd_request, cmd_request_size);
+ if (rc < 0) {
+ dev_err_ratelimited(&pdev->dev, "Error in submit request [rc=%d]\n", rc);
+
+ goto unlock_mutex;
+ }
+
+ rc = ne_wait_for_reply(pdev);
+ if (rc < 0) {
+ dev_err_ratelimited(&pdev->dev, "Error in wait for reply [rc=%d]\n", rc);
+
+ goto unlock_mutex;
+ }
+
+ rc = ne_retrieve_reply(pdev, cmd_reply, cmd_reply_size);
+ if (rc < 0) {
+ dev_err_ratelimited(&pdev->dev, "Error in retrieve reply [rc=%d]\n", rc);
+
+ goto unlock_mutex;
+ }
+
+ atomic_set(&ne_pci_dev->cmd_reply_avail, 0);
+
+ if (cmd_reply->rc < 0) {
+ rc = cmd_reply->rc;
+
+ dev_err_ratelimited(&pdev->dev, "Error in cmd process logic [rc=%d]\n", rc);
+
+ goto unlock_mutex;
+ }
+
+ rc = 0;
+
+unlock_mutex:
+ mutex_unlock(&ne_pci_dev->pci_dev_mutex);
+
+ return rc;
+}
+
+/**
+ * ne_reply_handler() - Interrupt handler for retrieving a reply matching a
+ * request sent to the PCI device for enclave lifetime
+ * management.
+ * @irq: Received interrupt for a reply sent by the PCI device.
+ * @args: PCI device private data structure.
+ *
+ * Context: Interrupt context.
+ * Return:
+ * * IRQ_HANDLED on handled interrupt.
+ */
+static irqreturn_t ne_reply_handler(int irq, void *args)
+{
+ struct ne_pci_dev *ne_pci_dev = (struct ne_pci_dev *)args;
+
+ atomic_set(&ne_pci_dev->cmd_reply_avail, 1);
+
+ /* TODO: Update to _interruptible. */
+ wake_up(&ne_pci_dev->cmd_reply_wait_q);
+
+ return IRQ_HANDLED;
+}
+
/**
* ne_setup_msix() - Setup MSI-X vectors for the PCI device.
* @pdev: PCI device to setup the MSI-X for.
@@ -44,6 +225,7 @@ MODULE_DEVICE_TABLE(pci, ne_pci_ids);
*/
static int ne_setup_msix(struct pci_dev *pdev)
{
+ struct ne_pci_dev *ne_pci_dev = pci_get_drvdata(pdev);
int nr_vecs = 0;
int rc = -EINVAL;
@@ -63,7 +245,25 @@ static int ne_setup_msix(struct pci_dev *pdev)
return rc;
}
+ /*
+ * This IRQ gets triggered every time the PCI device responds to a
+ * command request. The reply is then retrieved, reading from the MMIO
+ * space of the PCI device.
+ */
+ rc = request_irq(pci_irq_vector(pdev, NE_VEC_REPLY), ne_reply_handler,
+ 0, "enclave_cmd", ne_pci_dev);
+ if (rc < 0) {
+ dev_err(&pdev->dev, "Error in request irq reply [rc=%d]\n", rc);
+
+ goto free_irq_vectors;
+ }
+
return 0;
+
+free_irq_vectors:
+ pci_free_irq_vectors(pdev);
+
+ return rc;
}
/**
@@ -74,6 +274,10 @@ static int ne_setup_msix(struct pci_dev *pdev)
*/
static void ne_teardown_msix(struct pci_dev *pdev)
{
+ struct ne_pci_dev *ne_pci_dev = pci_get_drvdata(pdev);
+
+ free_irq(pci_irq_vector(pdev, NE_VEC_REPLY), ne_pci_dev);
+
pci_free_irq_vectors(pdev);
}
--
2.20.1 (Apple Git-117)
Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.
The Nitro Enclaves driver keeps an internal info per each enclave.
This is needed to be able to manage enclave resources state, enclave
notifications and have a reference of the PCI device that handles
command requests for enclave lifetime management.
Signed-off-by: Alexandru-Catalin Vasile <[email protected]>
Signed-off-by: Andra Paraschiv <[email protected]>
Reviewed-by: Alexander Graf <[email protected]>
---
Changelog
v6 -> v7
* Update the naming and add more comments to make more clear the logic
of handling full CPU cores and dedicating them to the enclave.
v5 -> v6
* Update documentation to kernel-doc format.
* Include in the enclave memory region data structure the user space
address and size for duplicate user space memory regions checks.
v4 -> v5
* Include enclave cores field in the enclave metadata.
* Update the vCPU ids data structure to be a cpumask instead of a list.
v3 -> v4
* Add NUMA node field for an enclave metadata as the enclave memory and
CPUs need to be from the same NUMA node.
v2 -> v3
* Remove the GPL additional wording as SPDX-License-Identifier is
already in place.
v1 -> v2
* Add enclave memory regions and vcpus count for enclave bookkeeping.
* Update ne_state comments to reflect NE_START_ENCLAVE ioctl naming
update.
---
drivers/virt/nitro_enclaves/ne_misc_dev.h | 99 +++++++++++++++++++++++
1 file changed, 99 insertions(+)
create mode 100644 drivers/virt/nitro_enclaves/ne_misc_dev.h
diff --git a/drivers/virt/nitro_enclaves/ne_misc_dev.h b/drivers/virt/nitro_enclaves/ne_misc_dev.h
new file mode 100644
index 000000000000..a907924de7ca
--- /dev/null
+++ b/drivers/virt/nitro_enclaves/ne_misc_dev.h
@@ -0,0 +1,99 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright 2020 Amazon.com, Inc. or its affiliates. All Rights Reserved.
+ */
+
+#ifndef _NE_MISC_DEV_H_
+#define _NE_MISC_DEV_H_
+
+#include <linux/cpumask.h>
+#include <linux/list.h>
+#include <linux/miscdevice.h>
+#include <linux/mm.h>
+#include <linux/mutex.h>
+#include <linux/pci.h>
+#include <linux/wait.h>
+
+/**
+ * struct ne_mem_region - Entry in the enclave user space memory regions list.
+ * @mem_region_list_entry: Entry in the list of enclave memory regions.
+ * @memory_size: Size of the user space memory region.
+ * @nr_pages: Number of pages that make up the memory region.
+ * @pages: Pages that make up the user space memory region.
+ * @userspace_addr: User space address of the memory region.
+ */
+struct ne_mem_region {
+ struct list_head mem_region_list_entry;
+ u64 memory_size;
+ unsigned long nr_pages;
+ struct page **pages;
+ u64 userspace_addr;
+};
+
+/**
+ * struct ne_enclave - Per-enclave data used for enclave lifetime management.
+ * @enclave_info_mutex : Mutex for accessing this internal state.
+ * @enclave_list_entry : Entry in the list of created enclaves.
+ * @eventq: Wait queue used for out-of-band event notifications
+ * triggered from the PCI device event handler to
+ * the enclave process via the poll function.
+ * @has_event: Variable used to determine if the out-of-band event
+ * was triggered.
+ * @max_mem_regions: The maximum number of memory regions that can be
+ * handled by the hypervisor.
+ * @mem_regions_list: Enclave user space memory regions list.
+ * @mem_size: Enclave memory size.
+ * @mm : Enclave process abstraction mm data struct.
+ * @nr_mem_regions: Number of memory regions associated with the enclave.
+ * @nr_parent_vm_cores : The size of the threads per core array. The
+ * total number of CPU cores available on the
+ * parent / primary VM.
+ * @nr_threads_per_core: The number of threads that a full CPU core has.
+ * @nr_vcpus: Number of vcpus associated with the enclave.
+ * @numa_node: NUMA node of the enclave memory and CPUs.
+ * @pdev: PCI device used for enclave lifetime management.
+ * @slot_uid: Slot unique id mapped to the enclave.
+ * @state: Enclave state, updated during enclave lifetime.
+ * @threads_per_core: Enclave full CPU cores array, indexed by core id,
+ * consisting of cpumasks with all their threads.
+ * Full CPU cores are taken from the NE CPU pool
+ * and are available to the enclave.
+ * @vcpu_ids: Cpumask of the vCPUs that are set for the enclave.
+ */
+struct ne_enclave {
+ struct mutex enclave_info_mutex;
+ struct list_head enclave_list_entry;
+ wait_queue_head_t eventq;
+ bool has_event;
+ u64 max_mem_regions;
+ struct list_head mem_regions_list;
+ u64 mem_size;
+ struct mm_struct *mm;
+ unsigned int nr_mem_regions;
+ unsigned int nr_parent_vm_cores;
+ unsigned int nr_threads_per_core;
+ unsigned int nr_vcpus;
+ int numa_node;
+ struct pci_dev *pdev;
+ u64 slot_uid;
+ u16 state;
+ cpumask_var_t *threads_per_core;
+ cpumask_var_t vcpu_ids;
+};
+
+/**
+ * enum ne_state - States available for an enclave.
+ * @NE_STATE_INIT: The enclave has not been started yet.
+ * @NE_STATE_RUNNING: The enclave was started and is running as expected.
+ * @NE_STATE_STOPPED: The enclave exited without userspace interaction.
+ */
+enum ne_state {
+ NE_STATE_INIT = 0,
+ NE_STATE_RUNNING = 2,
+ NE_STATE_STOPPED = U16_MAX,
+};
+
+/* Nitro Enclaves (NE) misc device */
+extern struct miscdevice ne_misc_dev;
+
+#endif /* _NE_MISC_DEV_H_ */
--
2.20.1 (Apple Git-117)
Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.
In addition to the replies sent by the Nitro Enclaves PCI device in
response to command requests, out-of-band enclave events can happen e.g.
an enclave crashes. In this case, the Nitro Enclaves driver needs to be
aware of the event and notify the corresponding user space process that
abstracts the enclave.
Register an MSI-X interrupt vector to be used for this kind of
out-of-band events. The interrupt notifies that the state of an enclave
changed and the driver logic scans the state of each running enclave to
identify for which this notification is intended.
Create an workqueue to handle the out-of-band events. Notify user space
enclave process that is using a polling mechanism on the enclave fd.
Signed-off-by: Alexandru-Catalin Vasile <[email protected]>
Signed-off-by: Andra Paraschiv <[email protected]>
---
Changelog
v6 -> v7
* No changes.
v5 -> v6
* Update documentation to kernel-doc format.
v4 -> v5
* Remove sanity checks for situations that shouldn't happen, only if
buggy system or broken logic at all.
v3 -> v4
* Use dev_err instead of custom NE log pattern.
* Return IRQ_NONE when interrupts are not handled.
v2 -> v3
* Remove the WARN_ON calls.
* Update static calls sanity checks.
* Remove "ratelimited" from the logs that are not in the ioctl call
paths.
v1 -> v2
* Add log pattern for NE.
* Update goto labels to match their purpose.
---
drivers/virt/nitro_enclaves/ne_pci_dev.c | 116 +++++++++++++++++++++++
1 file changed, 116 insertions(+)
diff --git a/drivers/virt/nitro_enclaves/ne_pci_dev.c b/drivers/virt/nitro_enclaves/ne_pci_dev.c
index 77ccbc43bce3..a898fae066d9 100644
--- a/drivers/virt/nitro_enclaves/ne_pci_dev.c
+++ b/drivers/virt/nitro_enclaves/ne_pci_dev.c
@@ -214,6 +214,88 @@ static irqreturn_t ne_reply_handler(int irq, void *args)
return IRQ_HANDLED;
}
+/**
+ * ne_event_work_handler() - Work queue handler for notifying enclaves on a
+ * state change received by the event interrupt
+ * handler.
+ * @work: Item containing the NE PCI device for which an out-of-band event
+ * was issued.
+ *
+ * An out-of-band event is being issued by the Nitro Hypervisor when at least
+ * one enclave is changing state without client interaction.
+ *
+ * Context: Work queue context.
+ */
+static void ne_event_work_handler(struct work_struct *work)
+{
+ struct ne_pci_dev_cmd_reply cmd_reply = {};
+ struct ne_enclave *ne_enclave = NULL;
+ struct ne_pci_dev *ne_pci_dev =
+ container_of(work, struct ne_pci_dev, notify_work);
+ int rc = -EINVAL;
+ struct slot_info_req slot_info_req = {};
+
+ mutex_lock(&ne_pci_dev->enclaves_list_mutex);
+
+ /*
+ * Iterate over all enclaves registered for the Nitro Enclaves
+ * PCI device and determine for which enclave(s) the out-of-band event
+ * is corresponding to.
+ */
+ list_for_each_entry(ne_enclave, &ne_pci_dev->enclaves_list, enclave_list_entry) {
+ mutex_lock(&ne_enclave->enclave_info_mutex);
+
+ /*
+ * Enclaves that were never started cannot receive out-of-band
+ * events.
+ */
+ if (ne_enclave->state != NE_STATE_RUNNING)
+ goto unlock;
+
+ slot_info_req.slot_uid = ne_enclave->slot_uid;
+
+ rc = ne_do_request(ne_enclave->pdev, SLOT_INFO, &slot_info_req,
+ sizeof(slot_info_req), &cmd_reply, sizeof(cmd_reply));
+ if (rc < 0)
+ dev_err(&ne_enclave->pdev->dev, "Error in slot info [rc=%d]\n", rc);
+
+ /* Notify enclave process that the enclave state changed. */
+ if (ne_enclave->state != cmd_reply.state) {
+ ne_enclave->state = cmd_reply.state;
+
+ ne_enclave->has_event = true;
+
+ wake_up_interruptible(&ne_enclave->eventq);
+ }
+
+unlock:
+ mutex_unlock(&ne_enclave->enclave_info_mutex);
+ }
+
+ mutex_unlock(&ne_pci_dev->enclaves_list_mutex);
+}
+
+/**
+ * ne_event_handler() - Interrupt handler for PCI device out-of-band events.
+ * This interrupt does not supply any data in the MMIO
+ * region. It notifies a change in the state of any of
+ * the launched enclaves.
+ * @irq: Received interrupt for an out-of-band event.
+ * @args: PCI device private data structure.
+ *
+ * Context: Interrupt context.
+ * Return:
+ * * IRQ_HANDLED on handled interrupt.
+ */
+static irqreturn_t ne_event_handler(int irq, void *args)
+{
+ struct ne_pci_dev *ne_pci_dev = (struct ne_pci_dev *)args;
+
+ queue_work(ne_pci_dev->event_wq, &ne_pci_dev->notify_work);
+
+ return IRQ_HANDLED;
+}
+
/**
* ne_setup_msix() - Setup MSI-X vectors for the PCI device.
* @pdev: PCI device to setup the MSI-X for.
@@ -258,8 +340,36 @@ static int ne_setup_msix(struct pci_dev *pdev)
goto free_irq_vectors;
}
+ ne_pci_dev->event_wq = create_singlethread_workqueue("ne_pci_dev_wq");
+ if (!ne_pci_dev->event_wq) {
+ rc = -ENOMEM;
+
+ dev_err(&pdev->dev, "Cannot get wq for dev events [rc=%d]\n", rc);
+
+ goto free_reply_irq_vec;
+ }
+
+ INIT_WORK(&ne_pci_dev->notify_work, ne_event_work_handler);
+
+ /*
+ * This IRQ gets triggered every time any enclave's state changes. Its
+ * handler then scans for the changes and propagates them to the user
+ * space.
+ */
+ rc = request_irq(pci_irq_vector(pdev, NE_VEC_EVENT), ne_event_handler,
+ 0, "enclave_evt", ne_pci_dev);
+ if (rc < 0) {
+ dev_err(&pdev->dev, "Error in request irq event [rc=%d]\n", rc);
+
+ goto destroy_wq;
+ }
+
return 0;
+destroy_wq:
+ destroy_workqueue(ne_pci_dev->event_wq);
+free_reply_irq_vec:
+ free_irq(pci_irq_vector(pdev, NE_VEC_REPLY), ne_pci_dev);
free_irq_vectors:
pci_free_irq_vectors(pdev);
@@ -276,6 +386,12 @@ static void ne_teardown_msix(struct pci_dev *pdev)
{
struct ne_pci_dev *ne_pci_dev = pci_get_drvdata(pdev);
+ free_irq(pci_irq_vector(pdev, NE_VEC_EVENT), ne_pci_dev);
+
+ flush_work(&ne_pci_dev->notify_work);
+ flush_workqueue(ne_pci_dev->event_wq);
+ destroy_workqueue(ne_pci_dev->event_wq);
+
free_irq(pci_irq_vector(pdev, NE_VEC_REPLY), ne_pci_dev);
pci_free_irq_vectors(pdev);
--
2.20.1 (Apple Git-117)
Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.
The Nitro Enclaves driver provides an ioctl interface to the user space
for enclave lifetime management e.g. enclave creation / termination and
setting enclave resources such as memory and CPU.
This ioctl interface is mapped to a Nitro Enclaves misc device.
Signed-off-by: Andra Paraschiv <[email protected]>
---
Changelog
v6 -> v7
* Set the NE PCI device the parent of the NE misc device to be able to
use it in the ioctl logic.
* Update the naming and add more comments to make more clear the logic
of handling full CPU cores and dedicating them to the enclave.
v5 -> v6
* Remove the ioctl to query API version.
* Update documentation to kernel-doc format.
v4 -> v5
* Update the size of the NE CPU pool string from 4096 to 512 chars.
v3 -> v4
* Use dev_err instead of custom NE log pattern.
* Remove the NE CPU pool init during kernel module loading, as the CPU
pool is now setup at runtime, via a sysfs file for the kernel
parameter.
* Add minimum enclave memory size definition.
v2 -> v3
* Remove the GPL additional wording as SPDX-License-Identifier is
already in place.
* Remove the WARN_ON calls.
* Remove linux/bug and linux/kvm_host includes that are not needed.
* Remove "ratelimited" from the logs that are not in the ioctl call
paths.
* Remove file ops that do nothing for now - open and release.
v1 -> v2
* Add log pattern for NE.
* Update goto labels to match their purpose.
* Update ne_cpu_pool data structure to include the global mutex.
* Update NE misc device mode to 0660.
* Check if the CPU siblings are included in the NE CPU pool, as full CPU
cores are given for the enclave(s).
---
drivers/virt/nitro_enclaves/ne_misc_dev.c | 128 ++++++++++++++++++++++
drivers/virt/nitro_enclaves/ne_pci_dev.c | 17 +++
2 files changed, 145 insertions(+)
create mode 100644 drivers/virt/nitro_enclaves/ne_misc_dev.c
diff --git a/drivers/virt/nitro_enclaves/ne_misc_dev.c b/drivers/virt/nitro_enclaves/ne_misc_dev.c
new file mode 100644
index 000000000000..0776a4b36c61
--- /dev/null
+++ b/drivers/virt/nitro_enclaves/ne_misc_dev.c
@@ -0,0 +1,128 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright 2020 Amazon.com, Inc. or its affiliates. All Rights Reserved.
+ */
+
+/**
+ * DOC: Enclave lifetime management driver for Nitro Enclaves (NE).
+ * Nitro is a hypervisor that has been developed by Amazon.
+ */
+
+#include <linux/anon_inodes.h>
+#include <linux/capability.h>
+#include <linux/cpu.h>
+#include <linux/device.h>
+#include <linux/file.h>
+#include <linux/hugetlb.h>
+#include <linux/list.h>
+#include <linux/miscdevice.h>
+#include <linux/mm.h>
+#include <linux/mman.h>
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/nitro_enclaves.h>
+#include <linux/pci.h>
+#include <linux/poll.h>
+#include <linux/slab.h>
+#include <linux/types.h>
+
+#include "ne_misc_dev.h"
+#include "ne_pci_dev.h"
+
+/**
+ * NE_CPUS_SIZE - Size for max 128 CPUs, for now, in a cpu-list string, comma
+ * separated. The NE CPU pool includes CPUs from a single NUMA
+ * node.
+ */
+#define NE_CPUS_SIZE (512)
+
+/**
+ * NE_EIF_LOAD_OFFSET - The offset where to copy the Enclave Image Format (EIF)
+ * image in enclave memory.
+ */
+#define NE_EIF_LOAD_OFFSET (8 * 1024UL * 1024UL)
+
+/**
+ * NE_MIN_ENCLAVE_MEM_SIZE - The minimum memory size an enclave can be launched
+ * with.
+ */
+#define NE_MIN_ENCLAVE_MEM_SIZE (64 * 1024UL * 1024UL)
+
+/**
+ * NE_MIN_MEM_REGION_SIZE - The minimum size of an enclave memory region.
+ */
+#define NE_MIN_MEM_REGION_SIZE (2 * 1024UL * 1024UL)
+
+/*
+ * TODO: Update logic to create new sysfs entries instead of using
+ * a kernel parameter e.g. if multiple sysfs files needed.
+ */
+static const struct kernel_param_ops ne_cpu_pool_ops = {
+ .get = param_get_string,
+};
+
+static char ne_cpus[NE_CPUS_SIZE];
+static struct kparam_string ne_cpus_arg = {
+ .maxlen = sizeof(ne_cpus),
+ .string = ne_cpus,
+};
+
+module_param_cb(ne_cpus, &ne_cpu_pool_ops, &ne_cpus_arg, 0644);
+/* https://www.kernel.org/doc/html/latest/admin-guide/kernel-parameters.html#cpu-lists */
+MODULE_PARM_DESC(ne_cpus, "<cpu-list> - CPU pool used for Nitro Enclaves");
+
+/**
+ * struct ne_cpu_pool - CPU pool used for Nitro Enclaves.
+ * @avail_threads_per_core: Available full CPU cores to be dedicated to
+ * enclave(s). The cpumasks from the array, indexed
+ * by core id, contain all the threads from the
+ * available cores, that are not set for created
+ * enclave(s). The full CPU cores are part of the
+ * NE CPU pool.
+ * @mutex: Mutex for the access to the NE CPU pool.
+ * @nr_parent_vm_cores : The size of the available threads per core array.
+ * The total number of CPU cores available on the
+ * parent / primary VM.
+ * @nr_threads_per_core: The number of threads that a full CPU core has.
+ * @numa_node: NUMA node of the CPUs in the pool.
+ */
+struct ne_cpu_pool {
+ cpumask_var_t *avail_threads_per_core;
+ struct mutex mutex;
+ unsigned int nr_parent_vm_cores;
+ unsigned int nr_threads_per_core;
+ int numa_node;
+};
+
+static struct ne_cpu_pool ne_cpu_pool;
+
+static const struct file_operations ne_fops = {
+ .owner = THIS_MODULE,
+ .llseek = noop_llseek,
+};
+
+struct miscdevice ne_misc_dev = {
+ .minor = MISC_DYNAMIC_MINOR,
+ .name = "nitro_enclaves",
+ .fops = &ne_fops,
+ .mode = 0660,
+};
+
+static int __init ne_init(void)
+{
+ mutex_init(&ne_cpu_pool.mutex);
+
+ return pci_register_driver(&ne_pci_driver);
+}
+
+static void __exit ne_exit(void)
+{
+ pci_unregister_driver(&ne_pci_driver);
+}
+
+module_init(ne_init);
+module_exit(ne_exit);
+
+MODULE_AUTHOR("Amazon.com, Inc. or its affiliates");
+MODULE_DESCRIPTION("Nitro Enclaves Driver");
+MODULE_LICENSE("GPL v2");
diff --git a/drivers/virt/nitro_enclaves/ne_pci_dev.c b/drivers/virt/nitro_enclaves/ne_pci_dev.c
index a898fae066d9..65f2814ceae4 100644
--- a/drivers/virt/nitro_enclaves/ne_pci_dev.c
+++ b/drivers/virt/nitro_enclaves/ne_pci_dev.c
@@ -527,6 +527,16 @@ static int ne_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
goto teardown_msix;
}
+ /* Set the NE PCI device as parent to use it in the ioctl logic. */
+ ne_misc_dev.parent = &pdev->dev;
+
+ rc = misc_register(&ne_misc_dev);
+ if (rc < 0) {
+ dev_err(&pdev->dev, "Error in misc dev register [rc=%d]\n", rc);
+
+ goto disable_ne_pci_dev;
+ }
+
atomic_set(&ne_pci_dev->cmd_reply_avail, 0);
init_waitqueue_head(&ne_pci_dev->cmd_reply_wait_q);
INIT_LIST_HEAD(&ne_pci_dev->enclaves_list);
@@ -536,6 +546,9 @@ static int ne_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
return 0;
+disable_ne_pci_dev:
+ ne_misc_dev.parent = NULL;
+ ne_pci_dev_disable(pdev);
teardown_msix:
ne_teardown_msix(pdev);
iounmap_pci_bar:
@@ -561,6 +574,10 @@ static void ne_pci_remove(struct pci_dev *pdev)
{
struct ne_pci_dev *ne_pci_dev = pci_get_drvdata(pdev);
+ misc_deregister(&ne_misc_dev);
+
+ ne_misc_dev.parent = NULL;
+
ne_pci_dev_disable(pdev);
ne_teardown_msix(pdev);
--
2.20.1 (Apple Git-117)
Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.
Signed-off-by: Andra Paraschiv <[email protected]>
---
Changelog
v6 -> v7
* Remove, for now, the dependency on ARM64 arch. x86 is currently
supported, with Arm to come afterwards. The NE kernel driver can be
built for aarch64 arch.
v5 -> v6
* No changes.
v4 -> v5
* Add arch dependency for Arm / x86.
v3 -> v4
* Add PCI and SMP dependencies.
v2 -> v3
* Remove the GPL additional wording as SPDX-License-Identifier is
already in place.
v1 -> v2
* Update path to Kconfig to match the drivers/virt/nitro_enclaves
directory.
* Update help in Kconfig.
---
drivers/virt/Kconfig | 2 ++
drivers/virt/nitro_enclaves/Kconfig | 20 ++++++++++++++++++++
2 files changed, 22 insertions(+)
create mode 100644 drivers/virt/nitro_enclaves/Kconfig
diff --git a/drivers/virt/Kconfig b/drivers/virt/Kconfig
index cbc1f25c79ab..80c5f9c16ec1 100644
--- a/drivers/virt/Kconfig
+++ b/drivers/virt/Kconfig
@@ -32,4 +32,6 @@ config FSL_HV_MANAGER
partition shuts down.
source "drivers/virt/vboxguest/Kconfig"
+
+source "drivers/virt/nitro_enclaves/Kconfig"
endif
diff --git a/drivers/virt/nitro_enclaves/Kconfig b/drivers/virt/nitro_enclaves/Kconfig
new file mode 100644
index 000000000000..8c9387a232df
--- /dev/null
+++ b/drivers/virt/nitro_enclaves/Kconfig
@@ -0,0 +1,20 @@
+# SPDX-License-Identifier: GPL-2.0
+#
+# Copyright 2020 Amazon.com, Inc. or its affiliates. All Rights Reserved.
+
+# Amazon Nitro Enclaves (NE) support.
+# Nitro is a hypervisor that has been developed by Amazon.
+
+# TODO: Add dependency for ARM64 once NE is supported on Arm platforms. For now,
+# the NE kernel driver can be built for aarch64 arch.
+# depends on (ARM64 || X86) && HOTPLUG_CPU && PCI && SMP
+
+config NITRO_ENCLAVES
+ tristate "Nitro Enclaves Support"
+ depends on X86 && HOTPLUG_CPU && PCI && SMP
+ help
+ This driver consists of support for enclave lifetime management
+ for Nitro Enclaves (NE).
+
+ To compile this driver as a module, choose M here.
+ The module will be called nitro_enclaves.
--
2.20.1 (Apple Git-117)
Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.
Add ioctl command logic for enclave VM creation. It triggers a slot
allocation. The enclave resources will be associated with this slot and
it will be used as an identifier for triggering enclave run.
Return a file descriptor, namely enclave fd. This is further used by the
associated user space enclave process to set enclave resources and
trigger enclave termination.
The poll function is implemented in order to notify the enclave process
when an enclave exits without a specific enclave termination command
trigger e.g. when an enclave crashes.
Signed-off-by: Alexandru Vasile <[email protected]>
Signed-off-by: Andra Paraschiv <[email protected]>
Reviewed-by: Alexander Graf <[email protected]>
---
Changelog
v6 -> v7
* Use the NE misc device parent field to get the NE PCI device.
* Update the naming and add more comments to make more clear the logic
of handling full CPU cores and dedicating them to the enclave.
v5 -> v6
* Update the code base to init the ioctl function in this patch.
* Update documentation to kernel-doc format.
v4 -> v5
* Release the reference to the NE PCI device on create VM error.
* Close enclave fd on copy_to_user() failure; rename fd to enclave fd
while at it.
* Remove sanity checks for situations that shouldn't happen, only if
buggy system or broken logic at all.
* Remove log on copy_to_user() failure.
v3 -> v4
* Use dev_err instead of custom NE log pattern.
* Update the NE ioctl call to match the decoupling from the KVM API.
* Add metadata for the NUMA node for the enclave memory and CPUs.
v2 -> v3
* Remove the WARN_ON calls.
* Update static calls sanity checks.
* Update kzfree() calls to kfree().
* Remove file ops that do nothing for now - open.
v1 -> v2
* Add log pattern for NE.
* Update goto labels to match their purpose.
* Remove the BUG_ON calls.
---
drivers/virt/nitro_enclaves/ne_misc_dev.c | 226 ++++++++++++++++++++++
1 file changed, 226 insertions(+)
diff --git a/drivers/virt/nitro_enclaves/ne_misc_dev.c b/drivers/virt/nitro_enclaves/ne_misc_dev.c
index 0776a4b36c61..a824a50341dd 100644
--- a/drivers/virt/nitro_enclaves/ne_misc_dev.c
+++ b/drivers/virt/nitro_enclaves/ne_misc_dev.c
@@ -96,9 +96,235 @@ struct ne_cpu_pool {
static struct ne_cpu_pool ne_cpu_pool;
+/**
+ * ne_enclave_poll() - Poll functionality used for enclave out-of-band events.
+ * @file: File associated with this poll function.
+ * @wait: Poll table data structure.
+ *
+ * Context: Process context.
+ * Return:
+ * * Poll mask.
+ */
+static __poll_t ne_enclave_poll(struct file *file, poll_table *wait)
+{
+ __poll_t mask = 0;
+ struct ne_enclave *ne_enclave = file->private_data;
+
+ poll_wait(file, &ne_enclave->eventq, wait);
+
+ if (!ne_enclave->has_event)
+ return mask;
+
+ mask = POLLHUP;
+
+ return mask;
+}
+
+static const struct file_operations ne_enclave_fops = {
+ .owner = THIS_MODULE,
+ .llseek = noop_llseek,
+ .poll = ne_enclave_poll,
+};
+
+/**
+ * ne_create_vm_ioctl() - Alloc slot to be associated with an enclave. Create
+ * enclave file descriptor to be further used for enclave
+ * resources handling e.g. memory regions and CPUs.
+ * @pdev: PCI device used for enclave lifetime management.
+ * @ne_pci_dev : Private data associated with the PCI device.
+ * @slot_uid: Generated unique slot id associated with an enclave.
+ *
+ * Context: Process context. This function is called with the ne_pci_dev enclave
+ * mutex held.
+ * Return:
+ * * Enclave fd on success.
+ * * Negative return value on failure.
+ */
+static int ne_create_vm_ioctl(struct pci_dev *pdev, struct ne_pci_dev *ne_pci_dev,
+ u64 *slot_uid)
+{
+ struct ne_pci_dev_cmd_reply cmd_reply = {};
+ int enclave_fd = -1;
+ struct file *enclave_file = NULL;
+ unsigned int i = 0;
+ struct ne_enclave *ne_enclave = NULL;
+ int rc = -EINVAL;
+ struct slot_alloc_req slot_alloc_req = {};
+
+ mutex_lock(&ne_cpu_pool.mutex);
+
+ for (i = 0; i < ne_cpu_pool.nr_parent_vm_cores; i++)
+ if (!cpumask_empty(ne_cpu_pool.avail_threads_per_core[i]))
+ break;
+
+ if (i == ne_cpu_pool.nr_parent_vm_cores) {
+ dev_err_ratelimited(ne_misc_dev.this_device,
+ "No CPUs available in CPU pool\n");
+
+ mutex_unlock(&ne_cpu_pool.mutex);
+
+ return -NE_ERR_NO_CPUS_AVAIL_IN_POOL;
+ }
+
+ mutex_unlock(&ne_cpu_pool.mutex);
+
+ ne_enclave = kzalloc(sizeof(*ne_enclave), GFP_KERNEL);
+ if (!ne_enclave)
+ return -ENOMEM;
+
+ mutex_lock(&ne_cpu_pool.mutex);
+
+ ne_enclave->nr_parent_vm_cores = ne_cpu_pool.nr_parent_vm_cores;
+ ne_enclave->nr_threads_per_core = ne_cpu_pool.nr_threads_per_core;
+ ne_enclave->numa_node = ne_cpu_pool.numa_node;
+
+ mutex_unlock(&ne_cpu_pool.mutex);
+
+ ne_enclave->threads_per_core = kcalloc(ne_enclave->nr_parent_vm_cores,
+ sizeof(*ne_enclave->threads_per_core), GFP_KERNEL);
+ if (!ne_enclave->threads_per_core) {
+ rc = -ENOMEM;
+
+ goto free_ne_enclave;
+ }
+
+ for (i = 0; i < ne_enclave->nr_parent_vm_cores; i++)
+ if (!zalloc_cpumask_var(&ne_enclave->threads_per_core[i], GFP_KERNEL)) {
+ rc = -ENOMEM;
+
+ goto free_cpumask;
+ }
+
+ if (!zalloc_cpumask_var(&ne_enclave->vcpu_ids, GFP_KERNEL)) {
+ rc = -ENOMEM;
+
+ goto free_cpumask;
+ }
+
+ ne_enclave->pdev = pdev;
+
+ enclave_fd = get_unused_fd_flags(O_CLOEXEC);
+ if (enclave_fd < 0) {
+ rc = enclave_fd;
+
+ dev_err_ratelimited(ne_misc_dev.this_device,
+ "Error in getting unused fd [rc=%d]\n", rc);
+
+ goto free_cpumask;
+ }
+
+ enclave_file = anon_inode_getfile("ne-vm", &ne_enclave_fops, ne_enclave, O_RDWR);
+ if (IS_ERR(enclave_file)) {
+ rc = PTR_ERR(enclave_file);
+
+ dev_err_ratelimited(ne_misc_dev.this_device,
+ "Error in anon inode get file [rc=%d]\n", rc);
+
+ goto put_fd;
+ }
+
+ rc = ne_do_request(ne_enclave->pdev, SLOT_ALLOC, &slot_alloc_req, sizeof(slot_alloc_req),
+ &cmd_reply, sizeof(cmd_reply));
+ if (rc < 0) {
+ dev_err_ratelimited(ne_misc_dev.this_device,
+ "Error in slot alloc [rc=%d]\n", rc);
+
+ goto put_file;
+ }
+
+ init_waitqueue_head(&ne_enclave->eventq);
+ ne_enclave->has_event = false;
+ mutex_init(&ne_enclave->enclave_info_mutex);
+ ne_enclave->max_mem_regions = cmd_reply.mem_regions;
+ INIT_LIST_HEAD(&ne_enclave->mem_regions_list);
+ ne_enclave->mm = current->mm;
+ ne_enclave->slot_uid = cmd_reply.slot_uid;
+ ne_enclave->state = NE_STATE_INIT;
+
+ list_add(&ne_enclave->enclave_list_entry, &ne_pci_dev->enclaves_list);
+
+ *slot_uid = ne_enclave->slot_uid;
+
+ fd_install(enclave_fd, enclave_file);
+
+ return enclave_fd;
+
+put_file:
+ fput(enclave_file);
+put_fd:
+ put_unused_fd(enclave_fd);
+free_cpumask:
+ free_cpumask_var(ne_enclave->vcpu_ids);
+ for (i = 0; i < ne_enclave->nr_parent_vm_cores; i++)
+ free_cpumask_var(ne_enclave->threads_per_core[i]);
+ kfree(ne_enclave->threads_per_core);
+free_ne_enclave:
+ kfree(ne_enclave);
+
+ return rc;
+}
+
+/**
+ * ne_ioctl() - Ioctl function provided by the NE misc device.
+ * @file: File associated with this ioctl function.
+ * @cmd: The command that is set for the ioctl call.
+ * @arg: The argument that is provided for the ioctl call.
+ *
+ * Context: Process context.
+ * Return:
+ * * Ioctl result (e.g. enclave file descriptor) on success.
+ * * Negative return value on failure.
+ */
+static long ne_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
+{
+ switch (cmd) {
+ case NE_CREATE_VM: {
+ int enclave_fd = -1;
+ struct file *enclave_file = NULL;
+ struct ne_pci_dev *ne_pci_dev = NULL;
+ struct pci_dev *pdev = to_pci_dev(ne_misc_dev.parent);
+ int rc = -EINVAL;
+ u64 slot_uid = 0;
+
+ ne_pci_dev = pci_get_drvdata(pdev);
+
+ mutex_lock(&ne_pci_dev->enclaves_list_mutex);
+
+ enclave_fd = ne_create_vm_ioctl(pdev, ne_pci_dev, &slot_uid);
+ if (enclave_fd < 0) {
+ rc = enclave_fd;
+
+ mutex_unlock(&ne_pci_dev->enclaves_list_mutex);
+
+ return rc;
+ }
+
+ mutex_unlock(&ne_pci_dev->enclaves_list_mutex);
+
+ if (copy_to_user((void __user *)arg, &slot_uid, sizeof(slot_uid))) {
+ enclave_file = fget(enclave_fd);
+ /* Decrement file refs to have release() called. */
+ fput(enclave_file);
+ fput(enclave_file);
+ put_unused_fd(enclave_fd);
+
+ return -EFAULT;
+ }
+
+ return enclave_fd;
+ }
+
+ default:
+ return -ENOTTY;
+ }
+
+ return 0;
+}
+
static const struct file_operations ne_fops = {
.owner = THIS_MODULE,
.llseek = noop_llseek,
+ .unlocked_ioctl = ne_ioctl,
};
struct miscdevice ne_misc_dev = {
--
2.20.1 (Apple Git-117)
Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.
An enclave, before being started, has its resources set. One of its
resources is CPU.
A NE CPU pool is set and enclave CPUs are chosen from it. Offline the
CPUs from the NE CPU pool during the pool setup and online them back
during the NE CPU pool teardown. The CPU offline is necessary so that
there would not be more vCPUs than physical CPUs available to the
primary / parent VM. In that case the CPUs would be overcommitted and
would change the initial configuration of the primary / parent VM of
having dedicated vCPUs to physical CPUs.
The enclave CPUs need to be full cores and from the same NUMA node. CPU
0 and its siblings have to remain available to the primary / parent VM.
Add ioctl command logic for setting an enclave vCPU.
Signed-off-by: Alexandru Vasile <[email protected]>
Signed-off-by: Andra Paraschiv <[email protected]>
---
Changelog
v6 -> v7
* Check for error return value when setting the kernel parameter string.
* Use the NE misc device parent field to get the NE PCI device.
* Update the naming and add more comments to make more clear the logic
of handling full CPU cores and dedicating them to the enclave.
* Calculate the number of threads per core and not use smp_num_siblings
that is x86 specific.
v5 -> v6
* Check CPUs are from the same NUMA node before going through CPU
siblings during the NE CPU pool setup.
* Update documentation to kernel-doc format.
v4 -> v5
* Set empty string in case of invalid NE CPU pool.
* Clear NE CPU pool mask on pool setup failure.
* Setup NE CPU cores out of the NE CPU pool.
* Early exit on NE CPU pool setup if enclave(s) already running.
* Remove sanity checks for situations that shouldn't happen, only if
buggy system or broken logic at all.
* Add check for maximum vCPU id possible before looking into the CPU
pool.
* Remove log on copy_from_user() / copy_to_user() failure and on admin
capability check for setting the NE CPU pool.
* Update the ioctl call to not create a file descriptor for the vCPU.
* Split the CPU pool usage logic in 2 separate functions - one to get a
CPU from the pool and the other to check the given CPU is available in
the pool.
v3 -> v4
* Setup the NE CPU pool at runtime via a sysfs file for the kernel
parameter.
* Check enclave CPUs to be from the same NUMA node.
* Use dev_err instead of custom NE log pattern.
* Update the NE ioctl call to match the decoupling from the KVM API.
v2 -> v3
* Remove the WARN_ON calls.
* Update static calls sanity checks.
* Update kzfree() calls to kfree().
* Remove file ops that do nothing for now - open, ioctl and release.
v1 -> v2
* Add log pattern for NE.
* Update goto labels to match their purpose.
* Remove the BUG_ON calls.
* Check if enclave state is init when setting enclave vCPU.
---
drivers/virt/nitro_enclaves/ne_misc_dev.c | 702 ++++++++++++++++++++++
1 file changed, 702 insertions(+)
diff --git a/drivers/virt/nitro_enclaves/ne_misc_dev.c b/drivers/virt/nitro_enclaves/ne_misc_dev.c
index a824a50341dd..104c9646ec87 100644
--- a/drivers/virt/nitro_enclaves/ne_misc_dev.c
+++ b/drivers/virt/nitro_enclaves/ne_misc_dev.c
@@ -57,8 +57,11 @@
* TODO: Update logic to create new sysfs entries instead of using
* a kernel parameter e.g. if multiple sysfs files needed.
*/
+static int ne_set_kernel_param(const char *val, const struct kernel_param *kp);
+
static const struct kernel_param_ops ne_cpu_pool_ops = {
.get = param_get_string,
+ .set = ne_set_kernel_param,
};
static char ne_cpus[NE_CPUS_SIZE];
@@ -96,6 +99,702 @@ struct ne_cpu_pool {
static struct ne_cpu_pool ne_cpu_pool;
+/**
+ * ne_check_enclaves_created() - Verify if at least one enclave has been created.
+ * @void: No parameters provided.
+ *
+ * Context: Process context.
+ * Return:
+ * * True if at least one enclave is created.
+ * * False otherwise.
+ */
+static bool ne_check_enclaves_created(void)
+{
+ struct ne_pci_dev *ne_pci_dev = NULL;
+ struct pci_dev *pdev = NULL;
+ bool ret = false;
+
+ if (!ne_misc_dev.parent)
+ return ret;
+
+ pdev = to_pci_dev(ne_misc_dev.parent);
+ if (!pdev)
+ return ret;
+
+ ne_pci_dev = pci_get_drvdata(pdev);
+ if (!ne_pci_dev)
+ return ret;
+
+ mutex_lock(&ne_pci_dev->enclaves_list_mutex);
+
+ if (!list_empty(&ne_pci_dev->enclaves_list))
+ ret = true;
+
+ mutex_unlock(&ne_pci_dev->enclaves_list_mutex);
+
+ return ret;
+}
+
+/**
+ * ne_setup_cpu_pool() - Set the NE CPU pool after handling sanity checks such
+ * as not sharing CPU cores with the primary / parent VM
+ * or not using CPU 0, which should remain available for
+ * the primary / parent VM. Offline the CPUs from the
+ * pool after the checks passed.
+ * @ne_cpu_list: The CPU list used for setting NE CPU pool.
+ *
+ * Context: Process context.
+ * Return:
+ * * 0 on success.
+ * * Negative return value on failure.
+ */
+static int ne_setup_cpu_pool(const char *ne_cpu_list)
+{
+ int core_id = -1;
+ unsigned int cpu = 0;
+ cpumask_var_t cpu_pool;
+ unsigned int cpu_sibling = 0;
+ unsigned int i = 0;
+ int numa_node = -1;
+ int rc = -EINVAL;
+
+ if (!zalloc_cpumask_var(&cpu_pool, GFP_KERNEL))
+ return -ENOMEM;
+
+ mutex_lock(&ne_cpu_pool.mutex);
+
+ rc = cpulist_parse(ne_cpu_list, cpu_pool);
+ if (rc < 0) {
+ pr_err("%s: Error in cpulist parse [rc=%d]\n", ne_misc_dev.name, rc);
+
+ goto free_pool_cpumask;
+ }
+
+ cpu = cpumask_any(cpu_pool);
+ if (cpu >= nr_cpu_ids) {
+ pr_err("%s: No CPUs available in CPU pool\n", ne_misc_dev.name);
+
+ rc = -EINVAL;
+
+ goto free_pool_cpumask;
+ }
+
+ /*
+ * Check if the CPUs are online, to further get info about them
+ * e.g. numa node, core id, siblings.
+ */
+ for_each_cpu(cpu, cpu_pool)
+ if (cpu_is_offline(cpu)) {
+ pr_err("%s: CPU %d is offline, has to be online to get its metadata\n",
+ ne_misc_dev.name, cpu);
+
+ rc = -EINVAL;
+
+ goto free_pool_cpumask;
+ }
+
+ /*
+ * Check if the CPUs from the NE CPU pool are from the same NUMA node.
+ */
+ for_each_cpu(cpu, cpu_pool)
+ if (numa_node < 0) {
+ numa_node = cpu_to_node(cpu);
+ if (numa_node < 0) {
+ pr_err("%s: Invalid NUMA node %d\n",
+ ne_misc_dev.name, numa_node);
+
+ rc = -EINVAL;
+
+ goto free_pool_cpumask;
+ }
+ } else {
+ if (numa_node != cpu_to_node(cpu)) {
+ pr_err("%s: CPUs with different NUMA nodes\n",
+ ne_misc_dev.name);
+
+ rc = -EINVAL;
+
+ goto free_pool_cpumask;
+ }
+ }
+
+ /*
+ * Check if CPU 0 and its siblings are included in the provided CPU pool
+ * They should remain available for the primary / parent VM.
+ */
+ if (cpumask_test_cpu(0, cpu_pool)) {
+ pr_err("%s: CPU 0 has to remain available\n", ne_misc_dev.name);
+
+ rc = -EINVAL;
+
+ goto free_pool_cpumask;
+ }
+
+ for_each_cpu(cpu_sibling, topology_sibling_cpumask(0)) {
+ if (cpumask_test_cpu(cpu_sibling, cpu_pool)) {
+ pr_err("%s: CPU sibling %d for CPU 0 is in CPU pool\n",
+ ne_misc_dev.name, cpu_sibling);
+
+ rc = -EINVAL;
+
+ goto free_pool_cpumask;
+ }
+ }
+
+ /*
+ * Check if CPU siblings are included in the provided CPU pool. The
+ * expectation is that full CPU cores are made available in the CPU pool
+ * for enclaves.
+ */
+ for_each_cpu(cpu, cpu_pool) {
+ for_each_cpu(cpu_sibling, topology_sibling_cpumask(cpu)) {
+ if (!cpumask_test_cpu(cpu_sibling, cpu_pool)) {
+ pr_err("%s: CPU %d is not in CPU pool\n",
+ ne_misc_dev.name, cpu_sibling);
+
+ rc = -EINVAL;
+
+ goto free_pool_cpumask;
+ }
+ }
+ }
+
+ /* Calculate the number of threads from a full CPU core. */
+ cpu = cpumask_any(cpu_pool);
+ for_each_cpu(cpu_sibling, topology_sibling_cpumask(cpu))
+ ne_cpu_pool.nr_threads_per_core++;
+
+ ne_cpu_pool.nr_parent_vm_cores = nr_cpu_ids / ne_cpu_pool.nr_threads_per_core;
+
+ ne_cpu_pool.avail_threads_per_core = kcalloc(ne_cpu_pool.nr_parent_vm_cores,
+ sizeof(*ne_cpu_pool.avail_threads_per_core),
+ GFP_KERNEL);
+ if (!ne_cpu_pool.avail_threads_per_core) {
+ rc = -ENOMEM;
+
+ goto free_pool_cpumask;
+ }
+
+ for (i = 0; i < ne_cpu_pool.nr_parent_vm_cores; i++)
+ if (!zalloc_cpumask_var(&ne_cpu_pool.avail_threads_per_core[i], GFP_KERNEL)) {
+ rc = -ENOMEM;
+
+ goto free_cores_cpumask;
+ }
+
+ /*
+ * Split the NE CPU pool in threads per core to keep the CPU topology
+ * after offlining the CPUs.
+ */
+ for_each_cpu(cpu, cpu_pool) {
+ core_id = topology_core_id(cpu);
+ if (core_id < 0 || core_id >= ne_cpu_pool.nr_parent_vm_cores) {
+ pr_err("%s: Invalid core id %d for CPU %d\n",
+ ne_misc_dev.name, core_id, cpu);
+
+ rc = -EINVAL;
+
+ goto clear_cpumask;
+ }
+
+ cpumask_set_cpu(cpu, ne_cpu_pool.avail_threads_per_core[core_id]);
+ }
+
+ /*
+ * CPUs that are given to enclave(s) should not be considered online
+ * by Linux anymore, as the hypervisor will degrade them to floating.
+ * The physical CPUs (full cores) are carved out of the primary / parent
+ * VM and given to the enclave VM. The same number of vCPUs would run
+ * on less pCPUs for the primary / parent VM.
+ *
+ * We offline them here, to not degrade performance and expose correct
+ * topology to Linux and user space.
+ */
+ for_each_cpu(cpu, cpu_pool) {
+ rc = remove_cpu(cpu);
+ if (rc != 0) {
+ pr_err("%s: CPU %d is not offlined [rc=%d]\n",
+ ne_misc_dev.name, cpu, rc);
+
+ goto online_cpus;
+ }
+ }
+
+ free_cpumask_var(cpu_pool);
+
+ ne_cpu_pool.numa_node = numa_node;
+
+ mutex_unlock(&ne_cpu_pool.mutex);
+
+ return 0;
+
+online_cpus:
+ for_each_cpu(cpu, cpu_pool)
+ add_cpu(cpu);
+clear_cpumask:
+ for (i = 0; i < ne_cpu_pool.nr_parent_vm_cores; i++)
+ cpumask_clear(ne_cpu_pool.avail_threads_per_core[i]);
+free_cores_cpumask:
+ for (i = 0; i < ne_cpu_pool.nr_parent_vm_cores; i++)
+ free_cpumask_var(ne_cpu_pool.avail_threads_per_core[i]);
+ kfree(ne_cpu_pool.avail_threads_per_core);
+free_pool_cpumask:
+ free_cpumask_var(cpu_pool);
+ ne_cpu_pool.nr_parent_vm_cores = 0;
+ ne_cpu_pool.nr_threads_per_core = 0;
+ ne_cpu_pool.numa_node = -1;
+ mutex_unlock(&ne_cpu_pool.mutex);
+
+ return rc;
+}
+
+/**
+ * ne_teardown_cpu_pool() - Online the CPUs from the NE CPU pool and cleanup the
+ * CPU pool.
+ * @void: No parameters provided.
+ *
+ * Context: Process context.
+ */
+static void ne_teardown_cpu_pool(void)
+{
+ unsigned int cpu = 0;
+ unsigned int i = 0;
+ int rc = -EINVAL;
+
+ mutex_lock(&ne_cpu_pool.mutex);
+
+ if (!ne_cpu_pool.nr_parent_vm_cores) {
+ mutex_unlock(&ne_cpu_pool.mutex);
+
+ return;
+ }
+
+ for (i = 0; i < ne_cpu_pool.nr_parent_vm_cores; i++) {
+ for_each_cpu(cpu, ne_cpu_pool.avail_threads_per_core[i]) {
+ rc = add_cpu(cpu);
+ if (rc != 0)
+ pr_err("%s: CPU %d is not onlined [rc=%d]\n",
+ ne_misc_dev.name, cpu, rc);
+ }
+
+ cpumask_clear(ne_cpu_pool.avail_threads_per_core[i]);
+
+ free_cpumask_var(ne_cpu_pool.avail_threads_per_core[i]);
+ }
+
+ kfree(ne_cpu_pool.avail_threads_per_core);
+ ne_cpu_pool.nr_parent_vm_cores = 0;
+ ne_cpu_pool.nr_threads_per_core = 0;
+ ne_cpu_pool.numa_node = -1;
+
+ mutex_unlock(&ne_cpu_pool.mutex);
+}
+
+/**
+ * ne_set_kernel_param() - Set the NE CPU pool value via the NE kernel parameter.
+ * @val: NE CPU pool string value.
+ * @kp : NE kernel parameter associated with the NE CPU pool.
+ *
+ * Context: Process context.
+ * Return:
+ * * 0 on success.
+ * * Negative return value on failure.
+ */
+static int ne_set_kernel_param(const char *val, const struct kernel_param *kp)
+{
+ char error_val[] = "";
+ int rc = -EINVAL;
+
+ if (!capable(CAP_SYS_ADMIN))
+ return -EPERM;
+
+ if (ne_check_enclaves_created()) {
+ pr_err("%s: The CPU pool is used by enclave(s)\n", ne_misc_dev.name);
+
+ return -EPERM;
+ }
+
+ ne_teardown_cpu_pool();
+
+ rc = ne_setup_cpu_pool(val);
+ if (rc < 0) {
+ pr_err("%s: Error in setup CPU pool [rc=%d]\n", ne_misc_dev.name, rc);
+
+ param_set_copystring(error_val, kp);
+
+ return rc;
+ }
+
+ rc = param_set_copystring(val, kp);
+ if (rc < 0) {
+ pr_err("%s: Error in param set copystring [rc=%d]\n", ne_misc_dev.name, rc);
+
+ ne_teardown_cpu_pool();
+
+ param_set_copystring(error_val, kp);
+
+ return rc;
+ }
+
+ return 0;
+}
+
+/**
+ * ne_donated_cpu() - Check if the provided CPU is already used by the enclave.
+ * @ne_enclave : Private data associated with the current enclave.
+ * @cpu: CPU to check if already used.
+ *
+ * Context: Process context. This function is called with the ne_enclave mutex held.
+ * Return:
+ * * True if the provided CPU is already used by the enclave.
+ * * False otherwise.
+ */
+static bool ne_donated_cpu(struct ne_enclave *ne_enclave, unsigned int cpu)
+{
+ if (cpumask_test_cpu(cpu, ne_enclave->vcpu_ids))
+ return true;
+
+ return false;
+}
+
+/**
+ * ne_get_unused_core_from_cpu_pool() - Get the id of a full core from the
+ * NE CPU pool.
+ * @void: No parameters provided.
+ *
+ * Context: Process context. This function is called with the ne_enclave and
+ * ne_cpu_pool mutexes held.
+ * Return:
+ * * Core id.
+ * * -1 if no CPU core available in the pool.
+ */
+static int ne_get_unused_core_from_cpu_pool(void)
+{
+ int core_id = -1;
+ unsigned int i = 0;
+
+ for (i = 0; i < ne_cpu_pool.nr_parent_vm_cores; i++)
+ if (!cpumask_empty(ne_cpu_pool.avail_threads_per_core[i])) {
+ core_id = i;
+
+ break;
+ }
+
+ return core_id;
+}
+
+/**
+ * ne_set_enclave_threads_per_core() - Set the threads of the provided core in
+ * the enclave data structure.
+ * @ne_enclave : Private data associated with the current enclave.
+ * @core_id: Core id to get its threads from the NE CPU pool.
+ * @vcpu_id: vCPU id part of the provided core.
+ *
+ * Context: Process context. This function is called with the ne_enclave and
+ * ne_cpu_pool mutexes held.
+ * Return:
+ * * 0 on success.
+ * * Negative return value on failure.
+ */
+static int ne_set_enclave_threads_per_core(struct ne_enclave *ne_enclave,
+ int core_id, u32 vcpu_id)
+{
+ unsigned int cpu = 0;
+
+ if (core_id < 0 && vcpu_id == 0) {
+ dev_err_ratelimited(ne_misc_dev.this_device,
+ "No CPUs available in NE CPU pool\n");
+
+ return -NE_ERR_NO_CPUS_AVAIL_IN_POOL;
+ }
+
+ if (core_id < 0) {
+ dev_err_ratelimited(ne_misc_dev.this_device,
+ "CPU %d is not in NE CPU pool\n", vcpu_id);
+
+ return -NE_ERR_VCPU_NOT_IN_CPU_POOL;
+ }
+
+ if (core_id >= ne_enclave->nr_parent_vm_cores) {
+ dev_err_ratelimited(ne_misc_dev.this_device,
+ "Invalid core id %d - ne_enclave\n", core_id);
+
+ return -NE_ERR_VCPU_INVALID_CPU_CORE;
+ }
+
+ for_each_cpu(cpu, ne_cpu_pool.avail_threads_per_core[core_id])
+ cpumask_set_cpu(cpu, ne_enclave->threads_per_core[core_id]);
+
+ cpumask_clear(ne_cpu_pool.avail_threads_per_core[core_id]);
+
+ return 0;
+}
+
+/**
+ * ne_get_cpu_from_cpu_pool() - Get a CPU from the NE CPU pool, either from the
+ * remaining sibling(s) of a CPU core or the first
+ * sibling of a new CPU core.
+ * @ne_enclave : Private data associated with the current enclave.
+ * @vcpu_id: vCPU to get from the NE CPU pool.
+ *
+ * Context: Process context. This function is called with the ne_enclave mutex held.
+ * Return:
+ * * 0 on success.
+ * * Negative return value on failure.
+ */
+static int ne_get_cpu_from_cpu_pool(struct ne_enclave *ne_enclave, u32 *vcpu_id)
+{
+ int core_id = -1;
+ unsigned int cpu = 0;
+ unsigned int i = 0;
+ int rc = -EINVAL;
+
+ /*
+ * If previously allocated a thread of a core to this enclave, first
+ * check remaining sibling(s) for new CPU allocations, so that full
+ * CPU cores are used for the enclave.
+ */
+ for (i = 0; i < ne_enclave->nr_parent_vm_cores; i++)
+ for_each_cpu(cpu, ne_enclave->threads_per_core[i])
+ if (!ne_donated_cpu(ne_enclave, cpu)) {
+ *vcpu_id = cpu;
+
+ return 0;
+ }
+
+ mutex_lock(&ne_cpu_pool.mutex);
+
+ /*
+ * If no remaining siblings, get a core from the NE CPU pool and keep
+ * track of all the threads in the enclave threads per core data structure.
+ */
+ core_id = ne_get_unused_core_from_cpu_pool();
+
+ rc = ne_set_enclave_threads_per_core(ne_enclave, core_id, *vcpu_id);
+ if (rc < 0)
+ goto unlock_mutex;
+
+ *vcpu_id = cpumask_any(ne_enclave->threads_per_core[core_id]);
+
+ rc = 0;
+
+unlock_mutex:
+ mutex_unlock(&ne_cpu_pool.mutex);
+
+ return rc;
+}
+
+/**
+ * ne_get_vcpu_core_from_cpu_pool() - Get from the NE CPU pool the id of the
+ * core associated with the provided vCPU.
+ * @vcpu_id: Provided vCPU id to get its associated core id.
+ *
+ * Context: Process context. This function is called with the ne_enclave and
+ * ne_cpu_pool mutexes held.
+ * Return:
+ * * Core id.
+ * * -1 if the provided vCPU is not in the pool.
+ */
+static int ne_get_vcpu_core_from_cpu_pool(u32 vcpu_id)
+{
+ int core_id = -1;
+ unsigned int i = 0;
+
+ for (i = 0; i < ne_cpu_pool.nr_parent_vm_cores; i++)
+ if (cpumask_test_cpu(vcpu_id, ne_cpu_pool.avail_threads_per_core[i])) {
+ core_id = i;
+
+ break;
+ }
+
+ return core_id;
+}
+
+/**
+ * ne_check_cpu_in_cpu_pool() - Check if the given vCPU is in the available CPUs
+ * from the pool.
+ * @ne_enclave : Private data associated with the current enclave.
+ * @vcpu_id: ID of the vCPU to check if available in the NE CPU pool.
+ *
+ * Context: Process context. This function is called with the ne_enclave mutex held.
+ * Return:
+ * * 0 on success.
+ * * Negative return value on failure.
+ */
+static int ne_check_cpu_in_cpu_pool(struct ne_enclave *ne_enclave, u32 vcpu_id)
+{
+ int core_id = -1;
+ unsigned int i = 0;
+ int rc = -EINVAL;
+
+ if (ne_donated_cpu(ne_enclave, vcpu_id)) {
+ dev_err_ratelimited(ne_misc_dev.this_device,
+ "CPU %d already used\n", vcpu_id);
+
+ return -NE_ERR_VCPU_ALREADY_USED;
+ }
+
+ /*
+ * If previously allocated a thread of a core to this enclave, but not
+ * the full core, first check remaining sibling(s).
+ */
+ for (i = 0; i < ne_enclave->nr_parent_vm_cores; i++)
+ if (cpumask_test_cpu(vcpu_id, ne_enclave->threads_per_core[i]))
+ return 0;
+
+ mutex_lock(&ne_cpu_pool.mutex);
+
+ /*
+ * If no remaining siblings, get from the NE CPU pool the core
+ * associated with the vCPU and keep track of all the threads in the
+ * enclave threads per core data structure.
+ */
+ core_id = ne_get_vcpu_core_from_cpu_pool(vcpu_id);
+
+ rc = ne_set_enclave_threads_per_core(ne_enclave, core_id, vcpu_id);
+ if (rc < 0)
+ goto unlock_mutex;
+
+ rc = 0;
+
+unlock_mutex:
+ mutex_unlock(&ne_cpu_pool.mutex);
+
+ return rc;
+}
+
+/**
+ * ne_add_vcpu_ioctl() - Add a vCPU to the slot associated with the current
+ * enclave.
+ * @ne_enclave : Private data associated with the current enclave.
+ * @vcpu_id: ID of the CPU to be associated with the given slot,
+ * apic id on x86.
+ *
+ * Context: Process context. This function is called with the ne_enclave mutex held.
+ * Return:
+ * * 0 on success.
+ * * Negative return value on failure.
+ */
+static int ne_add_vcpu_ioctl(struct ne_enclave *ne_enclave, u32 vcpu_id)
+{
+ struct ne_pci_dev_cmd_reply cmd_reply = {};
+ int rc = -EINVAL;
+ struct slot_add_vcpu_req slot_add_vcpu_req = {};
+
+ if (ne_enclave->mm != current->mm)
+ return -EIO;
+
+ slot_add_vcpu_req.slot_uid = ne_enclave->slot_uid;
+ slot_add_vcpu_req.vcpu_id = vcpu_id;
+
+ rc = ne_do_request(ne_enclave->pdev, SLOT_ADD_VCPU, &slot_add_vcpu_req,
+ sizeof(slot_add_vcpu_req), &cmd_reply, sizeof(cmd_reply));
+ if (rc < 0) {
+ dev_err_ratelimited(ne_misc_dev.this_device,
+ "Error in slot add vCPU [rc=%d]\n", rc);
+
+ return rc;
+ }
+
+ cpumask_set_cpu(vcpu_id, ne_enclave->vcpu_ids);
+
+ ne_enclave->nr_vcpus++;
+
+ return 0;
+}
+
+/**
+ * ne_enclave_ioctl() - Ioctl function provided by the enclave file.
+ * @file: File associated with this ioctl function.
+ * @cmd: The command that is set for the ioctl call.
+ * @arg: The argument that is provided for the ioctl call.
+ *
+ * Context: Process context.
+ * Return:
+ * * 0 on success.
+ * * Negative return value on failure.
+ */
+static long ne_enclave_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
+{
+ struct ne_enclave *ne_enclave = file->private_data;
+
+ switch (cmd) {
+ case NE_ADD_VCPU: {
+ int rc = -EINVAL;
+ u32 vcpu_id = 0;
+
+ if (copy_from_user(&vcpu_id, (void __user *)arg, sizeof(vcpu_id)))
+ return -EFAULT;
+
+ mutex_lock(&ne_enclave->enclave_info_mutex);
+
+ if (ne_enclave->state != NE_STATE_INIT) {
+ dev_err_ratelimited(ne_misc_dev.this_device,
+ "Enclave is not in init state\n");
+
+ mutex_unlock(&ne_enclave->enclave_info_mutex);
+
+ return -NE_ERR_NOT_IN_INIT_STATE;
+ }
+
+ if (vcpu_id >= (ne_enclave->nr_parent_vm_cores *
+ ne_enclave->nr_threads_per_core)) {
+ dev_err_ratelimited(ne_misc_dev.this_device,
+ "vCPU id higher than max CPU id\n");
+
+ mutex_unlock(&ne_enclave->enclave_info_mutex);
+
+ return -NE_ERR_INVALID_VCPU;
+ }
+
+ if (!vcpu_id) {
+ /* Use the CPU pool for choosing a CPU for the enclave. */
+ rc = ne_get_cpu_from_cpu_pool(ne_enclave, &vcpu_id);
+ if (rc < 0) {
+ dev_err_ratelimited(ne_misc_dev.this_device,
+ "Error in get CPU from pool [rc=%d]\n",
+ rc);
+
+ mutex_unlock(&ne_enclave->enclave_info_mutex);
+
+ return rc;
+ }
+ } else {
+ /* Check if the provided vCPU is available in the NE CPU pool. */
+ rc = ne_check_cpu_in_cpu_pool(ne_enclave, vcpu_id);
+ if (rc < 0) {
+ dev_err_ratelimited(ne_misc_dev.this_device,
+ "Error in check CPU %d in pool [rc=%d]\n",
+ vcpu_id, rc);
+
+ mutex_unlock(&ne_enclave->enclave_info_mutex);
+
+ return rc;
+ }
+ }
+
+ rc = ne_add_vcpu_ioctl(ne_enclave, vcpu_id);
+ if (rc < 0) {
+ mutex_unlock(&ne_enclave->enclave_info_mutex);
+
+ return rc;
+ }
+
+ mutex_unlock(&ne_enclave->enclave_info_mutex);
+
+ if (copy_to_user((void __user *)arg, &vcpu_id, sizeof(vcpu_id)))
+ return -EFAULT;
+
+ return 0;
+ }
+
+ default:
+ return -ENOTTY;
+ }
+
+ return 0;
+}
+
/**
* ne_enclave_poll() - Poll functionality used for enclave out-of-band events.
* @file: File associated with this poll function.
@@ -124,6 +823,7 @@ static const struct file_operations ne_enclave_fops = {
.owner = THIS_MODULE,
.llseek = noop_llseek,
.poll = ne_enclave_poll,
+ .unlocked_ioctl = ne_enclave_ioctl,
};
/**
@@ -344,6 +1044,8 @@ static int __init ne_init(void)
static void __exit ne_exit(void)
{
pci_unregister_driver(&ne_pci_driver);
+
+ ne_teardown_cpu_pool();
}
module_init(ne_init);
--
2.20.1 (Apple Git-117)
Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.
Before setting the memory regions for the enclave, the enclave image
needs to be placed in memory. After the memory regions are set, this
memory cannot be used anymore by the VM, being carved out.
Add ioctl command logic to get the offset in enclave memory where to
place the enclave image. Then the user space tooling copies the enclave
image in the memory using the given memory offset.
Signed-off-by: Andra Paraschiv <[email protected]>
Reviewed-by: Alexander Graf <[email protected]>
---
Changelog
v6 -> v7
* No changes.
v5 -> v6
* Check for invalid enclave image load flags.
v4 -> v5
* Check for the enclave not being started when invoking this ioctl call.
* Remove log on copy_from_user() / copy_to_user() failure.
v3 -> v4
* Use dev_err instead of custom NE log pattern.
* Set enclave image load offset based on flags.
* Update the naming for the ioctl command from metadata to info.
v2 -> v3
* No changes.
v1 -> v2
* New in v2.
---
drivers/virt/nitro_enclaves/ne_misc_dev.c | 30 +++++++++++++++++++++++
1 file changed, 30 insertions(+)
diff --git a/drivers/virt/nitro_enclaves/ne_misc_dev.c b/drivers/virt/nitro_enclaves/ne_misc_dev.c
index 104c9646ec87..810c4bba424f 100644
--- a/drivers/virt/nitro_enclaves/ne_misc_dev.c
+++ b/drivers/virt/nitro_enclaves/ne_misc_dev.c
@@ -788,6 +788,36 @@ static long ne_enclave_ioctl(struct file *file, unsigned int cmd, unsigned long
return 0;
}
+ case NE_GET_IMAGE_LOAD_INFO: {
+ struct ne_image_load_info image_load_info = {};
+
+ if (copy_from_user(&image_load_info, (void __user *)arg, sizeof(image_load_info)))
+ return -EFAULT;
+
+ mutex_lock(&ne_enclave->enclave_info_mutex);
+
+ if (ne_enclave->state != NE_STATE_INIT) {
+ dev_err_ratelimited(ne_misc_dev.this_device,
+ "Enclave is not in init state\n");
+
+ mutex_unlock(&ne_enclave->enclave_info_mutex);
+
+ return -NE_ERR_NOT_IN_INIT_STATE;
+ }
+
+ mutex_unlock(&ne_enclave->enclave_info_mutex);
+
+ if (image_load_info.flags == NE_EIF_IMAGE)
+ image_load_info.memory_offset = NE_EIF_LOAD_OFFSET;
+ else
+ return -EINVAL;
+
+ if (copy_to_user((void __user *)arg, &image_load_info, sizeof(image_load_info)))
+ return -EFAULT;
+
+ return 0;
+ }
+
default:
return -ENOTTY;
}
--
2.20.1 (Apple Git-117)
Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.
Signed-off-by: Andra Paraschiv <[email protected]>
---
Changelog
v6 -> v7
* No changes.
v5 -> v6
* No changes.
v4 -> v5
* No changes.
v3 -> v4
* Update doc type from .txt to .rst.
* Update documentation based on the changes from v4.
v2 -> v3
* No changes.
v1 -> v2
* New in v2.
---
Documentation/nitro_enclaves/ne_overview.rst | 87 ++++++++++++++++++++
1 file changed, 87 insertions(+)
create mode 100644 Documentation/nitro_enclaves/ne_overview.rst
diff --git a/Documentation/nitro_enclaves/ne_overview.rst b/Documentation/nitro_enclaves/ne_overview.rst
new file mode 100644
index 000000000000..9cc7a2720955
--- /dev/null
+++ b/Documentation/nitro_enclaves/ne_overview.rst
@@ -0,0 +1,87 @@
+Nitro Enclaves
+==============
+
+Nitro Enclaves (NE) is a new Amazon Elastic Compute Cloud (EC2) capability
+that allows customers to carve out isolated compute environments within EC2
+instances [1].
+
+For example, an application that processes sensitive data and runs in a VM,
+can be separated from other applications running in the same VM. This
+application then runs in a separate VM than the primary VM, namely an enclave.
+
+An enclave runs alongside the VM that spawned it. This setup matches low latency
+applications needs. The resources that are allocated for the enclave, such as
+memory and CPUs, are carved out of the primary VM. Each enclave is mapped to a
+process running in the primary VM, that communicates with the NE driver via an
+ioctl interface.
+
+In this sense, there are two components:
+
+1. An enclave abstraction process - a user space process running in the primary
+VM guest that uses the provided ioctl interface of the NE driver to spawn an
+enclave VM (that's 2 below).
+
+There is a NE emulated PCI device exposed to the primary VM. The driver for this
+new PCI device is included in the NE driver.
+
+The ioctl logic is mapped to PCI device commands e.g. the NE_START_ENCLAVE ioctl
+maps to an enclave start PCI command. The PCI device commands are then
+translated into actions taken on the hypervisor side; that's the Nitro
+hypervisor running on the host where the primary VM is running. The Nitro
+hypervisor is based on core KVM technology.
+
+2. The enclave itself - a VM running on the same host as the primary VM that
+spawned it. Memory and CPUs are carved out of the primary VM and are dedicated
+for the enclave VM. An enclave does not have persistent storage attached.
+
+The memory regions carved out of the primary VM and given to an enclave need to
+be aligned 2 MiB / 1 GiB physically contiguous memory regions (or multiple of
+this size e.g. 8 MiB). The memory can be allocated e.g. by using hugetlbfs from
+user space [2][3]. The memory size for an enclave needs to be at least 64 MiB.
+The enclave memory and CPUs need to be from the same NUMA node.
+
+An enclave runs on dedicated cores. CPU 0 and its CPU siblings need to remain
+available for the primary VM. A CPU pool has to be set for NE purposes by an
+user with admin capability. See the cpu list section from the kernel
+documentation [4] for how a CPU pool format looks.
+
+An enclave communicates with the primary VM via a local communication channel,
+using virtio-vsock [5]. The primary VM has virtio-pci vsock emulated device,
+while the enclave VM has a virtio-mmio vsock emulated device. The vsock device
+uses eventfd for signaling. The enclave VM sees the usual interfaces - local
+APIC and IOAPIC - to get interrupts from virtio-vsock device. The virtio-mmio
+device is placed in memory below the typical 4 GiB.
+
+The application that runs in the enclave needs to be packaged in an enclave
+image together with the OS ( e.g. kernel, ramdisk, init ) that will run in the
+enclave VM. The enclave VM has its own kernel and follows the standard Linux
+boot protocol.
+
+The kernel bzImage, the kernel command line, the ramdisk(s) are part of the
+Enclave Image Format (EIF); plus an EIF header including metadata such as magic
+number, eif version, image size and CRC.
+
+Hash values are computed for the entire enclave image (EIF), the kernel and
+ramdisk(s). That's used, for example, to check that the enclave image that is
+loaded in the enclave VM is the one that was intended to be run.
+
+These crypto measurements are included in a signed attestation document
+generated by the Nitro Hypervisor and further used to prove the identity of the
+enclave; KMS is an example of service that NE is integrated with and that checks
+the attestation doc.
+
+The enclave image (EIF) is loaded in the enclave memory at offset 8 MiB. The
+init process in the enclave connects to the vsock CID of the primary VM and a
+predefined port - 9000 - to send a heartbeat value - 0xb7. This mechanism is
+used to check in the primary VM that the enclave has booted.
+
+If the enclave VM crashes or gracefully exits, an interrupt event is received by
+the NE driver. This event is sent further to the user space enclave process
+running in the primary VM via a poll notification mechanism. Then the user space
+enclave process can exit.
+
+[1] https://aws.amazon.com/ec2/nitro/nitro-enclaves/
+[2] https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt
+[3] https://lwn.net/Articles/807108/
+[4] https://www.kernel.org/doc/html/latest/admin-guide/kernel-parameters.html
+[5] https://man7.org/linux/man-pages/man7/vsock.7.html
--
2.20.1 (Apple Git-117)
Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.
After all the enclave resources are set, the enclave is ready for
beginning to run.
Add ioctl command logic for starting an enclave after all its resources,
memory regions and CPUs, have been set.
The enclave start information includes the local channel addressing -
vsock CID - and the flags associated with the enclave.
Signed-off-by: Alexandru Vasile <[email protected]>
Signed-off-by: Andra Paraschiv <[email protected]>
Reviewed-by: Alexander Graf <[email protected]>
---
Changelog
v6 -> v7
* Update the naming and add more comments to make more clear the logic
of handling full CPU cores and dedicating them to the enclave.
v5 -> v6
* Check for invalid enclave start flags.
* Update documentation to kernel-doc format.
v4 -> v5
* Add early exit on enclave start ioctl function call error.
* Move sanity checks in the enclave start ioctl function, outside of the
switch-case block.
* Remove log on copy_from_user() / copy_to_user() failure.
v3 -> v4
* Use dev_err instead of custom NE log pattern.
* Update the naming for the ioctl command from metadata to info.
* Check for minimum enclave memory size.
v2 -> v3
* Remove the WARN_ON calls.
* Update static calls sanity checks.
v1 -> v2
* Add log pattern for NE.
* Check if enclave state is init when starting an enclave.
* Remove the BUG_ON calls.
---
drivers/virt/nitro_enclaves/ne_misc_dev.c | 109 ++++++++++++++++++++++
1 file changed, 109 insertions(+)
diff --git a/drivers/virt/nitro_enclaves/ne_misc_dev.c b/drivers/virt/nitro_enclaves/ne_misc_dev.c
index 3d8a771bde1d..be81ff5634af 100644
--- a/drivers/virt/nitro_enclaves/ne_misc_dev.c
+++ b/drivers/virt/nitro_enclaves/ne_misc_dev.c
@@ -957,6 +957,77 @@ static int ne_set_user_memory_region_ioctl(struct ne_enclave *ne_enclave,
return rc;
}
+/**
+ * ne_start_enclave_ioctl() - Trigger enclave start after the enclave resources,
+ * such as memory and CPU, have been set.
+ * @ne_enclave : Private data associated with the current enclave.
+ * @enclave_start_info : Enclave info that includes enclave cid and flags.
+ *
+ * Context: Process context. This function is called with the ne_enclave mutex held.
+ * Return:
+ * * 0 on success.
+ * * Negative return value on failure.
+ */
+static int ne_start_enclave_ioctl(struct ne_enclave *ne_enclave,
+ struct ne_enclave_start_info *enclave_start_info)
+{
+ struct ne_pci_dev_cmd_reply cmd_reply = {};
+ unsigned int cpu = 0;
+ struct enclave_start_req enclave_start_req = {};
+ unsigned int i = 0;
+ int rc = -EINVAL;
+
+ if (!ne_enclave->nr_mem_regions) {
+ dev_err_ratelimited(ne_misc_dev.this_device,
+ "Enclave has no mem regions\n");
+
+ return -NE_ERR_NO_MEM_REGIONS_ADDED;
+ }
+
+ if (ne_enclave->mem_size < NE_MIN_ENCLAVE_MEM_SIZE) {
+ dev_err_ratelimited(ne_misc_dev.this_device,
+ "Enclave memory is less than %ld\n",
+ NE_MIN_ENCLAVE_MEM_SIZE);
+
+ return -NE_ERR_ENCLAVE_MEM_MIN_SIZE;
+ }
+
+ if (!ne_enclave->nr_vcpus) {
+ dev_err_ratelimited(ne_misc_dev.this_device,
+ "Enclave has no vCPUs\n");
+
+ return -NE_ERR_NO_VCPUS_ADDED;
+ }
+
+ for (i = 0; i < ne_enclave->nr_parent_vm_cores; i++)
+ for_each_cpu(cpu, ne_enclave->threads_per_core[i])
+ if (!cpumask_test_cpu(cpu, ne_enclave->vcpu_ids)) {
+ dev_err_ratelimited(ne_misc_dev.this_device,
+ "Full CPU cores not used\n");
+
+ return -NE_ERR_FULL_CORES_NOT_USED;
+ }
+
+ enclave_start_req.enclave_cid = enclave_start_info->enclave_cid;
+ enclave_start_req.flags = enclave_start_info->flags;
+ enclave_start_req.slot_uid = ne_enclave->slot_uid;
+
+ rc = ne_do_request(ne_enclave->pdev, ENCLAVE_START, &enclave_start_req,
+ sizeof(enclave_start_req), &cmd_reply, sizeof(cmd_reply));
+ if (rc < 0) {
+ dev_err_ratelimited(ne_misc_dev.this_device,
+ "Error in enclave start [rc=%d]\n", rc);
+
+ return rc;
+ }
+
+ ne_enclave->state = NE_STATE_RUNNING;
+
+ enclave_start_info->enclave_cid = cmd_reply.enclave_cid;
+
+ return 0;
+}
+
/**
* ne_enclave_ioctl() - Ioctl function provided by the enclave file.
* @file: File associated with this ioctl function.
@@ -1105,6 +1176,44 @@ static long ne_enclave_ioctl(struct file *file, unsigned int cmd, unsigned long
return 0;
}
+ case NE_START_ENCLAVE: {
+ struct ne_enclave_start_info enclave_start_info = {};
+ int rc = -EINVAL;
+
+ if (copy_from_user(&enclave_start_info, (void __user *)arg,
+ sizeof(enclave_start_info)))
+ return -EFAULT;
+
+ if (enclave_start_info.flags >= NE_ENCLAVE_START_MAX_FLAG_VAL)
+ return -EINVAL;
+
+ mutex_lock(&ne_enclave->enclave_info_mutex);
+
+ if (ne_enclave->state != NE_STATE_INIT) {
+ dev_err_ratelimited(ne_misc_dev.this_device,
+ "Enclave is not in init state\n");
+
+ mutex_unlock(&ne_enclave->enclave_info_mutex);
+
+ return -NE_ERR_NOT_IN_INIT_STATE;
+ }
+
+ rc = ne_start_enclave_ioctl(ne_enclave, &enclave_start_info);
+ if (rc < 0) {
+ mutex_unlock(&ne_enclave->enclave_info_mutex);
+
+ return rc;
+ }
+
+ mutex_unlock(&ne_enclave->enclave_info_mutex);
+
+ if (copy_to_user((void __user *)arg, &enclave_start_info,
+ sizeof(enclave_start_info)))
+ return -EFAULT;
+
+ return 0;
+ }
+
default:
return -ENOTTY;
}
--
2.20.1 (Apple Git-117)
Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.
Another resource that is being set for an enclave is memory. User space
memory regions, that need to be backed by contiguous memory regions,
are associated with the enclave.
One solution for allocating / reserving contiguous memory regions, that
is used for integration, is hugetlbfs. The user space process that is
associated with the enclave passes to the driver these memory regions.
The enclave memory regions need to be from the same NUMA node as the
enclave CPUs.
Add ioctl command logic for setting user space memory region for an
enclave.
Signed-off-by: Alexandru Vasile <[email protected]>
Signed-off-by: Andra Paraschiv <[email protected]>
Reviewed-by: Alexander Graf <[email protected]>
---
Changelog
v6 -> v7
* Update check for duplicate user space memory regions to cover
additional possible scenarios.
v5 -> v6
* Check for max number of pages allocated for the internal data
structure for pages.
* Check for invalid memory region flags.
* Check for aligned physical memory regions.
* Update documentation to kernel-doc format.
* Check for duplicate user space memory regions.
* Use directly put_page() instead of unpin_user_pages(), to match the
get_user_pages() calls.
v4 -> v5
* Add early exit on set memory region ioctl function call error.
* Remove log on copy_from_user() failure.
* Exit without unpinning the pages on NE PCI dev request failure as
memory regions from the user space range may have already been added.
* Add check for the memory region user space address to be 2 MiB
aligned.
* Update logic to not have a hardcoded check for 2 MiB memory regions.
v3 -> v4
* Check enclave memory regions are from the same NUMA node as the
enclave CPUs.
* Use dev_err instead of custom NE log pattern.
* Update the NE ioctl call to match the decoupling from the KVM API.
v2 -> v3
* Remove the WARN_ON calls.
* Update static calls sanity checks.
* Update kzfree() calls to kfree().
v1 -> v2
* Add log pattern for NE.
* Update goto labels to match their purpose.
* Remove the BUG_ON calls.
* Check if enclave max memory regions is reached when setting an enclave
memory region.
* Check if enclave state is init when setting an enclave memory region.
---
drivers/virt/nitro_enclaves/ne_misc_dev.c | 287 ++++++++++++++++++++++
1 file changed, 287 insertions(+)
diff --git a/drivers/virt/nitro_enclaves/ne_misc_dev.c b/drivers/virt/nitro_enclaves/ne_misc_dev.c
index 810c4bba424f..3d8a771bde1d 100644
--- a/drivers/virt/nitro_enclaves/ne_misc_dev.c
+++ b/drivers/virt/nitro_enclaves/ne_misc_dev.c
@@ -703,6 +703,260 @@ static int ne_add_vcpu_ioctl(struct ne_enclave *ne_enclave, u32 vcpu_id)
return 0;
}
+/**
+ * ne_sanity_check_user_mem_region() - Sanity check the user space memory
+ * region received during the set user
+ * memory region ioctl call.
+ * @ne_enclave : Private data associated with the current enclave.
+ * @mem_region : User space memory region to be sanity checked.
+ *
+ * Context: Process context. This function is called with the ne_enclave mutex held.
+ * Return:
+ * * 0 on success.
+ * * Negative return value on failure.
+ */
+static int ne_sanity_check_user_mem_region(struct ne_enclave *ne_enclave,
+ struct ne_user_memory_region mem_region)
+{
+ struct ne_mem_region *ne_mem_region = NULL;
+
+ if (ne_enclave->mm != current->mm)
+ return -EIO;
+
+ if (mem_region.memory_size & (NE_MIN_MEM_REGION_SIZE - 1)) {
+ dev_err_ratelimited(ne_misc_dev.this_device,
+ "User space memory size is not multiple of 2 MiB\n");
+
+ return -NE_ERR_INVALID_MEM_REGION_SIZE;
+ }
+
+ if (!IS_ALIGNED(mem_region.userspace_addr, NE_MIN_MEM_REGION_SIZE)) {
+ dev_err_ratelimited(ne_misc_dev.this_device,
+ "User space address is not 2 MiB aligned\n");
+
+ return -NE_ERR_UNALIGNED_MEM_REGION_ADDR;
+ }
+
+ if ((mem_region.userspace_addr & (NE_MIN_MEM_REGION_SIZE - 1)) ||
+ !access_ok((void __user *)(unsigned long)mem_region.userspace_addr,
+ mem_region.memory_size)) {
+ dev_err_ratelimited(ne_misc_dev.this_device,
+ "Invalid user space address range\n");
+
+ return -NE_ERR_INVALID_MEM_REGION_ADDR;
+ }
+
+ list_for_each_entry(ne_mem_region, &ne_enclave->mem_regions_list,
+ mem_region_list_entry) {
+ u64 memory_size = ne_mem_region->memory_size;
+ u64 userspace_addr = ne_mem_region->userspace_addr;
+
+ if ((userspace_addr <= mem_region.userspace_addr &&
+ mem_region.userspace_addr < (userspace_addr + memory_size)) ||
+ (mem_region.userspace_addr <= userspace_addr &&
+ (mem_region.userspace_addr + mem_region.memory_size) > userspace_addr)) {
+ dev_err_ratelimited(ne_misc_dev.this_device,
+ "User space memory region already used\n");
+
+ return -NE_ERR_MEM_REGION_ALREADY_USED;
+ }
+ }
+
+ return 0;
+}
+
+/**
+ * ne_set_user_memory_region_ioctl() - Add user space memory region to the slot
+ * associated with the current enclave.
+ * @ne_enclave : Private data associated with the current enclave.
+ * @mem_region : User space memory region to be associated with the given slot.
+ *
+ * Context: Process context. This function is called with the ne_enclave mutex held.
+ * Return:
+ * * 0 on success.
+ * * Negative return value on failure.
+ */
+static int ne_set_user_memory_region_ioctl(struct ne_enclave *ne_enclave,
+ struct ne_user_memory_region mem_region)
+{
+ long gup_rc = 0;
+ unsigned long i = 0;
+ unsigned long max_nr_pages = 0;
+ unsigned long memory_size = 0;
+ struct ne_mem_region *ne_mem_region = NULL;
+ unsigned long nr_phys_contig_mem_regions = 0;
+ struct page **phys_contig_mem_regions = NULL;
+ int rc = -EINVAL;
+
+ rc = ne_sanity_check_user_mem_region(ne_enclave, mem_region);
+ if (rc < 0)
+ return rc;
+
+ ne_mem_region = kzalloc(sizeof(*ne_mem_region), GFP_KERNEL);
+ if (!ne_mem_region)
+ return -ENOMEM;
+
+ max_nr_pages = mem_region.memory_size / NE_MIN_MEM_REGION_SIZE;
+
+ ne_mem_region->pages = kcalloc(max_nr_pages, sizeof(*ne_mem_region->pages),
+ GFP_KERNEL);
+ if (!ne_mem_region->pages) {
+ rc = -ENOMEM;
+
+ goto free_mem_region;
+ }
+
+ phys_contig_mem_regions = kcalloc(max_nr_pages, sizeof(*phys_contig_mem_regions),
+ GFP_KERNEL);
+ if (!phys_contig_mem_regions) {
+ rc = -ENOMEM;
+
+ goto free_mem_region;
+ }
+
+ do {
+ i = ne_mem_region->nr_pages;
+
+ if (i == max_nr_pages) {
+ dev_err_ratelimited(ne_misc_dev.this_device,
+ "Reached max nr of pages in the pages data struct\n");
+
+ rc = -ENOMEM;
+
+ goto put_pages;
+ }
+
+ gup_rc = get_user_pages(mem_region.userspace_addr + memory_size, 1, FOLL_GET,
+ ne_mem_region->pages + i, NULL);
+ if (gup_rc < 0) {
+ rc = gup_rc;
+
+ dev_err_ratelimited(ne_misc_dev.this_device,
+ "Error in get user pages [rc=%d]\n", rc);
+
+ goto put_pages;
+ }
+
+ if (!PageHuge(ne_mem_region->pages[i])) {
+ dev_err_ratelimited(ne_misc_dev.this_device,
+ "Not a hugetlbfs page\n");
+
+ rc = -NE_ERR_MEM_NOT_HUGE_PAGE;
+
+ goto put_pages;
+ }
+
+ if (ne_enclave->numa_node != page_to_nid(ne_mem_region->pages[i])) {
+ dev_err_ratelimited(ne_misc_dev.this_device,
+ "Page is not from NUMA node %d\n",
+ ne_enclave->numa_node);
+
+ rc = -NE_ERR_MEM_DIFFERENT_NUMA_NODE;
+
+ goto put_pages;
+ }
+
+ /*
+ * TODO: Update once handled non-contiguous memory regions
+ * received from user space or contiguous physical memory regions
+ * larger than 2 MiB e.g. 8 MiB.
+ */
+ phys_contig_mem_regions[i] = ne_mem_region->pages[i];
+
+ memory_size += page_size(ne_mem_region->pages[i]);
+
+ ne_mem_region->nr_pages++;
+ } while (memory_size < mem_region.memory_size);
+
+ /*
+ * TODO: Update once handled non-contiguous memory regions received
+ * from user space or contiguous physical memory regions larger than
+ * 2 MiB e.g. 8 MiB.
+ */
+ nr_phys_contig_mem_regions = ne_mem_region->nr_pages;
+
+ if ((ne_enclave->nr_mem_regions + nr_phys_contig_mem_regions) >
+ ne_enclave->max_mem_regions) {
+ dev_err_ratelimited(ne_misc_dev.this_device,
+ "Reached max memory regions %lld\n",
+ ne_enclave->max_mem_regions);
+
+ rc = -NE_ERR_MEM_MAX_REGIONS;
+
+ goto put_pages;
+ }
+
+ for (i = 0; i < nr_phys_contig_mem_regions; i++) {
+ u64 phys_region_addr = page_to_phys(phys_contig_mem_regions[i]);
+ u64 phys_region_size = page_size(phys_contig_mem_regions[i]);
+
+ if (phys_region_size & (NE_MIN_MEM_REGION_SIZE - 1)) {
+ dev_err_ratelimited(ne_misc_dev.this_device,
+ "Physical mem region size is not multiple of 2 MiB\n");
+
+ rc = -EINVAL;
+
+ goto put_pages;
+ }
+
+ if (!IS_ALIGNED(phys_region_addr, NE_MIN_MEM_REGION_SIZE)) {
+ dev_err_ratelimited(ne_misc_dev.this_device,
+ "Physical mem region address is not 2 MiB aligned\n");
+
+ rc = -EINVAL;
+
+ goto put_pages;
+ }
+ }
+
+ ne_mem_region->memory_size = mem_region.memory_size;
+ ne_mem_region->userspace_addr = mem_region.userspace_addr;
+
+ list_add(&ne_mem_region->mem_region_list_entry, &ne_enclave->mem_regions_list);
+
+ for (i = 0; i < nr_phys_contig_mem_regions; i++) {
+ struct ne_pci_dev_cmd_reply cmd_reply = {};
+ struct slot_add_mem_req slot_add_mem_req = {};
+
+ slot_add_mem_req.slot_uid = ne_enclave->slot_uid;
+ slot_add_mem_req.paddr = page_to_phys(phys_contig_mem_regions[i]);
+ slot_add_mem_req.size = page_size(phys_contig_mem_regions[i]);
+
+ rc = ne_do_request(ne_enclave->pdev, SLOT_ADD_MEM,
+ &slot_add_mem_req, sizeof(slot_add_mem_req),
+ &cmd_reply, sizeof(cmd_reply));
+ if (rc < 0) {
+ dev_err_ratelimited(ne_misc_dev.this_device,
+ "Error in slot add mem [rc=%d]\n", rc);
+
+ kfree(phys_contig_mem_regions);
+
+ /*
+ * Exit here without put pages as memory regions may
+ * already been added.
+ */
+ return rc;
+ }
+
+ ne_enclave->mem_size += slot_add_mem_req.size;
+ ne_enclave->nr_mem_regions++;
+ }
+
+ kfree(phys_contig_mem_regions);
+
+ return 0;
+
+put_pages:
+ for (i = 0; i < ne_mem_region->nr_pages; i++)
+ put_page(ne_mem_region->pages[i]);
+free_mem_region:
+ kfree(phys_contig_mem_regions);
+ kfree(ne_mem_region->pages);
+ kfree(ne_mem_region);
+
+ return rc;
+}
+
/**
* ne_enclave_ioctl() - Ioctl function provided by the enclave file.
* @file: File associated with this ioctl function.
@@ -818,6 +1072,39 @@ static long ne_enclave_ioctl(struct file *file, unsigned int cmd, unsigned long
return 0;
}
+ case NE_SET_USER_MEMORY_REGION: {
+ struct ne_user_memory_region mem_region = {};
+ int rc = -EINVAL;
+
+ if (copy_from_user(&mem_region, (void __user *)arg, sizeof(mem_region)))
+ return -EFAULT;
+
+ if (mem_region.flags >= NE_MEMORY_REGION_MAX_FLAG_VAL)
+ return -EINVAL;
+
+ mutex_lock(&ne_enclave->enclave_info_mutex);
+
+ if (ne_enclave->state != NE_STATE_INIT) {
+ dev_err_ratelimited(ne_misc_dev.this_device,
+ "Enclave is not in init state\n");
+
+ mutex_unlock(&ne_enclave->enclave_info_mutex);
+
+ return -NE_ERR_NOT_IN_INIT_STATE;
+ }
+
+ rc = ne_set_user_memory_region_ioctl(ne_enclave, mem_region);
+ if (rc < 0) {
+ mutex_unlock(&ne_enclave->enclave_info_mutex);
+
+ return rc;
+ }
+
+ mutex_unlock(&ne_enclave->enclave_info_mutex);
+
+ return 0;
+ }
+
default:
return -ENOTTY;
}
--
2.20.1 (Apple Git-117)
Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.
The Nitro Enclaves PCI device is used by the kernel driver as a means of
communication with the hypervisor on the host where the primary VM and
the enclaves run. It handles requests with regard to enclave lifetime.
Setup the PCI device driver and add support for MSI-X interrupts.
Signed-off-by: Alexandru-Catalin Vasile <[email protected]>
Signed-off-by: Alexandru Ciobotaru <[email protected]>
Signed-off-by: Andra Paraschiv <[email protected]>
Reviewed-by: Alexander Graf <[email protected]>
---
Changelog
v6 -> v7
* No changes.
v5 -> v6
* Update documentation to kernel-doc format.
v4 -> v5
* Remove sanity checks for situations that shouldn't happen, only if
buggy system or broken logic at all.
v3 -> v4
* Use dev_err instead of custom NE log pattern.
* Update NE PCI driver name to "nitro_enclaves".
v2 -> v3
* Remove the GPL additional wording as SPDX-License-Identifier is
already in place.
* Remove the WARN_ON calls.
* Remove linux/bug include that is not needed.
* Update static calls sanity checks.
* Remove "ratelimited" from the logs that are not in the ioctl call
paths.
* Update kzfree() calls to kfree().
v1 -> v2
* Add log pattern for NE.
* Update PCI device setup functions to receive PCI device data structure and
then get private data from it inside the functions logic.
* Remove the BUG_ON calls.
* Add teardown function for MSI-X setup.
* Update goto labels to match their purpose.
* Implement TODO for NE PCI device disable state check.
* Update function name for NE PCI device probe / remove.
---
drivers/virt/nitro_enclaves/ne_pci_dev.c | 269 +++++++++++++++++++++++
1 file changed, 269 insertions(+)
create mode 100644 drivers/virt/nitro_enclaves/ne_pci_dev.c
diff --git a/drivers/virt/nitro_enclaves/ne_pci_dev.c b/drivers/virt/nitro_enclaves/ne_pci_dev.c
new file mode 100644
index 000000000000..31650dcd592e
--- /dev/null
+++ b/drivers/virt/nitro_enclaves/ne_pci_dev.c
@@ -0,0 +1,269 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright 2020 Amazon.com, Inc. or its affiliates. All Rights Reserved.
+ */
+
+/**
+ * DOC: Nitro Enclaves (NE) PCI device driver.
+ */
+
+#include <linux/delay.h>
+#include <linux/device.h>
+#include <linux/list.h>
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/nitro_enclaves.h>
+#include <linux/pci.h>
+#include <linux/types.h>
+#include <linux/wait.h>
+
+#include "ne_misc_dev.h"
+#include "ne_pci_dev.h"
+
+/**
+ * NE_DEFAULT_TIMEOUT_MSECS - Default timeout to wait for a reply from
+ * the NE PCI device.
+ */
+#define NE_DEFAULT_TIMEOUT_MSECS (120000) /* 120 sec */
+
+static const struct pci_device_id ne_pci_ids[] = {
+ { PCI_DEVICE(PCI_VENDOR_ID_AMAZON, PCI_DEVICE_ID_NE) },
+ { 0, }
+};
+
+MODULE_DEVICE_TABLE(pci, ne_pci_ids);
+
+/**
+ * ne_setup_msix() - Setup MSI-X vectors for the PCI device.
+ * @pdev: PCI device to setup the MSI-X for.
+ *
+ * Context: Process context.
+ * Return:
+ * * 0 on success.
+ * * Negative return value on failure.
+ */
+static int ne_setup_msix(struct pci_dev *pdev)
+{
+ int nr_vecs = 0;
+ int rc = -EINVAL;
+
+ nr_vecs = pci_msix_vec_count(pdev);
+ if (nr_vecs < 0) {
+ rc = nr_vecs;
+
+ dev_err(&pdev->dev, "Error in getting vec count [rc=%d]\n", rc);
+
+ return rc;
+ }
+
+ rc = pci_alloc_irq_vectors(pdev, nr_vecs, nr_vecs, PCI_IRQ_MSIX);
+ if (rc < 0) {
+ dev_err(&pdev->dev, "Error in alloc MSI-X vecs [rc=%d]\n", rc);
+
+ return rc;
+ }
+
+ return 0;
+}
+
+/**
+ * ne_teardown_msix() - Teardown MSI-X vectors for the PCI device.
+ * @pdev: PCI device to teardown the MSI-X for.
+ *
+ * Context: Process context.
+ */
+static void ne_teardown_msix(struct pci_dev *pdev)
+{
+ pci_free_irq_vectors(pdev);
+}
+
+/**
+ * ne_pci_dev_enable() - Select the PCI device version and enable it.
+ * @pdev: PCI device to select version for and then enable.
+ *
+ * Context: Process context.
+ * Return:
+ * * 0 on success.
+ * * Negative return value on failure.
+ */
+static int ne_pci_dev_enable(struct pci_dev *pdev)
+{
+ u8 dev_enable_reply = 0;
+ u16 dev_version_reply = 0;
+ struct ne_pci_dev *ne_pci_dev = pci_get_drvdata(pdev);
+
+ iowrite16(NE_VERSION_MAX, ne_pci_dev->iomem_base + NE_VERSION);
+
+ dev_version_reply = ioread16(ne_pci_dev->iomem_base + NE_VERSION);
+ if (dev_version_reply != NE_VERSION_MAX) {
+ dev_err(&pdev->dev, "Error in pci dev version cmd\n");
+
+ return -EIO;
+ }
+
+ iowrite8(NE_ENABLE_ON, ne_pci_dev->iomem_base + NE_ENABLE);
+
+ dev_enable_reply = ioread8(ne_pci_dev->iomem_base + NE_ENABLE);
+ if (dev_enable_reply != NE_ENABLE_ON) {
+ dev_err(&pdev->dev, "Error in pci dev enable cmd\n");
+
+ return -EIO;
+ }
+
+ return 0;
+}
+
+/**
+ * ne_pci_dev_disable() - Disable the PCI device.
+ * @pdev: PCI device to disable.
+ *
+ * Context: Process context.
+ */
+static void ne_pci_dev_disable(struct pci_dev *pdev)
+{
+ u8 dev_disable_reply = 0;
+ struct ne_pci_dev *ne_pci_dev = pci_get_drvdata(pdev);
+ const unsigned int sleep_time = 10; /* 10 ms */
+ unsigned int sleep_time_count = 0;
+
+ iowrite8(NE_ENABLE_OFF, ne_pci_dev->iomem_base + NE_ENABLE);
+
+ /*
+ * Check for NE_ENABLE_OFF in a loop, to handle cases when the device
+ * state is not immediately set to disabled and going through a
+ * transitory state of disabling.
+ */
+ while (sleep_time_count < NE_DEFAULT_TIMEOUT_MSECS) {
+ dev_disable_reply = ioread8(ne_pci_dev->iomem_base + NE_ENABLE);
+ if (dev_disable_reply == NE_ENABLE_OFF)
+ return;
+
+ msleep_interruptible(sleep_time);
+ sleep_time_count += sleep_time;
+ }
+
+ dev_disable_reply = ioread8(ne_pci_dev->iomem_base + NE_ENABLE);
+ if (dev_disable_reply != NE_ENABLE_OFF)
+ dev_err(&pdev->dev, "Error in pci dev disable cmd\n");
+}
+
+/**
+ * ne_pci_probe() - Probe function for the NE PCI device.
+ * @pdev: PCI device to match with the NE PCI driver.
+ * @id : PCI device id table associated with the NE PCI driver.
+ *
+ * Context: Process context.
+ * Return:
+ * * 0 on success.
+ * * Negative return value on failure.
+ */
+static int ne_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
+{
+ struct ne_pci_dev *ne_pci_dev = NULL;
+ int rc = -EINVAL;
+
+ ne_pci_dev = kzalloc(sizeof(*ne_pci_dev), GFP_KERNEL);
+ if (!ne_pci_dev)
+ return -ENOMEM;
+
+ rc = pci_enable_device(pdev);
+ if (rc < 0) {
+ dev_err(&pdev->dev, "Error in pci dev enable [rc=%d]\n", rc);
+
+ goto free_ne_pci_dev;
+ }
+
+ rc = pci_request_regions_exclusive(pdev, "nitro_enclaves");
+ if (rc < 0) {
+ dev_err(&pdev->dev, "Error in pci request regions [rc=%d]\n", rc);
+
+ goto disable_pci_dev;
+ }
+
+ ne_pci_dev->iomem_base = pci_iomap(pdev, PCI_BAR_NE, 0);
+ if (!ne_pci_dev->iomem_base) {
+ rc = -ENOMEM;
+
+ dev_err(&pdev->dev, "Error in pci iomap [rc=%d]\n", rc);
+
+ goto release_pci_regions;
+ }
+
+ pci_set_drvdata(pdev, ne_pci_dev);
+
+ rc = ne_setup_msix(pdev);
+ if (rc < 0) {
+ dev_err(&pdev->dev, "Error in pci dev msix setup [rc=%d]\n", rc);
+
+ goto iounmap_pci_bar;
+ }
+
+ ne_pci_dev_disable(pdev);
+
+ rc = ne_pci_dev_enable(pdev);
+ if (rc < 0) {
+ dev_err(&pdev->dev, "Error in ne_pci_dev enable [rc=%d]\n", rc);
+
+ goto teardown_msix;
+ }
+
+ atomic_set(&ne_pci_dev->cmd_reply_avail, 0);
+ init_waitqueue_head(&ne_pci_dev->cmd_reply_wait_q);
+ INIT_LIST_HEAD(&ne_pci_dev->enclaves_list);
+ mutex_init(&ne_pci_dev->enclaves_list_mutex);
+ mutex_init(&ne_pci_dev->pci_dev_mutex);
+ ne_pci_dev->pdev = pdev;
+
+ return 0;
+
+teardown_msix:
+ ne_teardown_msix(pdev);
+iounmap_pci_bar:
+ pci_set_drvdata(pdev, NULL);
+ pci_iounmap(pdev, ne_pci_dev->iomem_base);
+release_pci_regions:
+ pci_release_regions(pdev);
+disable_pci_dev:
+ pci_disable_device(pdev);
+free_ne_pci_dev:
+ kfree(ne_pci_dev);
+
+ return rc;
+}
+
+/**
+ * ne_pci_remove() - Remove function for the NE PCI device.
+ * @pdev: PCI device associated with the NE PCI driver.
+ *
+ * Context: Process context.
+ */
+static void ne_pci_remove(struct pci_dev *pdev)
+{
+ struct ne_pci_dev *ne_pci_dev = pci_get_drvdata(pdev);
+
+ ne_pci_dev_disable(pdev);
+
+ ne_teardown_msix(pdev);
+
+ pci_set_drvdata(pdev, NULL);
+
+ pci_iounmap(pdev, ne_pci_dev->iomem_base);
+
+ pci_release_regions(pdev);
+
+ pci_disable_device(pdev);
+
+ kfree(ne_pci_dev);
+}
+
+/*
+ * TODO: Add suspend / resume functions for power management w/ CONFIG_PM, if
+ * needed.
+ */
+/* NE PCI device driver. */
+struct pci_driver ne_pci_driver = {
+ .name = "nitro_enclaves",
+ .id_table = ne_pci_ids,
+ .probe = ne_pci_probe,
+ .remove = ne_pci_remove,
+};
--
2.20.1 (Apple Git-117)
Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.
Signed-off-by: Andra Paraschiv <[email protected]>
---
Changelog
v6 -> v7
* No changes.
v5 -> v6
* No changes.
v4 -> v5
* No changes.
v3 -> v4
* No changes.
v2 -> v3
* Update file entries to be in alphabetical order.
v1 -> v2
* No changes.
---
MAINTAINERS | 13 +++++++++++++
1 file changed, 13 insertions(+)
diff --git a/MAINTAINERS b/MAINTAINERS
index deaafb617361..06247ca41e5e 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -12268,6 +12268,19 @@ S: Maintained
T: git git://git.kernel.org/pub/scm/linux/kernel/git/lftan/nios2.git
F: arch/nios2/
+NITRO ENCLAVES (NE)
+M: Andra Paraschiv <[email protected]>
+M: Alexandru Vasile <[email protected]>
+M: Alexandru Ciobotaru <[email protected]>
+L: [email protected]
+S: Supported
+W: https://aws.amazon.com/ec2/nitro/nitro-enclaves/
+F: Documentation/nitro_enclaves/
+F: drivers/virt/nitro_enclaves/
+F: include/linux/nitro_enclaves.h
+F: include/uapi/linux/nitro_enclaves.h
+F: samples/nitro_enclaves/
+
NOHZ, DYNTICKS SUPPORT
M: Frederic Weisbecker <[email protected]>
M: Thomas Gleixner <[email protected]>
--
2.20.1 (Apple Git-117)
Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.
An enclave is associated with an fd that is returned after the enclave
creation logic is completed. This enclave fd is further used to setup
enclave resources. Once the enclave needs to be terminated, the enclave
fd is closed.
Add logic for enclave termination, that is mapped to the enclave fd
release callback. Free the internal enclave info used for bookkeeping.
Signed-off-by: Alexandru Vasile <[email protected]>
Signed-off-by: Andra Paraschiv <[email protected]>
Reviewed-by: Alexander Graf <[email protected]>
---
Changelog
v6 -> v7
* Remove the pci_dev_put() call as the NE misc device parent field is
used now to get the NE PCI device.
* Update the naming and add more comments to make more clear the logic
of handling full CPU cores and dedicating them to the enclave.
v5 -> v6
* Update documentation to kernel-doc format.
* Use directly put_page() instead of unpin_user_pages(), to match the
get_user_pages() calls.
v4 -> v5
* Release the reference to the NE PCI device on enclave fd release.
* Adapt the logic to cpumask enclave vCPU ids and CPU cores.
* Remove sanity checks for situations that shouldn't happen, only if
buggy system or broken logic at all.
v3 -> v4
* Use dev_err instead of custom NE log pattern.
v2 -> v3
* Remove the WARN_ON calls.
* Update static calls sanity checks.
* Update kzfree() calls to kfree().
v1 -> v2
* Add log pattern for NE.
* Remove the BUG_ON calls.
* Update goto labels to match their purpose.
* Add early exit in release() if there was a slot alloc error in the fd
creation path.
---
drivers/virt/nitro_enclaves/ne_misc_dev.c | 166 ++++++++++++++++++++++
1 file changed, 166 insertions(+)
diff --git a/drivers/virt/nitro_enclaves/ne_misc_dev.c b/drivers/virt/nitro_enclaves/ne_misc_dev.c
index be81ff5634af..787428390d94 100644
--- a/drivers/virt/nitro_enclaves/ne_misc_dev.c
+++ b/drivers/virt/nitro_enclaves/ne_misc_dev.c
@@ -1221,6 +1221,171 @@ static long ne_enclave_ioctl(struct file *file, unsigned int cmd, unsigned long
return 0;
}
+/**
+ * ne_enclave_remove_all_mem_region_entries() - Remove all memory region entries
+ * from the enclave data structure.
+ * @ne_enclave : Private data associated with the current enclave.
+ *
+ * Context: Process context. This function is called with the ne_enclave mutex held.
+ */
+static void ne_enclave_remove_all_mem_region_entries(struct ne_enclave *ne_enclave)
+{
+ unsigned long i = 0;
+ struct ne_mem_region *ne_mem_region = NULL;
+ struct ne_mem_region *ne_mem_region_tmp = NULL;
+
+ list_for_each_entry_safe(ne_mem_region, ne_mem_region_tmp,
+ &ne_enclave->mem_regions_list,
+ mem_region_list_entry) {
+ list_del(&ne_mem_region->mem_region_list_entry);
+
+ for (i = 0; i < ne_mem_region->nr_pages; i++)
+ put_page(ne_mem_region->pages[i]);
+
+ kfree(ne_mem_region->pages);
+
+ kfree(ne_mem_region);
+ }
+}
+
+/**
+ * ne_enclave_remove_all_vcpu_id_entries() - Remove all vCPU id entries from
+ * the enclave data structure.
+ * @ne_enclave : Private data associated with the current enclave.
+ *
+ * Context: Process context. This function is called with the ne_enclave mutex held.
+ */
+static void ne_enclave_remove_all_vcpu_id_entries(struct ne_enclave *ne_enclave)
+{
+ unsigned int cpu = 0;
+ unsigned int i = 0;
+
+ mutex_lock(&ne_cpu_pool.mutex);
+
+ for (i = 0; i < ne_enclave->nr_parent_vm_cores; i++) {
+ for_each_cpu(cpu, ne_enclave->threads_per_core[i])
+ /* Update the available NE CPU pool. */
+ cpumask_set_cpu(cpu, ne_cpu_pool.avail_threads_per_core[i]);
+
+ free_cpumask_var(ne_enclave->threads_per_core[i]);
+ }
+
+ mutex_unlock(&ne_cpu_pool.mutex);
+
+ kfree(ne_enclave->threads_per_core);
+
+ free_cpumask_var(ne_enclave->vcpu_ids);
+}
+
+/**
+ * ne_pci_dev_remove_enclave_entry() - Remove the enclave entry from the data
+ * structure that is part of the NE PCI
+ * device private data.
+ * @ne_enclave : Private data associated with the current enclave.
+ * @ne_pci_dev : Private data associated with the PCI device.
+ *
+ * Context: Process context. This function is called with the ne_pci_dev enclave
+ * mutex held.
+ */
+static void ne_pci_dev_remove_enclave_entry(struct ne_enclave *ne_enclave,
+ struct ne_pci_dev *ne_pci_dev)
+{
+ struct ne_enclave *ne_enclave_entry = NULL;
+ struct ne_enclave *ne_enclave_entry_tmp = NULL;
+
+ list_for_each_entry_safe(ne_enclave_entry, ne_enclave_entry_tmp,
+ &ne_pci_dev->enclaves_list, enclave_list_entry) {
+ if (ne_enclave_entry->slot_uid == ne_enclave->slot_uid) {
+ list_del(&ne_enclave_entry->enclave_list_entry);
+
+ break;
+ }
+ }
+}
+
+/**
+ * ne_enclave_release() - Release function provided by the enclave file.
+ * @inode: Inode associated with this file release function.
+ * @file: File associated with this release function.
+ *
+ * Context: Process context.
+ * Return:
+ * * 0 on success.
+ * * Negative return value on failure.
+ */
+static int ne_enclave_release(struct inode *inode, struct file *file)
+{
+ struct ne_pci_dev_cmd_reply cmd_reply = {};
+ struct enclave_stop_req enclave_stop_request = {};
+ struct ne_enclave *ne_enclave = file->private_data;
+ struct ne_pci_dev *ne_pci_dev = NULL;
+ int rc = -EINVAL;
+ struct slot_free_req slot_free_req = {};
+
+ if (!ne_enclave)
+ return 0;
+
+ /*
+ * Early exit in case there is an error in the enclave creation logic
+ * and fput() is called on the cleanup path.
+ */
+ if (!ne_enclave->slot_uid)
+ return 0;
+
+ ne_pci_dev = pci_get_drvdata(ne_enclave->pdev);
+
+ /*
+ * Acquire the enclave list mutex before the enclave mutex
+ * in order to avoid deadlocks with @ref ne_event_work_handler.
+ */
+ mutex_lock(&ne_pci_dev->enclaves_list_mutex);
+ mutex_lock(&ne_enclave->enclave_info_mutex);
+
+ if (ne_enclave->state != NE_STATE_INIT && ne_enclave->state != NE_STATE_STOPPED) {
+ enclave_stop_request.slot_uid = ne_enclave->slot_uid;
+
+ rc = ne_do_request(ne_enclave->pdev, ENCLAVE_STOP,
+ &enclave_stop_request, sizeof(enclave_stop_request),
+ &cmd_reply, sizeof(cmd_reply));
+ if (rc < 0) {
+ dev_err_ratelimited(ne_misc_dev.this_device,
+ "Error in enclave stop [rc=%d]\n", rc);
+
+ goto unlock_mutex;
+ }
+
+ memset(&cmd_reply, 0, sizeof(cmd_reply));
+ }
+
+ slot_free_req.slot_uid = ne_enclave->slot_uid;
+
+ rc = ne_do_request(ne_enclave->pdev, SLOT_FREE, &slot_free_req, sizeof(slot_free_req),
+ &cmd_reply, sizeof(cmd_reply));
+ if (rc < 0) {
+ dev_err_ratelimited(ne_misc_dev.this_device,
+ "Error in slot free [rc=%d]\n", rc);
+
+ goto unlock_mutex;
+ }
+
+ ne_pci_dev_remove_enclave_entry(ne_enclave, ne_pci_dev);
+ ne_enclave_remove_all_mem_region_entries(ne_enclave);
+ ne_enclave_remove_all_vcpu_id_entries(ne_enclave);
+
+ mutex_unlock(&ne_enclave->enclave_info_mutex);
+ mutex_unlock(&ne_pci_dev->enclaves_list_mutex);
+
+ kfree(ne_enclave);
+
+ return 0;
+
+unlock_mutex:
+ mutex_unlock(&ne_enclave->enclave_info_mutex);
+ mutex_unlock(&ne_pci_dev->enclaves_list_mutex);
+
+ return rc;
+}
+
/**
* ne_enclave_poll() - Poll functionality used for enclave out-of-band events.
* @file: File associated with this poll function.
@@ -1250,6 +1415,7 @@ static const struct file_operations ne_enclave_fops = {
.llseek = noop_llseek,
.poll = ne_enclave_poll,
.unlocked_ioctl = ne_enclave_ioctl,
+ .release = ne_enclave_release,
};
/**
--
2.20.1 (Apple Git-117)
Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.
Signed-off-by: Andra Paraschiv <[email protected]>
Reviewed-by: Alexander Graf <[email protected]>
---
Changelog
v6 -> v7
* No changes.
v5 -> v6
* No changes.
v4 -> v5
* No changes.
v3 -> v4
* No changes.
v2 -> v3
* Remove the GPL additional wording as SPDX-License-Identifier is
already in place.
v1 -> v2
* Update path to Makefile to match the drivers/virt/nitro_enclaves
directory.
---
drivers/virt/Makefile | 2 ++
drivers/virt/nitro_enclaves/Makefile | 11 +++++++++++
2 files changed, 13 insertions(+)
create mode 100644 drivers/virt/nitro_enclaves/Makefile
diff --git a/drivers/virt/Makefile b/drivers/virt/Makefile
index fd331247c27a..f28425ce4b39 100644
--- a/drivers/virt/Makefile
+++ b/drivers/virt/Makefile
@@ -5,3 +5,5 @@
obj-$(CONFIG_FSL_HV_MANAGER) += fsl_hypervisor.o
obj-y += vboxguest/
+
+obj-$(CONFIG_NITRO_ENCLAVES) += nitro_enclaves/
diff --git a/drivers/virt/nitro_enclaves/Makefile b/drivers/virt/nitro_enclaves/Makefile
new file mode 100644
index 000000000000..e9f4fcd1591e
--- /dev/null
+++ b/drivers/virt/nitro_enclaves/Makefile
@@ -0,0 +1,11 @@
+# SPDX-License-Identifier: GPL-2.0
+#
+# Copyright 2020 Amazon.com, Inc. or its affiliates. All Rights Reserved.
+
+# Enclave lifetime management support for Nitro Enclaves (NE).
+
+obj-$(CONFIG_NITRO_ENCLAVES) += nitro_enclaves.o
+
+nitro_enclaves-y := ne_pci_dev.o ne_misc_dev.o
+
+ccflags-y += -Wall
--
2.20.1 (Apple Git-117)
Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.
Signed-off-by: Alexandru Vasile <[email protected]>
Signed-off-by: Andra Paraschiv <[email protected]>
---
Changelog
v6 -> v7
* Track POLLNVAL as poll event in addition to POLLHUP.
v5 -> v6
* Remove "rc" mentioning when printing errno string.
* Remove the ioctl to query API version.
* Include usage info for NUMA-aware hugetlb configuration.
* Update documentation to kernel-doc format.
* Add logic for enclave image loading.
v4 -> v5
* Print enclave vCPU ids when they are created.
* Update logic to map the modified vCPU ioctl call.
* Add check for the path to the enclave image to be less than PATH_MAX.
* Update the ioctl calls error checking logic to match the NE specific
error codes.
v3 -> v4
* Update usage details to match the updates in v4.
* Update NE ioctl interface usage.
v2 -> v3
* Remove the include directory to use the uapi from the kernel.
* Remove the GPL additional wording as SPDX-License-Identifier is
already in place.
v1 -> v2
* New in v2.
---
samples/nitro_enclaves/.gitignore | 2 +
samples/nitro_enclaves/Makefile | 16 +
samples/nitro_enclaves/ne_ioctl_sample.c | 850 +++++++++++++++++++++++
3 files changed, 868 insertions(+)
create mode 100644 samples/nitro_enclaves/.gitignore
create mode 100644 samples/nitro_enclaves/Makefile
create mode 100644 samples/nitro_enclaves/ne_ioctl_sample.c
diff --git a/samples/nitro_enclaves/.gitignore b/samples/nitro_enclaves/.gitignore
new file mode 100644
index 000000000000..827934129c90
--- /dev/null
+++ b/samples/nitro_enclaves/.gitignore
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0
+ne_ioctl_sample
diff --git a/samples/nitro_enclaves/Makefile b/samples/nitro_enclaves/Makefile
new file mode 100644
index 000000000000..a3ec78fefb52
--- /dev/null
+++ b/samples/nitro_enclaves/Makefile
@@ -0,0 +1,16 @@
+# SPDX-License-Identifier: GPL-2.0
+#
+# Copyright 2020 Amazon.com, Inc. or its affiliates. All Rights Reserved.
+
+# Enclave lifetime management support for Nitro Enclaves (NE) - ioctl sample
+# usage.
+
+.PHONY: all clean
+
+CFLAGS += -Wall
+
+all:
+ $(CC) $(CFLAGS) -o ne_ioctl_sample ne_ioctl_sample.c -lpthread
+
+clean:
+ rm -f ne_ioctl_sample
diff --git a/samples/nitro_enclaves/ne_ioctl_sample.c b/samples/nitro_enclaves/ne_ioctl_sample.c
new file mode 100644
index 000000000000..1c4ee3132e11
--- /dev/null
+++ b/samples/nitro_enclaves/ne_ioctl_sample.c
@@ -0,0 +1,850 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright 2020 Amazon.com, Inc. or its affiliates. All Rights Reserved.
+ */
+
+/**
+ * DOC: Sample flow of using the ioctl interface provided by the Nitro Enclaves (NE)
+ * kernel driver.
+ *
+ * Usage
+ * -----
+ *
+ * Load the nitro_enclaves module, setting also the enclave CPU pool. The
+ * enclave CPUs need to be full cores from the same NUMA node. CPU 0 and its
+ * siblings have to remain available for the primary / parent VM, so they
+ * cannot be included in the enclave CPU pool.
+ *
+ * See the cpu list section from the kernel documentation.
+ * https://www.kernel.org/doc/html/latest/admin-guide/kernel-parameters.html#cpu-lists
+ *
+ * insmod drivers/virt/nitro_enclaves/nitro_enclaves.ko
+ * lsmod
+ *
+ * The CPU pool can be set at runtime, after the kernel module is loaded.
+ *
+ * echo <cpu-list> > /sys/module/nitro_enclaves/parameters/ne_cpus
+ *
+ * NUMA and CPU siblings information can be found using:
+ *
+ * lscpu
+ * /proc/cpuinfo
+ *
+ * Check the online / offline CPU list. The CPUs from the pool should be
+ * offlined.
+ *
+ * lscpu
+ *
+ * Check dmesg for any warnings / errors through the NE driver lifetime / usage.
+ * The NE logs contain the "nitro_enclaves" or "pci 0000:00:02.0" pattern.
+ *
+ * dmesg
+ *
+ * Setup hugetlbfs huge pages. The memory needs to be from the same NUMA node as
+ * the enclave CPUs.
+ * https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt
+ * By default, the allocation of hugetlb pages are distributed on all possible
+ * NUMA nodes. Use the following configuration files to set the number of huge
+ * pages from a NUMA node:
+ *
+ * /sys/devices/system/node/node<X>/hugepages/hugepages-2048kB/nr_hugepages
+ * /sys/devices/system/node/node<X>/hugepages/hugepages-1048576kB/nr_hugepages
+ *
+ * or, if not on a system with multiple NUMA nodes, can also set the number
+ * of 2 MiB / 1 GiB huge pages using
+ *
+ * /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
+ * /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
+ *
+ * In this example 256 hugepages of 2 MiB are used.
+ *
+ * Build and run the NE sample.
+ *
+ * make -C samples/nitro_enclaves clean
+ * make -C samples/nitro_enclaves
+ * ./samples/nitro_enclaves/ne_ioctl_sample <path_to_enclave_image>
+ *
+ * Unload the nitro_enclaves module.
+ *
+ * rmmod nitro_enclaves
+ * lsmod
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <limits.h>
+#include <poll.h>
+#include <pthread.h>
+#include <string.h>
+#include <sys/eventfd.h>
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+#include <sys/socket.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <unistd.h>
+
+#include <linux/mman.h>
+#include <linux/nitro_enclaves.h>
+#include <linux/vm_sockets.h>
+
+/**
+ * NE_DEV_NAME - Nitro Enclaves (NE) misc device that provides the ioctl interface.
+ */
+#define NE_DEV_NAME "/dev/nitro_enclaves"
+
+/**
+ * NE_POLL_WAIT_TIME - Timeout in seconds for each poll event.
+ */
+#define NE_POLL_WAIT_TIME (60)
+/**
+ * NE_POLL_WAIT_TIME_MS - Timeout in milliseconds for each poll event.
+ */
+#define NE_POLL_WAIT_TIME_MS (NE_POLL_WAIT_TIME * 1000)
+
+/**
+ * NE_SLEEP_TIME - Amount of time in seconds for the process to keep the enclave alive.
+ */
+#define NE_SLEEP_TIME (300)
+
+/**
+ * NE_DEFAULT_NR_VCPUS - Default number of vCPUs set for an enclave.
+ */
+#define NE_DEFAULT_NR_VCPUS (2)
+
+/**
+ * NE_MIN_MEM_REGION_SIZE - Minimum size of a memory region - 2 MiB.
+ */
+#define NE_MIN_MEM_REGION_SIZE (2 * 1024 * 1024)
+
+/**
+ * NE_DEFAULT_NR_MEM_REGIONS - Default number of memory regions of 2 MiB set for
+ * an enclave.
+ */
+#define NE_DEFAULT_NR_MEM_REGIONS (256)
+
+/**
+ * NE_IMAGE_LOAD_HEARTBEAT_CID - Vsock CID for enclave image loading heartbeat logic.
+ */
+#define NE_IMAGE_LOAD_HEARTBEAT_CID (3)
+/**
+ * NE_IMAGE_LOAD_HEARTBEAT_PORT - Vsock port for enclave image loading heartbeat logic.
+ */
+#define NE_IMAGE_LOAD_HEARTBEAT_PORT (9000)
+/**
+ * NE_IMAGE_LOAD_HEARTBEAT_VALUE - Heartbeat value for enclave image loading.
+ */
+#define NE_IMAGE_LOAD_HEARTBEAT_VALUE (0xb7)
+
+/**
+ * struct ne_user_mem_region - User space memory region set for an enclave.
+ * @userspace_addr: Address of the user space memory region.
+ * @memory_size: Size of the user space memory region.
+ */
+struct ne_user_mem_region {
+ void *userspace_addr;
+ size_t memory_size;
+};
+
+/**
+ * ne_create_vm() - Create a slot for the enclave VM.
+ * @ne_dev_fd: The file descriptor of the NE misc device.
+ * @slot_uid: The generated slot uid for the enclave.
+ * @enclave_fd : The generated file descriptor for the enclave.
+ *
+ * Context: Process context.
+ * Return:
+ * * 0 on success.
+ * * Negative return value on failure.
+ */
+static int ne_create_vm(int ne_dev_fd, unsigned long *slot_uid, int *enclave_fd)
+{
+ int rc = -EINVAL;
+ *enclave_fd = ioctl(ne_dev_fd, NE_CREATE_VM, slot_uid);
+
+ if (*enclave_fd < 0) {
+ rc = *enclave_fd;
+ switch (errno) {
+ case NE_ERR_NO_CPUS_AVAIL_IN_POOL: {
+ printf("Error in create VM, no CPUs available in the NE CPU pool\n");
+
+ break;
+ }
+
+ default:
+ printf("Error in create VM [%m]\n");
+ }
+
+ return rc;
+ }
+
+ return 0;
+}
+
+
+/**
+ * ne_poll_enclave_fd() - Thread function for polling the enclave fd.
+ * @data: Argument provided for the polling function.
+ *
+ * Context: Process context.
+ * Return:
+ * * NULL on success / failure.
+ */
+void *ne_poll_enclave_fd(void *data)
+{
+ int enclave_fd = *(int *)data;
+ struct pollfd fds[1] = {};
+ int i = 0;
+ int rc = -EINVAL;
+
+ printf("Running from poll thread, enclave fd %d\n", enclave_fd);
+
+ fds[0].fd = enclave_fd;
+ fds[0].events = POLLIN | POLLERR | POLLHUP;
+
+ /* Keep on polling until the current process is terminated. */
+ while (1) {
+ printf("[iter %d] Polling ...\n", i);
+
+ rc = poll(fds, 1, NE_POLL_WAIT_TIME_MS);
+ if (rc < 0) {
+ printf("Error in poll [%m]\n");
+
+ return NULL;
+ }
+
+ i++;
+
+ if (!rc) {
+ printf("Poll: %d seconds elapsed\n",
+ i * NE_POLL_WAIT_TIME);
+
+ continue;
+ }
+
+ printf("Poll received value 0x%x\n", fds[0].revents);
+
+ if (fds[0].revents & POLLHUP) {
+ printf("Received POLLHUP\n");
+
+ return NULL;
+ }
+
+ if (fds[0].revents & POLLNVAL) {
+ printf("Received POLLNVAL\n");
+
+ return NULL;
+ }
+ }
+
+ return NULL;
+}
+
+/**
+ * ne_alloc_user_mem_region() - Allocate a user space memory region for an enclave.
+ * @ne_user_mem_region: User space memory region allocated using hugetlbfs.
+ *
+ * Context: Process context.
+ * Return:
+ * * 0 on success.
+ * * Negative return value on failure.
+ */
+static int ne_alloc_user_mem_region(struct ne_user_mem_region *ne_user_mem_region)
+{
+ /**
+ * Check available hugetlb encodings for different huge page sizes in
+ * include/uapi/linux/mman.h.
+ */
+ ne_user_mem_region->userspace_addr = mmap(NULL, ne_user_mem_region->memory_size,
+ PROT_READ | PROT_WRITE,
+ MAP_PRIVATE | MAP_ANONYMOUS |
+ MAP_HUGETLB | MAP_HUGE_2MB, -1, 0);
+ if (ne_user_mem_region->userspace_addr == MAP_FAILED) {
+ printf("Error in mmap memory [%m]\n");
+
+ return -1;
+ }
+
+ return 0;
+}
+
+/**
+ * ne_load_enclave_image() - Place the enclave image in the enclave memory.
+ * @enclave_fd : The file descriptor associated with the enclave.
+ * @ne_user_mem_regions: User space memory regions allocated for the enclave.
+ * @enclave_image_path : The file path of the enclave image.
+ *
+ * Context: Process context.
+ * Return:
+ * * 0 on success.
+ * * Negative return value on failure.
+ */
+static int ne_load_enclave_image(int enclave_fd, struct ne_user_mem_region ne_user_mem_regions[],
+ char *enclave_image_path)
+{
+ unsigned char *enclave_image = NULL;
+ int enclave_image_fd = -1;
+ size_t enclave_image_size = 0;
+ size_t enclave_memory_size = 0;
+ unsigned long i = 0;
+ size_t image_written_bytes = 0;
+ struct ne_image_load_info image_load_info = {
+ .flags = NE_EIF_IMAGE,
+ };
+ struct stat image_stat_buf = {};
+ int rc = -EINVAL;
+ size_t temp_image_offset = 0;
+
+ for (i = 0; i < NE_DEFAULT_NR_MEM_REGIONS; i++)
+ enclave_memory_size += ne_user_mem_regions[i].memory_size;
+
+ rc = stat(enclave_image_path, &image_stat_buf);
+ if (rc < 0) {
+ printf("Error in get image stat info [%m]\n");
+
+ return rc;
+ }
+
+ enclave_image_size = image_stat_buf.st_size;
+
+ if (enclave_memory_size < enclave_image_size) {
+ printf("The enclave memory is smaller than the enclave image size\n");
+
+ return -ENOMEM;
+ }
+
+ rc = ioctl(enclave_fd, NE_GET_IMAGE_LOAD_INFO, &image_load_info);
+ if (rc < 0) {
+ switch (errno) {
+ case NE_ERR_NOT_IN_INIT_STATE: {
+ printf("Error in get image load info, enclave not in init state\n");
+
+ break;
+ }
+
+ default:
+ printf("Error in get image load info [%m]\n");
+ }
+
+ return rc;
+ }
+
+ printf("Enclave image offset in enclave memory is %lld\n",
+ image_load_info.memory_offset);
+
+ enclave_image_fd = open(enclave_image_path, O_RDONLY);
+ if (enclave_image_fd < 0) {
+ printf("Error in open enclave image file [%m]\n");
+
+ return enclave_image_fd;
+ }
+
+ enclave_image = mmap(NULL, enclave_image_size, PROT_READ,
+ MAP_PRIVATE, enclave_image_fd, 0);
+ if (enclave_image == MAP_FAILED) {
+ printf("Error in mmap enclave image [%m]\n");
+
+ return -1;
+ }
+
+ temp_image_offset = image_load_info.memory_offset;
+
+ for (i = 0; i < NE_DEFAULT_NR_MEM_REGIONS; i++) {
+ size_t bytes_to_write = 0;
+ size_t memory_offset = 0;
+ size_t memory_size = ne_user_mem_regions[i].memory_size;
+ size_t remaining_bytes = 0;
+ void *userspace_addr = ne_user_mem_regions[i].userspace_addr;
+
+ if (temp_image_offset >= memory_size) {
+ temp_image_offset -= memory_size;
+
+ continue;
+ } else if (temp_image_offset != 0) {
+ memory_offset = temp_image_offset;
+ memory_size -= temp_image_offset;
+ temp_image_offset = 0;
+ }
+
+ remaining_bytes = enclave_image_size - image_written_bytes;
+ bytes_to_write = memory_size < remaining_bytes ?
+ memory_size : remaining_bytes;
+
+ memcpy(userspace_addr + memory_offset,
+ enclave_image + image_written_bytes, bytes_to_write);
+
+ image_written_bytes += bytes_to_write;
+
+ if (image_written_bytes == enclave_image_size)
+ break;
+ }
+
+ munmap(enclave_image, enclave_image_size);
+
+ close(enclave_image_fd);
+
+ return 0;
+}
+
+/**
+ * ne_set_user_mem_region() - Set a user space memory region for the given enclave.
+ * @enclave_fd : The file descriptor associated with the enclave.
+ * @ne_user_mem_region : User space memory region to be set for the enclave.
+ *
+ * Context: Process context.
+ * Return:
+ * * 0 on success.
+ * * Negative return value on failure.
+ */
+static int ne_set_user_mem_region(int enclave_fd, struct ne_user_mem_region ne_user_mem_region)
+{
+ struct ne_user_memory_region mem_region = {
+ .flags = NE_DEFAULT_MEMORY_REGION,
+ .memory_size = ne_user_mem_region.memory_size,
+ .userspace_addr = (__u64)ne_user_mem_region.userspace_addr,
+ };
+ int rc = -EINVAL;
+
+ rc = ioctl(enclave_fd, NE_SET_USER_MEMORY_REGION, &mem_region);
+ if (rc < 0) {
+ switch (errno) {
+ case NE_ERR_NOT_IN_INIT_STATE: {
+ printf("Error in set user memory region, enclave not in init state\n");
+
+ break;
+ }
+
+ case NE_ERR_INVALID_MEM_REGION_SIZE: {
+ printf("Error in set user memory region, mem size not multiple of 2 MiB\n");
+
+ break;
+ }
+
+ case NE_ERR_INVALID_MEM_REGION_ADDR: {
+ printf("Error in set user memory region, invalid user space address\n");
+
+ break;
+ }
+
+ case NE_ERR_UNALIGNED_MEM_REGION_ADDR: {
+ printf("Error in set user memory region, unaligned user space address\n");
+
+ break;
+ }
+
+ case NE_ERR_MEM_REGION_ALREADY_USED: {
+ printf("Error in set user memory region, memory region already used\n");
+
+ break;
+ }
+
+ case NE_ERR_MEM_NOT_HUGE_PAGE: {
+ printf("Error in set user memory region, not backed by huge pages\n");
+
+ break;
+ }
+
+ case NE_ERR_MEM_DIFFERENT_NUMA_NODE: {
+ printf("Error in set user memory region, different NUMA node than CPUs\n");
+
+ break;
+ }
+
+ case NE_ERR_MEM_MAX_REGIONS: {
+ printf("Error in set user memory region, max memory regions reached\n");
+
+ break;
+ }
+
+ default:
+ printf("Error in set user memory region [%m]\n");
+ }
+
+ return rc;
+ }
+
+ return 0;
+}
+
+/**
+ * ne_free_mem_regions() - Unmap all the user space memory regions that were set
+ * aside for the enclave.
+ * @ne_user_mem_regions: The user space memory regions associated with an enclave.
+ *
+ * Context: Process context.
+ */
+static void ne_free_mem_regions(struct ne_user_mem_region ne_user_mem_regions[])
+{
+ unsigned int i = 0;
+
+ for (i = 0; i < NE_DEFAULT_NR_MEM_REGIONS; i++)
+ munmap(ne_user_mem_regions[i].userspace_addr,
+ ne_user_mem_regions[i].memory_size);
+}
+
+/**
+ * ne_add_vcpu() - Add a vCPU to the given enclave.
+ * @enclave_fd : The file descriptor associated with the enclave.
+ * @vcpu_id: vCPU id to be set for the enclave, either provided or
+ * auto-generated (if provided vCPU id is 0).
+ *
+ * Context: Process context.
+ * Return:
+ * * 0 on success.
+ * * Negative return value on failure.
+ */
+static int ne_add_vcpu(int enclave_fd, unsigned int *vcpu_id)
+{
+ int rc = -EINVAL;
+
+ rc = ioctl(enclave_fd, NE_ADD_VCPU, vcpu_id);
+ if (rc < 0) {
+ switch (errno) {
+ case NE_ERR_NO_CPUS_AVAIL_IN_POOL: {
+ printf("Error in add vcpu, no CPUs available in the NE CPU pool\n");
+
+ break;
+ }
+
+ case NE_ERR_VCPU_ALREADY_USED: {
+ printf("Error in add vcpu, the provided vCPU is already used\n");
+
+ break;
+ }
+
+ case NE_ERR_VCPU_NOT_IN_CPU_POOL: {
+ printf("Error in add vcpu, the provided vCPU is not in the NE CPU pool\n");
+
+ break;
+ }
+
+ case NE_ERR_VCPU_INVALID_CPU_CORE: {
+ printf("Error in add vcpu, the core id of the provided vCPU is invalid\n");
+
+ break;
+ }
+
+ case NE_ERR_NOT_IN_INIT_STATE: {
+ printf("Error in add vcpu, enclave not in init state\n");
+
+ break;
+ }
+
+ case NE_ERR_INVALID_VCPU: {
+ printf("Error in add vcpu, the provided vCPU is out of avail CPUs range\n");
+
+ break;
+ }
+
+ default:
+ printf("Error in add vcpu [%m]\n");
+
+ }
+ return rc;
+ }
+
+ return 0;
+}
+
+/**
+ * ne_start_enclave() - Start the given enclave.
+ * @enclave_fd : The file descriptor associated with the enclave.
+ * @enclave_start_info : Enclave metadata used for starting e.g. vsock CID.
+ *
+ * Context: Process context.
+ * Return:
+ * * 0 on success.
+ * * Negative return value on failure.
+ */
+static int ne_start_enclave(int enclave_fd, struct ne_enclave_start_info *enclave_start_info)
+{
+ int rc = -EINVAL;
+
+ rc = ioctl(enclave_fd, NE_START_ENCLAVE, enclave_start_info);
+ if (rc < 0) {
+ switch (errno) {
+ case NE_ERR_NOT_IN_INIT_STATE: {
+ printf("Error in start enclave, enclave not in init state\n");
+
+ break;
+ }
+
+ case NE_ERR_NO_MEM_REGIONS_ADDED: {
+ printf("Error in start enclave, no memory regions have been added\n");
+
+ break;
+ }
+
+ case NE_ERR_NO_VCPUS_ADDED: {
+ printf("Error in start enclave, no vCPUs have been added\n");
+
+ break;
+ }
+
+ case NE_ERR_FULL_CORES_NOT_USED: {
+ printf("Error in start enclave, enclave has no full cores set\n");
+
+ break;
+ }
+
+ case NE_ERR_ENCLAVE_MEM_MIN_SIZE: {
+ printf("Error in start enclave, enclave memory is less than min size\n");
+
+ break;
+ }
+
+ default:
+ printf("Error in start enclave [%m]\n");
+ }
+
+ return rc;
+ }
+
+ return 0;
+}
+
+/**
+ * ne_check_enclave_booted() - Wait for a hearbeat from the enclave on a newly
+ * created vsock channel to check it has booted.
+ * @void: No parameters provided.
+ *
+ * Context: Process context.
+ * Return:
+ * * 0 on success.
+ * * Negative return value on failure.
+ */
+static int ne_check_enclave_booted(void)
+{
+ struct sockaddr_vm client_vsock_addr = {};
+ int client_vsock_fd = -1;
+ socklen_t client_vsock_len = sizeof(client_vsock_addr);
+ struct pollfd fds[1] = {};
+ int rc = -EINVAL;
+ unsigned char recv_buf = 0;
+ struct sockaddr_vm server_vsock_addr = {
+ .svm_family = AF_VSOCK,
+ .svm_cid = NE_IMAGE_LOAD_HEARTBEAT_CID,
+ .svm_port = NE_IMAGE_LOAD_HEARTBEAT_PORT,
+ };
+ int server_vsock_fd = -1;
+
+ server_vsock_fd = socket(AF_VSOCK, SOCK_STREAM, 0);
+ if (server_vsock_fd < 0) {
+ rc = server_vsock_fd;
+
+ printf("Error in socket [%m]\n");
+
+ return rc;
+ }
+
+ rc = bind(server_vsock_fd, (struct sockaddr *)&server_vsock_addr,
+ sizeof(server_vsock_addr));
+ if (rc < 0) {
+ printf("Error in bind [%m]\n");
+
+ goto out;
+ }
+
+ rc = listen(server_vsock_fd, 1);
+ if (rc < 0) {
+ printf("Error in listen [%m]\n");
+
+ goto out;
+ }
+
+ fds[0].fd = server_vsock_fd;
+ fds[0].events = POLLIN;
+
+ rc = poll(fds, 1, NE_POLL_WAIT_TIME_MS);
+ if (rc < 0) {
+ printf("Error in poll [%m]\n");
+
+ goto out;
+ }
+
+ if (!rc) {
+ printf("Poll timeout, %d seconds elapsed\n", NE_POLL_WAIT_TIME);
+
+ rc = -ETIMEDOUT;
+
+ goto out;
+ }
+
+ if ((fds[0].revents & POLLIN) == 0) {
+ printf("Poll received value %d\n", fds[0].revents);
+
+ rc = -EINVAL;
+
+ goto out;
+ }
+
+ rc = accept(server_vsock_fd, (struct sockaddr *)&client_vsock_addr,
+ &client_vsock_len);
+ if (rc < 0) {
+ printf("Error in accept [%m]\n");
+
+ goto out;
+ }
+
+ client_vsock_fd = rc;
+
+ /*
+ * Read the heartbeat value that the init process in the enclave sends
+ * after vsock connect.
+ */
+ rc = read(client_vsock_fd, &recv_buf, sizeof(recv_buf));
+ if (rc < 0) {
+ printf("Error in read [%m]\n");
+
+ goto out;
+ }
+
+ if (rc != sizeof(recv_buf) || recv_buf != NE_IMAGE_LOAD_HEARTBEAT_VALUE) {
+ printf("Read %d instead of %d\n", recv_buf,
+ NE_IMAGE_LOAD_HEARTBEAT_VALUE);
+
+ goto out;
+ }
+
+ /* Write the heartbeat value back. */
+ rc = write(client_vsock_fd, &recv_buf, sizeof(recv_buf));
+ if (rc < 0) {
+ printf("Error in write [%m]\n");
+
+ goto out;
+ }
+
+ rc = 0;
+
+out:
+ close(server_vsock_fd);
+
+ return rc;
+}
+
+int main(int argc, char *argv[])
+{
+ int enclave_fd = -1;
+ struct ne_enclave_start_info enclave_start_info = {};
+ unsigned int i = 0;
+ int ne_dev_fd = -1;
+ struct ne_user_mem_region ne_user_mem_regions[NE_DEFAULT_NR_MEM_REGIONS] = {};
+ unsigned int ne_vcpus[NE_DEFAULT_NR_VCPUS] = {};
+ int rc = -EINVAL;
+ pthread_t thread_id = 0;
+ unsigned long slot_uid = 0;
+
+ if (argc != 2) {
+ printf("Usage: %s <path_to_enclave_image>\n", argv[0]);
+
+ exit(EXIT_FAILURE);
+ }
+
+ if (strlen(argv[1]) >= PATH_MAX) {
+ printf("The size of the path to enclave image is higher than max path\n");
+
+ exit(EXIT_FAILURE);
+ }
+
+ ne_dev_fd = open(NE_DEV_NAME, O_RDWR | O_CLOEXEC);
+ if (ne_dev_fd < 0) {
+ printf("Error in open NE device [%m]\n");
+
+ exit(EXIT_FAILURE);
+ }
+
+ printf("Creating enclave slot ...\n");
+
+ rc = ne_create_vm(ne_dev_fd, &slot_uid, &enclave_fd);
+
+ close(ne_dev_fd);
+
+ if (rc < 0)
+ exit(EXIT_FAILURE);
+
+ printf("Enclave fd %d\n", enclave_fd);
+
+ rc = pthread_create(&thread_id, NULL, ne_poll_enclave_fd, (void *)&enclave_fd);
+ if (rc < 0) {
+ printf("Error in thread create [%m]\n");
+
+ close(enclave_fd);
+
+ exit(EXIT_FAILURE);
+ }
+
+ for (i = 0; i < NE_DEFAULT_NR_MEM_REGIONS; i++) {
+ ne_user_mem_regions[i].memory_size = NE_MIN_MEM_REGION_SIZE;
+
+ rc = ne_alloc_user_mem_region(&ne_user_mem_regions[i]);
+ if (rc < 0) {
+ printf("Error in alloc userspace memory region, iter %d\n", i);
+
+ goto release_enclave_fd;
+ }
+ }
+
+ rc = ne_load_enclave_image(enclave_fd, ne_user_mem_regions, argv[1]);
+ if (rc < 0)
+ goto release_enclave_fd;
+
+ for (i = 0; i < NE_DEFAULT_NR_MEM_REGIONS; i++) {
+ rc = ne_set_user_mem_region(enclave_fd, ne_user_mem_regions[i]);
+ if (rc < 0) {
+ printf("Error in set memory region, iter %d\n", i);
+
+ goto release_enclave_fd;
+ }
+ }
+
+ printf("Enclave memory regions were added\n");
+
+ for (i = 0; i < NE_DEFAULT_NR_VCPUS; i++) {
+ /*
+ * The vCPU is chosen from the enclave vCPU pool, if the value
+ * of the vcpu_id is 0.
+ */
+ ne_vcpus[i] = 0;
+ rc = ne_add_vcpu(enclave_fd, &ne_vcpus[i]);
+ if (rc < 0) {
+ printf("Error in add vcpu, iter %d\n", i);
+
+ goto release_enclave_fd;
+ }
+
+ printf("Added vCPU %d to the enclave\n", ne_vcpus[i]);
+ }
+
+ printf("Enclave vCPUs were added\n");
+
+ rc = ne_start_enclave(enclave_fd, &enclave_start_info);
+ if (rc < 0)
+ goto release_enclave_fd;
+
+ printf("Enclave started, CID %llu\n", enclave_start_info.enclave_cid);
+
+ rc = ne_check_enclave_booted();
+ if (rc < 0) {
+ printf("Error in the enclave image loading heartbeat logic [rc=%d]\n", rc);
+
+ goto release_enclave_fd;
+ }
+
+ printf("Entering sleep for %d seconds ...\n", NE_SLEEP_TIME);
+
+ sleep(NE_SLEEP_TIME);
+
+ close(enclave_fd);
+
+ ne_free_mem_regions(ne_user_mem_regions);
+
+ exit(EXIT_SUCCESS);
+
+release_enclave_fd:
+ close(enclave_fd);
+ ne_free_mem_regions(ne_user_mem_regions);
+
+ exit(EXIT_FAILURE);
+}
--
2.20.1 (Apple Git-117)
Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.
On 17.08.20 15:09, Andra Paraschiv wrote:
> Nitro Enclaves (NE) is a new Amazon Elastic Compute Cloud (EC2) capability
> that allows customers to carve out isolated compute environments within EC2
> instances [1].
>
> For example, an application that processes sensitive data and runs in a VM,
> can be separated from other applications running in the same VM. This
> application then runs in a separate VM than the primary VM, namely an enclave.
>
> An enclave runs alongside the VM that spawned it. This setup matches low latency
> applications needs. The resources that are allocated for the enclave, such as
> memory and CPUs, are carved out of the primary VM. Each enclave is mapped to a
> process running in the primary VM, that communicates with the NE driver via an
> ioctl interface.
>
> In this sense, there are two components:
>
> 1. An enclave abstraction process - a user space process running in the primary
> VM guest that uses the provided ioctl interface of the NE driver to spawn an
> enclave VM (that's 2 below).
>
> There is a NE emulated PCI device exposed to the primary VM. The driver for this
> new PCI device is included in the NE driver.
>
> The ioctl logic is mapped to PCI device commands e.g. the NE_START_ENCLAVE ioctl
> maps to an enclave start PCI command. The PCI device commands are then
> translated into actions taken on the hypervisor side; that's the Nitro
> hypervisor running on the host where the primary VM is running. The Nitro
> hypervisor is based on core KVM technology.
>
> 2. The enclave itself - a VM running on the same host as the primary VM that
> spawned it. Memory and CPUs are carved out of the primary VM and are dedicated
> for the enclave VM. An enclave does not have persistent storage attached.
>
> The memory regions carved out of the primary VM and given to an enclave need to
> be aligned 2 MiB / 1 GiB physically contiguous memory regions (or multiple of
> this size e.g. 8 MiB). The memory can be allocated e.g. by using hugetlbfs from
> user space [2][3]. The memory size for an enclave needs to be at least 64 MiB.
> The enclave memory and CPUs need to be from the same NUMA node.
>
> An enclave runs on dedicated cores. CPU 0 and its CPU siblings need to remain
> available for the primary VM. A CPU pool has to be set for NE purposes by an
> user with admin capability. See the cpu list section from the kernel
> documentation [4] for how a CPU pool format looks.
>
> An enclave communicates with the primary VM via a local communication channel,
> using virtio-vsock [5]. The primary VM has virtio-pci vsock emulated device,
> while the enclave VM has a virtio-mmio vsock emulated device. The vsock device
> uses eventfd for signaling. The enclave VM sees the usual interfaces - local
> APIC and IOAPIC - to get interrupts from virtio-vsock device. The virtio-mmio
> device is placed in memory below the typical 4 GiB.
>
> The application that runs in the enclave needs to be packaged in an enclave
> image together with the OS ( e.g. kernel, ramdisk, init ) that will run in the
> enclave VM. The enclave VM has its own kernel and follows the standard Linux
> boot protocol.
>
> The kernel bzImage, the kernel command line, the ramdisk(s) are part of the
> Enclave Image Format (EIF); plus an EIF header including metadata such as magic
> number, eif version, image size and CRC.
>
> Hash values are computed for the entire enclave image (EIF), the kernel and
> ramdisk(s). That's used, for example, to check that the enclave image that is
> loaded in the enclave VM is the one that was intended to be run.
>
> These crypto measurements are included in a signed attestation document
> generated by the Nitro Hypervisor and further used to prove the identity of the
> enclave; KMS is an example of service that NE is integrated with and that checks
> the attestation doc.
>
> The enclave image (EIF) is loaded in the enclave memory at offset 8 MiB. The
> init process in the enclave connects to the vsock CID of the primary VM and a
> predefined port - 9000 - to send a heartbeat value - 0xb7. This mechanism is
> used to check in the primary VM that the enclave has booted.
>
> If the enclave VM crashes or gracefully exits, an interrupt event is received by
> the NE driver. This event is sent further to the user space enclave process
> running in the primary VM via a poll notification mechanism. Then the user space
> enclave process can exit.
>
> Thank you.
>
This version reads very well, thanks a lot Andra!
Greg, would you mind to have another look over it?
Reviewed-by: Alexander Graf <[email protected]>
Alex
Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879
On Wed, Aug 19, 2020 at 01:15:59PM +0200, Alexander Graf wrote:
>
>
> On 17.08.20 15:09, Andra Paraschiv wrote:
> > Nitro Enclaves (NE) is a new Amazon Elastic Compute Cloud (EC2) capability
> > that allows customers to carve out isolated compute environments within EC2
> > instances [1].
> >
> > For example, an application that processes sensitive data and runs in a VM,
> > can be separated from other applications running in the same VM. This
> > application then runs in a separate VM than the primary VM, namely an enclave.
> >
> > An enclave runs alongside the VM that spawned it. This setup matches low latency
> > applications needs. The resources that are allocated for the enclave, such as
> > memory and CPUs, are carved out of the primary VM. Each enclave is mapped to a
> > process running in the primary VM, that communicates with the NE driver via an
> > ioctl interface.
> >
> > In this sense, there are two components:
> >
> > 1. An enclave abstraction process - a user space process running in the primary
> > VM guest that uses the provided ioctl interface of the NE driver to spawn an
> > enclave VM (that's 2 below).
> >
> > There is a NE emulated PCI device exposed to the primary VM. The driver for this
> > new PCI device is included in the NE driver.
> >
> > The ioctl logic is mapped to PCI device commands e.g. the NE_START_ENCLAVE ioctl
> > maps to an enclave start PCI command. The PCI device commands are then
> > translated into actions taken on the hypervisor side; that's the Nitro
> > hypervisor running on the host where the primary VM is running. The Nitro
> > hypervisor is based on core KVM technology.
> >
> > 2. The enclave itself - a VM running on the same host as the primary VM that
> > spawned it. Memory and CPUs are carved out of the primary VM and are dedicated
> > for the enclave VM. An enclave does not have persistent storage attached.
> >
> > The memory regions carved out of the primary VM and given to an enclave need to
> > be aligned 2 MiB / 1 GiB physically contiguous memory regions (or multiple of
> > this size e.g. 8 MiB). The memory can be allocated e.g. by using hugetlbfs from
> > user space [2][3]. The memory size for an enclave needs to be at least 64 MiB.
> > The enclave memory and CPUs need to be from the same NUMA node.
> >
> > An enclave runs on dedicated cores. CPU 0 and its CPU siblings need to remain
> > available for the primary VM. A CPU pool has to be set for NE purposes by an
> > user with admin capability. See the cpu list section from the kernel
> > documentation [4] for how a CPU pool format looks.
> >
> > An enclave communicates with the primary VM via a local communication channel,
> > using virtio-vsock [5]. The primary VM has virtio-pci vsock emulated device,
> > while the enclave VM has a virtio-mmio vsock emulated device. The vsock device
> > uses eventfd for signaling. The enclave VM sees the usual interfaces - local
> > APIC and IOAPIC - to get interrupts from virtio-vsock device. The virtio-mmio
> > device is placed in memory below the typical 4 GiB.
> >
> > The application that runs in the enclave needs to be packaged in an enclave
> > image together with the OS ( e.g. kernel, ramdisk, init ) that will run in the
> > enclave VM. The enclave VM has its own kernel and follows the standard Linux
> > boot protocol.
> >
> > The kernel bzImage, the kernel command line, the ramdisk(s) are part of the
> > Enclave Image Format (EIF); plus an EIF header including metadata such as magic
> > number, eif version, image size and CRC.
> >
> > Hash values are computed for the entire enclave image (EIF), the kernel and
> > ramdisk(s). That's used, for example, to check that the enclave image that is
> > loaded in the enclave VM is the one that was intended to be run.
> >
> > These crypto measurements are included in a signed attestation document
> > generated by the Nitro Hypervisor and further used to prove the identity of the
> > enclave; KMS is an example of service that NE is integrated with and that checks
> > the attestation doc.
> >
> > The enclave image (EIF) is loaded in the enclave memory at offset 8 MiB. The
> > init process in the enclave connects to the vsock CID of the primary VM and a
> > predefined port - 9000 - to send a heartbeat value - 0xb7. This mechanism is
> > used to check in the primary VM that the enclave has booted.
> >
> > If the enclave VM crashes or gracefully exits, an interrupt event is received by
> > the NE driver. This event is sent further to the user space enclave process
> > running in the primary VM via a poll notification mechanism. Then the user space
> > enclave process can exit.
> >
> > Thank you.
> >
>
> This version reads very well, thanks a lot Andra!
>
> Greg, would you mind to have another look over it?
Will do, it's in my to-review queue, behind lots of other patches...
On 19/08/2020 14:26, Greg KH wrote:
>
> On Wed, Aug 19, 2020 at 01:15:59PM +0200, Alexander Graf wrote:
>>
>> On 17.08.20 15:09, Andra Paraschiv wrote:
>>> Nitro Enclaves (NE) is a new Amazon Elastic Compute Cloud (EC2) capability
>>> that allows customers to carve out isolated compute environments within EC2
>>> instances [1].
>>>
>>> For example, an application that processes sensitive data and runs in a VM,
>>> can be separated from other applications running in the same VM. This
>>> application then runs in a separate VM than the primary VM, namely an enclave.
>>>
>>> An enclave runs alongside the VM that spawned it. This setup matches low latency
>>> applications needs. The resources that are allocated for the enclave, such as
>>> memory and CPUs, are carved out of the primary VM. Each enclave is mapped to a
>>> process running in the primary VM, that communicates with the NE driver via an
>>> ioctl interface.
>>>
>>> In this sense, there are two components:
>>>
>>> 1. An enclave abstraction process - a user space process running in the primary
>>> VM guest that uses the provided ioctl interface of the NE driver to spawn an
>>> enclave VM (that's 2 below).
>>>
>>> There is a NE emulated PCI device exposed to the primary VM. The driver for this
>>> new PCI device is included in the NE driver.
>>>
>>> The ioctl logic is mapped to PCI device commands e.g. the NE_START_ENCLAVE ioctl
>>> maps to an enclave start PCI command. The PCI device commands are then
>>> translated into actions taken on the hypervisor side; that's the Nitro
>>> hypervisor running on the host where the primary VM is running. The Nitro
>>> hypervisor is based on core KVM technology.
>>>
>>> 2. The enclave itself - a VM running on the same host as the primary VM that
>>> spawned it. Memory and CPUs are carved out of the primary VM and are dedicated
>>> for the enclave VM. An enclave does not have persistent storage attached.
>>>
>>> The memory regions carved out of the primary VM and given to an enclave need to
>>> be aligned 2 MiB / 1 GiB physically contiguous memory regions (or multiple of
>>> this size e.g. 8 MiB). The memory can be allocated e.g. by using hugetlbfs from
>>> user space [2][3]. The memory size for an enclave needs to be at least 64 MiB.
>>> The enclave memory and CPUs need to be from the same NUMA node.
>>>
>>> An enclave runs on dedicated cores. CPU 0 and its CPU siblings need to remain
>>> available for the primary VM. A CPU pool has to be set for NE purposes by an
>>> user with admin capability. See the cpu list section from the kernel
>>> documentation [4] for how a CPU pool format looks.
>>>
>>> An enclave communicates with the primary VM via a local communication channel,
>>> using virtio-vsock [5]. The primary VM has virtio-pci vsock emulated device,
>>> while the enclave VM has a virtio-mmio vsock emulated device. The vsock device
>>> uses eventfd for signaling. The enclave VM sees the usual interfaces - local
>>> APIC and IOAPIC - to get interrupts from virtio-vsock device. The virtio-mmio
>>> device is placed in memory below the typical 4 GiB.
>>>
>>> The application that runs in the enclave needs to be packaged in an enclave
>>> image together with the OS ( e.g. kernel, ramdisk, init ) that will run in the
>>> enclave VM. The enclave VM has its own kernel and follows the standard Linux
>>> boot protocol.
>>>
>>> The kernel bzImage, the kernel command line, the ramdisk(s) are part of the
>>> Enclave Image Format (EIF); plus an EIF header including metadata such as magic
>>> number, eif version, image size and CRC.
>>>
>>> Hash values are computed for the entire enclave image (EIF), the kernel and
>>> ramdisk(s). That's used, for example, to check that the enclave image that is
>>> loaded in the enclave VM is the one that was intended to be run.
>>>
>>> These crypto measurements are included in a signed attestation document
>>> generated by the Nitro Hypervisor and further used to prove the identity of the
>>> enclave; KMS is an example of service that NE is integrated with and that checks
>>> the attestation doc.
>>>
>>> The enclave image (EIF) is loaded in the enclave memory at offset 8 MiB. The
>>> init process in the enclave connects to the vsock CID of the primary VM and a
>>> predefined port - 9000 - to send a heartbeat value - 0xb7. This mechanism is
>>> used to check in the primary VM that the enclave has booted.
>>>
>>> If the enclave VM crashes or gracefully exits, an interrupt event is received by
>>> the NE driver. This event is sent further to the user space enclave process
>>> running in the primary VM via a poll notification mechanism. Then the user space
>>> enclave process can exit.
>>>
>>> Thank you.
>>>
>> This version reads very well, thanks a lot Andra!
Glad that the review experience has been improved and the patch series
is in a better shape.
>>
>> Greg, would you mind to have another look over it?
> Will do, it's in my to-review queue, behind lots of other patches...
>
Thanks both for taking time to go through the patch series.
Andra
Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.
On 19/08/2020 14:26, Greg KH wrote:
>
> On Wed, Aug 19, 2020 at 01:15:59PM +0200, Alexander Graf wrote:
>>
>> On 17.08.20 15:09, Andra Paraschiv wrote:
>>> Nitro Enclaves (NE) is a new Amazon Elastic Compute Cloud (EC2) capability
>>> that allows customers to carve out isolated compute environments within EC2
>>> instances [1].
>>>
>>> For example, an application that processes sensitive data and runs in a VM,
>>> can be separated from other applications running in the same VM. This
>>> application then runs in a separate VM than the primary VM, namely an enclave.
>>>
>>> An enclave runs alongside the VM that spawned it. This setup matches low latency
>>> applications needs. The resources that are allocated for the enclave, such as
>>> memory and CPUs, are carved out of the primary VM. Each enclave is mapped to a
>>> process running in the primary VM, that communicates with the NE driver via an
>>> ioctl interface.
>>>
>>> In this sense, there are two components:
>>>
>>> 1. An enclave abstraction process - a user space process running in the primary
>>> VM guest that uses the provided ioctl interface of the NE driver to spawn an
>>> enclave VM (that's 2 below).
>>>
>>> There is a NE emulated PCI device exposed to the primary VM. The driver for this
>>> new PCI device is included in the NE driver.
>>>
>>> The ioctl logic is mapped to PCI device commands e.g. the NE_START_ENCLAVE ioctl
>>> maps to an enclave start PCI command. The PCI device commands are then
>>> translated into actions taken on the hypervisor side; that's the Nitro
>>> hypervisor running on the host where the primary VM is running. The Nitro
>>> hypervisor is based on core KVM technology.
>>>
>>> 2. The enclave itself - a VM running on the same host as the primary VM that
>>> spawned it. Memory and CPUs are carved out of the primary VM and are dedicated
>>> for the enclave VM. An enclave does not have persistent storage attached.
>>>
>>> The memory regions carved out of the primary VM and given to an enclave need to
>>> be aligned 2 MiB / 1 GiB physically contiguous memory regions (or multiple of
>>> this size e.g. 8 MiB). The memory can be allocated e.g. by using hugetlbfs from
>>> user space [2][3]. The memory size for an enclave needs to be at least 64 MiB.
>>> The enclave memory and CPUs need to be from the same NUMA node.
>>>
>>> An enclave runs on dedicated cores. CPU 0 and its CPU siblings need to remain
>>> available for the primary VM. A CPU pool has to be set for NE purposes by an
>>> user with admin capability. See the cpu list section from the kernel
>>> documentation [4] for how a CPU pool format looks.
>>>
>>> An enclave communicates with the primary VM via a local communication channel,
>>> using virtio-vsock [5]. The primary VM has virtio-pci vsock emulated device,
>>> while the enclave VM has a virtio-mmio vsock emulated device. The vsock device
>>> uses eventfd for signaling. The enclave VM sees the usual interfaces - local
>>> APIC and IOAPIC - to get interrupts from virtio-vsock device. The virtio-mmio
>>> device is placed in memory below the typical 4 GiB.
>>>
>>> The application that runs in the enclave needs to be packaged in an enclave
>>> image together with the OS ( e.g. kernel, ramdisk, init ) that will run in the
>>> enclave VM. The enclave VM has its own kernel and follows the standard Linux
>>> boot protocol.
>>>
>>> The kernel bzImage, the kernel command line, the ramdisk(s) are part of the
>>> Enclave Image Format (EIF); plus an EIF header including metadata such as magic
>>> number, eif version, image size and CRC.
>>>
>>> Hash values are computed for the entire enclave image (EIF), the kernel and
>>> ramdisk(s). That's used, for example, to check that the enclave image that is
>>> loaded in the enclave VM is the one that was intended to be run.
>>>
>>> These crypto measurements are included in a signed attestation document
>>> generated by the Nitro Hypervisor and further used to prove the identity of the
>>> enclave; KMS is an example of service that NE is integrated with and that checks
>>> the attestation doc.
>>>
>>> The enclave image (EIF) is loaded in the enclave memory at offset 8 MiB. The
>>> init process in the enclave connects to the vsock CID of the primary VM and a
>>> predefined port - 9000 - to send a heartbeat value - 0xb7. This mechanism is
>>> used to check in the primary VM that the enclave has booted.
>>>
>>> If the enclave VM crashes or gracefully exits, an interrupt event is received by
>>> the NE driver. This event is sent further to the user space enclave process
>>> running in the primary VM via a poll notification mechanism. Then the user space
>>> enclave process can exit.
>>>
>>> Thank you.
>>>
>> This version reads very well, thanks a lot Andra!
>>
>> Greg, would you mind to have another look over it?
> Will do, it's in my to-review queue, behind lots of other patches...
>
I have a set of updates that can be included in a new revision, v8 e.g.
new NE custom error codes for invalid flags / enclave CID, "shutdown"
function for the NE PCI device driver, a couple more checks wrt invalid
flags and enclave vsock CID, documentation and sample updates. There is
also the option to have these updates as follow-up patches.
Greg, let me know what would work fine for you with regard to the review
of the patch series.
Thanks,
Andra
Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.
On Mon, Aug 31, 2020 at 11:19:19AM +0300, Paraschiv, Andra-Irina wrote:
>
>
> On 19/08/2020 14:26, Greg KH wrote:
> >
> > On Wed, Aug 19, 2020 at 01:15:59PM +0200, Alexander Graf wrote:
> > >
> > > On 17.08.20 15:09, Andra Paraschiv wrote:
> > > > Nitro Enclaves (NE) is a new Amazon Elastic Compute Cloud (EC2) capability
> > > > that allows customers to carve out isolated compute environments within EC2
> > > > instances [1].
> > > >
> > > > For example, an application that processes sensitive data and runs in a VM,
> > > > can be separated from other applications running in the same VM. This
> > > > application then runs in a separate VM than the primary VM, namely an enclave.
> > > >
> > > > An enclave runs alongside the VM that spawned it. This setup matches low latency
> > > > applications needs. The resources that are allocated for the enclave, such as
> > > > memory and CPUs, are carved out of the primary VM. Each enclave is mapped to a
> > > > process running in the primary VM, that communicates with the NE driver via an
> > > > ioctl interface.
> > > >
> > > > In this sense, there are two components:
> > > >
> > > > 1. An enclave abstraction process - a user space process running in the primary
> > > > VM guest that uses the provided ioctl interface of the NE driver to spawn an
> > > > enclave VM (that's 2 below).
> > > >
> > > > There is a NE emulated PCI device exposed to the primary VM. The driver for this
> > > > new PCI device is included in the NE driver.
> > > >
> > > > The ioctl logic is mapped to PCI device commands e.g. the NE_START_ENCLAVE ioctl
> > > > maps to an enclave start PCI command. The PCI device commands are then
> > > > translated into actions taken on the hypervisor side; that's the Nitro
> > > > hypervisor running on the host where the primary VM is running. The Nitro
> > > > hypervisor is based on core KVM technology.
> > > >
> > > > 2. The enclave itself - a VM running on the same host as the primary VM that
> > > > spawned it. Memory and CPUs are carved out of the primary VM and are dedicated
> > > > for the enclave VM. An enclave does not have persistent storage attached.
> > > >
> > > > The memory regions carved out of the primary VM and given to an enclave need to
> > > > be aligned 2 MiB / 1 GiB physically contiguous memory regions (or multiple of
> > > > this size e.g. 8 MiB). The memory can be allocated e.g. by using hugetlbfs from
> > > > user space [2][3]. The memory size for an enclave needs to be at least 64 MiB.
> > > > The enclave memory and CPUs need to be from the same NUMA node.
> > > >
> > > > An enclave runs on dedicated cores. CPU 0 and its CPU siblings need to remain
> > > > available for the primary VM. A CPU pool has to be set for NE purposes by an
> > > > user with admin capability. See the cpu list section from the kernel
> > > > documentation [4] for how a CPU pool format looks.
> > > >
> > > > An enclave communicates with the primary VM via a local communication channel,
> > > > using virtio-vsock [5]. The primary VM has virtio-pci vsock emulated device,
> > > > while the enclave VM has a virtio-mmio vsock emulated device. The vsock device
> > > > uses eventfd for signaling. The enclave VM sees the usual interfaces - local
> > > > APIC and IOAPIC - to get interrupts from virtio-vsock device. The virtio-mmio
> > > > device is placed in memory below the typical 4 GiB.
> > > >
> > > > The application that runs in the enclave needs to be packaged in an enclave
> > > > image together with the OS ( e.g. kernel, ramdisk, init ) that will run in the
> > > > enclave VM. The enclave VM has its own kernel and follows the standard Linux
> > > > boot protocol.
> > > >
> > > > The kernel bzImage, the kernel command line, the ramdisk(s) are part of the
> > > > Enclave Image Format (EIF); plus an EIF header including metadata such as magic
> > > > number, eif version, image size and CRC.
> > > >
> > > > Hash values are computed for the entire enclave image (EIF), the kernel and
> > > > ramdisk(s). That's used, for example, to check that the enclave image that is
> > > > loaded in the enclave VM is the one that was intended to be run.
> > > >
> > > > These crypto measurements are included in a signed attestation document
> > > > generated by the Nitro Hypervisor and further used to prove the identity of the
> > > > enclave; KMS is an example of service that NE is integrated with and that checks
> > > > the attestation doc.
> > > >
> > > > The enclave image (EIF) is loaded in the enclave memory at offset 8 MiB. The
> > > > init process in the enclave connects to the vsock CID of the primary VM and a
> > > > predefined port - 9000 - to send a heartbeat value - 0xb7. This mechanism is
> > > > used to check in the primary VM that the enclave has booted.
> > > >
> > > > If the enclave VM crashes or gracefully exits, an interrupt event is received by
> > > > the NE driver. This event is sent further to the user space enclave process
> > > > running in the primary VM via a poll notification mechanism. Then the user space
> > > > enclave process can exit.
> > > >
> > > > Thank you.
> > > >
> > > This version reads very well, thanks a lot Andra!
> > >
> > > Greg, would you mind to have another look over it?
> > Will do, it's in my to-review queue, behind lots of other patches...
> >
>
> I have a set of updates that can be included in a new revision, v8 e.g. new
> NE custom error codes for invalid flags / enclave CID, "shutdown" function
> for the NE PCI device driver, a couple more checks wrt invalid flags and
> enclave vsock CID, documentation and sample updates. There is also the
> option to have these updates as follow-up patches.
>
> Greg, let me know what would work fine for you with regard to the review of
> the patch series.
A new series is always fine with me...
thanks,
greg k-h
On 04/09/2020 19:13, Greg KH wrote:
> On Mon, Aug 31, 2020 at 11:19:19AM +0300, Paraschiv, Andra-Irina wrote:
>>
>> On 19/08/2020 14:26, Greg KH wrote:
>>> On Wed, Aug 19, 2020 at 01:15:59PM +0200, Alexander Graf wrote:
>>>> On 17.08.20 15:09, Andra Paraschiv wrote:
>>>>> Nitro Enclaves (NE) is a new Amazon Elastic Compute Cloud (EC2) capability
>>>>> that allows customers to carve out isolated compute environments within EC2
>>>>> instances [1].
>>>>>
>>>>> For example, an application that processes sensitive data and runs in a VM,
>>>>> can be separated from other applications running in the same VM. This
>>>>> application then runs in a separate VM than the primary VM, namely an enclave.
>>>>>
>>>>> An enclave runs alongside the VM that spawned it. This setup matches low latency
>>>>> applications needs. The resources that are allocated for the enclave, such as
>>>>> memory and CPUs, are carved out of the primary VM. Each enclave is mapped to a
>>>>> process running in the primary VM, that communicates with the NE driver via an
>>>>> ioctl interface.
>>>>>
>>>>> In this sense, there are two components:
>>>>>
>>>>> 1. An enclave abstraction process - a user space process running in the primary
>>>>> VM guest that uses the provided ioctl interface of the NE driver to spawn an
>>>>> enclave VM (that's 2 below).
>>>>>
>>>>> There is a NE emulated PCI device exposed to the primary VM. The driver for this
>>>>> new PCI device is included in the NE driver.
>>>>>
>>>>> The ioctl logic is mapped to PCI device commands e.g. the NE_START_ENCLAVE ioctl
>>>>> maps to an enclave start PCI command. The PCI device commands are then
>>>>> translated into actions taken on the hypervisor side; that's the Nitro
>>>>> hypervisor running on the host where the primary VM is running. The Nitro
>>>>> hypervisor is based on core KVM technology.
>>>>>
>>>>> 2. The enclave itself - a VM running on the same host as the primary VM that
>>>>> spawned it. Memory and CPUs are carved out of the primary VM and are dedicated
>>>>> for the enclave VM. An enclave does not have persistent storage attached.
>>>>>
>>>>> The memory regions carved out of the primary VM and given to an enclave need to
>>>>> be aligned 2 MiB / 1 GiB physically contiguous memory regions (or multiple of
>>>>> this size e.g. 8 MiB). The memory can be allocated e.g. by using hugetlbfs from
>>>>> user space [2][3]. The memory size for an enclave needs to be at least 64 MiB.
>>>>> The enclave memory and CPUs need to be from the same NUMA node.
>>>>>
>>>>> An enclave runs on dedicated cores. CPU 0 and its CPU siblings need to remain
>>>>> available for the primary VM. A CPU pool has to be set for NE purposes by an
>>>>> user with admin capability. See the cpu list section from the kernel
>>>>> documentation [4] for how a CPU pool format looks.
>>>>>
>>>>> An enclave communicates with the primary VM via a local communication channel,
>>>>> using virtio-vsock [5]. The primary VM has virtio-pci vsock emulated device,
>>>>> while the enclave VM has a virtio-mmio vsock emulated device. The vsock device
>>>>> uses eventfd for signaling. The enclave VM sees the usual interfaces - local
>>>>> APIC and IOAPIC - to get interrupts from virtio-vsock device. The virtio-mmio
>>>>> device is placed in memory below the typical 4 GiB.
>>>>>
>>>>> The application that runs in the enclave needs to be packaged in an enclave
>>>>> image together with the OS ( e.g. kernel, ramdisk, init ) that will run in the
>>>>> enclave VM. The enclave VM has its own kernel and follows the standard Linux
>>>>> boot protocol.
>>>>>
>>>>> The kernel bzImage, the kernel command line, the ramdisk(s) are part of the
>>>>> Enclave Image Format (EIF); plus an EIF header including metadata such as magic
>>>>> number, eif version, image size and CRC.
>>>>>
>>>>> Hash values are computed for the entire enclave image (EIF), the kernel and
>>>>> ramdisk(s). That's used, for example, to check that the enclave image that is
>>>>> loaded in the enclave VM is the one that was intended to be run.
>>>>>
>>>>> These crypto measurements are included in a signed attestation document
>>>>> generated by the Nitro Hypervisor and further used to prove the identity of the
>>>>> enclave; KMS is an example of service that NE is integrated with and that checks
>>>>> the attestation doc.
>>>>>
>>>>> The enclave image (EIF) is loaded in the enclave memory at offset 8 MiB. The
>>>>> init process in the enclave connects to the vsock CID of the primary VM and a
>>>>> predefined port - 9000 - to send a heartbeat value - 0xb7. This mechanism is
>>>>> used to check in the primary VM that the enclave has booted.
>>>>>
>>>>> If the enclave VM crashes or gracefully exits, an interrupt event is received by
>>>>> the NE driver. This event is sent further to the user space enclave process
>>>>> running in the primary VM via a poll notification mechanism. Then the user space
>>>>> enclave process can exit.
>>>>>
>>>>> Thank you.
>>>>>
>>>> This version reads very well, thanks a lot Andra!
>>>>
>>>> Greg, would you mind to have another look over it?
>>> Will do, it's in my to-review queue, behind lots of other patches...
>>>
>> I have a set of updates that can be included in a new revision, v8 e.g. new
>> NE custom error codes for invalid flags / enclave CID, "shutdown" function
>> for the NE PCI device driver, a couple more checks wrt invalid flags and
>> enclave vsock CID, documentation and sample updates. There is also the
>> option to have these updates as follow-up patches.
>>
>> Greg, let me know what would work fine for you with regard to the review of
>> the patch series.
> A new series is always fine with me...
>
Alright, thank you. I sent out the new revision.
Andra
Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.