2022-02-09 11:19:46

by Reinette Chatre

[permalink] [raw]
Subject: [PATCH V2 00/32] x86/sgx and selftests/sgx: Support SGX2

V1: https://lore.kernel.org/linux-sgx/[email protected]/

Changes since V1 that directly impact user space:
- SGX2 permission changes changed from a single ioctl() named
SGX_IOC_PAGE_MODP to two new ioctl()s:
SGX_IOC_ENCLAVE_RELAX_PERMISSIONS and
SGX_IOC_ENCLAVE_RESTRICT_PERMISSIONS, supported by two different
parameter structures (SGX_IOC_ENCLAVE_RELAX_PERMISSIONS does
not support a result output parameter) (Jarkko).

User space flow impact: After user space runs ENCLU[EMODPE] it
needs to call SGX_IOC_ENCLAVE_RELAX_PERMISSIONS to have PTEs
updated. Previously running SGX_IOC_PAGE_MODP in this scenario
resulted in EPCM.PR being set but calling
SGX_IOC_ENCLAVE_RELAX_PERMISSIONS will not result in EPCM.PR
being set anymore and thus no need for an additional
ENCLU[EACCEPT].

- SGX_IOC_ENCLAVE_RELAX_PERMISSIONS and
SGX_IOC_ENCLAVE_RESTRICT_PERMISSIONS
obtain new permissions from secinfo as parameter instead of
the permissions directly (Jarkko).

- ioctl() supporting SGX2 page type change is renamed from
SGX_IOC_PAGE_MODT to SGX_IOC_ENCLAVE_MODIFY_TYPE (Jarkko).

- SGX_IOC_ENCLAVE_MODIFY_TYPE obtains new page type from secinfo
as parameter instead of the page type directly (Jarkko).

- ioctl() supporting SGX2 page removal is renamed from
SGX_IOC_PAGE_REMOVE to SGX_IOC_ENCLAVE_REMOVE_PAGES (Jarkko).

- All ioctl() parameter structures have been renamed as a result of the
ioctl() renaming:
SGX_IOC_ENCLAVE_RELAX_PERMISSIONS => struct sgx_enclave_relax_perm
SGX_IOC_ENCLAVE_RESTRICT_PERMISSIONS => struct sgx_enclave_restrict_perm
SGX_IOC_ENCLAVE_MODIFY_TYPE => struct sgx_enclave_modt
SGX_IOC_ENCLAVE_REMOVE_PAGES => struct sgx_enclave_remove_pages

Changes since V1 that do not directly impact user space:
- Number of patches in series increased from 25 to 32 primarily because
of splitting the original submission:
- Wrappers for the new SGX2 functions are introduced in three separate
patches replacing the original "x86/sgx: Add wrappers for SGX2
functions"
(Jarkko).
- Moving and renaming sgx_encl_ewb_cpumask() is done with two patches
replacing the original "x86/sgx: Use more generic name for enclave
cpumask function" (Jarkko).
- Support for SGX2 EPCM permission changes is split into two ioctls(),
one for relaxing and one for restricting permissions, each introduced
by a new patch replacing the original "x86/sgx: Support enclave page
permission changes" (Jarkko).
- Extracted code used by existing ioctls() for usage by new ioctl()s
into a new utility in new patch "x86/sgx: Create utility to validate
user provided offset and length" (Dave did not specifically ask for
this but it addresses his review feedback).
- Two new Documentation patches to support the SGX2 work
("Documentation/x86: Introduce enclave runtime management") and
a dedicated section on the enclave permission management
("Documentation/x86: Document SGX permission details") (Andy).
- Most patches were reworked to improve the language by:
* aiming to refer to exact item instead of English rephrasing (Jarkko).
* use ioctl() instead of ioctl throughout (Dave).
* Use "relaxed" instead of "exceed" when referring to permissions
(Dave).
- Improved documentation with several additions to
Documentation/x86/sgx.rst.
- Many smaller changes, please refer to individual patches.

Hi Everybody,

The current Linux kernel support for SGX includes support for SGX1 that
requires that an enclave be created with properties that accommodate all
usages over its (the enclave's) lifetime. This includes properties such
as permissions of enclave pages, the number of enclave pages, and the
number of threads supported by the enclave.

Consequences of this requirement to have the enclave be created to
accommodate all usages include:
* pages needing to support relocated code are required to have RWX
permissions for their entire lifetime,
* an enclave needs to be created with the maximum stack and heap
projected to be needed during the enclave's entire lifetime which
can be longer than the processes running within it,
* an enclave needs to be created with support for the maximum number
of threads projected to run in the enclave.

Since SGX1 a few more functions were introduced, collectively called
SGX2, that support modifications to an initialized enclave. Hardware
supporting these functions are already available as listed on
https://github.com/ayeks/SGX-hardware

This series adds support for SGX2, also referred to as Enclave Dynamic
Memory Management (EDMM). This includes:

* Support modifying permissions of regular enclave pages belonging to an
initialized enclave. New permissions are not allowed to exceed the
originally vetted permissions. For example, RX isn't allowed unless
the page was originally added with RX or RWX.
Modifying permissions is accomplished with two new ioctl()s:
SGX_IOC_ENCLAVE_RELAX_PERMISSIONS and
SGX_IOC_ENCLAVE_RESTRICT_PERMISSIONS.

* Support dynamic addition of regular enclave pages to an initialized
enclave. Pages are added with RW permissions as their "originally
vetted permissions" (see previous bullet) and thus not allowed to
be made executable at this time. Enabling dynamically added pages
to obtain executable permissions require integration with user space
policy that is deferred until the core SGX2 enabling is complete.
Pages are dynamically added to an initialized enclave from the SGX
page fault handler.

* Support expanding an initialized enclave to accommodate more threads.
More threads can be accommodated by an enclave with the addition of
Thread Control Structure (TCS) pages that is done by changing the
type of regular enclave pages to TCS pages using a new ioctl()
SGX_IOC_ENCLAVE_MODIFY_TYPE.

* Support removing regular and TCS pages from an initialized enclave.
Removing pages is accomplished in two stages as supported by two new
ioctl()s SGX_IOC_ENCLAVE_MODIFY_TYPE (same ioctl() as mentioned in
previous bullet) and SGX_IOC_ENCLAVE_REMOVE_PAGES.

* Tests covering all the new flows, some edge cases, and one
comprehensive stress scenario.

No additional work is needed to support SGX2 in a virtualized
environment. The tests included in this series can also be run from
a guest and was tested with the recent QEMU release based on 6.2.0
that supports SGX.

Patches 1 to 14 prepares the existing code for SGX2 support by
introducing the SGX2 functions, making sure pages remain accessible
after their enclave permissions are changed, and tracking enclave page
types as well as runtime permissions as needed by SGX2.

Patches 15 through 32 are a mix of x86/sgx and selftests/sgx patches
that follow the format where first an SGX2 feature is
enabled and then followed by tests of the new feature and/or
tests of scenarios that combine SGX2 features enabled up to that point.

In two cases (patches 20 and 31) code in support of SGX2 is separated
out with detailed motivation to support the review.

This series is based on v5.17-rc2 with the following fixes additionally
applied:

"selftests/sgx: Remove extra newlines in test output"
https://lore.kernel.org/linux-sgx/16317683a1822bbd44ab3ca48b60a9a217ac24de.1643754040.git.reinette.chatre@intel.com/
"selftests/sgx: Ensure enclave data available during debug print"
https://lore.kernel.org/linux-sgx/eaaeeb9122916d831942fc8a3043c687137314c1.1643754040.git.reinette.chatre@intel.com/
"selftests/sgx: Do not attempt enclave build without valid enclave"
https://lore.kernel.org/linux-sgx/4e4ea6d70c286c209964bec1e8d29ac8e692748b.1643754040.git.reinette.chatre@intel.com/
"selftests/sgx: Fix NULL-pointer-dereference upon early test failure"
https://lore.kernel.org/linux-sgx/89824888783fd8e770bfc64530c7549650a41851.1643754040.git.reinette.chatre@intel.com/
"x86/sgx: Add poison handling to reclaimer"
https://lore.kernel.org/linux-sgx/dcc95eb2aaefb042527ac50d0a50738c7c160dac.1643830353.git.reinette.chatre@intel.com/
"x86/sgx: Silence softlockup detection when releasing large enclaves"
https://lore.kernel.org/linux-sgx/b5e9f218064aa76e3026f778e1ad0a1d823e3db8.1643133224.git.reinette.chatre@intel.com/

Your feedback will be greatly appreciated.

Regards,

Reinette

Reinette Chatre (32):
x86/sgx: Add short descriptions to ENCLS wrappers
x86/sgx: Add wrapper for SGX2 EMODPR function
x86/sgx: Add wrapper for SGX2 EMODT function
x86/sgx: Add wrapper for SGX2 EAUG function
Documentation/x86: Document SGX permission details
x86/sgx: Support VMA permissions more relaxed than enclave permissions
x86/sgx: Add pfn_mkwrite() handler for present PTEs
x86/sgx: x86/sgx: Add sgx_encl_page->vm_run_prot_bits for dynamic
permission changes
x86/sgx: Export sgx_encl_ewb_cpumask()
x86/sgx: Rename sgx_encl_ewb_cpumask() as sgx_encl_cpumask()
x86/sgx: Move PTE zap code to new sgx_zap_enclave_ptes()
x86/sgx: Make sgx_ipi_cb() available internally
x86/sgx: Create utility to validate user provided offset and length
x86/sgx: Keep record of SGX page type
x86/sgx: Support relaxing of enclave page permissions
x86/sgx: Support restricting of enclave page permissions
selftests/sgx: Add test for EPCM permission changes
selftests/sgx: Add test for TCS page permission changes
x86/sgx: Support adding of pages to an initialized enclave
x86/sgx: Tighten accessible memory range after enclave initialization
selftests/sgx: Test two different SGX2 EAUG flows
x86/sgx: Support modifying SGX page type
x86/sgx: Support complete page removal
Documentation/x86: Introduce enclave runtime management section
selftests/sgx: Introduce dynamic entry point
selftests/sgx: Introduce TCS initialization enclave operation
selftests/sgx: Test complete changing of page type flow
selftests/sgx: Test faulty enclave behavior
selftests/sgx: Test invalid access to removed enclave page
selftests/sgx: Test reclaiming of untouched page
x86/sgx: Free up EPC pages directly to support large page ranges
selftests/sgx: Page removal stress test

Documentation/x86/sgx.rst | 64 +-
arch/x86/include/asm/sgx.h | 8 +
arch/x86/include/uapi/asm/sgx.h | 81 +
arch/x86/kernel/cpu/sgx/encl.c | 334 +++-
arch/x86/kernel/cpu/sgx/encl.h | 12 +-
arch/x86/kernel/cpu/sgx/encls.h | 33 +
arch/x86/kernel/cpu/sgx/ioctl.c | 831 ++++++++-
arch/x86/kernel/cpu/sgx/main.c | 70 +-
arch/x86/kernel/cpu/sgx/sgx.h | 3 +
tools/testing/selftests/sgx/defines.h | 23 +
tools/testing/selftests/sgx/load.c | 41 +
tools/testing/selftests/sgx/main.c | 1484 +++++++++++++++++
tools/testing/selftests/sgx/main.h | 1 +
tools/testing/selftests/sgx/test_encl.c | 68 +
.../selftests/sgx/test_encl_bootstrap.S | 6 +
15 files changed, 2963 insertions(+), 96 deletions(-)


base-commit: 26291c54e111ff6ba87a164d85d4a4e134b7315c
prerequisite-patch-id: 3c3908f1c3536cc04ba020fb3e81f51395b44223
prerequisite-patch-id: e860923423c3387cf6fdcceb2fa41dc5e9454ef4
prerequisite-patch-id: 986260c8bc4255eb61e2c4afa88d2b723e376423
prerequisite-patch-id: ba014a99fced2b57d5d9e2dfb9d80ddf4333c13e
prerequisite-patch-id: 65cbb72889b6353a5639b984615d12019136b270
prerequisite-patch-id: e3296a2f0345a77c8a7ca91f76697ae2e1dca21f
--
2.25.1



2022-02-09 11:44:12

by Reinette Chatre

[permalink] [raw]
Subject: [PATCH V2 12/32] x86/sgx: Make sgx_ipi_cb() available internally

The ETRACK function followed by an IPI to all CPUs within an enclave
is a common pattern with more frequent use in support of SGX2.

Make the (empty) IPI callback function available internally in
preparation for usage by SGX2.

Signed-off-by: Reinette Chatre <[email protected]>
---
Changes since V1:
- Replace "for more usages" by "for usage by SGX2" (Jarkko)

arch/x86/kernel/cpu/sgx/main.c | 2 +-
arch/x86/kernel/cpu/sgx/sgx.h | 2 ++
2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index ce9e87d5f8ec..6e2cb7564080 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -172,7 +172,7 @@ static int __sgx_encl_ewb(struct sgx_epc_page *epc_page, void *va_slot,
return ret;
}

-static void sgx_ipi_cb(void *info)
+void sgx_ipi_cb(void *info)
{
}

diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index 0f17def9fe6f..b30cee4de903 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -90,6 +90,8 @@ void sgx_mark_page_reclaimable(struct sgx_epc_page *page);
int sgx_unmark_page_reclaimable(struct sgx_epc_page *page);
struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim);

+void sgx_ipi_cb(void *info);
+
#ifdef CONFIG_X86_SGX_KVM
int __init sgx_vepc_init(void);
#else
--
2.25.1


2022-02-09 11:48:51

by Reinette Chatre

[permalink] [raw]
Subject: [PATCH V2 11/32] x86/sgx: Move PTE zap code to new sgx_zap_enclave_ptes()

The SGX reclaimer removes page table entries pointing to pages that are
moved to swap.

SGX2 enables changes to pages belonging to an initialized enclave, thus
enclave pages may have their permission or type changed while the page
is being accessed by an enclave. Supporting SGX2 requires page table
entries to be removed so that any cached mappings to changed pages
are removed. For example, with the ability to change enclave page types
a regular enclave page may be changed to a Thread Control Structure
(TCS) page that may not be accessed by an enclave.

Factor out the code removing page table entries to a separate function
sgx_zap_enclave_ptes(), fixing accuracy of comments in the process,
and make it available to the upcoming SGX2 code.

Place sgx_zap_enclave_ptes() with the rest of the enclave code in
encl.c interacting with the page table since this code is no longer
unique to the reclaimer.

Signed-off-by: Reinette Chatre <[email protected]>
---
Changes since V1:
- Elaborate why SGX2 needs this ability (Jarkko).
- More specific subject.
- Fix kernel-doc to have brackets in function name.

arch/x86/kernel/cpu/sgx/encl.c | 45 +++++++++++++++++++++++++++++++++-
arch/x86/kernel/cpu/sgx/encl.h | 2 +-
arch/x86/kernel/cpu/sgx/main.c | 31 ++---------------------
3 files changed, 47 insertions(+), 31 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index 6f5d01121766..8da813504249 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -589,7 +589,7 @@ int sgx_encl_mm_add(struct sgx_encl *encl, struct mm_struct *mm)

spin_lock(&encl->mm_lock);
list_add_rcu(&encl_mm->list, &encl->mm_list);
- /* Pairs with smp_rmb() in sgx_reclaimer_block(). */
+ /* Pairs with smp_rmb() in sgx_zap_enclave_ptes(). */
smp_wmb();
encl->mm_list_version++;
spin_unlock(&encl->mm_lock);
@@ -778,6 +778,49 @@ int sgx_encl_test_and_clear_young(struct mm_struct *mm,
return ret;
}

+/**
+ * sgx_zap_enclave_ptes() - remove PTEs mapping the address from enclave
+ * @encl: the enclave
+ * @addr: page aligned pointer to single page for which PTEs will be removed
+ *
+ * Multiple VMAs may have an enclave page mapped. Remove the PTE mapping
+ * @addr from each VMA. Ensure that page fault handler is ready to handle
+ * new mappings of @addr before calling this function.
+ */
+void sgx_zap_enclave_ptes(struct sgx_encl *encl, unsigned long addr)
+{
+ unsigned long mm_list_version;
+ struct sgx_encl_mm *encl_mm;
+ struct vm_area_struct *vma;
+ int idx, ret;
+
+ do {
+ mm_list_version = encl->mm_list_version;
+
+ /* Pairs with smp_wmb() in sgx_encl_mm_add(). */
+ smp_rmb();
+
+ idx = srcu_read_lock(&encl->srcu);
+
+ list_for_each_entry_rcu(encl_mm, &encl->mm_list, list) {
+ if (!mmget_not_zero(encl_mm->mm))
+ continue;
+
+ mmap_read_lock(encl_mm->mm);
+
+ ret = sgx_encl_find(encl_mm->mm, addr, &vma);
+ if (!ret && encl == vma->vm_private_data)
+ zap_vma_ptes(vma, addr, PAGE_SIZE);
+
+ mmap_read_unlock(encl_mm->mm);
+
+ mmput_async(encl_mm->mm);
+ }
+
+ srcu_read_unlock(&encl->srcu, idx);
+ } while (unlikely(encl->mm_list_version != mm_list_version));
+}
+
/**
* sgx_alloc_va_page() - Allocate a Version Array (VA) page
*
diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h
index becb68503baa..82e21088e68b 100644
--- a/arch/x86/kernel/cpu/sgx/encl.h
+++ b/arch/x86/kernel/cpu/sgx/encl.h
@@ -112,7 +112,7 @@ int sgx_encl_get_backing(struct sgx_encl *encl, unsigned long page_index,
void sgx_encl_put_backing(struct sgx_backing *backing, bool do_write);
int sgx_encl_test_and_clear_young(struct mm_struct *mm,
struct sgx_encl_page *page);
-
+void sgx_zap_enclave_ptes(struct sgx_encl *encl, unsigned long addr);
struct sgx_epc_page *sgx_alloc_va_page(void);
unsigned int sgx_alloc_va_slot(struct sgx_va_page *va_page);
void sgx_free_va_slot(struct sgx_va_page *va_page, unsigned int offset);
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index fa33922879bf..ce9e87d5f8ec 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -137,36 +137,9 @@ static void sgx_reclaimer_block(struct sgx_epc_page *epc_page)
struct sgx_encl_page *page = epc_page->owner;
unsigned long addr = page->desc & PAGE_MASK;
struct sgx_encl *encl = page->encl;
- unsigned long mm_list_version;
- struct sgx_encl_mm *encl_mm;
- struct vm_area_struct *vma;
- int idx, ret;
-
- do {
- mm_list_version = encl->mm_list_version;
-
- /* Pairs with smp_rmb() in sgx_encl_mm_add(). */
- smp_rmb();
-
- idx = srcu_read_lock(&encl->srcu);
-
- list_for_each_entry_rcu(encl_mm, &encl->mm_list, list) {
- if (!mmget_not_zero(encl_mm->mm))
- continue;
-
- mmap_read_lock(encl_mm->mm);
-
- ret = sgx_encl_find(encl_mm->mm, addr, &vma);
- if (!ret && encl == vma->vm_private_data)
- zap_vma_ptes(vma, addr, PAGE_SIZE);
-
- mmap_read_unlock(encl_mm->mm);
-
- mmput_async(encl_mm->mm);
- }
+ int ret;

- srcu_read_unlock(&encl->srcu, idx);
- } while (unlikely(encl->mm_list_version != mm_list_version));
+ sgx_zap_enclave_ptes(encl, addr);

mutex_lock(&encl->lock);

--
2.25.1


2022-02-09 11:49:31

by Reinette Chatre

[permalink] [raw]
Subject: [PATCH V2 14/32] x86/sgx: Keep record of SGX page type

SGX2 functions are not allowed on all page types. For example,
ENCLS[EMODPR] is only allowed on regular SGX enclave pages and
ENCLS[EMODPT] is only allowed on TCS and regular pages. If these
functions are attempted on another type of page the hardware would
trigger a fault.

Keep a record of the SGX page type so that there is more
certainty whether an SGX2 instruction can succeed and faults
can be treated as real failures.

The page type is a property of struct sgx_encl_page
and thus does not cover the VA page type. VA pages are maintained
in separate structures and their type can be determined in
a different way. The SGX2 instructions needing the page type do not
operate on VA pages and this is thus not a scenario needing to
be covered at this time.

With the protection bits consuming 16 bits of the unsigned long
there is room available in the bitfield to include the page type
information without increasing the space consumed by the struct.

Acked-by: Jarkko Sakkinen <[email protected]>
Signed-off-by: Reinette Chatre <[email protected]>
---
Changes since V1:
- Add Acked-by from Jarkko.

arch/x86/include/asm/sgx.h | 3 +++
arch/x86/kernel/cpu/sgx/encl.h | 1 +
arch/x86/kernel/cpu/sgx/ioctl.c | 2 ++
3 files changed, 6 insertions(+)

diff --git a/arch/x86/include/asm/sgx.h b/arch/x86/include/asm/sgx.h
index d67810b50a81..eae20fa52b93 100644
--- a/arch/x86/include/asm/sgx.h
+++ b/arch/x86/include/asm/sgx.h
@@ -239,6 +239,9 @@ struct sgx_pageinfo {
* %SGX_PAGE_TYPE_REG: a regular page
* %SGX_PAGE_TYPE_VA: a VA page
* %SGX_PAGE_TYPE_TRIM: a page in trimmed state
+ *
+ * Make sure when making changes to this enum that its values can still fit
+ * in the bitfield within &struct sgx_encl_page
*/
enum sgx_page_type {
SGX_PAGE_TYPE_SECS,
diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h
index 82e21088e68b..cb9f16d457ac 100644
--- a/arch/x86/kernel/cpu/sgx/encl.h
+++ b/arch/x86/kernel/cpu/sgx/encl.h
@@ -29,6 +29,7 @@ struct sgx_encl_page {
unsigned long desc;
unsigned long vm_max_prot_bits:8;
unsigned long vm_run_prot_bits:8;
+ enum sgx_page_type type:16;
struct sgx_epc_page *epc_page;
struct sgx_encl *encl;
struct sgx_va_page *va_page;
diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
index 6e7cc441156b..b8336d5d9029 100644
--- a/arch/x86/kernel/cpu/sgx/ioctl.c
+++ b/arch/x86/kernel/cpu/sgx/ioctl.c
@@ -107,6 +107,7 @@ static int sgx_encl_create(struct sgx_encl *encl, struct sgx_secs *secs)
set_bit(SGX_ENCL_DEBUG, &encl->flags);

encl->secs.encl = encl;
+ encl->secs.type = SGX_PAGE_TYPE_SECS;
encl->base = secs->base;
encl->size = secs->size;
encl->attributes = secs->attributes;
@@ -350,6 +351,7 @@ static int sgx_encl_add_page(struct sgx_encl *encl, unsigned long src,
*/
encl_page->encl = encl;
encl_page->epc_page = epc_page;
+ encl_page->type = (secinfo->flags & SGX_SECINFO_PAGE_TYPE_MASK) >> 8;
encl->secs_child_cnt++;

if (flags & SGX_PAGE_MEASURE) {
--
2.25.1


2022-02-09 13:21:15

by Reinette Chatre

[permalink] [raw]
Subject: [PATCH V2 18/32] selftests/sgx: Add test for TCS page permission changes

Kernel should not allow permission changes on TCS pages. Add test to
confirm this behavior.

Signed-off-by: Reinette Chatre <[email protected]>
---
Changes since V1:
- Adapt test to the kernel interface changes: the ioctl() name change
and providing entire secinfo as parameter.
- Rewrite error path to reduce line lengths.

tools/testing/selftests/sgx/main.c | 74 ++++++++++++++++++++++++++++++
1 file changed, 74 insertions(+)

diff --git a/tools/testing/selftests/sgx/main.c b/tools/testing/selftests/sgx/main.c
index 4f348ed1dc29..1398cd1b0983 100644
--- a/tools/testing/selftests/sgx/main.c
+++ b/tools/testing/selftests/sgx/main.c
@@ -121,6 +121,24 @@ static Elf64_Sym *vdso_symtab_get(struct vdso_symtab *symtab, const char *name)
return NULL;
}

+/*
+ * Return the offset in the enclave where the TCS segment can be found.
+ * The first RW segment loaded is the TCS.
+ */
+static off_t encl_get_tcs_offset(struct encl *encl)
+{
+ int i;
+
+ for (i = 0; i < encl->nr_segments; i++) {
+ struct encl_segment *seg = &encl->segment_tbl[i];
+
+ if (i == 0 && seg->prot == (PROT_READ | PROT_WRITE))
+ return seg->offset;
+ }
+
+ return -1;
+}
+
/*
* Return the offset in the enclave where the data segment can be found.
* The first RW segment loaded is the TCS, skip that to get info on the
@@ -567,6 +585,62 @@ TEST_F(enclave, pte_permissions)
EXPECT_EQ(self->run.exception_addr, 0);
}

+/*
+ * Modifying permissions of TCS page should not be possible.
+ */
+TEST_F(enclave, tcs_permissions)
+{
+ struct sgx_enclave_restrict_perm ioc;
+ struct sgx_secinfo secinfo;
+ int ret, errno_save;
+
+ ASSERT_TRUE(setup_test_encl(ENCL_HEAP_SIZE_DEFAULT, &self->encl, _metadata));
+
+ memset(&self->run, 0, sizeof(self->run));
+ self->run.tcs = self->encl.encl_base;
+
+ memset(&ioc, 0, sizeof(ioc));
+ memset(&secinfo, 0, sizeof(secinfo));
+
+ /*
+ * Ensure kernel supports needed ioctl() and system supports needed
+ * commands.
+ */
+
+ ret = ioctl(self->encl.fd, SGX_IOC_ENCLAVE_RESTRICT_PERMISSIONS, &ioc);
+ errno_save = ret == -1 ? errno : 0;
+
+ /*
+ * Invalid parameters were provided during sanity check,
+ * expect command to fail.
+ */
+ ASSERT_EQ(ret, -1);
+
+ /* ret == -1 */
+ if (errno_save == ENOTTY)
+ SKIP(return,
+ "Kernel does not support SGX_IOC_ENCLAVE_RESTRICT_PERMISSIONS ioctl()");
+ else if (errno_save == ENODEV)
+ SKIP(return, "System does not support SGX2");
+
+ /*
+ * Attempt to make TCS page read-only. This is not allowed and
+ * should be prevented by the kernel.
+ */
+ secinfo.flags = PROT_READ;
+ ioc.offset = encl_get_tcs_offset(&self->encl);
+ ioc.length = PAGE_SIZE;
+ ioc.secinfo = (unsigned long)&secinfo;
+
+ ret = ioctl(self->encl.fd, SGX_IOC_ENCLAVE_RESTRICT_PERMISSIONS, &ioc);
+ errno_save = ret == -1 ? errno : 0;
+
+ EXPECT_EQ(ret, -1);
+ EXPECT_EQ(errno_save, EINVAL);
+ EXPECT_EQ(ioc.result, 0);
+ EXPECT_EQ(ioc.count, 0);
+}
+
/*
* Enclave page permission test.
*
--
2.25.1


2022-02-22 20:57:36

by Nathaniel McCallum

[permalink] [raw]
Subject: Re: [PATCH V2 00/32] x86/sgx and selftests/sgx: Support SGX2

1. This interface looks very odd to me. mmap() is the kernel interface
for changing user space memory maps. Why are we introducing a new
interface for this? You can just simply add a new mmap flag (i.e.
MAP_SGX_TCS*) and then figure out which SGX instructions to execute
based on the desired state of the memory maps. If you do this, none of
the following ioctls are needed:

* SGX_IOC_ENCLAVE_RELAX_PERMISSIONS
* SGX_IOC_ENCLAVE_RESTRICT_PERMISSIONS
* SGX_IOC_ENCLAVE_REMOVE_PAGES
* SGX_IOC_ENCLAVE_MODIFY_TYPE

It also means that languages don't have to grow support for all these
ioctls. Instead, they can just reuse the existing mmap() bindings with
the new flag. Also, multiple operations can be combined into a single
mmap() call, amortizing the changes over a single context switch.

2. Automatically adding pages with hard-coded permissions in a fault
handler seems like a really bad idea. How do you distinguish between
accesses which should result in an updated mapping and accesses that
should result in a fault? IMHO, all unmapped page accesses should
result in a page fault. mmap() should be called first to identify the
correct permissions for these pages. Then the page handler should be
updated to use the permissions from the mapping when backfilling
physical pages. If I understand correctly, this should also obviate
the need for the weird userspace callback to allow for execute
permissions.

3. Implementing as I've suggested also means that we can lock down an
enclave, for example - after code has been JITed, by closing the file
descriptor. Once the file descriptor used to create the enclave is
closed, no further mmap() can be performed on the enclave. Attempting
to do EACCEPT on an unmapped page will generate a page fault.

* - I'm aware that a new flag might be frowned upon. I see a few other options:
1. reuse an existing flag which doesn't make sense in this context
2. communicate the page type in the offset argument
3. keep SGX_IOC_ENCLAVE_MODIFY_TYPE

On Mon, Feb 7, 2022 at 8:07 PM Reinette Chatre
<[email protected]> wrote:
>
> V1: https://lore.kernel.org/linux-sgx/[email protected]/
>
> Changes since V1 that directly impact user space:
> - SGX2 permission changes changed from a single ioctl() named
> SGX_IOC_PAGE_MODP to two new ioctl()s:
> SGX_IOC_ENCLAVE_RELAX_PERMISSIONS and
> SGX_IOC_ENCLAVE_RESTRICT_PERMISSIONS, supported by two different
> parameter structures (SGX_IOC_ENCLAVE_RELAX_PERMISSIONS does
> not support a result output parameter) (Jarkko).
>
> User space flow impact: After user space runs ENCLU[EMODPE] it
> needs to call SGX_IOC_ENCLAVE_RELAX_PERMISSIONS to have PTEs
> updated. Previously running SGX_IOC_PAGE_MODP in this scenario
> resulted in EPCM.PR being set but calling
> SGX_IOC_ENCLAVE_RELAX_PERMISSIONS will not result in EPCM.PR
> being set anymore and thus no need for an additional
> ENCLU[EACCEPT].
>
> - SGX_IOC_ENCLAVE_RELAX_PERMISSIONS and
> SGX_IOC_ENCLAVE_RESTRICT_PERMISSIONS
> obtain new permissions from secinfo as parameter instead of
> the permissions directly (Jarkko).
>
> - ioctl() supporting SGX2 page type change is renamed from
> SGX_IOC_PAGE_MODT to SGX_IOC_ENCLAVE_MODIFY_TYPE (Jarkko).
>
> - SGX_IOC_ENCLAVE_MODIFY_TYPE obtains new page type from secinfo
> as parameter instead of the page type directly (Jarkko).
>
> - ioctl() supporting SGX2 page removal is renamed from
> SGX_IOC_PAGE_REMOVE to SGX_IOC_ENCLAVE_REMOVE_PAGES (Jarkko).
>
> - All ioctl() parameter structures have been renamed as a result of the
> ioctl() renaming:
> SGX_IOC_ENCLAVE_RELAX_PERMISSIONS => struct sgx_enclave_relax_perm
> SGX_IOC_ENCLAVE_RESTRICT_PERMISSIONS => struct sgx_enclave_restrict_perm
> SGX_IOC_ENCLAVE_MODIFY_TYPE => struct sgx_enclave_modt
> SGX_IOC_ENCLAVE_REMOVE_PAGES => struct sgx_enclave_remove_pages
>
> Changes since V1 that do not directly impact user space:
> - Number of patches in series increased from 25 to 32 primarily because
> of splitting the original submission:
> - Wrappers for the new SGX2 functions are introduced in three separate
> patches replacing the original "x86/sgx: Add wrappers for SGX2
> functions"
> (Jarkko).
> - Moving and renaming sgx_encl_ewb_cpumask() is done with two patches
> replacing the original "x86/sgx: Use more generic name for enclave
> cpumask function" (Jarkko).
> - Support for SGX2 EPCM permission changes is split into two ioctls(),
> one for relaxing and one for restricting permissions, each introduced
> by a new patch replacing the original "x86/sgx: Support enclave page
> permission changes" (Jarkko).
> - Extracted code used by existing ioctls() for usage by new ioctl()s
> into a new utility in new patch "x86/sgx: Create utility to validate
> user provided offset and length" (Dave did not specifically ask for
> this but it addresses his review feedback).
> - Two new Documentation patches to support the SGX2 work
> ("Documentation/x86: Introduce enclave runtime management") and
> a dedicated section on the enclave permission management
> ("Documentation/x86: Document SGX permission details") (Andy).
> - Most patches were reworked to improve the language by:
> * aiming to refer to exact item instead of English rephrasing (Jarkko).
> * use ioctl() instead of ioctl throughout (Dave).
> * Use "relaxed" instead of "exceed" when referring to permissions
> (Dave).
> - Improved documentation with several additions to
> Documentation/x86/sgx.rst.
> - Many smaller changes, please refer to individual patches.
>
> Hi Everybody,
>
> The current Linux kernel support for SGX includes support for SGX1 that
> requires that an enclave be created with properties that accommodate all
> usages over its (the enclave's) lifetime. This includes properties such
> as permissions of enclave pages, the number of enclave pages, and the
> number of threads supported by the enclave.
>
> Consequences of this requirement to have the enclave be created to
> accommodate all usages include:
> * pages needing to support relocated code are required to have RWX
> permissions for their entire lifetime,
> * an enclave needs to be created with the maximum stack and heap
> projected to be needed during the enclave's entire lifetime which
> can be longer than the processes running within it,
> * an enclave needs to be created with support for the maximum number
> of threads projected to run in the enclave.
>
> Since SGX1 a few more functions were introduced, collectively called
> SGX2, that support modifications to an initialized enclave. Hardware
> supporting these functions are already available as listed on
> https://github.com/ayeks/SGX-hardware
>
> This series adds support for SGX2, also referred to as Enclave Dynamic
> Memory Management (EDMM). This includes:
>
> * Support modifying permissions of regular enclave pages belonging to an
> initialized enclave. New permissions are not allowed to exceed the
> originally vetted permissions. For example, RX isn't allowed unless
> the page was originally added with RX or RWX.
> Modifying permissions is accomplished with two new ioctl()s:
> SGX_IOC_ENCLAVE_RELAX_PERMISSIONS and
> SGX_IOC_ENCLAVE_RESTRICT_PERMISSIONS.
>
> * Support dynamic addition of regular enclave pages to an initialized
> enclave. Pages are added with RW permissions as their "originally
> vetted permissions" (see previous bullet) and thus not allowed to
> be made executable at this time. Enabling dynamically added pages
> to obtain executable permissions require integration with user space
> policy that is deferred until the core SGX2 enabling is complete.
> Pages are dynamically added to an initialized enclave from the SGX
> page fault handler.
>
> * Support expanding an initialized enclave to accommodate more threads.
> More threads can be accommodated by an enclave with the addition of
> Thread Control Structure (TCS) pages that is done by changing the
> type of regular enclave pages to TCS pages using a new ioctl()
> SGX_IOC_ENCLAVE_MODIFY_TYPE.
>
> * Support removing regular and TCS pages from an initialized enclave.
> Removing pages is accomplished in two stages as supported by two new
> ioctl()s SGX_IOC_ENCLAVE_MODIFY_TYPE (same ioctl() as mentioned in
> previous bullet) and SGX_IOC_ENCLAVE_REMOVE_PAGES.
>
> * Tests covering all the new flows, some edge cases, and one
> comprehensive stress scenario.
>
> No additional work is needed to support SGX2 in a virtualized
> environment. The tests included in this series can also be run from
> a guest and was tested with the recent QEMU release based on 6.2.0
> that supports SGX.
>
> Patches 1 to 14 prepares the existing code for SGX2 support by
> introducing the SGX2 functions, making sure pages remain accessible
> after their enclave permissions are changed, and tracking enclave page
> types as well as runtime permissions as needed by SGX2.
>
> Patches 15 through 32 are a mix of x86/sgx and selftests/sgx patches
> that follow the format where first an SGX2 feature is
> enabled and then followed by tests of the new feature and/or
> tests of scenarios that combine SGX2 features enabled up to that point.
>
> In two cases (patches 20 and 31) code in support of SGX2 is separated
> out with detailed motivation to support the review.
>
> This series is based on v5.17-rc2 with the following fixes additionally
> applied:
>
> "selftests/sgx: Remove extra newlines in test output"
> https://lore.kernel.org/linux-sgx/16317683a1822bbd44ab3ca48b60a9a217ac24de.1643754040.git.reinette.chatre@intel.com/
> "selftests/sgx: Ensure enclave data available during debug print"
> https://lore.kernel.org/linux-sgx/eaaeeb9122916d831942fc8a3043c687137314c1.1643754040.git.reinette.chatre@intel.com/
> "selftests/sgx: Do not attempt enclave build without valid enclave"
> https://lore.kernel.org/linux-sgx/4e4ea6d70c286c209964bec1e8d29ac8e692748b.1643754040.git.reinette.chatre@intel.com/
> "selftests/sgx: Fix NULL-pointer-dereference upon early test failure"
> https://lore.kernel.org/linux-sgx/89824888783fd8e770bfc64530c7549650a41851.1643754040.git.reinette.chatre@intel.com/
> "x86/sgx: Add poison handling to reclaimer"
> https://lore.kernel.org/linux-sgx/dcc95eb2aaefb042527ac50d0a50738c7c160dac.1643830353.git.reinette.chatre@intel.com/
> "x86/sgx: Silence softlockup detection when releasing large enclaves"
> https://lore.kernel.org/linux-sgx/b5e9f218064aa76e3026f778e1ad0a1d823e3db8.1643133224.git.reinette.chatre@intel.com/
>
> Your feedback will be greatly appreciated.
>
> Regards,
>
> Reinette
>
> Reinette Chatre (32):
> x86/sgx: Add short descriptions to ENCLS wrappers
> x86/sgx: Add wrapper for SGX2 EMODPR function
> x86/sgx: Add wrapper for SGX2 EMODT function
> x86/sgx: Add wrapper for SGX2 EAUG function
> Documentation/x86: Document SGX permission details
> x86/sgx: Support VMA permissions more relaxed than enclave permissions
> x86/sgx: Add pfn_mkwrite() handler for present PTEs
> x86/sgx: x86/sgx: Add sgx_encl_page->vm_run_prot_bits for dynamic
> permission changes
> x86/sgx: Export sgx_encl_ewb_cpumask()
> x86/sgx: Rename sgx_encl_ewb_cpumask() as sgx_encl_cpumask()
> x86/sgx: Move PTE zap code to new sgx_zap_enclave_ptes()
> x86/sgx: Make sgx_ipi_cb() available internally
> x86/sgx: Create utility to validate user provided offset and length
> x86/sgx: Keep record of SGX page type
> x86/sgx: Support relaxing of enclave page permissions
> x86/sgx: Support restricting of enclave page permissions
> selftests/sgx: Add test for EPCM permission changes
> selftests/sgx: Add test for TCS page permission changes
> x86/sgx: Support adding of pages to an initialized enclave
> x86/sgx: Tighten accessible memory range after enclave initialization
> selftests/sgx: Test two different SGX2 EAUG flows
> x86/sgx: Support modifying SGX page type
> x86/sgx: Support complete page removal
> Documentation/x86: Introduce enclave runtime management section
> selftests/sgx: Introduce dynamic entry point
> selftests/sgx: Introduce TCS initialization enclave operation
> selftests/sgx: Test complete changing of page type flow
> selftests/sgx: Test faulty enclave behavior
> selftests/sgx: Test invalid access to removed enclave page
> selftests/sgx: Test reclaiming of untouched page
> x86/sgx: Free up EPC pages directly to support large page ranges
> selftests/sgx: Page removal stress test
>
> Documentation/x86/sgx.rst | 64 +-
> arch/x86/include/asm/sgx.h | 8 +
> arch/x86/include/uapi/asm/sgx.h | 81 +
> arch/x86/kernel/cpu/sgx/encl.c | 334 +++-
> arch/x86/kernel/cpu/sgx/encl.h | 12 +-
> arch/x86/kernel/cpu/sgx/encls.h | 33 +
> arch/x86/kernel/cpu/sgx/ioctl.c | 831 ++++++++-
> arch/x86/kernel/cpu/sgx/main.c | 70 +-
> arch/x86/kernel/cpu/sgx/sgx.h | 3 +
> tools/testing/selftests/sgx/defines.h | 23 +
> tools/testing/selftests/sgx/load.c | 41 +
> tools/testing/selftests/sgx/main.c | 1484 +++++++++++++++++
> tools/testing/selftests/sgx/main.h | 1 +
> tools/testing/selftests/sgx/test_encl.c | 68 +
> .../selftests/sgx/test_encl_bootstrap.S | 6 +
> 15 files changed, 2963 insertions(+), 96 deletions(-)
>
>
> base-commit: 26291c54e111ff6ba87a164d85d4a4e134b7315c
> prerequisite-patch-id: 3c3908f1c3536cc04ba020fb3e81f51395b44223
> prerequisite-patch-id: e860923423c3387cf6fdcceb2fa41dc5e9454ef4
> prerequisite-patch-id: 986260c8bc4255eb61e2c4afa88d2b723e376423
> prerequisite-patch-id: ba014a99fced2b57d5d9e2dfb9d80ddf4333c13e
> prerequisite-patch-id: 65cbb72889b6353a5639b984615d12019136b270
> prerequisite-patch-id: e3296a2f0345a77c8a7ca91f76697ae2e1dca21f
> --
> 2.25.1
>

2022-02-23 01:43:17

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH V2 00/32] x86/sgx and selftests/sgx: Support SGX2

Hi Nathaniel,

On 2/22/2022 12:27 PM, Nathaniel McCallum wrote:
> 1. This interface looks very odd to me. mmap() is the kernel interface
> for changing user space memory maps. Why are we introducing a new
> interface for this?

mmap() is the kernel interface used to create new mappings in the
virtual address space of the calling process. This is different from
the permissions and properties of the underlying file/memory being mapped.

A new interface is introduced because changes need to be made to the
permissions and properties of the underlying enclave. A new virtual
address space is not needed nor should existing VMAs be impacted.

This is similar to how mmap() is not used to change file permissions.

VMA permissions are separate from enclave page permissions as found in
the EPCM (Enclave Page Cache Map). The current implementation (SGX1) already
distinguishes between the VMA and EPCM permissions - for example, it is
already possible to create a read-only VMA from enclave pages that have
RW EPCM permissions. mmap() of a portion of EPC memory with a particular
permission does not imply that the underlying EPCM permissions (should)have
that permission.

> You can just simply add a new mmap flag (i.e.
> MAP_SGX_TCS*) and then figure out which SGX instructions to execute
> based on the desired state of the memory maps. If you do this, none of
> the following ioctls are needed:
>
> * SGX_IOC_ENCLAVE_RELAX_PERMISSIONS
> * SGX_IOC_ENCLAVE_RESTRICT_PERMISSIONS
> * SGX_IOC_ENCLAVE_REMOVE_PAGES
> * SGX_IOC_ENCLAVE_MODIFY_TYPE
>
> It also means that languages don't have to grow support for all these
> ioctls. Instead, they can just reuse the existing mmap() bindings with
> the new flag. Also, multiple operations can be combined into a single
> mmap() call, amortizing the changes over a single context switch.
>
> 2. Automatically adding pages with hard-coded permissions in a fault
> handler seems like a really bad idea.

Could you please elaborate why this is a bad idea?

> How do you distinguish between
> accesses which should result in an updated mapping and accesses that
> should result in a fault?

Accesses that should result in an updated mapping have two requirements:
(a) address accessed belongs to the enclave based on the address
range specified during enclave create
(b) there is no backing enclave page for the address

> IMHO, all unmapped page accesses should
> result in a page fault. mmap() should be called first to identify the
> correct permissions for these pages.
> Then the page handler should be
> updated to use the permissions from the mapping when backfilling
> physical pages. If I understand correctly, this should also obviate

Regular enclave pages can _only_ be dynamically added with RW permission.

SGX2's support for adding regular pages to an enclave via the EAUG
instruction is architecturally set at RW. The OS cannot change those permissions
via the EAUG instruction nor can the OS do so with a different/additional
instruction because:
* the OS is not able to relax permissions since that can only be done from
within the enclave with ENCLU[EMODPE], thus it is not possible for the OS to
dynamically add pages via EAUG as RW and then relax permissions to RWX.
* the OS is not able to EAUG a page and immediately attempt an EMODPR either
as Jarkko also recently inquired about:
https://lore.kernel.org/linux-sgx/[email protected]/

> the need for the weird userspace callback to allow for execute
> permissions.

User policy integration would always be required to allow execute
permissions on a writable page. This is not expected to be a userspace
callback but instead integration with existing user policy subsystem(s).

>
> 3. Implementing as I've suggested also means that we can lock down an
> enclave, for example - after code has been JITed, by closing the file
> descriptor. Once the file descriptor used to create the enclave is
> closed, no further mmap() can be performed on the enclave. Attempting
> to do EACCEPT on an unmapped page will generate a page fault.

This is not clear to me. If the file descriptor is closed and no further
mmap() is allowed then how would a process be able to enter the enclave
to execute code within it?

This series does indeed lock down the address range to ensure that it is
not possible to map memory that does not belong to the enclave after the
enclave is created. Please see:
https://lore.kernel.org/linux-sgx/1b833dbce6c937f71523f4aaf4b2181b9673519f.1644274683.git.reinette.chatre@intel.com/

>
> * - I'm aware that a new flag might be frowned upon. I see a few other options:
> 1. reuse an existing flag which doesn't make sense in this context
> 2. communicate the page type in the offset argument
> 3. keep SGX_IOC_ENCLAVE_MODIFY_TYPE
>

Reinette

2022-02-23 21:54:13

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH V2 00/32] x86/sgx and selftests/sgx: Support SGX2

Hi Nathaniel,

On 2/23/2022 5:24 AM, Nathaniel McCallum wrote:
> On Tue, Feb 22, 2022 at 5:39 PM Reinette Chatre
> <[email protected]> wrote:
>>
>> Hi Nathaniel,
>>
>> On 2/22/2022 12:27 PM, Nathaniel McCallum wrote:
>>> 1. This interface looks very odd to me. mmap() is the kernel interface
>>> for changing user space memory maps. Why are we introducing a new
>>> interface for this?
>>
>> mmap() is the kernel interface used to create new mappings in the
>> virtual address space of the calling process. This is different from
>> the permissions and properties of the underlying file/memory being mapped.
>>
>> A new interface is introduced because changes need to be made to the
>> permissions and properties of the underlying enclave. A new virtual
>> address space is not needed nor should existing VMAs be impacted.
>>
>> This is similar to how mmap() is not used to change file permissions.
>>
>> VMA permissions are separate from enclave page permissions as found in
>> the EPCM (Enclave Page Cache Map). The current implementation (SGX1) already
>> distinguishes between the VMA and EPCM permissions - for example, it is
>> already possible to create a read-only VMA from enclave pages that have
>> RW EPCM permissions. mmap() of a portion of EPC memory with a particular
>> permission does not imply that the underlying EPCM permissions (should)have
>> that permission.
>
> Yes. BUT... unlike the file permissions, this leaks an implementation detail.

Not really - just like a RW file can be mapped read-only or RW, RW enclave
memory can be mapped read-only or RW.

>
> The user process is governed by VMA permissions. And during enclave
> creation, it had to mmap() all the enclave regions to their final VMA
> permissions. So during enclave creation you have to use mmap() but
> after enclave creation you use custom APIs? That's inconsistent at
> best.

No. ioctl()s are consistently used to manage enclave memory.

The existing ioctls() SGX_IOC_ENCLAVE_CREATE, SGX_IOC_ENCLAVE_ADD_PAGES,
and SGX_IOC_ENCLAVE_INIT are used to set up to initialize the enclave memory.

The new ioctls() are used to manage enclave memory after enclave initialization.

The enclave memory is thus managed with a consistent interface.

mmap() is required before SGX_IOC_ENCLAVE_CREATE to obtain a base address
for the enclave that is required by the ioctl(). The rest of the ioctl()s,
existing and new, are consistent in interface by not requiring a memory
mapping but instead work from an offset from the base address.

> Forcing userspace to worry about the (mostly undocumented!)
> interactions between EPC, PTE and VMA permissions makes these APIs
> hard to use and difficult to reason about.

This is not new. The current SGX1 user space is already prevented from
creating a mapping of enclave memory that is more relaxed than the enclave
memory. For example, if the enclave memory has RW EPCM permissions then it
is not possible to mmap() that memory as RWX.

>
> When I call SGX_IOC_ENCLAVE_RELAX_PERMISSIONS, do I also have to call
> mmap() to update the VMA permissions to match? It isn't clear. Nor is

mprotect() may be the better call to use.

> it really clear why I'm calling completely separate APIs.
>
>>> You can just simply add a new mmap flag (i.e.
>>> MAP_SGX_TCS*) and then figure out which SGX instructions to execute
>>> based on the desired state of the memory maps. If you do this, none of
>>> the following ioctls are needed:
>>>
>>> * SGX_IOC_ENCLAVE_RELAX_PERMISSIONS
>>> * SGX_IOC_ENCLAVE_RESTRICT_PERMISSIONS
>>> * SGX_IOC_ENCLAVE_REMOVE_PAGES
>>> * SGX_IOC_ENCLAVE_MODIFY_TYPE
>>>
>>> It also means that languages don't have to grow support for all these
>>> ioctls. Instead, they can just reuse the existing mmap() bindings with
>>> the new flag. Also, multiple operations can be combined into a single
>>> mmap() call, amortizing the changes over a single context switch.
>>>
>>> 2. Automatically adding pages with hard-coded permissions in a fault
>>> handler seems like a really bad idea.
>>
>> Could you please elaborate why this is a bad idea?
>
> Because implementations that miss this subtlety suddenly have pages
> with magic permissions. Magic is bad. Explicit is good.
>

There is no magic. Any new pages have to be accepted by the enclave.
The enclave will not be able to access these pages unless explicitly
accepted, ENCLU[EACCEPT], from within the enclave.

>>> How do you distinguish between
>>> accesses which should result in an updated mapping and accesses that
>>> should result in a fault?
>>
>> Accesses that should result in an updated mapping have two requirements:
>> (a) address accessed belongs to the enclave based on the address
>> range specified during enclave create
>> (b) there is no backing enclave page for the address
>
> What happens if the enclave is buggy? Or has been compromised. In both
> of those cases, there should be a userspace visible fault and pages
> should not be added.

If user space accesses a memory address with a regular read/write that
results in a new page added then there is indeed a user space visible
fault. You can see this flow in action in the "augment" test case in
https://lore.kernel.org/linux-sgx/32c1116934a588bd3e6c174684e3e36a05c0a4d4.1644274683.git.reinette.chatre@intel.com/

If user space indeed wants the page after encountering such a fault then
it needs to enter the enclave again, from a different entry point, to
run ENCLU[EACCEPT], before it can return to the original entry point to
continue execution from the instruction that triggered the original read/write.

The only flow where a page is added without a user space visible fault
is when user space explicitly runs the ENCLU[EACCEPT] to do so.

>
>>> IMHO, all unmapped page accesses should
>>> result in a page fault. mmap() should be called first to identify the
>>> correct permissions for these pages.
>>> Then the page handler should be
>>> updated to use the permissions from the mapping when backfilling
>>> physical pages. If I understand correctly, this should also obviate
>>
>> Regular enclave pages can _only_ be dynamically added with RW permission.
>>
>> SGX2's support for adding regular pages to an enclave via the EAUG
>> instruction is architecturally set at RW. The OS cannot change those permissions
>> via the EAUG instruction nor can the OS do so with a different/additional
>> instruction because:
>> * the OS is not able to relax permissions since that can only be done from
>> within the enclave with ENCLU[EMODPE], thus it is not possible for the OS to
>> dynamically add pages via EAUG as RW and then relax permissions to RWX.
>> * the OS is not able to EAUG a page and immediately attempt an EMODPR either
>> as Jarkko also recently inquired about:
>> https://lore.kernel.org/linux-sgx/[email protected]/
>
> This design looks... unfinished. EAUG takes a PAGEINFO in RBX, but
> PAGEINFO.SECINFO must be zeroed and EAUG instead sets magic hard-coded
> permissions. Why doesn't EAUG just respect the permissions in
> PAGEINFO.SECINFO? We aren't told.

This design is finished and respects the hardware specification. You can find
the details in the SDM's documentation of the EAUG function.

If the SECINFO field has a value then the hardware requires it to indicate
that it is a new shadow stack page being added, not a regular page. Support for
shadow stack pages is not in scope for this work. Attempting to dynamically
add a regular page with explicit permissions will result in a #GP(0).

The only way to add a regular enclave page is to make the SECINFO field empty
and doing so forces the page type to be a regular page and the permissions to
be RW.

>
> Further, if the enclave can do EMODPE, why does
> SGX_IOC_ENCLAVE_RELAX_PERMISSIONS even exist? None of the
> documentation explains what this ioctl even does. Does it update PTE
> permissions? VMA permissions? Nobody knows without reading the source
> code.

Build the documentation (after applying this series) and it should
contain all the information you are searching for. As is the current custom
in the SGX documentation the built documentation pulls its content from
the kernel doc of the functions that implement the core of the
user space interactions.

>
> Userspace should not be bothered with the subtle details of the
> interaction between EPC, PTE and VMA permissions. But this API does
> everything it can do to expose all these details to userspace. And it
> doesn't bother to document them (probably because it is hard). It
> would be much better to avoid exposing these details to userspace.
>
> IMHO, there should be a simple flow like this (if EAUG respects
> PAGEINFO.SECINFO):

EAUG does not respect PAGEINFO.SECINFO for regular pages.

>
> 1. Non-enclave calls mmap()/munmap().
> 2. Enclave issues EACCEPT, if necessary.
> 3. Enclave issues EMODPE, if necessary.
>
> Notice that in the second step above, during the mmap() call, the
> kernel ensures that EPC, PTE and VMA are in sync and fails if they
> cannot be made to be compatible. Also note that in the above flow EAUG
> instructions can be efficiently batched.
>
> Given the current poor state of the EAUG instruction, we might need to
> do this flow instead:
>
> 1. Enclave issues EACCEPT, if necessary. (Add RW pages...)
> 2. Non-enclave calls mmap()/munmap().
> 3. Enclave issues EACCEPT, if necessary.
> 4. Enclave issues EMODPE, if necessary.
>
> However, doing EAUG only via the page access handler means that there
> is no way to batch EAUG instructions and this forces a context switch
> for every page you want to add. This has to be terrible for
> performance. Note specifically that the SDM calls out batching, which
> is currently impossible under this patch set. 35.5.7 - "Page
> allocation operations may be batched to improve efficiency."

These page functions are all per-page so it is not possible to add multiple
pages with a single instruction. It is indeed possible to pre-fault pages.

> As it stands today, if I want to add 256MiB of pages to an enclave,
> I'll have to do 2^16 context switches. That doesn't seem scalable.

No. Running ENCLU[EACCEPT] on each of the pages within that range should not
need any explicit context switch out of the enclave. See the "augment_via_eaccept"
test case in:
https://lore.kernel.org/linux-sgx/32c1116934a588bd3e6c174684e3e36a05c0a4d4.1644274683.git.reinette.chatre@intel.com/


>>> the need for the weird userspace callback to allow for execute
>>> permissions.
>>
>> User policy integration would always be required to allow execute
>> permissions on a writable page. This is not expected to be a userspace
>> callback but instead integration with existing user policy subsystem(s).
>
> Why? This isn't documented.

This is similar to the existing policies involved in managing the permissions
of mapped memory. When user space calls mprotect() to change permissions
of a mapped region then the kernel will not blindly allow the permissions but
instead ensure that it is allowed based on user policy by calling the LSM
(Linux Security Module) hooks.

You can learn more about LSM and various security modules at:
Documentation/security/lsm.rst
Documentation/admin-guide/LSM/*

You can compare what is needed here to what is currently done when user space
attempts to make some memory executable (see:
mm/mprotect.c:do_mprotect_key()->security_file_mprotect()). User policy needs
to help the kernel determine if this is allowed. For example, when SELinux is
the security module of choice then the process or file (depending on what type
of memory is being changed) needs to have a special permission (PROCESS__EXECHEAP,
PROCESS__EXECSTACK, or FILE__EXECMOD) assigned by user space to allow this.

Integration with user space policy is required for RWX of dynamically added pages
to be supported. In this series dynamically added pages will not be allowed to
be made executable, a follow-up series will add support for user policy
integration to support RWX permissions of dynamically added pages.

>>> 3. Implementing as I've suggested also means that we can lock down an
>>> enclave, for example - after code has been JITed, by closing the file
>>> descriptor. Once the file descriptor used to create the enclave is
>>> closed, no further mmap() can be performed on the enclave. Attempting
>>> to do EACCEPT on an unmapped page will generate a page fault.
>>
>> This is not clear to me. If the file descriptor is closed and no further
>> mmap() is allowed then how would a process be able to enter the enclave
>> to execute code within it?
>
> EENTER (or the vdso function) with the address of a TCS page, like
> normal. In Enarx, we don't retain the enclave fd after the final
> mmap() following EINIT. Everything works just fine.

The OS fault handler is responsible for managing the PTEs that is required
for the enclave to be able to access the memory within the enclave.
The OS fault handler is attached to a VMA that is created with mmap().

>
>> This series does indeed lock down the address range to ensure that it is
>> not possible to map memory that does not belong to the enclave after the
>> enclave is created. Please see:
>> https://lore.kernel.org/linux-sgx/1b833dbce6c937f71523f4aaf4b2181b9673519f.1644274683.git.reinette.chatre@intel.com/
>
> That's not what I'm talking about. I'm talking about a workflow like this:
>
> 1. Enclave initialization: ECREATE ... EINIT
> 2. EENTER
> 3. Enclave JITs some code (changes page permissions)
> 4. EEXIT
> 5. Close enclave fd.
> 6. EENTER
> 7. If an enclave attempts page modifications, a fault occurs.

The original fd that was created to obtain the enclave base address
may be closed at (5) but the executable and data portions of the enclave
still needs to be mapped afterwards to be able to have OS support for
managing the PTEs that the enclave depends on to access those pages.

>
> Think of this similar to seccomp(). The enclave wants to do some
> dynamic page table manipulation. But then it wants to lock down page
> table modification so that, if compromised, attackers have no ability
> to obtain RWX permissions.

Reinette

2022-02-24 01:46:02

by Nathaniel McCallum

[permalink] [raw]
Subject: Re: [PATCH V2 00/32] x86/sgx and selftests/sgx: Support SGX2

On Tue, Feb 22, 2022 at 5:39 PM Reinette Chatre
<[email protected]> wrote:
>
> Hi Nathaniel,
>
> On 2/22/2022 12:27 PM, Nathaniel McCallum wrote:
> > 1. This interface looks very odd to me. mmap() is the kernel interface
> > for changing user space memory maps. Why are we introducing a new
> > interface for this?
>
> mmap() is the kernel interface used to create new mappings in the
> virtual address space of the calling process. This is different from
> the permissions and properties of the underlying file/memory being mapped.
>
> A new interface is introduced because changes need to be made to the
> permissions and properties of the underlying enclave. A new virtual
> address space is not needed nor should existing VMAs be impacted.
>
> This is similar to how mmap() is not used to change file permissions.
>
> VMA permissions are separate from enclave page permissions as found in
> the EPCM (Enclave Page Cache Map). The current implementation (SGX1) already
> distinguishes between the VMA and EPCM permissions - for example, it is
> already possible to create a read-only VMA from enclave pages that have
> RW EPCM permissions. mmap() of a portion of EPC memory with a particular
> permission does not imply that the underlying EPCM permissions (should)have
> that permission.

Yes. BUT... unlike the file permissions, this leaks an implementation detail.

The user process is governed by VMA permissions. And during enclave
creation, it had to mmap() all the enclave regions to their final VMA
permissions. So during enclave creation you have to use mmap() but
after enclave creation you use custom APIs? That's inconsistent at
best.

Forcing userspace to worry about the (mostly undocumented!)
interactions between EPC, PTE and VMA permissions makes these APIs
hard to use and difficult to reason about.

When I call SGX_IOC_ENCLAVE_RELAX_PERMISSIONS, do I also have to call
mmap() to update the VMA permissions to match? It isn't clear. Nor is
it really clear why I'm calling completely separate APIs.

> > You can just simply add a new mmap flag (i.e.
> > MAP_SGX_TCS*) and then figure out which SGX instructions to execute
> > based on the desired state of the memory maps. If you do this, none of
> > the following ioctls are needed:
> >
> > * SGX_IOC_ENCLAVE_RELAX_PERMISSIONS
> > * SGX_IOC_ENCLAVE_RESTRICT_PERMISSIONS
> > * SGX_IOC_ENCLAVE_REMOVE_PAGES
> > * SGX_IOC_ENCLAVE_MODIFY_TYPE
> >
> > It also means that languages don't have to grow support for all these
> > ioctls. Instead, they can just reuse the existing mmap() bindings with
> > the new flag. Also, multiple operations can be combined into a single
> > mmap() call, amortizing the changes over a single context switch.
> >
> > 2. Automatically adding pages with hard-coded permissions in a fault
> > handler seems like a really bad idea.
>
> Could you please elaborate why this is a bad idea?

Because implementations that miss this subtlety suddenly have pages
with magic permissions. Magic is bad. Explicit is good.

> > How do you distinguish between
> > accesses which should result in an updated mapping and accesses that
> > should result in a fault?
>
> Accesses that should result in an updated mapping have two requirements:
> (a) address accessed belongs to the enclave based on the address
> range specified during enclave create
> (b) there is no backing enclave page for the address

What happens if the enclave is buggy? Or has been compromised. In both
of those cases, there should be a userspace visible fault and pages
should not be added.

> > IMHO, all unmapped page accesses should
> > result in a page fault. mmap() should be called first to identify the
> > correct permissions for these pages.
> > Then the page handler should be
> > updated to use the permissions from the mapping when backfilling
> > physical pages. If I understand correctly, this should also obviate
>
> Regular enclave pages can _only_ be dynamically added with RW permission.
>
> SGX2's support for adding regular pages to an enclave via the EAUG
> instruction is architecturally set at RW. The OS cannot change those permissions
> via the EAUG instruction nor can the OS do so with a different/additional
> instruction because:
> * the OS is not able to relax permissions since that can only be done from
> within the enclave with ENCLU[EMODPE], thus it is not possible for the OS to
> dynamically add pages via EAUG as RW and then relax permissions to RWX.
> * the OS is not able to EAUG a page and immediately attempt an EMODPR either
> as Jarkko also recently inquired about:
> https://lore.kernel.org/linux-sgx/[email protected]/

This design looks... unfinished. EAUG takes a PAGEINFO in RBX, but
PAGEINFO.SECINFO must be zeroed and EAUG instead sets magic hard-coded
permissions. Why doesn't EAUG just respect the permissions in
PAGEINFO.SECINFO? We aren't told.

Further, if the enclave can do EMODPE, why does
SGX_IOC_ENCLAVE_RELAX_PERMISSIONS even exist? None of the
documentation explains what this ioctl even does. Does it update PTE
permissions? VMA permissions? Nobody knows without reading the source
code.

Userspace should not be bothered with the subtle details of the
interaction between EPC, PTE and VMA permissions. But this API does
everything it can do to expose all these details to userspace. And it
doesn't bother to document them (probably because it is hard). It
would be much better to avoid exposing these details to userspace.

IMHO, there should be a simple flow like this (if EAUG respects
PAGEINFO.SECINFO):

1. Non-enclave calls mmap()/munmap().
2. Enclave issues EACCEPT, if necessary.
3. Enclave issues EMODPE, if necessary.

Notice that in the second step above, during the mmap() call, the
kernel ensures that EPC, PTE and VMA are in sync and fails if they
cannot be made to be compatible. Also note that in the above flow EAUG
instructions can be efficiently batched.

Given the current poor state of the EAUG instruction, we might need to
do this flow instead:

1. Enclave issues EACCEPT, if necessary. (Add RW pages...)
2. Non-enclave calls mmap()/munmap().
3. Enclave issues EACCEPT, if necessary.
4. Enclave issues EMODPE, if necessary.

However, doing EAUG only via the page access handler means that there
is no way to batch EAUG instructions and this forces a context switch
for every page you want to add. This has to be terrible for
performance. Note specifically that the SDM calls out batching, which
is currently impossible under this patch set. 35.5.7 - "Page
allocation operations may be batched to improve efficiency."

As it stands today, if I want to add 256MiB of pages to an enclave,
I'll have to do 2^16 context switches. That doesn't seem scalable.

> > the need for the weird userspace callback to allow for execute
> > permissions.
>
> User policy integration would always be required to allow execute
> permissions on a writable page. This is not expected to be a userspace
> callback but instead integration with existing user policy subsystem(s).

Why? This isn't documented.

> > 3. Implementing as I've suggested also means that we can lock down an
> > enclave, for example - after code has been JITed, by closing the file
> > descriptor. Once the file descriptor used to create the enclave is
> > closed, no further mmap() can be performed on the enclave. Attempting
> > to do EACCEPT on an unmapped page will generate a page fault.
>
> This is not clear to me. If the file descriptor is closed and no further
> mmap() is allowed then how would a process be able to enter the enclave
> to execute code within it?

EENTER (or the vdso function) with the address of a TCS page, like
normal. In Enarx, we don't retain the enclave fd after the final
mmap() following EINIT. Everything works just fine.

> This series does indeed lock down the address range to ensure that it is
> not possible to map memory that does not belong to the enclave after the
> enclave is created. Please see:
> https://lore.kernel.org/linux-sgx/1b833dbce6c937f71523f4aaf4b2181b9673519f.1644274683.git.reinette.chatre@intel.com/

That's not what I'm talking about. I'm talking about a workflow like this:

1. Enclave initialization: ECREATE ... EINIT
2. EENTER
3. Enclave JITs some code (changes page permissions)
4. EEXIT
5. Close enclave fd.
6. EENTER
7. If an enclave attempts page modifications, a fault occurs.

Think of this similar to seccomp(). The enclave wants to do some
dynamic page table manipulation. But then it wants to lock down page
table modification so that, if compromised, attackers have no ability
to obtain RWX permissions.

2022-03-02 23:28:35

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH V2 00/32] x86/sgx and selftests/sgx: Support SGX2

Hi Nathaniel,

On 3/2/2022 8:57 AM, Nathaniel McCallum wrote:
> Perhaps it would be better for us to have a shared understanding on
> how the patches as posted are supposed to work in the most common
> cases? I'm thinking here of projects such as Enarx, Gramine and
> Occulum, which all have a similar process. Namely they execute an
> executable (called exec in the below chart) which has things like
> syscalls handled by a shim. These two components (shim and exec) are
> supported by a non-enclave userspace runtime. Given this common
> architectural pattern, this is how I understand adding pages via an
> exec call to mmap() to work.
>
> https://mermaid.live/edit#pako:eNp1k81qwzAQhF9F6NRCAu1Vh0BIRemhoeSHBuIettYmFpElVZZLQ8i7144sJ8aOT2bmY3d2vT7R1AikjBb4U6JO8UXC3kGeaFI9FpyXqbSgPTmg06j6uiu1lzn2jSKTA2XwD9NEB31uPBLzi-6iMpLnYB8Wn4-kOBYpKBW52iXj8WQSmzEy5Zvt01ewG5HUQN2UEc7nK77YPjdALd64GWih8NpkALGwR_JtzOGAaKXexyTKGEt2pgoMaXahgj5Qgk9nM_6xGvDDJpsmOyiVv0LB62B8un4dBDrLiLPeWciCL9fvvKVQizhSG6stFz9Df7sxUpcYitR-SodFO2A_Vw-7l4nzzduqjX9bKJxOHDDeBB3RHF0OUlS3faq1hPoMqzulrHoVGPZOE32u0NIK8MiF9MZRtgNV4IhC6c3yqFPKvCsxQs3_0VDnfzf-CPg
>
> This only covers adding RW pages. I haven't even tackled permission
> changes yet. Is that understanding correct? If not, please provide an
> alternative sequence diagram to explain how you expect this to be
> used.

Please find my attempt linked below:

https://mermaid.live/edit#pako:eNqFUsFqAjEQ_ZWQUwsK7XUPgthQeqiUVang9jAkoxu6m2yzWVsR_72J2WTbKnSOb97MvPeSI-VaIM1oix8dKo4PEnYG6kIRVw0YK7lsQFlSghGfYPCy845GYXWJm05ZWV8ZaEt55QB-IS9UwOfaItF7NGc0I3UNzU3-ekvaQ8uhqiLPd8l4PJnEYxmZsvXm7i20e5B4QlA5rAqMgJJfG9Ixg21X2ctVXn9GGJsvWb65729FSZXWDdlqpxx46Qzu-gB8-cHzhhim2zKdzdjLcuAAt3IPzv6Qkq84EdxGM3492UJS-cdSpLHp6nEgCPz3RjI5NPvAlRisJjspOsbWT8sUyc_MwjuynC1Wzyw9EB3RGk0NUrgvePRYQW2J7tNQd5sKDN5ooU6O2jXCiWZCWm1otoWqxRGFzurFQXGaWdNhJPXfuGedvgFejOuH

The changes include:
* Move mmap() to occur before attempting EACCEPT on the addresses. This is
required for EACCEPT (as well as any subsequent access from within the enclave)
to be able to access the pages.
* Remove AEX[1] to the runtime within the loop. After EAUG returns execution
will return to the instruction pointer that triggered the #PF, EACCEPT,
this will cause the EACCEPT to be run again, this time succeeding.

This is based on the implementation within this series. When supporting
the new ioctl() requested by Jarkko there will be an additional ioctl()
required before the loop.

Reinette

2022-03-02 23:59:52

by Nathaniel McCallum

[permalink] [raw]
Subject: Re: [PATCH V2 00/32] x86/sgx and selftests/sgx: Support SGX2

Reinette,

Perhaps it would be better for us to have a shared understanding on
how the patches as posted are supposed to work in the most common
cases? I'm thinking here of projects such as Enarx, Gramine and
Occulum, which all have a similar process. Namely they execute an
executable (called exec in the below chart) which has things like
syscalls handled by a shim. These two components (shim and exec) are
supported by a non-enclave userspace runtime. Given this common
architectural pattern, this is how I understand adding pages via an
exec call to mmap() to work.

https://mermaid.live/edit#pako:eNp1k81qwzAQhF9F6NRCAu1Vh0BIRemhoeSHBuIettYmFpElVZZLQ8i7144sJ8aOT2bmY3d2vT7R1AikjBb4U6JO8UXC3kGeaFI9FpyXqbSgPTmg06j6uiu1lzn2jSKTA2XwD9NEB31uPBLzi-6iMpLnYB8Wn4-kOBYpKBW52iXj8WQSmzEy5Zvt01ewG5HUQN2UEc7nK77YPjdALd64GWih8NpkALGwR_JtzOGAaKXexyTKGEt2pgoMaXahgj5Qgk9nM_6xGvDDJpsmOyiVv0LB62B8un4dBDrLiLPeWciCL9fvvKVQizhSG6stFz9Df7sxUpcYitR-SodFO2A_Vw-7l4nzzduqjX9bKJxOHDDeBB3RHF0OUlS3faq1hPoMqzulrHoVGPZOE32u0NIK8MiF9MZRtgNV4IhC6c3yqFPKvCsxQs3_0VDnfzf-CPg

This only covers adding RW pages. I haven't even tackled permission
changes yet. Is that understanding correct? If not, please provide an
alternative sequence diagram to explain how you expect this to be
used.

On Wed, Feb 23, 2022 at 1:25 PM Reinette Chatre
<[email protected]> wrote:
>
> Hi Nathaniel,
>
> On 2/23/2022 5:24 AM, Nathaniel McCallum wrote:
> > On Tue, Feb 22, 2022 at 5:39 PM Reinette Chatre
> > <[email protected]> wrote:
> >>
> >> Hi Nathaniel,
> >>
> >> On 2/22/2022 12:27 PM, Nathaniel McCallum wrote:
> >>> 1. This interface looks very odd to me. mmap() is the kernel interface
> >>> for changing user space memory maps. Why are we introducing a new
> >>> interface for this?
> >>
> >> mmap() is the kernel interface used to create new mappings in the
> >> virtual address space of the calling process. This is different from
> >> the permissions and properties of the underlying file/memory being mapped.
> >>
> >> A new interface is introduced because changes need to be made to the
> >> permissions and properties of the underlying enclave. A new virtual
> >> address space is not needed nor should existing VMAs be impacted.
> >>
> >> This is similar to how mmap() is not used to change file permissions.
> >>
> >> VMA permissions are separate from enclave page permissions as found in
> >> the EPCM (Enclave Page Cache Map). The current implementation (SGX1) already
> >> distinguishes between the VMA and EPCM permissions - for example, it is
> >> already possible to create a read-only VMA from enclave pages that have
> >> RW EPCM permissions. mmap() of a portion of EPC memory with a particular
> >> permission does not imply that the underlying EPCM permissions (should)have
> >> that permission.
> >
> > Yes. BUT... unlike the file permissions, this leaks an implementation detail.
>
> Not really - just like a RW file can be mapped read-only or RW, RW enclave
> memory can be mapped read-only or RW.
>
> >
> > The user process is governed by VMA permissions. And during enclave
> > creation, it had to mmap() all the enclave regions to their final VMA
> > permissions. So during enclave creation you have to use mmap() but
> > after enclave creation you use custom APIs? That's inconsistent at
> > best.
>
> No. ioctl()s are consistently used to manage enclave memory.
>
> The existing ioctls() SGX_IOC_ENCLAVE_CREATE, SGX_IOC_ENCLAVE_ADD_PAGES,
> and SGX_IOC_ENCLAVE_INIT are used to set up to initialize the enclave memory.
>
> The new ioctls() are used to manage enclave memory after enclave initialization.
>
> The enclave memory is thus managed with a consistent interface.
>
> mmap() is required before SGX_IOC_ENCLAVE_CREATE to obtain a base address
> for the enclave that is required by the ioctl(). The rest of the ioctl()s,
> existing and new, are consistent in interface by not requiring a memory
> mapping but instead work from an offset from the base address.
>
> > Forcing userspace to worry about the (mostly undocumented!)
> > interactions between EPC, PTE and VMA permissions makes these APIs
> > hard to use and difficult to reason about.
>
> This is not new. The current SGX1 user space is already prevented from
> creating a mapping of enclave memory that is more relaxed than the enclave
> memory. For example, if the enclave memory has RW EPCM permissions then it
> is not possible to mmap() that memory as RWX.
>
> >
> > When I call SGX_IOC_ENCLAVE_RELAX_PERMISSIONS, do I also have to call
> > mmap() to update the VMA permissions to match? It isn't clear. Nor is
>
> mprotect() may be the better call to use.
>
> > it really clear why I'm calling completely separate APIs.
> >
> >>> You can just simply add a new mmap flag (i.e.
> >>> MAP_SGX_TCS*) and then figure out which SGX instructions to execute
> >>> based on the desired state of the memory maps. If you do this, none of
> >>> the following ioctls are needed:
> >>>
> >>> * SGX_IOC_ENCLAVE_RELAX_PERMISSIONS
> >>> * SGX_IOC_ENCLAVE_RESTRICT_PERMISSIONS
> >>> * SGX_IOC_ENCLAVE_REMOVE_PAGES
> >>> * SGX_IOC_ENCLAVE_MODIFY_TYPE
> >>>
> >>> It also means that languages don't have to grow support for all these
> >>> ioctls. Instead, they can just reuse the existing mmap() bindings with
> >>> the new flag. Also, multiple operations can be combined into a single
> >>> mmap() call, amortizing the changes over a single context switch.
> >>>
> >>> 2. Automatically adding pages with hard-coded permissions in a fault
> >>> handler seems like a really bad idea.
> >>
> >> Could you please elaborate why this is a bad idea?
> >
> > Because implementations that miss this subtlety suddenly have pages
> > with magic permissions. Magic is bad. Explicit is good.
> >
>
> There is no magic. Any new pages have to be accepted by the enclave.
> The enclave will not be able to access these pages unless explicitly
> accepted, ENCLU[EACCEPT], from within the enclave.
>
> >>> How do you distinguish between
> >>> accesses which should result in an updated mapping and accesses that
> >>> should result in a fault?
> >>
> >> Accesses that should result in an updated mapping have two requirements:
> >> (a) address accessed belongs to the enclave based on the address
> >> range specified during enclave create
> >> (b) there is no backing enclave page for the address
> >
> > What happens if the enclave is buggy? Or has been compromised. In both
> > of those cases, there should be a userspace visible fault and pages
> > should not be added.
>
> If user space accesses a memory address with a regular read/write that
> results in a new page added then there is indeed a user space visible
> fault. You can see this flow in action in the "augment" test case in
> https://lore.kernel.org/linux-sgx/32c1116934a588bd3e6c174684e3e36a05c0a4d4.1644274683.git.reinette.chatre@intel.com/
>
> If user space indeed wants the page after encountering such a fault then
> it needs to enter the enclave again, from a different entry point, to
> run ENCLU[EACCEPT], before it can return to the original entry point to
> continue execution from the instruction that triggered the original read/write.
>
> The only flow where a page is added without a user space visible fault
> is when user space explicitly runs the ENCLU[EACCEPT] to do so.
>
> >
> >>> IMHO, all unmapped page accesses should
> >>> result in a page fault. mmap() should be called first to identify the
> >>> correct permissions for these pages.
> >>> Then the page handler should be
> >>> updated to use the permissions from the mapping when backfilling
> >>> physical pages. If I understand correctly, this should also obviate
> >>
> >> Regular enclave pages can _only_ be dynamically added with RW permission.
> >>
> >> SGX2's support for adding regular pages to an enclave via the EAUG
> >> instruction is architecturally set at RW. The OS cannot change those permissions
> >> via the EAUG instruction nor can the OS do so with a different/additional
> >> instruction because:
> >> * the OS is not able to relax permissions since that can only be done from
> >> within the enclave with ENCLU[EMODPE], thus it is not possible for the OS to
> >> dynamically add pages via EAUG as RW and then relax permissions to RWX.
> >> * the OS is not able to EAUG a page and immediately attempt an EMODPR either
> >> as Jarkko also recently inquired about:
> >> https://lore.kernel.org/linux-sgx/[email protected]/
> >
> > This design looks... unfinished. EAUG takes a PAGEINFO in RBX, but
> > PAGEINFO.SECINFO must be zeroed and EAUG instead sets magic hard-coded
> > permissions. Why doesn't EAUG just respect the permissions in
> > PAGEINFO.SECINFO? We aren't told.
>
> This design is finished and respects the hardware specification. You can find
> the details in the SDM's documentation of the EAUG function.
>
> If the SECINFO field has a value then the hardware requires it to indicate
> that it is a new shadow stack page being added, not a regular page. Support for
> shadow stack pages is not in scope for this work. Attempting to dynamically
> add a regular page with explicit permissions will result in a #GP(0).
>
> The only way to add a regular enclave page is to make the SECINFO field empty
> and doing so forces the page type to be a regular page and the permissions to
> be RW.
>
> >
> > Further, if the enclave can do EMODPE, why does
> > SGX_IOC_ENCLAVE_RELAX_PERMISSIONS even exist? None of the
> > documentation explains what this ioctl even does. Does it update PTE
> > permissions? VMA permissions? Nobody knows without reading the source
> > code.
>
> Build the documentation (after applying this series) and it should
> contain all the information you are searching for. As is the current custom
> in the SGX documentation the built documentation pulls its content from
> the kernel doc of the functions that implement the core of the
> user space interactions.
>
> >
> > Userspace should not be bothered with the subtle details of the
> > interaction between EPC, PTE and VMA permissions. But this API does
> > everything it can do to expose all these details to userspace. And it
> > doesn't bother to document them (probably because it is hard). It
> > would be much better to avoid exposing these details to userspace.
> >
> > IMHO, there should be a simple flow like this (if EAUG respects
> > PAGEINFO.SECINFO):
>
> EAUG does not respect PAGEINFO.SECINFO for regular pages.
>
> >
> > 1. Non-enclave calls mmap()/munmap().
> > 2. Enclave issues EACCEPT, if necessary.
> > 3. Enclave issues EMODPE, if necessary.
> >
> > Notice that in the second step above, during the mmap() call, the
> > kernel ensures that EPC, PTE and VMA are in sync and fails if they
> > cannot be made to be compatible. Also note that in the above flow EAUG
> > instructions can be efficiently batched.
> >
> > Given the current poor state of the EAUG instruction, we might need to
> > do this flow instead:
> >
> > 1. Enclave issues EACCEPT, if necessary. (Add RW pages...)
> > 2. Non-enclave calls mmap()/munmap().
> > 3. Enclave issues EACCEPT, if necessary.
> > 4. Enclave issues EMODPE, if necessary.
> >
> > However, doing EAUG only via the page access handler means that there
> > is no way to batch EAUG instructions and this forces a context switch
> > for every page you want to add. This has to be terrible for
> > performance. Note specifically that the SDM calls out batching, which
> > is currently impossible under this patch set. 35.5.7 - "Page
> > allocation operations may be batched to improve efficiency."
>
> These page functions are all per-page so it is not possible to add multiple
> pages with a single instruction. It is indeed possible to pre-fault pages.
>
> > As it stands today, if I want to add 256MiB of pages to an enclave,
> > I'll have to do 2^16 context switches. That doesn't seem scalable.
>
> No. Running ENCLU[EACCEPT] on each of the pages within that range should not
> need any explicit context switch out of the enclave. See the "augment_via_eaccept"
> test case in:
> https://lore.kernel.org/linux-sgx/32c1116934a588bd3e6c174684e3e36a05c0a4d4.1644274683.git.reinette.chatre@intel.com/
>
>
> >>> the need for the weird userspace callback to allow for execute
> >>> permissions.
> >>
> >> User policy integration would always be required to allow execute
> >> permissions on a writable page. This is not expected to be a userspace
> >> callback but instead integration with existing user policy subsystem(s).
> >
> > Why? This isn't documented.
>
> This is similar to the existing policies involved in managing the permissions
> of mapped memory. When user space calls mprotect() to change permissions
> of a mapped region then the kernel will not blindly allow the permissions but
> instead ensure that it is allowed based on user policy by calling the LSM
> (Linux Security Module) hooks.
>
> You can learn more about LSM and various security modules at:
> Documentation/security/lsm.rst
> Documentation/admin-guide/LSM/*
>
> You can compare what is needed here to what is currently done when user space
> attempts to make some memory executable (see:
> mm/mprotect.c:do_mprotect_key()->security_file_mprotect()). User policy needs
> to help the kernel determine if this is allowed. For example, when SELinux is
> the security module of choice then the process or file (depending on what type
> of memory is being changed) needs to have a special permission (PROCESS__EXECHEAP,
> PROCESS__EXECSTACK, or FILE__EXECMOD) assigned by user space to allow this.
>
> Integration with user space policy is required for RWX of dynamically added pages
> to be supported. In this series dynamically added pages will not be allowed to
> be made executable, a follow-up series will add support for user policy
> integration to support RWX permissions of dynamically added pages.
>
> >>> 3. Implementing as I've suggested also means that we can lock down an
> >>> enclave, for example - after code has been JITed, by closing the file
> >>> descriptor. Once the file descriptor used to create the enclave is
> >>> closed, no further mmap() can be performed on the enclave. Attempting
> >>> to do EACCEPT on an unmapped page will generate a page fault.
> >>
> >> This is not clear to me. If the file descriptor is closed and no further
> >> mmap() is allowed then how would a process be able to enter the enclave
> >> to execute code within it?
> >
> > EENTER (or the vdso function) with the address of a TCS page, like
> > normal. In Enarx, we don't retain the enclave fd after the final
> > mmap() following EINIT. Everything works just fine.
>
> The OS fault handler is responsible for managing the PTEs that is required
> for the enclave to be able to access the memory within the enclave.
> The OS fault handler is attached to a VMA that is created with mmap().
>
> >
> >> This series does indeed lock down the address range to ensure that it is
> >> not possible to map memory that does not belong to the enclave after the
> >> enclave is created. Please see:
> >> https://lore.kernel.org/linux-sgx/1b833dbce6c937f71523f4aaf4b2181b9673519f.1644274683.git.reinette.chatre@intel.com/
> >
> > That's not what I'm talking about. I'm talking about a workflow like this:
> >
> > 1. Enclave initialization: ECREATE ... EINIT
> > 2. EENTER
> > 3. Enclave JITs some code (changes page permissions)
> > 4. EEXIT
> > 5. Close enclave fd.
> > 6. EENTER
> > 7. If an enclave attempts page modifications, a fault occurs.
>
> The original fd that was created to obtain the enclave base address
> may be closed at (5) but the executable and data portions of the enclave
> still needs to be mapped afterwards to be able to have OS support for
> managing the PTEs that the enclave depends on to access those pages.
>
> >
> > Think of this similar to seccomp(). The enclave wants to do some
> > dynamic page table manipulation. But then it wants to lock down page
> > table modification so that, if compromised, attackers have no ability
> > to obtain RWX permissions.
>
> Reinette

2022-03-03 01:32:47

by Nathaniel McCallum

[permalink] [raw]
Subject: Re: [PATCH V2 00/32] x86/sgx and selftests/sgx: Support SGX2

On Wed, Mar 2, 2022 at 4:20 PM Reinette Chatre
<[email protected]> wrote:
>
> Hi Nathaniel,
>
> On 3/2/2022 8:57 AM, Nathaniel McCallum wrote:
> > Perhaps it would be better for us to have a shared understanding on
> > how the patches as posted are supposed to work in the most common
> > cases? I'm thinking here of projects such as Enarx, Gramine and
> > Occulum, which all have a similar process. Namely they execute an
> > executable (called exec in the below chart) which has things like
> > syscalls handled by a shim. These two components (shim and exec) are
> > supported by a non-enclave userspace runtime. Given this common
> > architectural pattern, this is how I understand adding pages via an
> > exec call to mmap() to work.
> >
> > https://mermaid.live/edit#pako:eNp1k81qwzAQhF9F6NRCAu1Vh0BIRemhoeSHBuIettYmFpElVZZLQ8i7144sJ8aOT2bmY3d2vT7R1AikjBb4U6JO8UXC3kGeaFI9FpyXqbSgPTmg06j6uiu1lzn2jSKTA2XwD9NEB31uPBLzi-6iMpLnYB8Wn4-kOBYpKBW52iXj8WQSmzEy5Zvt01ewG5HUQN2UEc7nK77YPjdALd64GWih8NpkALGwR_JtzOGAaKXexyTKGEt2pgoMaXahgj5Qgk9nM_6xGvDDJpsmOyiVv0LB62B8un4dBDrLiLPeWciCL9fvvKVQizhSG6stFz9Df7sxUpcYitR-SodFO2A_Vw-7l4nzzduqjX9bKJxOHDDeBB3RHF0OUlS3faq1hPoMqzulrHoVGPZOE32u0NIK8MiF9MZRtgNV4IhC6c3yqFPKvCsxQs3_0VDnfzf-CPg
> >
> > This only covers adding RW pages. I haven't even tackled permission
> > changes yet. Is that understanding correct? If not, please provide an
> > alternative sequence diagram to explain how you expect this to be
> > used.
>
> Please find my attempt linked below:
>
> https://mermaid.live/edit#pako:eNqFUsFqAjEQ_ZWQUwsK7XUPgthQeqiUVang9jAkoxu6m2yzWVsR_72J2WTbKnSOb97MvPeSI-VaIM1oix8dKo4PEnYG6kIRVw0YK7lsQFlSghGfYPCy845GYXWJm05ZWV8ZaEt55QB-IS9UwOfaItF7NGc0I3UNzU3-ekvaQ8uhqiLPd8l4PJnEYxmZsvXm7i20e5B4QlA5rAqMgJJfG9Ixg21X2ctVXn9GGJsvWb65729FSZXWDdlqpxx46Qzu-gB8-cHzhhim2zKdzdjLcuAAt3IPzv6Qkq84EdxGM3492UJS-cdSpLHp6nEgCPz3RjI5NPvAlRisJjspOsbWT8sUyc_MwjuynC1Wzyw9EB3RGk0NUrgvePRYQW2J7tNQd5sKDN5ooU6O2jXCiWZCWm1otoWqxRGFzurFQXGaWdNhJPXfuGedvgFejOuH
>
> The changes include:
> * Move mmap() to occur before attempting EACCEPT on the addresses. This is
> required for EACCEPT (as well as any subsequent access from within the enclave)
> to be able to access the pages.
> * Remove AEX[1] to the runtime within the loop. After EAUG returns execution
> will return to the instruction pointer that triggered the #PF, EACCEPT,
> this will cause the EACCEPT to be run again, this time succeeding.
>
> This is based on the implementation within this series. When supporting
> the new ioctl() requested by Jarkko there will be an additional ioctl()
> required before the loop.

https://mermaid.live/edit/#pako:eNp1U9FqgzAU_ZWQpw1a2F6FFaQLYw8ro7asUPeQmWsNNYlL4rZS-u-LRmut1ie953jvOecmR5woBjjABr5LkAk8c7rTVMQSuYeWVslSfIH23wXVlie8oNKijGr2SzUMkT1oCfmwrktpuRj5wWRcDKvwB0ksfX2hLCD1A7quBkgIWtwtP-6ROZiE5nnLq1A0nc5m7bAAhWSzffj0cFNEFaEaGiBCFiuy3D42hKp4gWZUshy6ISOUL6X2e4CCy10rQhUW8dR52QESivGUJ9RyJQ2SAAyYZ_V6ndUSsnldneVca_bJdvY7lkf6vc4haTBlbsdbDmLoaLlSBUqVy5wmWW2nw3rq26Pg-oTzOXlf9Xkt7BfTeqjjSWlP2JWTlkrC9cutlmcLlUlxoRBkE3T9Mrq7KArd0UBPqFDGTpstI2OphSv-jf1cBukPJlmSaP1GXFs8wQK0oJy523Ws-DG2GTiJOHCvDLx3HMuTo5YFc1MJ41ZpHKQ0NzDB1fWLDjLBgdUltKTmhjas0z-kWy8L

My comments below correspond to the arrow numbers in the diagram.

2. When the runtime receives the AEX, it doesn't have enough knowledge
to know whether or not to ask the kernel for an mmap(). So it has to
reenter the shim.

3. The shim has to handle the syscall instruction routing it to the
enclave's memory management subsystem.

4. The shim has to do bookkeeping and decide if additional pages are
even needed. If pages are already allocated, for example, it can skip
directly to step 13. However, if modifications are needed, it will go
to steps 5-12.

5-12. This is the part that represents new code from the kernel's
perspective for SGX2. It is also in a performance critical path and
should be evaluated with greater scrutiny. The number of context
switches is O(2N + 4) for each new allocated block, where N is the
number of pages: a context switch occurs at step 5, 6, 7, 8, 9/10 and
12. However, this can be reduced to O(4) for each new allocated block
with a simple modification:

https://mermaid.live/edit/#pako:eNqNk11rwyAUhv-KeLVBC9ttYIXQydjFymhaVmh24fSkkUbN1Gwrpf99pvlsk8G80nMez3l91SNmmgMOsIXPAhSDR0F3hspYIT9o4bQq5AeYap1T4wQTOVUOpdTwb2pgmNmDUZAN46ZQTsiRDTYVchiFH2CxquIL7QDpLzDnaICkpPnN8u0W2YNlNMsarsyi6XQ2a5oFKCSb7d17la6DqATKpgEiZLEiy-19DZTBXjalimfQNRlBPrTe7wFyoXaNCJ07JBJ_lh0gqblIBKNOaGWRAuDAK-qiVquWkM3zqpVzrblytjt-R2Va5yjR3h_K0nPrLleOaudFERKunzoIVE9Xj26VtZYbsEXmxgUOTP2_witfSTifk9fViMDzZPQuoij0V40eUK6tm9a3hqyjDq74P_zuH6V6aGRJovUL8WXxBEswkgruf8ux5GPsUvDvGQd-yiGhpS04ViePFjn3XQkXThscJDSzMMHld4oOiuHAmQIaqP5xNXX6BeBJIEk

The interesting thing about this pattern is that this can be done for
all page modification types except EMODT. For example, here's the same
process for changing a mapping from RW to RX:

https://mermaid.live/edit/#pako:eNqNk11rwyAUhv-KeLVBC9ttYIVCvdhFu5F0UGh24fSkkUbN1Gwrpf995jttMphXes7jOa-vesZMc8ABtvBZgGKwEvRgqIwV8oMWTqtCfoCp1zk1TjCRU-VQSg3_pgbGmSMYBdk4bgrlhJzYYFMhx1H4ARarOr7RDpD-AlNFAyQlze_C3T2yJ8tolrVcmUXz-WLRNgvQkuz2D-91ugmiEiibBoiQzZaE-8cGKIODbEoVz6BvMoF8aH08AuRCHVoROndIJP4sB0BSc5EIRp3QyiIFwIHX1FWtTi0hu-dtJ-dWc-1sf_yeyrTOUaK9P5SlVes-V45651URsn5ZvYY9BmqgbMB32jrTDdgic9MSR7b-X-ONs5U-MqGvmkxeRhQt_V2jJ5Rr6-bNtSHrqIMb_g_DhyepXxoJSfS2Jr4snmEJRlLB_Xc5l3yMXQr-QePATzkktHQFx-ri0SLnvivhwmmDg4RmFma4_E_RSTEcOFNACzVfrqEuvytQILY

My point in this thread has always been that it is an anti-feature to
presume that there is a need to treat EPC and VLA permissions
separately. This is a performance sink and it optimizes for a use case
which doesn't exist. Nobody actually wants there to be a mismatch
between EPC and VLA permissions.

So, besides EMODT, the only userspace interface we need is
mmap()/mprotect()/munmap(). The kernel should either succeed the
mmap()/mprotect()/munmap() syscall if the EPC permissions can be made
compatible or should fail otherwise.

Another interesting property arises from this flow. Since the EPC and
VLA permissions are always synchronized from the perspective of
userspace, in cases where the memory state between the kernel and the
exec layer is roughly synchronous, bookkeeping in the shim can be
implemented without any persistent memory between syscall handling
events. So, for example, the shim can implement brk() and
mmap()/munmap()/mprotect() with just two pointers: one to the break
position and one to the lowest mmap().

It is true that this basically commits enclave authors to doing all
EACCEPT calls immediately after modifications. But I suspect everyone
will do this anyway since there is no efficient (read: performant) way
for shims to handle page faults. So trying to do this lazily will just
result in a huge decrease in performance.

2022-03-03 19:14:13

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH V2 00/32] x86/sgx and selftests/sgx: Support SGX2

Hi Nathaniel,

On 3/2/2022 5:13 PM, Nathaniel McCallum wrote:
> On Wed, Mar 2, 2022 at 4:20 PM Reinette Chatre
> <[email protected]> wrote:
>>
>> Hi Nathaniel,
>>
>> On 3/2/2022 8:57 AM, Nathaniel McCallum wrote:
>>> Perhaps it would be better for us to have a shared understanding on
>>> how the patches as posted are supposed to work in the most common
>>> cases? I'm thinking here of projects such as Enarx, Gramine and
>>> Occulum, which all have a similar process. Namely they execute an
>>> executable (called exec in the below chart) which has things like
>>> syscalls handled by a shim. These two components (shim and exec) are
>>> supported by a non-enclave userspace runtime. Given this common
>>> architectural pattern, this is how I understand adding pages via an
>>> exec call to mmap() to work.
>>>
>>> https://mermaid.live/edit#pako:eNp1k81qwzAQhF9F6NRCAu1Vh0BIRemhoeSHBuIettYmFpElVZZLQ8i7144sJ8aOT2bmY3d2vT7R1AikjBb4U6JO8UXC3kGeaFI9FpyXqbSgPTmg06j6uiu1lzn2jSKTA2XwD9NEB31uPBLzi-6iMpLnYB8Wn4-kOBYpKBW52iXj8WQSmzEy5Zvt01ewG5HUQN2UEc7nK77YPjdALd64GWih8NpkALGwR_JtzOGAaKXexyTKGEt2pgoMaXahgj5Qgk9nM_6xGvDDJpsmOyiVv0LB62B8un4dBDrLiLPeWciCL9fvvKVQizhSG6stFz9Df7sxUpcYitR-SodFO2A_Vw-7l4nzzduqjX9bKJxOHDDeBB3RHF0OUlS3faq1hPoMqzulrHoVGPZOE32u0NIK8MiF9MZRtgNV4IhC6c3yqFPKvCsxQs3_0VDnfzf-CPg
>>>
>>> This only covers adding RW pages. I haven't even tackled permission
>>> changes yet. Is that understanding correct? If not, please provide an
>>> alternative sequence diagram to explain how you expect this to be
>>> used.
>>
>> Please find my attempt linked below:
>>
>> https://mermaid.live/edit#pako:eNqFUsFqAjEQ_ZWQUwsK7XUPgthQeqiUVang9jAkoxu6m2yzWVsR_72J2WTbKnSOb97MvPeSI-VaIM1oix8dKo4PEnYG6kIRVw0YK7lsQFlSghGfYPCy845GYXWJm05ZWV8ZaEt55QB-IS9UwOfaItF7NGc0I3UNzU3-ekvaQ8uhqiLPd8l4PJnEYxmZsvXm7i20e5B4QlA5rAqMgJJfG9Ixg21X2ctVXn9GGJsvWb65729FSZXWDdlqpxx46Qzu-gB8-cHzhhim2zKdzdjLcuAAt3IPzv6Qkq84EdxGM3492UJS-cdSpLHp6nEgCPz3RjI5NPvAlRisJjspOsbWT8sUyc_MwjuynC1Wzyw9EB3RGk0NUrgvePRYQW2J7tNQd5sKDN5ooU6O2jXCiWZCWm1otoWqxRGFzurFQXGaWdNhJPXfuGedvgFejOuH
>>
>> The changes include:
>> * Move mmap() to occur before attempting EACCEPT on the addresses. This is
>> required for EACCEPT (as well as any subsequent access from within the enclave)
>> to be able to access the pages.
>> * Remove AEX[1] to the runtime within the loop. After EAUG returns execution
>> will return to the instruction pointer that triggered the #PF, EACCEPT,
>> this will cause the EACCEPT to be run again, this time succeeding.
>>
>> This is based on the implementation within this series. When supporting
>> the new ioctl() requested by Jarkko there will be an additional ioctl()
>> required before the loop.
>
> https://mermaid.live/edit/#pako:eNp1U9FqgzAU_ZWQpw1a2F6FFaQLYw8ro7asUPeQmWsNNYlL4rZS-u-LRmut1ie953jvOecmR5woBjjABr5LkAk8c7rTVMQSuYeWVslSfIH23wXVlie8oNKijGr2SzUMkT1oCfmwrktpuRj5wWRcDKvwB0ksfX2hLCD1A7quBkgIWtwtP-6ROZiE5nnLq1A0nc5m7bAAhWSzffj0cFNEFaEaGiBCFiuy3D42hKp4gWZUshy6ISOUL6X2e4CCy10rQhUW8dR52QESivGUJ9RyJQ2SAAyYZ_V6ndUSsnldneVca_bJdvY7lkf6vc4haTBlbsdbDmLoaLlSBUqVy5wmWW2nw3rq26Pg-oTzOXlf9Xkt7BfTeqjjSWlP2JWTlkrC9cutlmcLlUlxoRBkE3T9Mrq7KArd0UBPqFDGTpstI2OphSv-jf1cBukPJlmSaP1GXFs8wQK0oJy523Ws-DG2GTiJOHCvDLx3HMuTo5YFc1MJ41ZpHKQ0NzDB1fWLDjLBgdUltKTmhjas0z-kWy8L
>
> My comments below correspond to the arrow numbers in the diagram.
>
> 2. When the runtime receives the AEX, it doesn't have enough knowledge
> to know whether or not to ask the kernel for an mmap(). So it has to
> reenter the shim.
>
> 3. The shim has to handle the syscall instruction routing it to the
> enclave's memory management subsystem.
>
> 4. The shim has to do bookkeeping and decide if additional pages are
> even needed. If pages are already allocated, for example, it can skip
> directly to step 13. However, if modifications are needed, it will go
> to steps 5-12.
>
> 5-12. This is the part that represents new code from the kernel's
> perspective for SGX2. It is also in a performance critical path and
> should be evaluated with greater scrutiny. The number of context
> switches is O(2N + 4) for each new allocated block, where N is the
> number of pages: a context switch occurs at step 5, 6, 7, 8, 9/10 and
> 12. However, this can be reduced to O(4) for each new allocated block
> with a simple modification:
>
> https://mermaid.live/edit/#pako:eNqNk11rwyAUhv-KeLVBC9ttYIXQydjFymhaVmh24fSkkUbN1Gwrpf99pvlsk8G80nMez3l91SNmmgMOsIXPAhSDR0F3hspYIT9o4bQq5AeYap1T4wQTOVUOpdTwb2pgmNmDUZAN46ZQTsiRDTYVchiFH2CxquIL7QDpLzDnaICkpPnN8u0W2YNlNMsarsyi6XQ2a5oFKCSb7d17la6DqATKpgEiZLEiy-19DZTBXjalimfQNRlBPrTe7wFyoXaNCJ07JBJ_lh0gqblIBKNOaGWRAuDAK-qiVquWkM3zqpVzrblytjt-R2Va5yjR3h_K0nPrLleOaudFERKunzoIVE9Xj26VtZYbsEXmxgUOTP2_witfSTifk9fViMDzZPQuoij0V40eUK6tm9a3hqyjDq74P_zuH6V6aGRJovUL8WXxBEswkgruf8ux5GPsUvDvGQd-yiGhpS04ViePFjn3XQkXThscJDSzMMHld4oOiuHAmQIaqP5xNXX6BeBJIEk

Your optimized proposal is possible in the current implementation as
follows:

https://mermaid.live/edit#pako:eNp1k11vgjAUhv_KSa-2RJPtlmQmxvViFzOLuMxEdlHbgzTSlrVlmzH-9xUBUWFclfc8nI-X0wPhRiCJiMOvEjXHZ8m2lqlEQ3hY6Y0u1QZt_V4w6yWXBdMeMmbFD7PYj-zQasz7ui21l2rgA5dJ1VfxF3mia31uPIL5RntSI1CKFXeLj3twe8dZnrdcFYXxeDJpi0Uwpav1w2cdbkSogKpoBJTOl3SxfmyASryIZkyLHLsiA8jGmN0OsZB62zZhCg8yDbNsEZQRMpWceWm0A40oUNTUVa5zt5SuXpbndm57rp3txu-oOnKd62ySRVfmfjhlz4YOy40pIDXBc8az0zhdbMAJOp3N6NuyY1A3o54Og-7F8TT8HHiCwjg_bnwG55nHG_4fhy5HqVeDLmj8_kpDWjIiCq1iUoT9PlR8QnyGYQNJFI4CU1bZQhJ9DGhZiFCVCumNJVHKcocjUl2AeK85ibwtsYWaO9JQxz-gBQs-

You can think of that EACCEPT instruction similar to a current (SGX1)
enclave memory read or write when the enclave page is not currently in
the EPC, for example, if the enclave memory being accessed is swapped
out and need to be decrypted and loaded back. Instead of ENCLS[ELDU]
incorporated to load the enclave page back into EPC, ENCLS[EAUG] is
incorporated to create a new EPC page.

You can find an example of such a flow involving EACCEPT in the
"augment_via_eaccept" test found in "[PATCH V2 21/32] selftests/sgx: Test
two different SGX2 EAUG flows"


> The interesting thing about this pattern is that this can be done for
> all page modification types except EMODT. For example, here's the same
> process for changing a mapping from RW to RX:
>
> https://mermaid.live/edit/#pako:eNqNk11rwyAUhv-KeLVBC9ttYIVCvdhFu5F0UGh24fSkkUbN1Gwrpf995jttMphXes7jOa-vesZMc8ABtvBZgGKwEvRgqIwV8oMWTqtCfoCp1zk1TjCRU-VQSg3_pgbGmSMYBdk4bgrlhJzYYFMhx1H4ARarOr7RDpD-AlNFAyQlze_C3T2yJ8tolrVcmUXz-WLRNgvQkuz2D-91ugmiEiibBoiQzZaE-8cGKIODbEoVz6BvMoF8aH08AuRCHVoROndIJP4sB0BSc5EIRp3QyiIFwIHX1FWtTi0hu-dtJ-dWc-1sf_yeyrTOUaK9P5SlVes-V45651URsn5ZvYY9BmqgbMB32jrTDdgic9MSR7b-X-ONs5U-MqGvmkxeRhQt_V2jJ5Rr6-bNtSHrqIMb_g_DhyepXxoJSfS2Jr4snmEJRlLB_Xc5l3yMXQr-QePATzkktHQFx-ri0SLnvivhwmmDg4RmFma4_E_RSTEcOFNACzVfrqEuvytQILY
>
> My point in this thread has always been that it is an anti-feature to
> presume that there is a need to treat EPC and VLA permissions
> separately. This is a performance sink and it optimizes for a use case
> which doesn't exist. Nobody actually wants there to be a mismatch
> between EPC and VLA permissions.

I assume you mean VMA permissions. It is hard for me to trust the statement
that nobody wants there to be a mismatch since VMA permissions being separate
from EPC permissions is an intentional (as documented) and integral part of the
current SGX ABI. Current SGX implementation explicitly checks for and supports
VMA mappings with permissions different from EPC permissions.

This SGX2 implementation follows and respects the current ABI and changing ABI
cannot be taken lightly.

Reinette

2022-03-04 12:50:42

by Jarkko Sakkinen

[permalink] [raw]
Subject: Re: [PATCH V2 00/32] x86/sgx and selftests/sgx: Support SGX2

On Wed, Mar 02, 2022 at 08:13:55PM -0500, Nathaniel McCallum wrote:
> On Wed, Mar 2, 2022 at 4:20 PM Reinette Chatre
> <[email protected]> wrote:
> >
> > Hi Nathaniel,
> >
> > On 3/2/2022 8:57 AM, Nathaniel McCallum wrote:
> > > Perhaps it would be better for us to have a shared understanding on
> > > how the patches as posted are supposed to work in the most common
> > > cases? I'm thinking here of projects such as Enarx, Gramine and
> > > Occulum, which all have a similar process. Namely they execute an
> > > executable (called exec in the below chart) which has things like
> > > syscalls handled by a shim. These two components (shim and exec) are
> > > supported by a non-enclave userspace runtime. Given this common
> > > architectural pattern, this is how I understand adding pages via an
> > > exec call to mmap() to work.
> > >
> > > https://mermaid.live/edit#pako:eNp1k81qwzAQhF9F6NRCAu1Vh0BIRemhoeSHBuIettYmFpElVZZLQ8i7144sJ8aOT2bmY3d2vT7R1AikjBb4U6JO8UXC3kGeaFI9FpyXqbSgPTmg06j6uiu1lzn2jSKTA2XwD9NEB31uPBLzi-6iMpLnYB8Wn4-kOBYpKBW52iXj8WQSmzEy5Zvt01ewG5HUQN2UEc7nK77YPjdALd64GWih8NpkALGwR_JtzOGAaKXexyTKGEt2pgoMaXahgj5Qgk9nM_6xGvDDJpsmOyiVv0LB62B8un4dBDrLiLPeWciCL9fvvKVQizhSG6stFz9Df7sxUpcYitR-SodFO2A_Vw-7l4nzzduqjX9bKJxOHDDeBB3RHF0OUlS3faq1hPoMqzulrHoVGPZOE32u0NIK8MiF9MZRtgNV4IhC6c3yqFPKvCsxQs3_0VDnfzf-CPg
> > >
> > > This only covers adding RW pages. I haven't even tackled permission
> > > changes yet. Is that understanding correct? If not, please provide an
> > > alternative sequence diagram to explain how you expect this to be
> > > used.
> >
> > Please find my attempt linked below:
> >
> > https://mermaid.live/edit#pako:eNqFUsFqAjEQ_ZWQUwsK7XUPgthQeqiUVang9jAkoxu6m2yzWVsR_72J2WTbKnSOb97MvPeSI-VaIM1oix8dKo4PEnYG6kIRVw0YK7lsQFlSghGfYPCy845GYXWJm05ZWV8ZaEt55QB-IS9UwOfaItF7NGc0I3UNzU3-ekvaQ8uhqiLPd8l4PJnEYxmZsvXm7i20e5B4QlA5rAqMgJJfG9Ixg21X2ctVXn9GGJsvWb65729FSZXWDdlqpxx46Qzu-gB8-cHzhhim2zKdzdjLcuAAt3IPzv6Qkq84EdxGM3492UJS-cdSpLHp6nEgCPz3RjI5NPvAlRisJjspOsbWT8sUyc_MwjuynC1Wzyw9EB3RGk0NUrgvePRYQW2J7tNQd5sKDN5ooU6O2jXCiWZCWm1otoWqxRGFzurFQXGaWdNhJPXfuGedvgFejOuH
> >
> > The changes include:
> > * Move mmap() to occur before attempting EACCEPT on the addresses. This is
> > required for EACCEPT (as well as any subsequent access from within the enclave)
> > to be able to access the pages.
> > * Remove AEX[1] to the runtime within the loop. After EAUG returns execution
> > will return to the instruction pointer that triggered the #PF, EACCEPT,
> > this will cause the EACCEPT to be run again, this time succeeding.
> >
> > This is based on the implementation within this series. When supporting
> > the new ioctl() requested by Jarkko there will be an additional ioctl()
> > required before the loop.
>
> https://mermaid.live/edit/#pako:eNp1U9FqgzAU_ZWQpw1a2F6FFaQLYw8ro7asUPeQmWsNNYlL4rZS-u-LRmut1ie953jvOecmR5woBjjABr5LkAk8c7rTVMQSuYeWVslSfIH23wXVlie8oNKijGr2SzUMkT1oCfmwrktpuRj5wWRcDKvwB0ksfX2hLCD1A7quBkgIWtwtP-6ROZiE5nnLq1A0nc5m7bAAhWSzffj0cFNEFaEaGiBCFiuy3D42hKp4gWZUshy6ISOUL6X2e4CCy10rQhUW8dR52QESivGUJ9RyJQ2SAAyYZ_V6ndUSsnldneVca_bJdvY7lkf6vc4haTBlbsdbDmLoaLlSBUqVy5wmWW2nw3rq26Pg-oTzOXlf9Xkt7BfTeqjjSWlP2JWTlkrC9cutlmcLlUlxoRBkE3T9Mrq7KArd0UBPqFDGTpstI2OphSv-jf1cBukPJlmSaP1GXFs8wQK0oJy523Ws-DG2GTiJOHCvDLx3HMuTo5YFc1MJ41ZpHKQ0NzDB1fWLDjLBgdUltKTmhjas0z-kWy8L
>
> My comments below correspond to the arrow numbers in the diagram.
>
> 2. When the runtime receives the AEX, it doesn't have enough knowledge
> to know whether or not to ask the kernel for an mmap(). So it has to
> reenter the shim.
>
> 3. The shim has to handle the syscall instruction routing it to the
> enclave's memory management subsystem.
>
> 4. The shim has to do bookkeeping and decide if additional pages are
> even needed. If pages are already allocated, for example, it can skip
> directly to step 13. However, if modifications are needed, it will go
> to steps 5-12.
>
> 5-12. This is the part that represents new code from the kernel's
> perspective for SGX2. It is also in a performance critical path and
> should be evaluated with greater scrutiny. The number of context
> switches is O(2N + 4) for each new allocated block, where N is the
> number of pages: a context switch occurs at step 5, 6, 7, 8, 9/10 and
> 12. However, this can be reduced to O(4) for each new allocated block
> with a simple modification:
>
> https://mermaid.live/edit/#pako:eNqNk11rwyAUhv-KeLVBC9ttYIXQydjFymhaVmh24fSkkUbN1Gwrpf99pvlsk8G80nMez3l91SNmmgMOsIXPAhSDR0F3hspYIT9o4bQq5AeYap1T4wQTOVUOpdTwb2pgmNmDUZAN46ZQTsiRDTYVchiFH2CxquIL7QDpLzDnaICkpPnN8u0W2YNlNMsarsyi6XQ2a5oFKCSb7d17la6DqATKpgEiZLEiy-19DZTBXjalimfQNRlBPrTe7wFyoXaNCJ07JBJ_lh0gqblIBKNOaGWRAuDAK-qiVquWkM3zqpVzrblytjt-R2Va5yjR3h_K0nPrLleOaudFERKunzoIVE9Xj26VtZYbsEXmxgUOTP2_witfSTifk9fViMDzZPQuoij0V40eUK6tm9a3hqyjDq74P_zuH6V6aGRJovUL8WXxBEswkgruf8ux5GPsUvDvGQd-yiGhpS04ViePFjn3XQkXThscJDSzMMHld4oOiuHAmQIaqP5xNXX6BeBJIEk
>
> The interesting thing about this pattern is that this can be done for
> all page modification types except EMODT. For example, here's the same
> process for changing a mapping from RW to RX:
>
> https://mermaid.live/edit/#pako:eNqNk11rwyAUhv-KeLVBC9ttYIVCvdhFu5F0UGh24fSkkUbN1Gwrpf995jttMphXes7jOa-vesZMc8ABtvBZgGKwEvRgqIwV8oMWTqtCfoCp1zk1TjCRU-VQSg3_pgbGmSMYBdk4bgrlhJzYYFMhx1H4ARarOr7RDpD-AlNFAyQlze_C3T2yJ8tolrVcmUXz-WLRNgvQkuz2D-91ugmiEiibBoiQzZaE-8cGKIODbEoVz6BvMoF8aH08AuRCHVoROndIJP4sB0BSc5EIRp3QyiIFwIHX1FWtTi0hu-dtJ-dWc-1sf_yeyrTOUaK9P5SlVes-V45651URsn5ZvYY9BmqgbMB32jrTDdgic9MSR7b-X-ONs5U-MqGvmkxeRhQt_V2jJ5Rr6-bNtSHrqIMb_g_DhyepXxoJSfS2Jr4snmEJRlLB_Xc5l3yMXQr-QePATzkktHQFx-ri0SLnvivhwmmDg4RmFma4_E_RSTEcOFNACzVfrqEuvytQILY
>
> My point in this thread has always been that it is an anti-feature to
> presume that there is a need to treat EPC and VLA permissions
> separately. This is a performance sink and it optimizes for a use case
> which doesn't exist. Nobody actually wants there to be a mismatch
> between EPC and VLA permissions.

I would not touch pre-initialization, EADD'd pages. The reason is
backwards compatibility.

For post-initialization, options are still open.

> So, besides EMODT, the only userspace interface we need is
> mmap()/mprotect()/munmap(). The kernel should either succeed the
> mmap()/mprotect()/munmap() syscall if the EPC permissions can be made
> compatible or should fail otherwise.

For mmap() it is the enclave who sets the permissions, not kernel, i.e. you
get a half-broken mmap() implementation. Kernel does EAUG, enclave does
EACCEPTCOPY.

I think what you're asking is too simple to be true, and even if we could
do it, it might limit possibilities to optimize user space, e.g. because
there is two ENCLU leaf functions (EMODPE, EACCEPTCOPY) and one ENCLS
leaf function (EMODPR), which can modify permissions.

A kernel syscall is essentially something that can be fully serviced
by the kernel. This is not such situation. The work is split.

BR, Jarkko