2022-11-11 19:13:32

by Kristen Carlson Accardi

[permalink] [raw]
Subject: [PATCH 00/26] Add Cgroup support for SGX EPC memory

Utilize the Miscellaneous cgroup controller to regulate the distribution
of SGX EPC memory, which is a subset of system RAM that is used to provide
SGX-enabled applications with protected memory, and is otherwise inaccessible.

SGX EPC memory allocations are separate from normal RAM allocations,
and is managed solely by the SGX subsystem. The existing cgroup memory
controller cannot be used to limit or account for SGX EPC memory.

This patchset implements the support for sgx_epc memory within the
misc cgroup controller, and then utilizes the misc cgroup controller
to provide support for setting the total system capacity, max limit
per cgroup, and events.

This work was originally authored by Sean Christopherson a few years ago,
and was modified to work with more recent kernels, and to utilize the
misc cgroup controller rather than a custom controller. It is currently
based on top of the MCA patches.

Here's the MCA patchset for reference.
https://lore.kernel.org/linux-sgx/[email protected]/T/#t

The patchset adds support for multiple LRUs to track both reclaimable
EPC pages (i.e. pages the reclaimer knows about), as well as unreclaimable
EPC pages (i.e. pages which the reclaimer isn't aware of, such as va pages).
These pages are assigned to an LRU, as well as an enclave, so that an
enclave's full EPC usage can be tracked, and limited to a max value. During
OOM events, an enclave can be have its memory zapped, and all the EPC pages
not tracked by the reclaimer can be freed.

I appreciate your comments and feedback.

Kristen Carlson Accardi (13):
x86/sgx: Add 'struct sgx_epc_lru' to encapsulate lru list(s)
x86/sgx: Use sgx_epc_lru for existing active page list
x86/sgx: Track epc pages on reclaimable or unreclaimable lists
cgroup/misc: Add notifier block list support for css events
cgroup/misc: Expose root_misc
cgroup/misc: Expose parent_misc()
cgroup/misc: allow users of misc cgroup to read specific cgroup usage
cgroup/misc: allow misc cgroup consumers to read the max value
cgroup/misc: Add private per cgroup data to struct misc_cg
cgroup/misc: Add tryget functionality for misc controller
cgroup/misc: Add SGX EPC resource type
x86/sgx: Add support for misc cgroup controller
Docs/x86/sgx: Add description for cgroup support

Sean Christopherson (13):
x86/sgx: Call cond_resched() at the end of sgx_reclaim_pages()
x86/sgx: Store struct sgx_encl when allocating new va pages
x86/sgx: Introduce RECLAIM_IN_PROGRESS flag for EPC pages
x86/sgx: Use a list to track to-be-reclaimed pages during reclaim
x86/sgx: Add EPC page flags to identify type of page
x86/sgx: Allow reclaiming up to 32 pages, but scan 16 by default
x86/sgx: Return the number of EPC pages that were successfully
reclaimed
x86/sgx: Add option to ignore age of page during EPC reclaim
x86/sgx: Add helper to retrieve SGX EPC LRU given an EPC page
x86/sgx: Prepare for multiple LRUs
x86/sgx: Expose sgx_reclaim_pages() for use by EPC cgroup
x86/sgx: Add helper to grab pages from an arbitrary EPC LRU
x86/sgx: Add EPC OOM path to forcefully reclaim EPC

Documentation/x86/sgx.rst | 77 ++++
arch/x86/Kconfig | 13 +
arch/x86/kernel/cpu/sgx/Makefile | 1 +
arch/x86/kernel/cpu/sgx/encl.c | 89 ++++-
arch/x86/kernel/cpu/sgx/encl.h | 4 +-
arch/x86/kernel/cpu/sgx/epc_cgroup.c | 561 +++++++++++++++++++++++++++
arch/x86/kernel/cpu/sgx/epc_cgroup.h | 59 +++
arch/x86/kernel/cpu/sgx/ioctl.c | 13 +-
arch/x86/kernel/cpu/sgx/main.c | 405 +++++++++++++++----
arch/x86/kernel/cpu/sgx/sgx.h | 96 ++++-
arch/x86/kernel/cpu/sgx/virt.c | 28 +-
include/linux/misc_cgroup.h | 71 ++++
kernel/cgroup/misc.c | 145 ++++++-
13 files changed, 1446 insertions(+), 116 deletions(-)
create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.c
create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.h

--
2.37.3



2022-11-11 19:17:06

by Kristen Carlson Accardi

[permalink] [raw]
Subject: [PATCH 22/26] cgroup/misc: Add private per cgroup data to struct misc_cg

The SGX driver needs to be able to store additional per cgroup data
specific to SGX along with the misc_cg struct. Add the ability to get
and set this data in struct misc_cg.

Signed-off-by: Kristen Carlson Accardi <[email protected]>
---
include/linux/misc_cgroup.h | 12 ++++++++++++
kernel/cgroup/misc.c | 39 +++++++++++++++++++++++++++++++++++++
2 files changed, 51 insertions(+)

diff --git a/include/linux/misc_cgroup.h b/include/linux/misc_cgroup.h
index c00deae4d2df..7fbf3efb0f62 100644
--- a/include/linux/misc_cgroup.h
+++ b/include/linux/misc_cgroup.h
@@ -43,6 +43,7 @@ struct misc_res {
unsigned long max;
atomic_long_t usage;
atomic_long_t events;
+ void *priv;
};

/**
@@ -63,6 +64,8 @@ struct misc_cg *root_misc(void);
struct misc_cg *parent_misc(struct misc_cg *cg);
unsigned long misc_cg_read(enum misc_res_type type, struct misc_cg *cg);
unsigned long misc_cg_max(enum misc_res_type type, struct misc_cg *cg);
+void *misc_cg_get_priv(enum misc_res_type type, struct misc_cg *cg);
+void misc_cg_set_priv(enum misc_res_type type, struct misc_cg *cg, void *priv);
unsigned long misc_cg_res_total_usage(enum misc_res_type type);
int misc_cg_set_capacity(enum misc_res_type type, unsigned long capacity);
int misc_cg_try_charge(enum misc_res_type type, struct misc_cg *cg,
@@ -130,6 +133,15 @@ static inline unsigned long misc_cg_max(enum misc_res_type type, struct misc_cg
return 0;
}

+static void *misc_cg_get_priv(enum misc_res_type type, struct misc_cg *cg)
+{
+ return NULL;
+}
+
+static void misc_cg_set_priv(enum misc_res_type type, struct misc_cg *cg, void *priv)
+{
+}
+
static inline unsigned long misc_cg_res_total_usage(enum misc_res_type type)
{
return 0;
diff --git a/kernel/cgroup/misc.c b/kernel/cgroup/misc.c
index 18d0bec7d609..642879ad136f 100644
--- a/kernel/cgroup/misc.c
+++ b/kernel/cgroup/misc.c
@@ -251,6 +251,45 @@ unsigned long misc_cg_max(enum misc_res_type type, struct misc_cg *cg)
}
EXPORT_SYMBOL_GPL(misc_cg_max);

+/**
+ * misc_cg_get_priv() - Return the priv value of the misc cgroup res.
+ * @type: Type of the misc res.
+ * @cg: Misc cgroup whose priv will be read
+ *
+ * Context: Any context.
+ * Return:
+ * The value of the priv field for the specified misc cgroup.
+ * If an invalid misc_res_type is given, NULL will be returned.
+ */
+void *misc_cg_get_priv(enum misc_res_type type, struct misc_cg *cg)
+{
+ if (!(valid_type(type) && cg))
+ return NULL;
+
+ return cg->res[type].priv;
+}
+EXPORT_SYMBOL_GPL(misc_cg_get_priv);
+
+/**
+ * misc_cg_set_priv() - Set the priv value of the misc cgroup res.
+ * @type: Type of the misc res.
+ * @cg: Misc cgroup whose priv will be written
+ * @priv: Value to store in the priv field of the struct misc_cg
+ *
+ * If an invalid misc_res_type is given, the priv data will not be
+ * stored.
+ *
+ * Context: Any context.
+ */
+void misc_cg_set_priv(enum misc_res_type type, struct misc_cg *cg, void *priv)
+{
+ if (!(valid_type(type) && cg))
+ return;
+
+ cg->res[type].priv = priv;
+}
+EXPORT_SYMBOL_GPL(misc_cg_set_priv);
+
/**
* misc_cg_max_show() - Show the misc cgroup max limit.
* @sf: Interface file
--
2.37.3


2022-11-11 19:22:49

by Kristen Carlson Accardi

[permalink] [raw]
Subject: [PATCH 01/26] x86/sgx: Call cond_resched() at the end of sgx_reclaim_pages()

From: Sean Christopherson <[email protected]>

In order to avoid repetition of cond_resched() in ksgxd() and
sgx_alloc_epc_page(), move the invocation of post-reclaim cond_resched()
inside sgx_reclaim_pages(). Except in the case of sgx_reclaim_direct(),
sgx_reclaim_pages() is always called in a loop and is always followed
by a call to cond_resched(). This will hold true for the EPC cgroup
as well, which adds even more calls to sgx_reclaim_pages() and thus
cond_resched(). Calls to sgx_reclaim_direct() may be performance
sensitive. Allow sgx_reclaim_direct() to avoid the cond_resched()
call by moving the original sgx_reclaim_pages() call to
__sgx_reclaim_pages() and then have sgx_reclaim_pages() become a
wrapper around that call with a cond_resched().

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Kristen Carlson Accardi <[email protected]>
Cc: Sean Christopherson <[email protected]>
---
arch/x86/kernel/cpu/sgx/main.c | 17 +++++++++++------
1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 160c8dbee0ab..ffce6fc70a1f 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -287,7 +287,7 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
* problematic as it would increase the lock contention too much, which would
* halt forward progress.
*/
-static void sgx_reclaim_pages(void)
+static void __sgx_reclaim_pages(void)
{
struct sgx_epc_page *chunk[SGX_NR_TO_SCAN];
struct sgx_backing backing[SGX_NR_TO_SCAN];
@@ -369,6 +369,12 @@ static void sgx_reclaim_pages(void)
}
}

+static void sgx_reclaim_pages(void)
+{
+ __sgx_reclaim_pages();
+ cond_resched();
+}
+
static bool sgx_should_reclaim(unsigned long watermark)
{
return atomic_long_read(&sgx_nr_free_pages) < watermark &&
@@ -378,12 +384,14 @@ static bool sgx_should_reclaim(unsigned long watermark)
/*
* sgx_reclaim_direct() should be called (without enclave's mutex held)
* in locations where SGX memory resources might be low and might be
- * needed in order to make forward progress.
+ * needed in order to make forward progress. This call to
+ * __sgx_reclaim_pages() avoids the cond_resched() in sgx_reclaim_pages()
+ * to improve performance.
*/
void sgx_reclaim_direct(void)
{
if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
- sgx_reclaim_pages();
+ __sgx_reclaim_pages();
}

static int ksgxd(void *p)
@@ -410,8 +418,6 @@ static int ksgxd(void *p)

if (sgx_should_reclaim(SGX_NR_HIGH_PAGES))
sgx_reclaim_pages();
-
- cond_resched();
}

return 0;
@@ -582,7 +588,6 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
}

sgx_reclaim_pages();
- cond_resched();
}

if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
--
2.37.3


2022-11-11 19:25:17

by Kristen Carlson Accardi

[permalink] [raw]
Subject: [PATCH 15/26] x86/sgx: Add helper to grab pages from an arbitrary EPC LRU

From: Sean Christopherson <[email protected]>

Move the isolation loop into a standalone helper, sgx_isolate_pages(),
in preparation for existence of multiple LRUs. Expose the helper to
other SGX code so that it can be called from the EPC cgroup code, e.g.
to isolate pages from a single cgroup LRU. Exposing the isolation loop
allows the cgroup iteration logic to be wholly encapsulated within the
cgroup code.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Kristen Carlson Accardi <[email protected]>
Cc: Sean Christopherson <[email protected]>
---
arch/x86/kernel/cpu/sgx/main.c | 68 +++++++++++++++++++++-------------
arch/x86/kernel/cpu/sgx/sgx.h | 2 +
2 files changed, 44 insertions(+), 26 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index cb6f57caf24c..f8f1451b0a11 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -280,7 +280,46 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
}

/**
- * sgx_reclaim_pages() - Reclaim EPC pages from the consumers
+ * sgx_isolate_epc_pages() - Isolate pages from an LRU for reclaim
+ * @lru: LRU from which to reclaim
+ * @nr_to_scan: Number of pages to scan for reclaim
+ * @dst: Destination list to hold the isolated pages
+ */
+void sgx_isolate_epc_pages(struct sgx_epc_lru *lru, int *nr_to_scan,
+ struct list_head *dst)
+{
+ struct sgx_encl_page *encl_page;
+ struct sgx_epc_page *epc_page;
+
+ spin_lock(&lru->lock);
+ for (; *nr_to_scan > 0; --(*nr_to_scan)) {
+ if (list_empty(&lru->reclaimable))
+ break;
+
+ epc_page = sgx_epc_peek_reclaimable(lru);
+ if (!epc_page)
+ break;
+
+ encl_page = epc_page->encl_owner;
+
+ if (WARN_ON_ONCE(!(epc_page->flags & SGX_EPC_PAGE_ENCLAVE)))
+ continue;
+
+ if (kref_get_unless_zero(&encl_page->encl->refcount)) {
+ epc_page->flags |= SGX_EPC_PAGE_RECLAIM_IN_PROGRESS;
+ list_move_tail(&epc_page->list, dst);
+ } else {
+ /* The owner is freeing the page, remove it from the
+ * LRU list
+ */
+ epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
+ list_del_init(&epc_page->list);
+ }
+ }
+ spin_unlock(&lru->lock);
+}
+
+/**
* sgx_reclaim_epc_pages() - Reclaim EPC pages from the consumers
* @nr_to_scan: Number of EPC pages to scan for reclaim
* @ignore_age: Reclaim a page even if it is young
@@ -305,37 +344,14 @@ static int __sgx_reclaim_pages(int nr_to_scan, bool ignore_age)
struct sgx_epc_lru *lru;
pgoff_t page_index;
LIST_HEAD(iso);
+ int i = 0;
int ret;
- int i;
-
- spin_lock(&sgx_global_lru.lock);
- for (i = 0; i < nr_to_scan; i++) {
- epc_page = sgx_epc_peek_reclaimable(&sgx_global_lru);
- if (!epc_page)
- break;
-
- encl_page = epc_page->encl_owner;

- if (WARN_ON_ONCE(!(epc_page->flags & SGX_EPC_PAGE_ENCLAVE)))
- continue;
-
- if (kref_get_unless_zero(&encl_page->encl->refcount) != 0) {
- epc_page->flags |= SGX_EPC_PAGE_RECLAIM_IN_PROGRESS;
- list_move_tail(&epc_page->list, &iso);
- } else {
- /* The owner is freeing the page, remove it from the
- * LRU list
- */
- epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
- list_del_init(&epc_page->list);
- }
- }
- spin_unlock(&sgx_global_lru.lock);
+ sgx_isolate_epc_pages(&sgx_global_lru, &nr_to_scan, &iso);

if (list_empty(&iso))
return 0;

- i = 0;
list_for_each_entry_safe(epc_page, tmp, &iso, list) {
encl_page = epc_page->encl_owner;

diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index ca51b3c7d905..29c37f20792c 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -182,6 +182,8 @@ void sgx_record_epc_page(struct sgx_epc_page *page, unsigned long flags);
int sgx_drop_epc_page(struct sgx_epc_page *page);
struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim);
int sgx_reclaim_epc_pages(int nr_to_scan, bool ignore_age);
+void sgx_isolate_epc_pages(struct sgx_epc_lru *lru, int *nr_to_scan,
+ struct list_head *dst);

void sgx_ipi_cb(void *info);

--
2.37.3


2022-11-11 19:46:17

by Kristen Carlson Accardi

[permalink] [raw]
Subject: [PATCH 06/26] x86/sgx: Introduce RECLAIM_IN_PROGRESS flag for EPC pages

From: Sean Christopherson <[email protected]>

Keep track of whether the EPC page is in the middle of being reclaimed
and do not delete the page off the it's LRU if it has not yet finished
being reclaimed.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Kristen Carlson Accardi <[email protected]>
Cc: Sean Christopherson <[email protected]>
---
arch/x86/kernel/cpu/sgx/main.c | 14 +++++++++-----
arch/x86/kernel/cpu/sgx/sgx.h | 4 ++++
2 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 3b09433ffd85..8c451071fa91 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -305,13 +305,15 @@ static void __sgx_reclaim_pages(void)

encl_page = epc_page->encl_owner;

- if (kref_get_unless_zero(&encl_page->encl->refcount) != 0)
+ if (kref_get_unless_zero(&encl_page->encl->refcount) != 0) {
+ epc_page->flags |= SGX_EPC_PAGE_RECLAIM_IN_PROGRESS;
chunk[cnt++] = epc_page;
- else
+ } else {
/* The owner is freeing the page. No need to add the
* page back to the list of reclaimable pages.
*/
epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
+ }
}
spin_unlock(&sgx_global_lru.lock);

@@ -337,6 +339,7 @@ static void __sgx_reclaim_pages(void)

skip:
spin_lock(&sgx_global_lru.lock);
+ epc_page->flags &= ~SGX_EPC_PAGE_RECLAIM_IN_PROGRESS;
sgx_epc_push_reclaimable(&sgx_global_lru, epc_page);
spin_unlock(&sgx_global_lru.lock);

@@ -360,7 +363,8 @@ static void __sgx_reclaim_pages(void)
sgx_reclaimer_write(epc_page, &backing[i]);

kref_put(&encl_page->encl->refcount, sgx_encl_release);
- epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
+ epc_page->flags &= ~(SGX_EPC_PAGE_RECLAIMER_TRACKED |
+ SGX_EPC_PAGE_RECLAIM_IN_PROGRESS);

sgx_free_epc_page(epc_page);
}
@@ -508,7 +512,7 @@ struct sgx_epc_page *__sgx_alloc_epc_page(void)
void sgx_record_epc_page(struct sgx_epc_page *page, unsigned long flags)
{
spin_lock(&sgx_global_lru.lock);
- WARN_ON(page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED);
+ WARN_ON(page->flags & SGX_EPC_PAGE_RECLAIM_FLAGS);
page->flags |= flags;
if (flags & SGX_EPC_PAGE_RECLAIMER_TRACKED)
sgx_epc_push_reclaimable(&sgx_global_lru, page);
@@ -532,7 +536,7 @@ int sgx_drop_epc_page(struct sgx_epc_page *page)
spin_lock(&sgx_global_lru.lock);
if (page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED) {
/* The page is being reclaimed. */
- if (list_empty(&page->list)) {
+ if (page->flags & SGX_EPC_PAGE_RECLAIM_IN_PROGRESS) {
spin_unlock(&sgx_global_lru.lock);
return -EBUSY;
}
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index 969606615211..04ca644928a8 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -30,6 +30,10 @@
#define SGX_EPC_PAGE_IS_FREE BIT(1)
/* Pages allocated for KVM guest */
#define SGX_EPC_PAGE_KVM_GUEST BIT(2)
+/* page flag to indicate reclaim is in progress */
+#define SGX_EPC_PAGE_RECLAIM_IN_PROGRESS BIT(3)
+#define SGX_EPC_PAGE_RECLAIM_FLAGS (SGX_EPC_PAGE_RECLAIMER_TRACKED | \
+ SGX_EPC_PAGE_RECLAIM_IN_PROGRESS)

struct sgx_epc_page {
unsigned int section;
--
2.37.3


2022-11-11 19:47:04

by Kristen Carlson Accardi

[permalink] [raw]
Subject: [PATCH 25/26] x86/sgx: Add support for misc cgroup controller

Implement support for cgroup control of SGX Enclave Page Cache (EPC)
memory using the misc cgroup controller. EPC memory is independent
from normal system memory, e.g. must be reserved at boot from RAM and
cannot be converted between EPC and normal memory while the system is
running. EPC is managed by the SGX subsystem and is not accounted by
the memory controller.

Much like normal system memory, EPC memory can be overcommitted via
virtual memory techniques and pages can be swapped out of the EPC to
their backing store (normal system memory, e.g. shmem). The SGX EPC
subsystem is analogous to the memory subsytem and the SGX EPC controller
is in turn analogous to the memory controller; it implements limit and
protection models for EPC memory.

The misc controller provides a mechanism to set a hard limit of EPC
usage via the "sgx_epc" resource in "misc.max". The total EPC memory
available on the system is reported via the "sgx_epc" resource in
"misc.capacity".

This patch was modified from its original version to use the misc cgroup
controller instead of a custom controller.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Kristen Carlson Accardi <[email protected]>
Cc: Sean Christopherson <[email protected]>
---
arch/x86/Kconfig | 13 +
arch/x86/kernel/cpu/sgx/Makefile | 1 +
arch/x86/kernel/cpu/sgx/epc_cgroup.c | 561 +++++++++++++++++++++++++++
arch/x86/kernel/cpu/sgx/epc_cgroup.h | 59 +++
arch/x86/kernel/cpu/sgx/main.c | 86 +++-
arch/x86/kernel/cpu/sgx/sgx.h | 5 +-
6 files changed, 709 insertions(+), 16 deletions(-)
create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.c
create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index f9920f1341c8..0eeae4ebe1c3 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1936,6 +1936,19 @@ config X86_SGX

If unsure, say N.

+config CGROUP_SGX_EPC
+ bool "Miscellaneous Cgroup Controller for Enclave Page Cache (EPC) for Intel SGX"
+ depends on X86_SGX && CGROUP_MISC
+ help
+ Provides control over the EPC footprint of tasks in a cgroup via
+ the Miscellaneous cgroup controller.
+
+ EPC is a subset of regular memory that is usable only by SGX
+ enclaves and is very limited in quantity, e.g. less than 1%
+ of total DRAM.
+
+ Say N if unsure.
+
config EFI
bool "EFI runtime service support"
depends on ACPI
diff --git a/arch/x86/kernel/cpu/sgx/Makefile b/arch/x86/kernel/cpu/sgx/Makefile
index 9c1656779b2a..12901a488da7 100644
--- a/arch/x86/kernel/cpu/sgx/Makefile
+++ b/arch/x86/kernel/cpu/sgx/Makefile
@@ -4,3 +4,4 @@ obj-y += \
ioctl.o \
main.o
obj-$(CONFIG_X86_SGX_KVM) += virt.o
+obj-$(CONFIG_CGROUP_SGX_EPC) += epc_cgroup.o
diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
new file mode 100644
index 000000000000..03c0fa42880c
--- /dev/null
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
@@ -0,0 +1,561 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright(c) 2022 Intel Corporation.
+
+#include <linux/atomic.h>
+#include <linux/kernel.h>
+#include <linux/ratelimit.h>
+#include <linux/sched/signal.h>
+#include <linux/slab.h>
+#include <linux/threads.h>
+
+#include "epc_cgroup.h"
+
+#define SGX_EPC_RECLAIM_MIN_PAGES 16UL
+#define SGX_EPC_RECLAIM_MAX_PAGES 64UL
+#define SGX_EPC_RECLAIM_IGNORE_AGE_THRESHOLD 5
+#define SGX_EPC_RECLAIM_OOM_THRESHOLD 5
+
+static struct workqueue_struct *sgx_epc_cg_wq;
+
+struct sgx_epc_reclaim_control {
+ struct sgx_epc_cgroup *epc_cg;
+ int nr_fails;
+ bool ignore_age;
+};
+
+static inline unsigned long sgx_epc_cgroup_page_counter_read(struct sgx_epc_cgroup *epc_cg)
+{
+ return misc_cg_read(MISC_CG_RES_SGX_EPC, epc_cg->cg) / PAGE_SIZE;
+}
+
+static inline unsigned long sgx_epc_cgroup_max_pages(struct sgx_epc_cgroup *epc_cg)
+{
+ return misc_cg_max(MISC_CG_RES_SGX_EPC, epc_cg->cg) / PAGE_SIZE;
+}
+
+static inline struct sgx_epc_cgroup *sgx_epc_cgroup_from_misc_cg(struct misc_cg *cg)
+{
+ return (struct sgx_epc_cgroup *)misc_cg_get_priv(MISC_CG_RES_SGX_EPC, cg);
+}
+
+static inline struct sgx_epc_cgroup *parent_epc_cgroup(struct sgx_epc_cgroup *epc_cg)
+{
+ return sgx_epc_cgroup_from_misc_cg(parent_misc(epc_cg->cg));
+}
+
+static inline bool sgx_epc_cgroup_disabled(void)
+{
+ return !cgroup_subsys_enabled(misc_cgrp_subsys);
+}
+
+/**
+ * sgx_epc_cgroup_iter - iterate over the EPC cgroup hierarchy
+ * @root: hierarchy root
+ * @prev: previously returned epc_cg, NULL on first invocation
+ * @reclaim_epoch: epoch for shared reclaim walks, NULL for full walks
+ *
+ * Return: references to children of the hierarchy below @root, or
+ * @root itself, or %NULL after a full round-trip.
+ *
+ * Caller must pass the return value in @prev on subsequent invocations
+ * for reference counting, or use sgx_epc_cgroup_iter_break() to cancel
+ * a hierarchy walk before the round-trip is complete.
+ */
+static struct sgx_epc_cgroup *sgx_epc_cgroup_iter(struct sgx_epc_cgroup *prev,
+ struct sgx_epc_cgroup *root,
+ unsigned long *reclaim_epoch)
+{
+ struct cgroup_subsys_state *css = NULL;
+ struct sgx_epc_cgroup *epc_cg = NULL;
+ struct sgx_epc_cgroup *pos = NULL;
+ bool inc_epoch = false;
+
+ if (sgx_epc_cgroup_disabled())
+ return NULL;
+
+ if (!root)
+ root = sgx_epc_cgroup_from_misc_cg(root_misc());
+
+ if (prev && !reclaim_epoch)
+ pos = prev;
+
+ rcu_read_lock();
+
+start:
+ if (reclaim_epoch) {
+ /*
+ * Abort the walk if a reclaimer working from the same root has
+ * started a new walk after this reclaimer has already scanned
+ * at least one cgroup.
+ */
+ if (prev && *reclaim_epoch != root->epoch)
+ goto out;
+
+ while (1) {
+ pos = READ_ONCE(root->reclaim_iter);
+ if (!pos || misc_cg_tryget(pos->cg))
+ break;
+
+ /*
+ * The css is dying, clear the reclaim_iter immediately
+ * instead of waiting for ->css_released to be called.
+ * Busy waiting serves no purpose and attempting to wait
+ * for ->css_released may actually block it from being
+ * called.
+ */
+ (void)cmpxchg(&root->reclaim_iter, pos, NULL);
+ }
+ }
+
+ if (pos)
+ css = &pos->cg->css;
+
+ while (!epc_cg) {
+ struct misc_cg *cg;
+
+ css = css_next_descendant_pre(css, &root->cg->css);
+ if (!css) {
+ /*
+ * Increment the epoch as we've reached the end of the
+ * tree and the next call to css_next_descendant_pre
+ * will restart at root. Do not update root->epoch
+ * directly as we should only do so if we update the
+ * reclaim_iter, i.e. a different thread may win the
+ * race and update the epoch for us.
+ */
+ inc_epoch = true;
+
+ /*
+ * Reclaimers share the hierarchy walk, and a new one
+ * might jump in at the end of the hierarchy. Restart
+ * at root so that we don't return NULL on a thread's
+ * initial call.
+ */
+ if (!prev)
+ continue;
+ break;
+ }
+
+ cg = css_misc(css);
+ /*
+ * Verify the css and acquire a reference. Don't take an
+ * extra reference to root as it's either the global root
+ * or is provided by the caller and so is guaranteed to be
+ * alive. Keep walking if this css is dying.
+ */
+ if (cg != root->cg && !misc_cg_tryget(cg))
+ continue;
+
+ epc_cg = sgx_epc_cgroup_from_misc_cg(cg);
+ }
+
+ if (reclaim_epoch) {
+ /*
+ * reclaim_iter could have already been updated by a competing
+ * thread; check that the value hasn't changed since we read
+ * it to avoid reclaiming from the same cgroup twice. If the
+ * value did change, put all of our references and restart the
+ * entire process, for all intents and purposes we're making a
+ * new call.
+ */
+ if (cmpxchg(&root->reclaim_iter, pos, epc_cg) != pos) {
+ if (epc_cg && epc_cg != root)
+ put_misc_cg(epc_cg->cg);
+ if (pos)
+ put_misc_cg(pos->cg);
+ css = NULL;
+ epc_cg = NULL;
+ inc_epoch = false;
+ goto start;
+ }
+
+ if (inc_epoch)
+ root->epoch++;
+ if (!prev)
+ *reclaim_epoch = root->epoch;
+
+ if (pos)
+ put_misc_cg(pos->cg);
+ }
+
+out:
+ rcu_read_unlock();
+ if (prev && prev != root)
+ put_misc_cg(prev->cg);
+
+ return epc_cg;
+}
+
+/**
+ * sgx_epc_cgroup_iter_break - abort a hierarchy walk prematurely
+ * @prev: last visited cgroup as returned by sgx_epc_cgroup_iter()
+ * @root: hierarchy root
+ */
+static void sgx_epc_cgroup_iter_break(struct sgx_epc_cgroup *prev,
+ struct sgx_epc_cgroup *root)
+{
+ if (!root)
+ root = sgx_epc_cgroup_from_misc_cg(root_misc());
+ if (prev && prev != root)
+ put_misc_cg(prev->cg);
+}
+
+/**
+ * sgx_epc_cgroup_lru_empty - check if a cgroup tree has no pages on its lrus
+ * @root: root of the tree to check
+ *
+ * Return: %true if all cgroups under the specified root have empty LRU lists.
+ * Used to avoid livelocks due to a cgroup having a non-zero charge count but
+ * no pages on its LRUs, e.g. due to a dead enclave waiting to be released or
+ * because all pages in the cgroup are unreclaimable.
+ */
+bool sgx_epc_cgroup_lru_empty(struct sgx_epc_cgroup *root)
+{
+ struct sgx_epc_cgroup *epc_cg;
+
+ for (epc_cg = sgx_epc_cgroup_iter(NULL, root, NULL);
+ epc_cg;
+ epc_cg = sgx_epc_cgroup_iter(epc_cg, root, NULL)) {
+ if (!list_empty(&epc_cg->lru.reclaimable)) {
+ sgx_epc_cgroup_iter_break(epc_cg, root);
+ return false;
+ }
+ }
+ return true;
+}
+
+/**
+ * sgx_epc_cgroup_isolate_pages - walk a cgroup tree and separate pages
+ * @root: root of the tree to start walking
+ * @nr_to_scan: The number of pages that need to be isolated
+ * @dst: Destination list to hold the isolated pages
+ *
+ * Walk the cgroup tree and isolate the pages in the hierarchy
+ * for reclaiming.
+ */
+void sgx_epc_cgroup_isolate_pages(struct sgx_epc_cgroup *root,
+ int *nr_to_scan, struct list_head *dst)
+{
+ struct sgx_epc_cgroup *epc_cg;
+ unsigned long epoch;
+
+ if (!*nr_to_scan)
+ return;
+
+ for (epc_cg = sgx_epc_cgroup_iter(NULL, root, &epoch);
+ epc_cg;
+ epc_cg = sgx_epc_cgroup_iter(epc_cg, root, &epoch)) {
+ sgx_isolate_epc_pages(&epc_cg->lru, nr_to_scan, dst);
+ if (!*nr_to_scan) {
+ sgx_epc_cgroup_iter_break(epc_cg, root);
+ break;
+ }
+ }
+}
+
+static int sgx_epc_cgroup_reclaim_pages(unsigned long nr_pages,
+ struct sgx_epc_reclaim_control *rc)
+{
+ /*
+ * Ensure sgx_reclaim_pages is called with a minimum and maximum
+ * number of pages. Attempting to reclaim only a few pages will
+ * often fail and is inefficient, while reclaiming a huge number
+ * of pages can result in soft lockups due to holding various
+ * locks for an extended duration. This also bounds nr_pages so
+ */
+ nr_pages = max(nr_pages, SGX_EPC_RECLAIM_MIN_PAGES);
+ nr_pages = min(nr_pages, SGX_EPC_RECLAIM_MAX_PAGES);
+
+ return sgx_reclaim_epc_pages(nr_pages, rc->ignore_age, rc->epc_cg);
+}
+
+static int sgx_epc_cgroup_reclaim_failed(struct sgx_epc_reclaim_control *rc)
+{
+ if (sgx_epc_cgroup_lru_empty(rc->epc_cg))
+ return -ENOMEM;
+
+ ++rc->nr_fails;
+ if (rc->nr_fails > SGX_EPC_RECLAIM_IGNORE_AGE_THRESHOLD)
+ rc->ignore_age = true;
+
+ return 0;
+}
+
+static inline
+void sgx_epc_reclaim_control_init(struct sgx_epc_reclaim_control *rc,
+ struct sgx_epc_cgroup *epc_cg)
+{
+ rc->epc_cg = epc_cg;
+ rc->nr_fails = 0;
+ rc->ignore_age = false;
+}
+
+/*
+ * Scheduled by sgx_epc_cgroup_try_charge() to reclaim pages from the
+ * cgroup when the cgroup is at/near its maximum capacity
+ */
+static void sgx_epc_cgroup_reclaim_work_func(struct work_struct *work)
+{
+ struct sgx_epc_reclaim_control rc;
+ struct sgx_epc_cgroup *epc_cg;
+ unsigned long cur, max;
+
+ epc_cg = container_of(work, struct sgx_epc_cgroup, reclaim_work);
+
+ sgx_epc_reclaim_control_init(&rc, epc_cg);
+
+ for (;;) {
+ max = sgx_epc_cgroup_max_pages(epc_cg);
+
+ /*
+ * Adjust the limit down by one page, the goal is to free up
+ * pages for fault allocations, not to simply obey the limit.
+ * Conditionally decrementing max also means the cur vs. max
+ * check will correctly handle the case where both are zero.
+ */
+ if (max)
+ max--;
+
+ /*
+ * Unless the limit is extremely low, in which case forcing
+ * reclaim will likely cause thrashing, force the cgroup to
+ * reclaim at least once if it's operating *near* its maximum
+ * limit by adjusting @max down by half the min reclaim size.
+ * This work func is scheduled by sgx_epc_cgroup_try_charge
+ * when it cannot directly reclaim due to being in an atomic
+ * context, e.g. EPC allocation in a fault handler. Waiting
+ * to reclaim until the cgroup is actually at its limit is less
+ * performant as it means the faulting task is effectively
+ * blocked until a worker makes its way through the global work
+ * queue.
+ */
+ if (max > SGX_EPC_RECLAIM_MAX_PAGES)
+ max -= (SGX_EPC_RECLAIM_MIN_PAGES/2);
+
+ cur = sgx_epc_cgroup_page_counter_read(epc_cg);
+ if (cur <= max)
+ break;
+
+ if (!sgx_epc_cgroup_reclaim_pages(cur - max, &rc)) {
+ if (sgx_epc_cgroup_reclaim_failed(&rc))
+ break;
+ }
+ }
+}
+
+static int __sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg,
+ unsigned long nr_pages, bool reclaim)
+{
+ struct sgx_epc_reclaim_control rc;
+ unsigned long cur, max, over;
+ unsigned int nr_empty = 0;
+
+ if (epc_cg == sgx_epc_cgroup_from_misc_cg(root_misc())) {
+ misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg,
+ nr_pages * PAGE_SIZE);
+ return 0;
+ }
+
+ sgx_epc_reclaim_control_init(&rc, NULL);
+
+ for (;;) {
+ if (!misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg,
+ nr_pages * PAGE_SIZE))
+ break;
+
+ rc.epc_cg = epc_cg;
+ max = sgx_epc_cgroup_max_pages(rc.epc_cg);
+ if (nr_pages > max)
+ return -ENOMEM;
+
+ if (signal_pending(current))
+ return -ERESTARTSYS;
+
+ if (!reclaim) {
+ queue_work(sgx_epc_cg_wq, &rc.epc_cg->reclaim_work);
+ return -EBUSY;
+ }
+
+ cur = sgx_epc_cgroup_page_counter_read(rc.epc_cg);
+ over = ((cur + nr_pages) > max) ?
+ (cur + nr_pages) - max : SGX_EPC_RECLAIM_MIN_PAGES;
+
+ if (!sgx_epc_cgroup_reclaim_pages(over, &rc)) {
+ if (sgx_epc_cgroup_reclaim_failed(&rc)) {
+ if (++nr_empty > SGX_EPC_RECLAIM_OOM_THRESHOLD)
+ return -ENOMEM;
+ schedule();
+ }
+ }
+ }
+
+ css_get_many(&epc_cg->cg->css, nr_pages);
+
+ return 0;
+}
+
+
+/**
+ * sgx_epc_cgroup_try_charge - hierarchically try to charge a single EPC page
+ * @mm: the mm_struct of the process to charge
+ * @reclaim: whether or not synchronous reclaim is allowed
+ *
+ * Returns EPC cgroup or NULL on success, -errno on failure.
+ */
+struct sgx_epc_cgroup *sgx_epc_cgroup_try_charge(struct mm_struct *mm,
+ bool reclaim)
+{
+ struct sgx_epc_cgroup *epc_cg;
+ int ret;
+
+ if (sgx_epc_cgroup_disabled())
+ return NULL;
+
+ epc_cg = sgx_epc_cgroup_from_misc_cg(get_current_misc_cg());
+ ret = __sgx_epc_cgroup_try_charge(epc_cg, 1, reclaim);
+ put_misc_cg(epc_cg->cg);
+
+ if (ret)
+ return ERR_PTR(ret);
+
+ return epc_cg;
+}
+
+/**
+ * sgx_epc_cgroup_uncharge - hierarchically uncharge EPC pages
+ * @epc_cg: the charged epc cgroup
+ */
+void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg)
+{
+ if (sgx_epc_cgroup_disabled())
+ return;
+
+ misc_cg_uncharge(MISC_CG_RES_SGX_EPC, epc_cg->cg, PAGE_SIZE);
+
+ if (epc_cg->cg != root_misc())
+ put_misc_cg(epc_cg->cg);
+}
+
+static void sgx_epc_cgroup_oom(struct sgx_epc_cgroup *root)
+{
+ struct sgx_epc_cgroup *epc_cg;
+
+ for (epc_cg = sgx_epc_cgroup_iter(NULL, root, NULL);
+ epc_cg;
+ epc_cg = sgx_epc_cgroup_iter(epc_cg, root, NULL)) {
+ if (sgx_epc_oom(&epc_cg->lru)) {
+ sgx_epc_cgroup_iter_break(epc_cg, root);
+ return;
+ }
+ }
+}
+
+static void sgx_epc_cgroup_release(struct sgx_epc_cgroup *epc_cg)
+{
+ struct sgx_epc_cgroup *dead_cg = epc_cg;
+
+ while ((epc_cg = parent_epc_cgroup(epc_cg)))
+ cmpxchg(&epc_cg->reclaim_iter, dead_cg, NULL);
+}
+
+static void sgx_epc_cgroup_free(struct sgx_epc_cgroup *epc_cg)
+{
+ cancel_work_sync(&epc_cg->reclaim_work);
+ kfree(epc_cg);
+}
+
+static struct sgx_epc_cgroup *sgx_epc_cgroup_alloc(struct misc_cg *cg)
+{
+ struct sgx_epc_cgroup *epc_cg;
+
+ epc_cg = kzalloc(sizeof(struct sgx_epc_cgroup), GFP_KERNEL);
+ if (!epc_cg)
+ return ERR_PTR(-ENOMEM);
+
+ sgx_lru_init(&epc_cg->lru);
+ INIT_WORK(&epc_cg->reclaim_work, sgx_epc_cgroup_reclaim_work_func);
+ epc_cg->cg = cg;
+ misc_cg_set_priv(MISC_CG_RES_SGX_EPC, cg, epc_cg);
+
+ return epc_cg;
+}
+
+static void sgx_epc_cgroup_max_write(struct sgx_epc_cgroup *epc_cg)
+{
+ struct sgx_epc_reclaim_control rc;
+ unsigned int nr_empty = 0;
+ unsigned long cur, max;
+
+ sgx_epc_reclaim_control_init(&rc, epc_cg);
+
+ max = sgx_epc_cgroup_max_pages(epc_cg);
+
+ for (;;) {
+ cur = sgx_epc_cgroup_page_counter_read(epc_cg);
+ if (cur <= max)
+ break;
+
+ if (signal_pending(current))
+ break;
+
+ if (!sgx_epc_cgroup_reclaim_pages(cur - max, &rc)) {
+ if (sgx_epc_cgroup_reclaim_failed(&rc)) {
+ if (++nr_empty > SGX_EPC_RECLAIM_OOM_THRESHOLD)
+ sgx_epc_cgroup_oom(epc_cg);
+ schedule();
+ }
+ }
+ }
+}
+
+static int sgx_epc_cgroup_callback(struct notifier_block *nb,
+ unsigned long val, void *data)
+{
+ struct misc_cg *cg = data;
+ struct sgx_epc_cgroup *epc_cg;
+
+ if (val == MISC_CG_ALLOC) {
+ epc_cg = sgx_epc_cgroup_alloc(cg);
+ if (!epc_cg)
+ return NOTIFY_BAD;
+
+ return NOTIFY_OK;
+ }
+
+ epc_cg = sgx_epc_cgroup_from_misc_cg(cg);
+
+ if (val == MISC_CG_FREE) {
+ sgx_epc_cgroup_free(epc_cg);
+ return NOTIFY_OK;
+ } else if (val == MISC_CG_CHANGE) {
+ sgx_epc_cgroup_max_write(epc_cg);
+ return NOTIFY_OK;
+ } else if (val == MISC_CG_RELEASED) {
+ sgx_epc_cgroup_release(epc_cg);
+ return NOTIFY_OK;
+ }
+ return NOTIFY_DONE;
+}
+
+static struct notifier_block sgx_epc_cg_nb = {
+ .notifier_call = sgx_epc_cgroup_callback,
+ .priority = 0,
+};
+
+static int __init sgx_epc_cgroup_init(void)
+{
+ if (!boot_cpu_has(X86_FEATURE_SGX))
+ return 0;
+
+ sgx_epc_cg_wq = alloc_workqueue("sgx_epc_cg_wq",
+ WQ_UNBOUND | WQ_FREEZABLE,
+ WQ_UNBOUND_MAX_ACTIVE);
+ BUG_ON(!sgx_epc_cg_wq);
+
+ sgx_epc_cgroup_alloc(root_misc());
+
+ register_misc_cg_notifier(&sgx_epc_cg_nb);
+
+ return 0;
+}
+subsys_initcall(sgx_epc_cgroup_init);
diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.h b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
new file mode 100644
index 000000000000..a8c631ee6fac
--- /dev/null
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
@@ -0,0 +1,59 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright(c) 2022 Intel Corporation. */
+#ifndef _INTEL_SGX_EPC_CGROUP_H_
+#define _INTEL_SGX_EPC_CGROUP_H_
+
+#include <asm/sgx.h>
+#include <linux/cgroup.h>
+#include <linux/list.h>
+#include <linux/misc_cgroup.h>
+#include <linux/page_counter.h>
+#include <linux/workqueue.h>
+
+#include "sgx.h"
+
+#ifndef CONFIG_CGROUP_SGX_EPC
+#define MISC_CG_RES_SGX_EPC MISC_CG_RES_TYPES
+struct sgx_epc_cgroup;
+
+static inline struct sgx_epc_cgroup *sgx_epc_cgroup_try_charge(struct mm_struct *mm,
+ bool reclaim)
+{
+ return NULL;
+}
+static inline void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg) { }
+static inline void sgx_epc_cgroup_isolate_pages(struct sgx_epc_cgroup *root,
+ int *nr_to_scan,
+ struct list_head *dst) { }
+static inline struct sgx_epc_lru *epc_cg_lru(struct sgx_epc_cgroup *epc_cg)
+{
+ return NULL;
+}
+static bool sgx_epc_cgroup_lru_empty(struct sgx_epc_cgroup *root)
+{
+ return true;
+}
+#else
+struct sgx_epc_cgroup {
+ struct misc_cg *cg;
+ struct sgx_epc_lru lru;
+ struct sgx_epc_cgroup *reclaim_iter;
+ struct work_struct reclaim_work;
+ unsigned int epoch;
+};
+
+struct sgx_epc_cgroup *sgx_epc_cgroup_try_charge(struct mm_struct *mm,
+ bool reclaim);
+void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg);
+bool sgx_epc_cgroup_lru_empty(struct sgx_epc_cgroup *root);
+void sgx_epc_cgroup_isolate_pages(struct sgx_epc_cgroup *root,
+ int *nr_to_scan, struct list_head *dst);
+static inline struct sgx_epc_lru *epc_cg_lru(struct sgx_epc_cgroup *epc_cg)
+{
+ if (epc_cg)
+ return &epc_cg->lru;
+ return NULL;
+}
+#endif
+
+#endif /* _INTEL_SGX_EPC_CGROUP_H_ */
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 5a511046ad38..b9b55068f87f 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -6,6 +6,7 @@
#include <linux/highmem.h>
#include <linux/kthread.h>
#include <linux/miscdevice.h>
+#include <linux/misc_cgroup.h>
#include <linux/node.h>
#include <linux/pagemap.h>
#include <linux/ratelimit.h>
@@ -17,6 +18,7 @@
#include "driver.h"
#include "encl.h"
#include "encls.h"
+#include "epc_cgroup.h"

#define SGX_MAX_NR_TO_RECLAIM 32

@@ -33,9 +35,20 @@ static DEFINE_XARRAY(sgx_epc_address_space);
static struct sgx_epc_lru sgx_global_lru;
static inline struct sgx_epc_lru *sgx_lru(struct sgx_epc_page *epc_page)
{
+ if (IS_ENABLED(CONFIG_CGROUP_SGX_EPC))
+ return epc_cg_lru(epc_page->epc_cg);
+
return &sgx_global_lru;
}

+static inline bool sgx_can_reclaim(void)
+{
+ if (!IS_ENABLED(CONFIG_CGROUP_SGX_EPC))
+ return !list_empty(&sgx_global_lru.reclaimable);
+
+ return !sgx_epc_cgroup_lru_empty(NULL);
+}
+
static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0);

/* Nodes with one or more EPC sections. */
@@ -320,9 +333,10 @@ void sgx_isolate_epc_pages(struct sgx_epc_lru *lru, int *nr_to_scan,
}

/**
- * sgx_reclaim_epc_pages() - Reclaim EPC pages from the consumers
+ * __sgx_reclaim_epc_pages() - Reclaim EPC pages from the consumers
* @nr_to_scan: Number of EPC pages to scan for reclaim
* @ignore_age: Reclaim a page even if it is young
+ * @epc_cg: EPC cgroup from which to reclaim
*
* Take a fixed number of pages from the head of the active page pool and
* reclaim them to the enclave's private shmem files. Skip the pages, which have
@@ -336,7 +350,8 @@ void sgx_isolate_epc_pages(struct sgx_epc_lru *lru, int *nr_to_scan,
* problematic as it would increase the lock contention too much, which would
* halt forward progress.
*/
-static int __sgx_reclaim_pages(int nr_to_scan, bool ignore_age)
+static int __sgx_reclaim_epc_pages(int nr_to_scan, bool ignore_age,
+ struct sgx_epc_cgroup *epc_cg)
{
struct sgx_backing backing[SGX_MAX_NR_TO_RECLAIM];
struct sgx_epc_page *epc_page, *tmp;
@@ -347,7 +362,15 @@ static int __sgx_reclaim_pages(int nr_to_scan, bool ignore_age)
int i = 0;
int ret;

- sgx_isolate_epc_pages(&sgx_global_lru, &nr_to_scan, &iso);
+ /*
+ * If a specific cgroup is not being targetted, take from the global
+ * list first, even when cgroups are enabled. If there are
+ * pages on the global LRU then they should get reclaimed asap.
+ */
+ if (!IS_ENABLED(CONFIG_CGROUP_SGX_EPC) || !epc_cg)
+ sgx_isolate_epc_pages(&sgx_global_lru, &nr_to_scan, &iso);
+
+ sgx_epc_cgroup_isolate_pages(epc_cg, &nr_to_scan, &iso);

if (list_empty(&iso))
return 0;
@@ -394,25 +417,33 @@ static int __sgx_reclaim_pages(int nr_to_scan, bool ignore_age)
kref_put(&encl_page->encl->refcount, sgx_encl_release);
epc_page->flags &= ~SGX_EPC_PAGE_RECLAIM_FLAGS;

+ if (epc_page->epc_cg) {
+ sgx_epc_cgroup_uncharge(epc_page->epc_cg);
+ epc_page->epc_cg = NULL;
+ }
+
sgx_free_epc_page(epc_page);
}
return i;
}

-int sgx_reclaim_epc_pages(int nr_to_scan, bool ignore_age)
+/**
+ * sgx_reclaim_epc_pages() - wrapper for __sgx_reclaim_epc_pages which
+ * calls cond_resched() upon completion.
+ * @nr_to_scan: Number of EPC pages to scan for reclaim
+ * @ignore_age: Reclaim a page even if it is young
+ * @epc_cg: EPC cgroup from which to reclaim
+ */
+int sgx_reclaim_epc_pages(int nr_to_scan, bool ignore_age,
+ struct sgx_epc_cgroup *epc_cg)
{
int ret;

- ret = __sgx_reclaim_pages(nr_to_scan, ignore_age);
+ ret = __sgx_reclaim_epc_pages(nr_to_scan, ignore_age, epc_cg);
cond_resched();
return ret;
}

-static bool sgx_can_reclaim(void)
-{
- return !list_empty(&sgx_global_lru.reclaimable);
-}
-
static bool sgx_should_reclaim(unsigned long watermark)
{
return atomic_long_read(&sgx_nr_free_pages) < watermark &&
@@ -429,7 +460,7 @@ static bool sgx_should_reclaim(unsigned long watermark)
void sgx_reclaim_direct(void)
{
if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
- __sgx_reclaim_pages(SGX_NR_TO_SCAN, false);
+ __sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false, NULL);
}

static int ksgxd(void *p)
@@ -455,7 +486,7 @@ static int ksgxd(void *p)
sgx_should_reclaim(SGX_NR_HIGH_PAGES));

if (sgx_should_reclaim(SGX_NR_HIGH_PAGES))
- sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false);
+ sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false, NULL);
}

return 0;
@@ -613,6 +644,11 @@ int sgx_drop_epc_page(struct sgx_epc_page *page)
struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
{
struct sgx_epc_page *page;
+ struct sgx_epc_cgroup *epc_cg;
+
+ epc_cg = sgx_epc_cgroup_try_charge(current->mm, reclaim);
+ if (IS_ERR(epc_cg))
+ return ERR_CAST(epc_cg);

for ( ; ; ) {
page = __sgx_alloc_epc_page();
@@ -621,8 +657,10 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
break;
}

- if (!sgx_can_reclaim())
- return ERR_PTR(-ENOMEM);
+ if (!sgx_can_reclaim()) {
+ page = ERR_PTR(-ENOMEM);
+ break;
+ }

if (!reclaim) {
page = ERR_PTR(-EBUSY);
@@ -634,7 +672,14 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
break;
}

- sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false);
+ sgx_reclaim_epc_pages(SGX_NR_TO_SCAN, false, NULL);
+ }
+
+ if (!IS_ERR(page)) {
+ WARN_ON(page->epc_cg);
+ page->epc_cg = epc_cg;
+ } else {
+ sgx_epc_cgroup_uncharge(epc_cg);
}

if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
@@ -667,6 +712,12 @@ void sgx_free_epc_page(struct sgx_epc_page *page)
page->flags = SGX_EPC_PAGE_IS_FREE;

spin_unlock(&node->lock);
+
+ if (page->epc_cg) {
+ sgx_epc_cgroup_uncharge(page->epc_cg);
+ page->epc_cg = NULL;
+ }
+
atomic_long_inc(&sgx_nr_free_pages);
}

@@ -831,6 +882,7 @@ static bool __init sgx_setup_epc_section(u64 phys_addr, u64 size,
section->pages[i].flags = 0;
section->pages[i].encl_owner = NULL;
section->pages[i].poison = 0;
+ section->pages[i].epc_cg = NULL;
list_add_tail(&section->pages[i].list, &sgx_dirty_page_list);
}

@@ -995,6 +1047,7 @@ static void __init arch_update_sysfs_visibility(int nid) {}
static bool __init sgx_page_cache_init(void)
{
u32 eax, ebx, ecx, edx, type;
+ u64 capacity = 0;
u64 pa, size;
int nid;
int i;
@@ -1045,6 +1098,7 @@ static bool __init sgx_page_cache_init(void)

sgx_epc_sections[i].node = &sgx_numa_nodes[nid];
sgx_numa_nodes[nid].size += size;
+ capacity += size;

sgx_nr_epc_sections++;
}
@@ -1054,6 +1108,8 @@ static bool __init sgx_page_cache_init(void)
return false;
}

+ misc_cg_set_capacity(MISC_CG_RES_SGX_EPC, capacity);
+
return true;
}

diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index db09a8a0ea6e..4059dd74b0d4 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -40,6 +40,7 @@
SGX_EPC_PAGE_RECLAIM_IN_PROGRESS | \
SGX_EPC_PAGE_ENCLAVE | \
SGX_EPC_PAGE_VERSION_ARRAY)
+struct sgx_epc_cgroup;

struct sgx_epc_page {
unsigned int section;
@@ -53,6 +54,7 @@ struct sgx_epc_page {
struct sgx_encl *encl;
};
struct list_head list;
+ struct sgx_epc_cgroup *epc_cg;
};

/*
@@ -181,7 +183,8 @@ void sgx_reclaim_direct(void);
void sgx_record_epc_page(struct sgx_epc_page *page, unsigned long flags);
int sgx_drop_epc_page(struct sgx_epc_page *page);
struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim);
-int sgx_reclaim_epc_pages(int nr_to_scan, bool ignore_age);
+int sgx_reclaim_epc_pages(int nr_to_scan, bool ignore_age,
+ struct sgx_epc_cgroup *epc_cg);
void sgx_isolate_epc_pages(struct sgx_epc_lru *lru, int *nr_to_scan,
struct list_head *dst);
bool sgx_epc_oom(struct sgx_epc_lru *lru);
--
2.37.3


2022-11-14 22:46:41

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 25/26] x86/sgx: Add support for misc cgroup controller

On Fri, Nov 11, 2022 at 10:35:30AM -0800, Kristen Carlson Accardi wrote:
> Implement support for cgroup control of SGX Enclave Page Cache (EPC)
> memory using the misc cgroup controller. EPC memory is independent
> from normal system memory, e.g. must be reserved at boot from RAM and
> cannot be converted between EPC and normal memory while the system is
> running. EPC is managed by the SGX subsystem and is not accounted by
> the memory controller.
>
> Much like normal system memory, EPC memory can be overcommitted via
> virtual memory techniques and pages can be swapped out of the EPC to
> their backing store (normal system memory, e.g. shmem). The SGX EPC
> subsystem is analogous to the memory subsytem and the SGX EPC controller
> is in turn analogous to the memory controller; it implements limit and
> protection models for EPC memory.
>
> The misc controller provides a mechanism to set a hard limit of EPC
> usage via the "sgx_epc" resource in "misc.max". The total EPC memory
> available on the system is reported via the "sgx_epc" resource in
> "misc.capacity".
>
> This patch was modified from its original version to use the misc cgroup
> controller instead of a custom controller.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> Signed-off-by: Kristen Carlson Accardi <[email protected]>
> Cc: Sean Christopherson <[email protected]>

This looks fine from cgroup POV.

Thanks.

--
tejun

2022-11-14 23:01:25

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 22/26] cgroup/misc: Add private per cgroup data to struct misc_cg

On Fri, Nov 11, 2022 at 10:35:27AM -0800, Kristen Carlson Accardi wrote:
> +void *misc_cg_get_priv(enum misc_res_type type, struct misc_cg *cg)
> +{
> + if (!(valid_type(type) && cg))
> + return NULL;
> +
> + return cg->res[type].priv;
> +}
> +EXPORT_SYMBOL_GPL(misc_cg_get_priv);

Yeah, just deref it. I'm not sure what all these accessors are contributing.

Thanks.

--
tejun

2022-11-15 23:40:42

by Jarkko Sakkinen

[permalink] [raw]
Subject: Re: [PATCH 01/26] x86/sgx: Call cond_resched() at the end of sgx_reclaim_pages()

On Fri, Nov 11, 2022 at 10:35:06AM -0800, Kristen Carlson Accardi wrote:
> From: Sean Christopherson <[email protected]>
>
> In order to avoid repetition of cond_resched() in ksgxd() and
> sgx_alloc_epc_page(), move the invocation of post-reclaim cond_resched()
> inside sgx_reclaim_pages(). Except in the case of sgx_reclaim_direct(),
> sgx_reclaim_pages() is always called in a loop and is always followed
> by a call to cond_resched(). This will hold true for the EPC cgroup
> as well, which adds even more calls to sgx_reclaim_pages() and thus
> cond_resched(). Calls to sgx_reclaim_direct() may be performance
> sensitive. Allow sgx_reclaim_direct() to avoid the cond_resched()
> call by moving the original sgx_reclaim_pages() call to
> __sgx_reclaim_pages() and then have sgx_reclaim_pages() become a
> wrapper around that call with a cond_resched().
>
> Signed-off-by: Sean Christopherson <[email protected]>
> Signed-off-by: Kristen Carlson Accardi <[email protected]>
> Cc: Sean Christopherson <[email protected]>
> ---
> arch/x86/kernel/cpu/sgx/main.c | 17 +++++++++++------
> 1 file changed, 11 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
> index 160c8dbee0ab..ffce6fc70a1f 100644
> --- a/arch/x86/kernel/cpu/sgx/main.c
> +++ b/arch/x86/kernel/cpu/sgx/main.c
> @@ -287,7 +287,7 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
> * problematic as it would increase the lock contention too much, which would
> * halt forward progress.
> */
> -static void sgx_reclaim_pages(void)
> +static void __sgx_reclaim_pages(void)
> {
> struct sgx_epc_page *chunk[SGX_NR_TO_SCAN];
> struct sgx_backing backing[SGX_NR_TO_SCAN];
> @@ -369,6 +369,12 @@ static void sgx_reclaim_pages(void)
> }
> }
>
> +static void sgx_reclaim_pages(void)
> +{
> + __sgx_reclaim_pages();
> + cond_resched();
> +}
> +
> static bool sgx_should_reclaim(unsigned long watermark)
> {
> return atomic_long_read(&sgx_nr_free_pages) < watermark &&
> @@ -378,12 +384,14 @@ static bool sgx_should_reclaim(unsigned long watermark)
> /*
> * sgx_reclaim_direct() should be called (without enclave's mutex held)
> * in locations where SGX memory resources might be low and might be
> - * needed in order to make forward progress.
> + * needed in order to make forward progress. This call to
> + * __sgx_reclaim_pages() avoids the cond_resched() in sgx_reclaim_pages()
> + * to improve performance.
> */
> void sgx_reclaim_direct(void)
> {
> if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
> - sgx_reclaim_pages();
> + __sgx_reclaim_pages();

Is it a big deal to have "extra" cond_resched?

> }
>
> static int ksgxd(void *p)
> @@ -410,8 +418,6 @@ static int ksgxd(void *p)
>
> if (sgx_should_reclaim(SGX_NR_HIGH_PAGES))
> sgx_reclaim_pages();
> -
> - cond_resched();
> }
>
> return 0;
> @@ -582,7 +588,6 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
> }
>
> sgx_reclaim_pages();
> - cond_resched();
> }
>
> if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
> --
> 2.37.3
>

BR, Jarkko

2022-11-16 00:10:21

by Jarkko Sakkinen

[permalink] [raw]
Subject: Re: [PATCH 06/26] x86/sgx: Introduce RECLAIM_IN_PROGRESS flag for EPC pages

On Fri, Nov 11, 2022 at 10:35:11AM -0800, Kristen Carlson Accardi wrote:
> From: Sean Christopherson <[email protected]>
>
> Keep track of whether the EPC page is in the middle of being reclaimed
> and do not delete the page off the it's LRU if it has not yet finished

"off the it's LRU" ?

> being reclaimed.

I'm not sure how the description makes the change understandable.

>
> Signed-off-by: Sean Christopherson <[email protected]>
> Signed-off-by: Kristen Carlson Accardi <[email protected]>
> Cc: Sean Christopherson <[email protected]>
> ---
> arch/x86/kernel/cpu/sgx/main.c | 14 +++++++++-----
> arch/x86/kernel/cpu/sgx/sgx.h | 4 ++++
> 2 files changed, 13 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
> index 3b09433ffd85..8c451071fa91 100644
> --- a/arch/x86/kernel/cpu/sgx/main.c
> +++ b/arch/x86/kernel/cpu/sgx/main.c
> @@ -305,13 +305,15 @@ static void __sgx_reclaim_pages(void)
>
> encl_page = epc_page->encl_owner;
>
> - if (kref_get_unless_zero(&encl_page->encl->refcount) != 0)
> + if (kref_get_unless_zero(&encl_page->encl->refcount) != 0) {
> + epc_page->flags |= SGX_EPC_PAGE_RECLAIM_IN_PROGRESS;
> chunk[cnt++] = epc_page;
> - else
> + } else {
> /* The owner is freeing the page. No need to add the
> * page back to the list of reclaimable pages.
> */
> epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
> + }
> }
> spin_unlock(&sgx_global_lru.lock);
>
> @@ -337,6 +339,7 @@ static void __sgx_reclaim_pages(void)
>
> skip:
> spin_lock(&sgx_global_lru.lock);
> + epc_page->flags &= ~SGX_EPC_PAGE_RECLAIM_IN_PROGRESS;
> sgx_epc_push_reclaimable(&sgx_global_lru, epc_page);
> spin_unlock(&sgx_global_lru.lock);
>
> @@ -360,7 +363,8 @@ static void __sgx_reclaim_pages(void)
> sgx_reclaimer_write(epc_page, &backing[i]);
>
> kref_put(&encl_page->encl->refcount, sgx_encl_release);
> - epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
> + epc_page->flags &= ~(SGX_EPC_PAGE_RECLAIMER_TRACKED |
> + SGX_EPC_PAGE_RECLAIM_IN_PROGRESS);
>
> sgx_free_epc_page(epc_page);
> }
> @@ -508,7 +512,7 @@ struct sgx_epc_page *__sgx_alloc_epc_page(void)
> void sgx_record_epc_page(struct sgx_epc_page *page, unsigned long flags)
> {
> spin_lock(&sgx_global_lru.lock);
> - WARN_ON(page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED);
> + WARN_ON(page->flags & SGX_EPC_PAGE_RECLAIM_FLAGS);

Please, open code SGX_EPC_PAGE_RECLAIM_FLAGS. It only adds unnecassry
need to cross-reference to the header file.

Also, please describe the changes on how state flags are used before
and after this patch to the commit message.

> page->flags |= flags;
> if (flags & SGX_EPC_PAGE_RECLAIMER_TRACKED)
> sgx_epc_push_reclaimable(&sgx_global_lru, page);
> @@ -532,7 +536,7 @@ int sgx_drop_epc_page(struct sgx_epc_page *page)
> spin_lock(&sgx_global_lru.lock);
> if (page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED) {
> /* The page is being reclaimed. */
> - if (list_empty(&page->list)) {
> + if (page->flags & SGX_EPC_PAGE_RECLAIM_IN_PROGRESS) {
> spin_unlock(&sgx_global_lru.lock);
> return -EBUSY;
> }
> diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
> index 969606615211..04ca644928a8 100644
> --- a/arch/x86/kernel/cpu/sgx/sgx.h
> +++ b/arch/x86/kernel/cpu/sgx/sgx.h
> @@ -30,6 +30,10 @@
> #define SGX_EPC_PAGE_IS_FREE BIT(1)
> /* Pages allocated for KVM guest */
> #define SGX_EPC_PAGE_KVM_GUEST BIT(2)
> +/* page flag to indicate reclaim is in progress */
> +#define SGX_EPC_PAGE_RECLAIM_IN_PROGRESS BIT(3)
> +#define SGX_EPC_PAGE_RECLAIM_FLAGS (SGX_EPC_PAGE_RECLAIMER_TRACKED | \
> + SGX_EPC_PAGE_RECLAIM_IN_PROGRESS)
>
> struct sgx_epc_page {
> unsigned int section;
> --
> 2.37.3
>

BR, Jarkko

2022-11-16 01:39:21

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH 01/26] x86/sgx: Call cond_resched() at the end of sgx_reclaim_pages()

Hi Jarkko,

On 11/15/2022 3:27 PM, Jarkko Sakkinen wrote:
> On Fri, Nov 11, 2022 at 10:35:06AM -0800, Kristen Carlson Accardi wrote:
>> From: Sean Christopherson <[email protected]>
>>
>> In order to avoid repetition of cond_resched() in ksgxd() and
>> sgx_alloc_epc_page(), move the invocation of post-reclaim cond_resched()
>> inside sgx_reclaim_pages(). Except in the case of sgx_reclaim_direct(),
>> sgx_reclaim_pages() is always called in a loop and is always followed
>> by a call to cond_resched(). This will hold true for the EPC cgroup
>> as well, which adds even more calls to sgx_reclaim_pages() and thus
>> cond_resched(). Calls to sgx_reclaim_direct() may be performance
>> sensitive. Allow sgx_reclaim_direct() to avoid the cond_resched()
>> call by moving the original sgx_reclaim_pages() call to
>> __sgx_reclaim_pages() and then have sgx_reclaim_pages() become a
>> wrapper around that call with a cond_resched().
>>
>> Signed-off-by: Sean Christopherson <[email protected]>
>> Signed-off-by: Kristen Carlson Accardi <[email protected]>
>> Cc: Sean Christopherson <[email protected]>
>> ---
>> arch/x86/kernel/cpu/sgx/main.c | 17 +++++++++++------
>> 1 file changed, 11 insertions(+), 6 deletions(-)
>>
>> diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
>> index 160c8dbee0ab..ffce6fc70a1f 100644
>> --- a/arch/x86/kernel/cpu/sgx/main.c
>> +++ b/arch/x86/kernel/cpu/sgx/main.c
>> @@ -287,7 +287,7 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
>> * problematic as it would increase the lock contention too much, which would
>> * halt forward progress.
>> */
>> -static void sgx_reclaim_pages(void)
>> +static void __sgx_reclaim_pages(void)
>> {
>> struct sgx_epc_page *chunk[SGX_NR_TO_SCAN];
>> struct sgx_backing backing[SGX_NR_TO_SCAN];
>> @@ -369,6 +369,12 @@ static void sgx_reclaim_pages(void)
>> }
>> }
>>
>> +static void sgx_reclaim_pages(void)
>> +{
>> + __sgx_reclaim_pages();
>> + cond_resched();
>> +}
>> +
>> static bool sgx_should_reclaim(unsigned long watermark)
>> {
>> return atomic_long_read(&sgx_nr_free_pages) < watermark &&
>> @@ -378,12 +384,14 @@ static bool sgx_should_reclaim(unsigned long watermark)
>> /*
>> * sgx_reclaim_direct() should be called (without enclave's mutex held)
>> * in locations where SGX memory resources might be low and might be
>> - * needed in order to make forward progress.
>> + * needed in order to make forward progress. This call to
>> + * __sgx_reclaim_pages() avoids the cond_resched() in sgx_reclaim_pages()
>> + * to improve performance.
>> */
>> void sgx_reclaim_direct(void)
>> {
>> if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
>> - sgx_reclaim_pages();
>> + __sgx_reclaim_pages();
>
> Is it a big deal to have "extra" cond_resched?
>

sgx_reclaim_direct() is used to ensure that there is enough
SGX memory available to make forward progress within a loop that
may span a large range of pages. sgx_reclaim_direct()
ensures that there is enough memory right before it depends on
that available memory. I think that giving other tasks an opportunity
to run in the middle is risky since these other tasks may end
up consuming the SGX memory that was just freed up and thus increase
likelihood of the operation to fail with user getting an EAGAIN error.
Additionally, in a constrained environment where sgx_reclaim_direct()
is needed to reclaim pages an additional cond_resched() could cause
user visible slow down when operating on a large memory range.

Reinette