SGX Enclave Page Cache (EPC) memory allocations are separate from normal RAM allocations, and
are managed solely by the SGX subsystem. The existing cgroup memory controller cannot be used
to limit or account for SGX EPC memory, which is a desirable feature in some environments,
e.g., support for pod level control in a Kubernates cluster on a VM or baremetal host [1,2].
This patchset implements the support for sgx_epc memory within the misc cgroup controller. The
user can use the misc cgroup controller to set and enforce a max limit on total EPC usage per
cgroup. The implementation reports current usage and events of reaching the limit per cgroup as
well as the total system capacity.
With the EPC misc controller enabled, every EPC page allocation is accounted for a cgroup's
usage, reflected in the 'sgx_epc' entry in the 'misc.current' interface file of the cgroup.
Much like normal system memory, EPC memory can be overcommitted via virtual memory techniques
and pages can be swapped out of the EPC to their backing store (normal system memory allocated
via shmem, accounted by the memory controller). When the EPC usage of a cgroup reaches its hard
limit ('sgx_epc' entry in the 'misc.max' file), the cgroup starts a reclamation process to swap
out some EPC pages within the same cgroup and its descendant to their backing store. Although
the SGX architecture supports swapping for all pages, to avoid extra complexities, this
implementation does not support swapping for certain page types, e.g. Version Array(VA) pages,
and treat them as unreclaimable pages. When the limit is reached but nothing left in the
cgroup for reclamation, i.e., only unreclaimable pages left, any new EPC allocation in the
cgroup will result in an ENOMEM error.
The EPC pages allocated for guest VMs by the virtual EPC driver are not reclaimable by the host
kernel [5]. Therefore they are also treated as unreclaimable from cgroup's point of view. And
the virtual EPC driver translates an ENOMEM error resulted from an EPC allocation request into
a SIGBUS to the user process.
This work was originally authored by Sean Christopherson a few years ago, and previously
modified by Kristen C. Accardi to utilize the misc cgroup controller rather than a custom
controller. I have been updating the patches based on review comments since V2 [3, 4, 10],
simplified the implementation/design and fixed some stability issues found from testing.
The patches are organized as following:
- Patches 1-3 are prerequisite misc cgroup changes for adding new APIs, structs, resource
types.
- Patch 4 implements basic misc controller for EPC without reclamation.
- Patches 5-9 prepare for per-cgroup reclamation.
* Separate out the existing infrastructure of tracking reclaimable pages
from the global reclaimer(ksgxd) to a newly created LRU list struct.
* Separate out reusable top-level functions for reclamation.
- Patch 10 adds support for per-cgroup reclamation.
- Patch 11 adds documentation for the EPC cgroup.
- Patch 12 adds test scripts.
I appreciate your review and providing tags if appropriate.
---
V6:
- Dropped OOM killing path, only implement non-preemptive enforcement of max limit (Dave, Michal)
- Simplified reclamation flow by taking out sgx_epc_reclaim_control, forced reclamation by
ignoring 'age".
- Restructured patches: split misc API + resource types patch and the big EPC cgroup patch
(Kai, Michal)
- Dropped some Tested-by/Reviewed-by tags due to significant changes
- Added more selftests
v5:
- Replace the manual test script with a selftest script.
- Restore the "From" tag for some patches to Sean (Kai)
- Style fixes (Jarkko)
v4:
- Collected "Tested-by" from Mikko. I kept it for now as no functional changes in v4.
- Rebased on to v6.6_rc1 and reordered patches as described above.
- Separated out the bug fixes [7,8,9]. This series depend on those patches. (Dave, Jarkko)
- Added comments in commit message to give more preview what's to come next. (Jarkko)
- Fixed some documentation error, gap, style (Mikko, Randy)
- Fixed some comments, typo, style in code (Mikko, Kai)
- Patch format and background for reclaimable vs unreclaimable (Kai, Jarkko)
- Fixed typo (Pavel)
- Exclude the previous fixes/enhancements for self-tests. Patch 18 now depends on series [6]
- Use the same to list for cover and all patches. (Solo)
v3:
- Added EPC states to replace flags in sgx_epc_page struct. (Jarkko)
- Unrolled wrappers for cond_resched, list (Dave)
- Separate patches for adding reclaimable and unreclaimable lists. (Dave)
- Other improvements on patch flow, commit messages, styles. (Dave, Jarkko)
- Simplified the cgroup tree walking with plain
css_for_each_descendant_pre.
- Fixed race conditions and crashes.
- OOM killer to wait for the victim enclave pages being reclaimed.
- Unblock the user by handling misc_max_write callback asynchronously.
- Rebased onto 6.4 and no longer base this series on the MCA patchset.
- Fix an overflow in misc_try_charge.
- Fix a NULL pointer in SGX PF handler.
- Updated and included the SGX selftest patches previously reviewed. Those
patches fix issues triggered in high EPC pressure required for cgroup
testing.
- Added test scripts to help setup and test SGX EPC cgroups.
[1]https://lore.kernel.org/all/DM6PR21MB11772A6ED915825854B419D6C4989@DM6PR21MB1177.namprd21.prod.outlook.com/
[2]https://lore.kernel.org/all/ZD7Iutppjj+muH4p@himmelriiki/
[3]https://lore.kernel.org/all/[email protected]/
[4]https://lore.kernel.org/linux-sgx/[email protected]/
[5]Documentation/arch/x86/sgx.rst, Section "Virtual EPC"
[6]https://lore.kernel.org/linux-sgx/[email protected]/
[7]https://lore.kernel.org/linux-sgx/[email protected]/
[8]https://lore.kernel.org/linux-sgx/[email protected]/
[9]https://lore.kernel.org/linux-sgx/[email protected]/
[10]https://lore.kernel.org/all/[email protected]/
Haitao Huang (2):
x86/sgx: Introduce EPC page states
selftests/sgx: Add scripts for EPC cgroup testing
Kristen Carlson Accardi (5):
cgroup/misc: Add per resource callbacks for CSS events
cgroup/misc: Export APIs for SGX driver
cgroup/misc: Add SGX EPC resource type
x86/sgx: Implement basic EPC misc cgroup functionality
x86/sgx: Implement EPC reclamation for cgroup
Sean Christopherson (5):
x86/sgx: Add sgx_epc_lru_list to encapsulate LRU list
x86/sgx: Use sgx_epc_lru_list for existing active page list
x86/sgx: Use a list to track to-be-reclaimed pages
x86/sgx: Restructure top-level EPC reclaim function
Docs/x86/sgx: Add description for cgroup support
Documentation/arch/x86/sgx.rst | 74 ++++
arch/x86/Kconfig | 13 +
arch/x86/kernel/cpu/sgx/Makefile | 1 +
arch/x86/kernel/cpu/sgx/encl.c | 2 +-
arch/x86/kernel/cpu/sgx/epc_cgroup.c | 319 ++++++++++++++++++
arch/x86/kernel/cpu/sgx/epc_cgroup.h | 49 +++
arch/x86/kernel/cpu/sgx/main.c | 245 +++++++++-----
arch/x86/kernel/cpu/sgx/sgx.h | 88 ++++-
include/linux/misc_cgroup.h | 42 +++
kernel/cgroup/misc.c | 52 ++-
.../selftests/sgx/run_epc_cg_selftests.sh | 196 +++++++++++
.../selftests/sgx/watch_misc_for_tests.sh | 13 +
12 files changed, 996 insertions(+), 98 deletions(-)
create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.c
create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.h
create mode 100755 tools/testing/selftests/sgx/run_epc_cg_selftests.sh
create mode 100755 tools/testing/selftests/sgx/watch_misc_for_tests.sh
--
2.25.1
From: Kristen Carlson Accardi <[email protected]>
Export misc_cg_root() so the SGX EPC cgroup can access and do extra
setup during initialization, e.g., set callbacks and private data
previously defined.
The SGX EPC cgroup will reclaim EPC pages when a usage in a cgroup
reaches its or ancestor's limit. This requires a walk from the current
cgroup up to the root similar to misc_cg_try_charge(). Export
misc_cg_parent() to enable this walk.
Signed-off-by: Kristen Carlson Accardi <[email protected]>
Co-developed-by: Haitao Huang <[email protected]>
Signed-off-by: Haitao Huang <[email protected]>
---
V6:
- Make commit messages more concise and split the original patch into two(Kai)
---
include/linux/misc_cgroup.h | 24 ++++++++++++++++++++++++
kernel/cgroup/misc.c | 21 ++++++++-------------
2 files changed, 32 insertions(+), 13 deletions(-)
diff --git a/include/linux/misc_cgroup.h b/include/linux/misc_cgroup.h
index 5dc509c27c3d..2a3b1f8dc669 100644
--- a/include/linux/misc_cgroup.h
+++ b/include/linux/misc_cgroup.h
@@ -68,6 +68,7 @@ struct misc_cg {
struct misc_res res[MISC_CG_RES_TYPES];
};
+struct misc_cg *misc_cg_root(void);
u64 misc_cg_res_total_usage(enum misc_res_type type);
int misc_cg_set_capacity(enum misc_res_type type, u64 capacity);
int misc_cg_try_charge(enum misc_res_type type, struct misc_cg *cg, u64 amount);
@@ -87,6 +88,20 @@ static inline struct misc_cg *css_misc(struct cgroup_subsys_state *css)
return css ? container_of(css, struct misc_cg, css) : NULL;
}
+/**
+ * misc_cg_parent() - Get the parent of the passed misc cgroup.
+ * @cgroup: cgroup whose parent needs to be fetched.
+ *
+ * Context: Any context.
+ * Return:
+ * * struct misc_cg* - Parent of the @cgroup.
+ * * %NULL - If @cgroup is null or the passed cgroup does not have a parent.
+ */
+static inline struct misc_cg *misc_cg_parent(struct misc_cg *cgroup)
+{
+ return cgroup ? css_misc(cgroup->css.parent) : NULL;
+}
+
/*
* get_current_misc_cg() - Find and get the misc cgroup of the current task.
*
@@ -111,6 +126,15 @@ static inline void put_misc_cg(struct misc_cg *cg)
}
#else /* !CONFIG_CGROUP_MISC */
+static inline struct misc_cg *misc_cg_root(void)
+{
+ return NULL;
+}
+
+static inline struct misc_cg *misc_cg_parent(struct misc_cg *cg)
+{
+ return NULL;
+}
static inline u64 misc_cg_res_total_usage(enum misc_res_type type)
{
diff --git a/kernel/cgroup/misc.c b/kernel/cgroup/misc.c
index d971ede44ebf..fa464324ccf8 100644
--- a/kernel/cgroup/misc.c
+++ b/kernel/cgroup/misc.c
@@ -40,18 +40,13 @@ static struct misc_cg root_cg;
static u64 misc_res_capacity[MISC_CG_RES_TYPES];
/**
- * parent_misc() - Get the parent of the passed misc cgroup.
- * @cgroup: cgroup whose parent needs to be fetched.
- *
- * Context: Any context.
- * Return:
- * * struct misc_cg* - Parent of the @cgroup.
- * * %NULL - If @cgroup is null or the passed cgroup does not have a parent.
+ * misc_cg_root() - Return the root misc cgroup.
*/
-static struct misc_cg *parent_misc(struct misc_cg *cgroup)
+struct misc_cg *misc_cg_root(void)
{
- return cgroup ? css_misc(cgroup->css.parent) : NULL;
+ return &root_cg;
}
+EXPORT_SYMBOL_GPL(misc_cg_root);
/**
* valid_type() - Check if @type is valid or not.
@@ -150,7 +145,7 @@ int misc_cg_try_charge(enum misc_res_type type, struct misc_cg *cg, u64 amount)
if (!amount)
return 0;
- for (i = cg; i; i = parent_misc(i)) {
+ for (i = cg; i; i = misc_cg_parent(i)) {
res = &i->res[type];
new_usage = atomic64_add_return(amount, &res->usage);
@@ -163,12 +158,12 @@ int misc_cg_try_charge(enum misc_res_type type, struct misc_cg *cg, u64 amount)
return 0;
err_charge:
- for (j = i; j; j = parent_misc(j)) {
+ for (j = i; j; j = misc_cg_parent(j)) {
atomic64_inc(&j->res[type].events);
cgroup_file_notify(&j->events_file);
}
- for (j = cg; j != i; j = parent_misc(j))
+ for (j = cg; j != i; j = misc_cg_parent(j))
misc_cg_cancel_charge(type, j, amount);
misc_cg_cancel_charge(type, i, amount);
return ret;
@@ -190,7 +185,7 @@ void misc_cg_uncharge(enum misc_res_type type, struct misc_cg *cg, u64 amount)
if (!(amount && valid_type(type) && cg))
return;
- for (i = cg; i; i = parent_misc(i))
+ for (i = cg; i; i = misc_cg_parent(i))
misc_cg_cancel_charge(type, i, amount);
}
EXPORT_SYMBOL_GPL(misc_cg_uncharge);
--
2.25.1
From: Sean Christopherson <[email protected]>
In future each cgroup needs a LRU list to track reclaimable pages. For
now just replace the existing sgx_active_page_list in the reclaimer and
its spinlock with a global sgx_epc_lru_list struct.
Signed-off-by: Sean Christopherson <[email protected]>
Co-developed-by: Kristen Carlson Accardi <[email protected]>
Signed-off-by: Kristen Carlson Accardi <[email protected]>
Co-developed-by: Haitao Huang <[email protected]>
Signed-off-by: Haitao Huang <[email protected]>
Cc: Sean Christopherson <[email protected]>
---
V5:
- Spelled out SECS, VA (Jarkko)
V4:
- No change, only reordered the patch.
V3:
- Remove usage of list wrapper
---
arch/x86/kernel/cpu/sgx/main.c | 39 +++++++++++++++++-----------------
1 file changed, 20 insertions(+), 19 deletions(-)
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 07606f391540..d347acd717fd 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -28,10 +28,9 @@ static DEFINE_XARRAY(sgx_epc_address_space);
/*
* These variables are part of the state of the reclaimer, and must be accessed
- * with sgx_reclaimer_lock acquired.
+ * with sgx_global_lru.lock acquired.
*/
-static LIST_HEAD(sgx_active_page_list);
-static DEFINE_SPINLOCK(sgx_reclaimer_lock);
+static struct sgx_epc_lru_list sgx_global_lru;
static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0);
@@ -306,13 +305,13 @@ static void sgx_reclaim_pages(void)
int ret;
int i;
- spin_lock(&sgx_reclaimer_lock);
+ spin_lock(&sgx_global_lru.lock);
for (i = 0; i < SGX_NR_TO_SCAN; i++) {
- if (list_empty(&sgx_active_page_list))
+ epc_page = list_first_entry_or_null(&sgx_global_lru.reclaimable,
+ struct sgx_epc_page, list);
+ if (!epc_page)
break;
- epc_page = list_first_entry(&sgx_active_page_list,
- struct sgx_epc_page, list);
list_del_init(&epc_page->list);
encl_page = epc_page->owner;
@@ -324,7 +323,7 @@ static void sgx_reclaim_pages(void)
*/
epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
}
- spin_unlock(&sgx_reclaimer_lock);
+ spin_unlock(&sgx_global_lru.lock);
for (i = 0; i < cnt; i++) {
epc_page = chunk[i];
@@ -347,9 +346,9 @@ static void sgx_reclaim_pages(void)
continue;
skip:
- spin_lock(&sgx_reclaimer_lock);
- list_add_tail(&epc_page->list, &sgx_active_page_list);
- spin_unlock(&sgx_reclaimer_lock);
+ spin_lock(&sgx_global_lru.lock);
+ list_add_tail(&epc_page->list, &sgx_global_lru.reclaimable);
+ spin_unlock(&sgx_global_lru.lock);
kref_put(&encl_page->encl->refcount, sgx_encl_release);
@@ -380,7 +379,7 @@ static void sgx_reclaim_pages(void)
static bool sgx_should_reclaim(unsigned long watermark)
{
return atomic_long_read(&sgx_nr_free_pages) < watermark &&
- !list_empty(&sgx_active_page_list);
+ !list_empty(&sgx_global_lru.reclaimable);
}
/*
@@ -432,6 +431,8 @@ static bool __init sgx_page_reclaimer_init(void)
ksgxd_tsk = tsk;
+ sgx_lru_init(&sgx_global_lru);
+
return true;
}
@@ -507,10 +508,10 @@ struct sgx_epc_page *__sgx_alloc_epc_page(void)
*/
void sgx_mark_page_reclaimable(struct sgx_epc_page *page)
{
- spin_lock(&sgx_reclaimer_lock);
+ spin_lock(&sgx_global_lru.lock);
page->flags |= SGX_EPC_PAGE_RECLAIMER_TRACKED;
- list_add_tail(&page->list, &sgx_active_page_list);
- spin_unlock(&sgx_reclaimer_lock);
+ list_add_tail(&page->list, &sgx_global_lru.reclaimable);
+ spin_unlock(&sgx_global_lru.lock);
}
/**
@@ -525,18 +526,18 @@ void sgx_mark_page_reclaimable(struct sgx_epc_page *page)
*/
int sgx_unmark_page_reclaimable(struct sgx_epc_page *page)
{
- spin_lock(&sgx_reclaimer_lock);
+ spin_lock(&sgx_global_lru.lock);
if (page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED) {
/* The page is being reclaimed. */
if (list_empty(&page->list)) {
- spin_unlock(&sgx_reclaimer_lock);
+ spin_unlock(&sgx_global_lru.lock);
return -EBUSY;
}
list_del(&page->list);
page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
}
- spin_unlock(&sgx_reclaimer_lock);
+ spin_unlock(&sgx_global_lru.lock);
return 0;
}
@@ -574,7 +575,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
break;
}
- if (list_empty(&sgx_active_page_list))
+ if (list_empty(&sgx_global_lru.reclaimable))
return ERR_PTR(-ENOMEM);
if (!reclaim) {
--
2.25.1
From: Sean Christopherson <[email protected]>
Change sgx_reclaim_pages() to use a list rather than an array for
storing the epc_pages which will be reclaimed. This change is needed
to transition to the LRU implementation for EPC cgroup support.
When the EPC cgroup is implemented, the reclaiming process will do a
pre-order tree walk for the subtree starting from the limit-violating
cgroup. When each node is visited, candidate pages are selected from
its "reclaimable" LRU list and moved into this temporary list. Passing a
list from node to node for temporary storage in this walk is more
straightforward than using an array.
Signed-off-by: Sean Christopherson <[email protected]>
Co-developed-by: Kristen Carlson Accardi <[email protected]>
Signed-off-by: Kristen Carlson Accardi <[email protected]>
Co-developed-by: Haitao Huang<[email protected]>
Signed-off-by: Haitao Huang<[email protected]>
Cc: Sean Christopherson <[email protected]>
---
V6:
- Remove extra list_del_init and style fix (Kai)
V4:
- Changes needed for patch reordering
- Revised commit message
V3:
- Removed list wrappers
---
arch/x86/kernel/cpu/sgx/main.c | 35 +++++++++++++++-------------------
1 file changed, 15 insertions(+), 20 deletions(-)
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index e27ac73d8843..33bcba313d40 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -296,12 +296,11 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
*/
static void sgx_reclaim_pages(void)
{
- struct sgx_epc_page *chunk[SGX_NR_TO_SCAN];
struct sgx_backing backing[SGX_NR_TO_SCAN];
+ struct sgx_epc_page *epc_page, *tmp;
struct sgx_encl_page *encl_page;
- struct sgx_epc_page *epc_page;
pgoff_t page_index;
- int cnt = 0;
+ LIST_HEAD(iso);
int ret;
int i;
@@ -317,7 +316,7 @@ static void sgx_reclaim_pages(void)
if (kref_get_unless_zero(&encl_page->encl->refcount) != 0) {
sgx_epc_page_set_state(epc_page, SGX_EPC_PAGE_RECLAIM_IN_PROGRESS);
- chunk[cnt++] = epc_page;
+ list_move_tail(&epc_page->list, &iso);
} else
/* The owner is freeing the page. No need to add the
* page back to the list of reclaimable pages.
@@ -326,8 +325,11 @@ static void sgx_reclaim_pages(void)
}
spin_unlock(&sgx_global_lru.lock);
- for (i = 0; i < cnt; i++) {
- epc_page = chunk[i];
+ if (list_empty(&iso))
+ return;
+
+ i = 0;
+ list_for_each_entry_safe(epc_page, tmp, &iso, list) {
encl_page = epc_page->owner;
if (!sgx_reclaimer_age(epc_page))
@@ -342,6 +344,7 @@ static void sgx_reclaim_pages(void)
goto skip;
}
+ i++;
encl_page->desc |= SGX_ENCL_PAGE_BEING_RECLAIMED;
mutex_unlock(&encl_page->encl->lock);
continue;
@@ -349,27 +352,19 @@ static void sgx_reclaim_pages(void)
skip:
spin_lock(&sgx_global_lru.lock);
sgx_epc_page_set_state(epc_page, SGX_EPC_PAGE_RECLAIMABLE);
- list_add_tail(&epc_page->list, &sgx_global_lru.reclaimable);
+ list_move_tail(&epc_page->list, &sgx_global_lru.reclaimable);
spin_unlock(&sgx_global_lru.lock);
kref_put(&encl_page->encl->refcount, sgx_encl_release);
-
- chunk[i] = NULL;
- }
-
- for (i = 0; i < cnt; i++) {
- epc_page = chunk[i];
- if (epc_page)
- sgx_reclaimer_block(epc_page);
}
- for (i = 0; i < cnt; i++) {
- epc_page = chunk[i];
- if (!epc_page)
- continue;
+ list_for_each_entry(epc_page, &iso, list)
+ sgx_reclaimer_block(epc_page);
+ i = 0;
+ list_for_each_entry_safe(epc_page, tmp, &iso, list) {
encl_page = epc_page->owner;
- sgx_reclaimer_write(epc_page, &backing[i]);
+ sgx_reclaimer_write(epc_page, &backing[i++]);
kref_put(&encl_page->encl->refcount, sgx_encl_release);
sgx_epc_page_reset_state(epc_page);
--
2.25.1
From: Sean Christopherson <[email protected]>
Add initial documentation of how to regulate the distribution of
SGX Enclave Page Cache (EPC) memory via the Miscellaneous cgroup
controller.
Signed-off-by: Sean Christopherson <[email protected]>
Co-developed-by: Kristen Carlson Accardi <[email protected]>
Signed-off-by: Kristen Carlson Accardi <[email protected]>
Co-developed-by: Haitao Huang<[email protected]>
Signed-off-by: Haitao Huang<[email protected]>
Cc: Sean Christopherson <[email protected]>
---
V6:
- Remove mentioning of VMM specific behavior on handling SIGBUS
- Remove statement of forced reclamation, add statement to specify
ENOMEM returned when no reclamation possible.
- Added statements on the non-preemptive nature for the max limit
- Dropped Reviewed-by tag because of changes
V4:
- Fix indentation (Randy)
- Change misc.events file to be read-only
- Fix a typo for 'subsystem'
- Add behavior when VMM overcommit EPC with a cgroup (Mikko)
---
Documentation/arch/x86/sgx.rst | 74 ++++++++++++++++++++++++++++++++++
1 file changed, 74 insertions(+)
diff --git a/Documentation/arch/x86/sgx.rst b/Documentation/arch/x86/sgx.rst
index d90796adc2ec..dfc8fac13ab2 100644
--- a/Documentation/arch/x86/sgx.rst
+++ b/Documentation/arch/x86/sgx.rst
@@ -300,3 +300,77 @@ to expected failures and handle them as follows:
first call. It indicates a bug in the kernel or the userspace client
if any of the second round of ``SGX_IOC_VEPC_REMOVE_ALL`` calls has
a return code other than 0.
+
+
+Cgroup Support
+==============
+
+The "sgx_epc" resource within the Miscellaneous cgroup controller regulates distribution of SGX
+EPC memory, which is a subset of system RAM that is used to provide SGX-enabled applications
+with protected memory, and is otherwise inaccessible, i.e. shows up as reserved in /proc/iomem
+and cannot be read/written outside of an SGX enclave.
+
+Although current systems implement EPC by stealing memory from RAM, for all intents and
+purposes the EPC is independent from normal system memory, e.g. must be reserved at boot from
+RAM and cannot be converted between EPC and normal memory while the system is running. The EPC
+is managed by the SGX subsystem and is not accounted by the memory controller. Note that this
+is true only for EPC memory itself, i.e. normal memory allocations related to SGX and EPC
+memory, e.g. the backing memory for evicted EPC pages, are accounted, limited and protected by
+the memory controller.
+
+Much like normal system memory, EPC memory can be overcommitted via virtual memory techniques
+and pages can be swapped out of the EPC to their backing store (normal system memory allocated
+via shmem). The SGX EPC subsystem is analogous to the memory subsystem, and it implements
+limit and protection models for EPC memory.
+
+SGX EPC Interface Files
+-----------------------
+
+For a generic description of the Miscellaneous controller interface files, please see
+Documentation/admin-guide/cgroup-v2.rst
+
+All SGX EPC memory amounts are in bytes unless explicitly stated otherwise. If a value which
+is not PAGE_SIZE aligned is written, the actual value used by the controller will be rounded
+down to the closest PAGE_SIZE multiple.
+
+ misc.capacity
+ A read-only flat-keyed file shown only in the root cgroup. The sgx_epc resource will
+ show the total amount of EPC memory available on the platform.
+
+ misc.current
+ A read-only flat-keyed file shown in the non-root cgroups. The sgx_epc resource will
+ show the current active EPC memory usage of the cgroup and its descendants. EPC pages
+ that are swapped out to backing RAM are not included in the current count.
+
+ misc.max
+ A read-write single value file which exists on non-root cgroups. The sgx_epc resource
+ will show the EPC usage hard limit. The default is "max".
+
+ If a cgroup's EPC usage reaches this limit, EPC allocations, e.g. for page fault
+ handling, will be blocked until EPC can be reclaimed from the cgroup. If there are no
+ pages left that are reclaimable within the same group, the kernel returns ENOMEM.
+
+ The EPC pages allocated for a guest VM by the virtual EPC driver are not reclaimable by
+ the host kernel. In case the guest cgroup's limit is reached and no reclaimable pages
+ left in the same cgroup, the virtual EPC driver returns SIGBUS to the user space
+ process to indicate failure on new EPC allocation requests.
+
+ The misc.max limit is non-preemptive. If a user writes a limit lower than the current
+ usage to this file, the cgroup will not preemptively deallocate pages currently in use,
+ and will only start blocking the next allocation and reclaiming EPC at that time.
+
+ misc.events
+ A read-only flat-keyed file which exists on non-root cgroups.
+ A value change in this file generates a file modified event.
+
+ max
+ The number of times the cgroup has triggered a reclaim
+ due to its EPC usage approaching (or exceeding) its max
+ EPC boundary.
+
+Migration
+---------
+
+Once an EPC page is charged to a cgroup (during allocation), it remains charged to the original
+cgroup until the page is released or reclaimed. Migrating a process to a different cgroup
+doesn't move the EPC charges that it incurred while in the previous cgroup to its new cgroup.
--
2.25.1
From: Kristen Carlson Accardi <[email protected]>
Currently all reclaimable pages are tracked only in the global LRU list,
and only the global reclaimer(ksgxd) performs reclamation when the
global free page counts are lower than a threshold. When a cgroup limit
is reached, cgroup need also try to reclaim pages allocated within the
group. This patch enables per-cgroup reclamation.
Add a helper function sgx_lru_list(), that for a given EPC page, returns
the LRU list of the cgroup that is assigned to the EPC page at
allocation time. This helper is used to replace the hard coded global
LRU wherever appropriate: modify sgx_mark/unmark_page_reclaimable() to
track EPCs in the LRU list of the appropriate cgroup; modify
sgx_do_epc_reclamation() to return unreclaimed pages back to proper
cgroup.
Implement the reclamation flow for cgroup, encapsulated in the top-level
function sgx_epc_cgroup_reclaim_pages(). Just like the global reclaimer,
the cgroup reclaimer first isolates candidate pages for reclaim, then
invokes sgx_do_epc_reclamation(). The only difference is that a cgroup
does a pre-order walk on its subtree to scan for candidate pages from
its own LRU and LRUs in its descendants.
In some contexts, e.g. page fault handling, only asynchronous
reclamation is allowed. Create a workqueue, 'sgx_epc_cg_wq',
corresponding work item and function definitions to support the
asynchronous reclamation. Add a Boolean parameter for
sgx_epc_cgroup_try_charge() to indicate whether synchronous reclaim is
allowed or not. Both synchronous and asynchronous flows invoke the same
top level reclaim function, sgx_epc_cgroup_reclaim_pages().
All reclaimable pages are tracked in per-cgroup LRUs when cgroup is
enabled. Update the original global reclaimer to reclaim from the root
cgroup when cgroup is enabled, also calling
sgx_epc_cgroup_reclaim_pages().
Co-developed-by: Sean Christopherson <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Kristen Carlson Accardi <[email protected]>
Co-developed-by: Haitao Huang <[email protected]>
Signed-off-by: Haitao Huang <[email protected]>
---
V6:
- Drop EPC OOM killer.(Dave, Michal)
- Patch restructuring: this includes part split from the patch, "Limit process EPC usage with
misc cgroup controller", and combined with "Prepare for multiple LRUs"
- Removed force reclamation ignoring 'youngness' of the pages
- Removed checking for capacity in reclamation loop.
---
arch/x86/kernel/cpu/sgx/epc_cgroup.c | 224 ++++++++++++++++++++++++++-
arch/x86/kernel/cpu/sgx/epc_cgroup.h | 19 ++-
arch/x86/kernel/cpu/sgx/main.c | 71 ++++++---
3 files changed, 289 insertions(+), 25 deletions(-)
diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
index 500627d0563f..110d44c0ef7c 100644
--- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
@@ -5,6 +5,38 @@
#include <linux/kernel.h>
#include "epc_cgroup.h"
+#define SGX_EPC_RECLAIM_MIN_PAGES 16U
+
+static struct workqueue_struct *sgx_epc_cg_wq;
+
+static inline u64 sgx_epc_cgroup_page_counter_read(struct sgx_epc_cgroup *epc_cg)
+{
+ return atomic64_read(&epc_cg->cg->res[MISC_CG_RES_SGX_EPC].usage) / PAGE_SIZE;
+}
+
+static inline u64 sgx_epc_cgroup_max_pages(struct sgx_epc_cgroup *epc_cg)
+{
+ return READ_ONCE(epc_cg->cg->res[MISC_CG_RES_SGX_EPC].max) / PAGE_SIZE;
+}
+
+/*
+ * Get the lower bound of limits of a cgroup and its ancestors. Used in
+ * sgx_epc_cgroup_reclaim_work_func() to determine if EPC usage of a cgroup is over its limit
+ * or its ancestors' hence reclamation is needed.
+ */
+static inline u64 sgx_epc_cgroup_max_pages_to_root(struct sgx_epc_cgroup *epc_cg)
+{
+ struct misc_cg *i = epc_cg->cg;
+ u64 m = U64_MAX;
+
+ while (i) {
+ m = min(m, READ_ONCE(i->res[MISC_CG_RES_SGX_EPC].max));
+ i = misc_cg_parent(i);
+ }
+
+ return m / PAGE_SIZE;
+}
+
static inline struct sgx_epc_cgroup *sgx_epc_cgroup_from_misc_cg(struct misc_cg *cg)
{
return (struct sgx_epc_cgroup *)(cg->res[MISC_CG_RES_SGX_EPC].priv);
@@ -15,12 +47,188 @@ static inline bool sgx_epc_cgroup_disabled(void)
return !cgroup_subsys_enabled(misc_cgrp_subsys);
}
+/**
+ * sgx_epc_cgroup_lru_empty() - check if a cgroup tree has no pages on its LRUs
+ * @root: root of the tree to check
+ *
+ * Return: %true if all cgroups under the specified root have empty LRU lists.
+ * Used to avoid livelocks due to a cgroup having a non-zero charge count but
+ * no pages on its LRUs, e.g. due to a dead enclave waiting to be released or
+ * because all pages in the cgroup are unreclaimable.
+ */
+bool sgx_epc_cgroup_lru_empty(struct misc_cg *root)
+{
+ struct cgroup_subsys_state *css_root;
+ struct cgroup_subsys_state *pos;
+ struct sgx_epc_cgroup *epc_cg;
+ bool ret = true;
+
+ /*
+ * Caller ensure css_root ref acquired
+ */
+ css_root = &root->css;
+
+ rcu_read_lock();
+ css_for_each_descendant_pre(pos, css_root) {
+ if (!css_tryget(pos))
+ break;
+
+ rcu_read_unlock();
+
+ epc_cg = sgx_epc_cgroup_from_misc_cg(css_misc(pos));
+
+ spin_lock(&epc_cg->lru.lock);
+ ret = list_empty(&epc_cg->lru.reclaimable);
+ spin_unlock(&epc_cg->lru.lock);
+
+ rcu_read_lock();
+ css_put(pos);
+ if (!ret)
+ break;
+ }
+
+ rcu_read_unlock();
+
+ return ret;
+}
+
+/**
+ * sgx_epc_cgroup_isolate_pages() - walk a cgroup tree and scan LRUs to select pages for
+ * reclamation
+ * @root: root of the tree to start walking
+ * @nr_to_scan: The number of pages to scan
+ * @dst: Destination list to hold the isolated pages
+ */
+void sgx_epc_cgroup_isolate_pages(struct misc_cg *root,
+ unsigned int nr_to_scan, struct list_head *dst)
+{
+ struct cgroup_subsys_state *css_root;
+ struct cgroup_subsys_state *pos;
+ struct sgx_epc_cgroup *epc_cg;
+
+ if (!nr_to_scan)
+ return;
+
+ /* Caller ensure css_root ref acquired */
+ css_root = &root->css;
+
+ rcu_read_lock();
+ css_for_each_descendant_pre(pos, css_root) {
+ if (!css_tryget(pos))
+ break;
+ rcu_read_unlock();
+
+ epc_cg = sgx_epc_cgroup_from_misc_cg(css_misc(pos));
+ nr_to_scan = sgx_isolate_epc_pages(&epc_cg->lru, nr_to_scan, dst);
+
+ rcu_read_lock();
+ css_put(pos);
+ if (!nr_to_scan)
+ break;
+ }
+
+ rcu_read_unlock();
+}
+
+static unsigned int sgx_epc_cgroup_reclaim_pages(unsigned int nr_pages,
+ struct misc_cg *root)
+{
+ LIST_HEAD(iso);
+ /*
+ * Attempting to reclaim only a few pages will often fail and is inefficient, while
+ * reclaiming a huge number of pages can result in soft lockups due to holding various
+ * locks for an extended duration.
+ */
+ nr_pages = max(nr_pages, SGX_EPC_RECLAIM_MIN_PAGES);
+ nr_pages = min(nr_pages, SGX_NR_TO_SCAN_MAX);
+ sgx_epc_cgroup_isolate_pages(root, nr_pages, &iso);
+
+ return sgx_do_epc_reclamation(&iso);
+}
+
+/*
+ * Scheduled by sgx_epc_cgroup_try_charge() to reclaim pages from the cgroup when the cgroup is
+ * at/near its maximum capacity
+ */
+static void sgx_epc_cgroup_reclaim_work_func(struct work_struct *work)
+{
+ struct sgx_epc_cgroup *epc_cg;
+ u64 cur, max;
+
+ epc_cg = container_of(work, struct sgx_epc_cgroup, reclaim_work);
+
+ for (;;) {
+ max = sgx_epc_cgroup_max_pages_to_root(epc_cg);
+
+ /*
+ * Adjust the limit down by one page, the goal is to free up
+ * pages for fault allocations, not to simply obey the limit.
+ * Conditionally decrementing max also means the cur vs. max
+ * check will correctly handle the case where both are zero.
+ */
+ if (max)
+ max--;
+
+ /*
+ * Unless the limit is extremely low, in which case forcing
+ * reclaim will likely cause thrashing, force the cgroup to
+ * reclaim at least once if it's operating *near* its maximum
+ * limit by adjusting @max down by half the min reclaim size.
+ * This work func is scheduled by sgx_epc_cgroup_try_charge
+ * when it cannot directly reclaim due to being in an atomic
+ * context, e.g. EPC allocation in a fault handler. Waiting
+ * to reclaim until the cgroup is actually at its limit is less
+ * performant as it means the faulting task is effectively
+ * blocked until a worker makes its way through the global work
+ * queue.
+ */
+ if (max > SGX_NR_TO_SCAN_MAX)
+ max -= (SGX_EPC_RECLAIM_MIN_PAGES / 2);
+
+ cur = sgx_epc_cgroup_page_counter_read(epc_cg);
+
+ if (cur <= max || sgx_epc_cgroup_lru_empty(epc_cg->cg))
+ break;
+
+ /* Keep reclaiming until above condition is met. */
+ sgx_epc_cgroup_reclaim_pages((unsigned int)(cur - max), epc_cg->cg);
+ }
+}
+
+static int __sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg,
+ bool reclaim)
+{
+ for (;;) {
+ if (!misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg,
+ PAGE_SIZE))
+ break;
+
+ if (sgx_epc_cgroup_lru_empty(epc_cg->cg))
+ return -ENOMEM;
+
+ if (signal_pending(current))
+ return -ERESTARTSYS;
+
+ if (!reclaim) {
+ queue_work(sgx_epc_cg_wq, &epc_cg->reclaim_work);
+ return -EBUSY;
+ }
+
+ if (!sgx_epc_cgroup_reclaim_pages(1, epc_cg->cg))
+ /* All pages were too young to reclaim, try again */
+ schedule();
+ }
+
+ return 0;
+}
+
/**
* sgx_epc_cgroup_try_charge() - hierarchically try to charge a single EPC page
+ * @reclaim: whether or not synchronous reclaim is allowed
*
* Returns EPC cgroup or NULL on success, -errno on failure.
*/
-struct sgx_epc_cgroup *sgx_epc_cgroup_try_charge(void)
+struct sgx_epc_cgroup *sgx_epc_cgroup_try_charge(bool reclaim)
{
struct sgx_epc_cgroup *epc_cg;
int ret;
@@ -29,12 +237,12 @@ struct sgx_epc_cgroup *sgx_epc_cgroup_try_charge(void)
return NULL;
epc_cg = sgx_epc_cgroup_from_misc_cg(get_current_misc_cg());
- ret = misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, PAGE_SIZE);
+ ret = __sgx_epc_cgroup_try_charge(epc_cg, reclaim);
- if (!ret) {
+ if (ret) {
/* No epc_cg returned, release ref from get_current_misc_cg() */
put_misc_cg(epc_cg->cg);
- return ERR_PTR(-ENOMEM);
+ return ERR_PTR(ret);
}
/* Ref released in sgx_epc_cgroup_uncharge() */
@@ -64,6 +272,7 @@ static void sgx_epc_cgroup_free(struct misc_cg *cg)
if (!epc_cg)
return;
+ cancel_work_sync(&epc_cg->reclaim_work);
kfree(epc_cg);
}
@@ -82,6 +291,8 @@ static int sgx_epc_cgroup_alloc(struct misc_cg *cg)
if (!epc_cg)
return -ENOMEM;
+ sgx_lru_init(&epc_cg->lru);
+ INIT_WORK(&epc_cg->reclaim_work, sgx_epc_cgroup_reclaim_work_func);
cg->res[MISC_CG_RES_SGX_EPC].misc_ops = &sgx_epc_cgroup_ops;
cg->res[MISC_CG_RES_SGX_EPC].priv = epc_cg;
epc_cg->cg = cg;
@@ -95,6 +306,11 @@ static int __init sgx_epc_cgroup_init(void)
if (!boot_cpu_has(X86_FEATURE_SGX))
return 0;
+ sgx_epc_cg_wq = alloc_workqueue("sgx_epc_cg_wq",
+ WQ_UNBOUND | WQ_FREEZABLE,
+ WQ_UNBOUND_MAX_ACTIVE);
+ BUG_ON(!sgx_epc_cg_wq);
+
cg = misc_cg_root();
BUG_ON(!cg);
diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.h b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
index c3abfe82be15..ddc1b89f2805 100644
--- a/arch/x86/kernel/cpu/sgx/epc_cgroup.h
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
@@ -16,20 +16,33 @@
#define MISC_CG_RES_SGX_EPC MISC_CG_RES_TYPES
struct sgx_epc_cgroup;
-static inline struct sgx_epc_cgroup *sgx_epc_cgroup_try_charge(void)
+static inline struct sgx_epc_cgroup *sgx_epc_cgroup_try_charge(bool reclaim)
{
return NULL;
}
static inline void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg) { }
+
+static inline void sgx_epc_cgroup_isolate_pages(struct misc_cg *root,
+ unsigned int nr_to_scan,
+ struct list_head *dst) { }
+
+static bool sgx_epc_cgroup_lru_empty(struct misc_cg *root)
+{
+ return true;
+}
#else
struct sgx_epc_cgroup {
- struct misc_cg *cg;
+ struct misc_cg *cg;
+ struct sgx_epc_lru_list lru;
+ struct work_struct reclaim_work;
};
-struct sgx_epc_cgroup *sgx_epc_cgroup_try_charge(void);
+struct sgx_epc_cgroup *sgx_epc_cgroup_try_charge(bool reclaim);
void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg);
bool sgx_epc_cgroup_lru_empty(struct misc_cg *root);
+void sgx_epc_cgroup_isolate_pages(struct misc_cg *root, unsigned int nr_to_scan,
+ struct list_head *dst);
#endif
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index e8848b493eb7..c496b8f15b54 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -32,6 +32,31 @@ static DEFINE_XARRAY(sgx_epc_address_space);
*/
static struct sgx_epc_lru_list sgx_global_lru;
+#ifndef CONFIG_CGROUP_SGX_EPC
+static inline struct sgx_epc_lru_list *sgx_lru_list(struct sgx_epc_page *epc_page)
+{
+ return &sgx_global_lru;
+}
+#else
+static inline struct sgx_epc_lru_list *sgx_lru_list(struct sgx_epc_page *epc_page)
+{
+ if (epc_page->epc_cg)
+ return &epc_page->epc_cg->lru;
+
+ /* This should not happen if kernel is configured correctly */
+ WARN_ON_ONCE(1);
+ return &sgx_global_lru;
+}
+#endif
+
+static inline bool sgx_can_reclaim(void)
+{
+ if (IS_ENABLED(CONFIG_CGROUP_SGX_EPC))
+ return !sgx_epc_cgroup_lru_empty(misc_cg_root());
+
+ return !list_empty(&sgx_global_lru.reclaimable);
+}
+
static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0);
/* Nodes with one or more EPC sections. */
@@ -342,6 +367,7 @@ unsigned int sgx_do_epc_reclamation(struct list_head *iso)
struct sgx_backing backing[SGX_NR_TO_SCAN_MAX];
struct sgx_epc_page *epc_page, *tmp;
struct sgx_encl_page *encl_page;
+ struct sgx_epc_lru_list *lru;
pgoff_t page_index;
size_t ret, i;
@@ -370,10 +396,11 @@ unsigned int sgx_do_epc_reclamation(struct list_head *iso)
continue;
skip:
- spin_lock(&sgx_global_lru.lock);
+ lru = sgx_lru_list(epc_page);
+ spin_lock(&lru->lock);
sgx_epc_page_set_state(epc_page, SGX_EPC_PAGE_RECLAIMABLE);
- list_move_tail(&epc_page->list, &sgx_global_lru.reclaimable);
- spin_unlock(&sgx_global_lru.lock);
+ list_move_tail(&epc_page->list, &lru->reclaimable);
+ spin_unlock(&lru->lock);
kref_put(&encl_page->encl->refcount, sgx_encl_release);
}
@@ -397,9 +424,13 @@ unsigned int sgx_do_epc_reclamation(struct list_head *iso)
static void sgx_reclaim_epc_pages_global(void)
{
+ unsigned int nr_to_scan = SGX_NR_TO_SCAN;
LIST_HEAD(iso);
- sgx_isolate_epc_pages(&sgx_global_lru, SGX_NR_TO_SCAN, &iso);
+ if (IS_ENABLED(CONFIG_CGROUP_SGX_EPC))
+ sgx_epc_cgroup_isolate_pages(misc_cg_root(), nr_to_scan, &iso);
+ else
+ sgx_isolate_epc_pages(&sgx_global_lru, nr_to_scan, &iso);
sgx_do_epc_reclamation(&iso);
}
@@ -407,7 +438,7 @@ static void sgx_reclaim_epc_pages_global(void)
static bool sgx_should_reclaim(unsigned long watermark)
{
return atomic_long_read(&sgx_nr_free_pages) < watermark &&
- !list_empty(&sgx_global_lru.reclaimable);
+ sgx_can_reclaim();
}
/*
@@ -528,26 +559,26 @@ struct sgx_epc_page *__sgx_alloc_epc_page(void)
}
/**
- * sgx_mark_page_reclaimable() - Mark a page as reclaimable
+ * sgx_mark_page_reclaimable() - Mark a page as reclaimable and add it to an appropriate LRU
* @page: EPC page
*
- * Mark a page as reclaimable and add it to the active page list. Pages
- * are automatically removed from the active list when freed.
*/
void sgx_mark_page_reclaimable(struct sgx_epc_page *page)
{
- spin_lock(&sgx_global_lru.lock);
+ struct sgx_epc_lru_list *lru = sgx_lru_list(page);
+
+ spin_lock(&lru->lock);
WARN_ON_ONCE(sgx_epc_page_reclaimable(page->flags));
page->flags |= SGX_EPC_PAGE_RECLAIMABLE;
- list_add_tail(&page->list, &sgx_global_lru.reclaimable);
- spin_unlock(&sgx_global_lru.lock);
+ list_add_tail(&page->list, &lru->reclaimable);
+ spin_unlock(&lru->lock);
}
/**
* sgx_unmark_page_reclaimable() - Remove a page from the reclaim list
* @page: EPC page
*
- * Clear the reclaimable flag and remove the page from the active page list.
+ * Clear the reclaimable flag if set and remove the page from its LRU.
*
* Return:
* 0 on success,
@@ -555,15 +586,17 @@ void sgx_mark_page_reclaimable(struct sgx_epc_page *page)
*/
int sgx_unmark_page_reclaimable(struct sgx_epc_page *page)
{
- spin_lock(&sgx_global_lru.lock);
+ struct sgx_epc_lru_list *lru = sgx_lru_list(page);
+
+ spin_lock(&lru->lock);
if (sgx_epc_page_reclaim_in_progress(page->flags)) {
- spin_unlock(&sgx_global_lru.lock);
+ spin_unlock(&lru->lock);
return -EBUSY;
}
list_del(&page->list);
sgx_epc_page_reset_state(page);
- spin_unlock(&sgx_global_lru.lock);
+ spin_unlock(&lru->lock);
return 0;
}
@@ -590,7 +623,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
struct sgx_epc_page *page;
struct sgx_epc_cgroup *epc_cg;
- epc_cg = sgx_epc_cgroup_try_charge();
+ epc_cg = sgx_epc_cgroup_try_charge(reclaim);
if (IS_ERR(epc_cg))
return ERR_CAST(epc_cg);
@@ -601,8 +634,10 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
break;
}
- if (list_empty(&sgx_global_lru.reclaimable))
- return ERR_PTR(-ENOMEM);
+ if (!sgx_can_reclaim()) {
+ page = ERR_PTR(-ENOMEM);
+ break;
+ }
if (!reclaim) {
page = ERR_PTR(-EBUSY);
--
2.25.1
From: Kristen Carlson Accardi <[email protected]>
Implement support for cgroup control of SGX Enclave Page Cache (EPC)
memory using the misc cgroup controller. EPC memory is independent
from normal system memory, e.g. must be reserved at boot from RAM and
cannot be converted between EPC and normal memory while the system is
running. EPC is managed by the SGX subsystem and is not accounted by
the memory controller.
Much like normal system memory, EPC memory can be overcommitted via
virtual memory techniques and pages can be swapped out of the EPC to
their backing store (normal system memory, e.g. shmem). The SGX EPC
subsystem is analogous to the memory subsystem and the SGX EPC controller
is in turn analogous to the memory controller; it implements limit and
protection models for EPC memory.
The misc controller provides a mechanism to set a hard limit of EPC
usage via the "sgx_epc" resource in "misc.max". The total EPC memory
available on the system is reported via the "sgx_epc" resource in
"misc.capacity".
This patch was modified from the previous version to only add basic EPC
cgroup structure, accounting allocations for cgroup usage
(charge/uncharge), setup misc cgroup callbacks, set total EPC capacity.
For now, the EPC cgroup simply blocks additional EPC allocation in
sgx_alloc_epc_page() when the limit is reached. Reclaimable pages are
still tracked in the global active list, only reclaimed by the global
reclaimer when the total free page count is lower than a threshold.
Later patches will reorganize the tracking and reclamation code in the
globale reclaimer and implement per-cgroup tracking and reclaiming.
Co-developed-by: Sean Christopherson <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Kristen Carlson Accardi <[email protected]>
Co-developed-by: Haitao Huang <[email protected]>
Signed-off-by: Haitao Huang <[email protected]>
---
V6:
- Split the original large patch"Limit process EPC usage with misc
cgroup controller" and restructure it (Kai)
---
arch/x86/Kconfig | 13 ++++
arch/x86/kernel/cpu/sgx/Makefile | 1 +
arch/x86/kernel/cpu/sgx/epc_cgroup.c | 103 +++++++++++++++++++++++++++
arch/x86/kernel/cpu/sgx/epc_cgroup.h | 36 ++++++++++
arch/x86/kernel/cpu/sgx/main.c | 28 ++++++++
arch/x86/kernel/cpu/sgx/sgx.h | 3 +
6 files changed, 184 insertions(+)
create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.c
create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.h
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 66bfabae8814..e17c5dc3aea4 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1921,6 +1921,19 @@ config X86_SGX
If unsure, say N.
+config CGROUP_SGX_EPC
+ bool "Miscellaneous Cgroup Controller for Enclave Page Cache (EPC) for Intel SGX"
+ depends on X86_SGX && CGROUP_MISC
+ help
+ Provides control over the EPC footprint of tasks in a cgroup via
+ the Miscellaneous cgroup controller.
+
+ EPC is a subset of regular memory that is usable only by SGX
+ enclaves and is very limited in quantity, e.g. less than 1%
+ of total DRAM.
+
+ Say N if unsure.
+
config X86_USER_SHADOW_STACK
bool "X86 userspace shadow stack"
depends on AS_WRUSS
diff --git a/arch/x86/kernel/cpu/sgx/Makefile b/arch/x86/kernel/cpu/sgx/Makefile
index 9c1656779b2a..12901a488da7 100644
--- a/arch/x86/kernel/cpu/sgx/Makefile
+++ b/arch/x86/kernel/cpu/sgx/Makefile
@@ -4,3 +4,4 @@ obj-y += \
ioctl.o \
main.o
obj-$(CONFIG_X86_SGX_KVM) += virt.o
+obj-$(CONFIG_CGROUP_SGX_EPC) += epc_cgroup.o
diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
new file mode 100644
index 000000000000..500627d0563f
--- /dev/null
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
@@ -0,0 +1,103 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright(c) 2022 Intel Corporation.
+
+#include <linux/atomic.h>
+#include <linux/kernel.h>
+#include "epc_cgroup.h"
+
+static inline struct sgx_epc_cgroup *sgx_epc_cgroup_from_misc_cg(struct misc_cg *cg)
+{
+ return (struct sgx_epc_cgroup *)(cg->res[MISC_CG_RES_SGX_EPC].priv);
+}
+
+static inline bool sgx_epc_cgroup_disabled(void)
+{
+ return !cgroup_subsys_enabled(misc_cgrp_subsys);
+}
+
+/**
+ * sgx_epc_cgroup_try_charge() - hierarchically try to charge a single EPC page
+ *
+ * Returns EPC cgroup or NULL on success, -errno on failure.
+ */
+struct sgx_epc_cgroup *sgx_epc_cgroup_try_charge(void)
+{
+ struct sgx_epc_cgroup *epc_cg;
+ int ret;
+
+ if (sgx_epc_cgroup_disabled())
+ return NULL;
+
+ epc_cg = sgx_epc_cgroup_from_misc_cg(get_current_misc_cg());
+ ret = misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, PAGE_SIZE);
+
+ if (!ret) {
+ /* No epc_cg returned, release ref from get_current_misc_cg() */
+ put_misc_cg(epc_cg->cg);
+ return ERR_PTR(-ENOMEM);
+ }
+
+ /* Ref released in sgx_epc_cgroup_uncharge() */
+ return epc_cg;
+}
+
+/**
+ * sgx_epc_cgroup_uncharge() - hierarchically uncharge EPC pages
+ * @epc_cg: the charged epc cgroup
+ */
+void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg)
+{
+ if (sgx_epc_cgroup_disabled())
+ return;
+
+ misc_cg_uncharge(MISC_CG_RES_SGX_EPC, epc_cg->cg, PAGE_SIZE);
+
+ /* Ref got from sgx_epc_cgroup_try_charge() */
+ put_misc_cg(epc_cg->cg);
+}
+
+static void sgx_epc_cgroup_free(struct misc_cg *cg)
+{
+ struct sgx_epc_cgroup *epc_cg;
+
+ epc_cg = sgx_epc_cgroup_from_misc_cg(cg);
+ if (!epc_cg)
+ return;
+
+ kfree(epc_cg);
+}
+
+static int sgx_epc_cgroup_alloc(struct misc_cg *cg);
+
+const struct misc_operations_struct sgx_epc_cgroup_ops = {
+ .alloc = sgx_epc_cgroup_alloc,
+ .free = sgx_epc_cgroup_free,
+};
+
+static int sgx_epc_cgroup_alloc(struct misc_cg *cg)
+{
+ struct sgx_epc_cgroup *epc_cg;
+
+ epc_cg = kzalloc(sizeof(*epc_cg), GFP_KERNEL);
+ if (!epc_cg)
+ return -ENOMEM;
+
+ cg->res[MISC_CG_RES_SGX_EPC].misc_ops = &sgx_epc_cgroup_ops;
+ cg->res[MISC_CG_RES_SGX_EPC].priv = epc_cg;
+ epc_cg->cg = cg;
+ return 0;
+}
+
+static int __init sgx_epc_cgroup_init(void)
+{
+ struct misc_cg *cg;
+
+ if (!boot_cpu_has(X86_FEATURE_SGX))
+ return 0;
+
+ cg = misc_cg_root();
+ BUG_ON(!cg);
+
+ return sgx_epc_cgroup_alloc(cg);
+}
+subsys_initcall(sgx_epc_cgroup_init);
diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.h b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
new file mode 100644
index 000000000000..c3abfe82be15
--- /dev/null
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
@@ -0,0 +1,36 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright(c) 2022 Intel Corporation. */
+#ifndef _INTEL_SGX_EPC_CGROUP_H_
+#define _INTEL_SGX_EPC_CGROUP_H_
+
+#include <asm/sgx.h>
+#include <linux/cgroup.h>
+#include <linux/list.h>
+#include <linux/misc_cgroup.h>
+#include <linux/page_counter.h>
+#include <linux/workqueue.h>
+
+#include "sgx.h"
+
+#ifndef CONFIG_CGROUP_SGX_EPC
+#define MISC_CG_RES_SGX_EPC MISC_CG_RES_TYPES
+struct sgx_epc_cgroup;
+
+static inline struct sgx_epc_cgroup *sgx_epc_cgroup_try_charge(void)
+{
+ return NULL;
+}
+
+static inline void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg) { }
+#else
+struct sgx_epc_cgroup {
+ struct misc_cg *cg;
+};
+
+struct sgx_epc_cgroup *sgx_epc_cgroup_try_charge(void);
+void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg);
+bool sgx_epc_cgroup_lru_empty(struct misc_cg *root);
+
+#endif
+
+#endif /* _INTEL_SGX_EPC_CGROUP_H_ */
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 166692f2d501..07606f391540 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -6,6 +6,7 @@
#include <linux/highmem.h>
#include <linux/kthread.h>
#include <linux/miscdevice.h>
+#include <linux/misc_cgroup.h>
#include <linux/node.h>
#include <linux/pagemap.h>
#include <linux/ratelimit.h>
@@ -17,6 +18,7 @@
#include "driver.h"
#include "encl.h"
#include "encls.h"
+#include "epc_cgroup.h"
struct sgx_epc_section sgx_epc_sections[SGX_MAX_EPC_SECTIONS];
static int sgx_nr_epc_sections;
@@ -559,6 +561,11 @@ int sgx_unmark_page_reclaimable(struct sgx_epc_page *page)
struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
{
struct sgx_epc_page *page;
+ struct sgx_epc_cgroup *epc_cg;
+
+ epc_cg = sgx_epc_cgroup_try_charge();
+ if (IS_ERR(epc_cg))
+ return ERR_CAST(epc_cg);
for ( ; ; ) {
page = __sgx_alloc_epc_page();
@@ -580,10 +587,21 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
break;
}
+ /*
+ * Need to do a global reclamation if cgroup was not full but free
+ * physical pages run out, causing __sgx_alloc_epc_page() to fail.
+ */
sgx_reclaim_pages();
cond_resched();
}
+ if (!IS_ERR(page)) {
+ WARN_ON_ONCE(page->epc_cg);
+ page->epc_cg = epc_cg;
+ } else {
+ sgx_epc_cgroup_uncharge(epc_cg);
+ }
+
if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
wake_up(&ksgxd_waitq);
@@ -604,6 +622,11 @@ void sgx_free_epc_page(struct sgx_epc_page *page)
struct sgx_epc_section *section = &sgx_epc_sections[page->section];
struct sgx_numa_node *node = section->node;
+ if (page->epc_cg) {
+ sgx_epc_cgroup_uncharge(page->epc_cg);
+ page->epc_cg = NULL;
+ }
+
spin_lock(&node->lock);
page->owner = NULL;
@@ -643,6 +666,7 @@ static bool __init sgx_setup_epc_section(u64 phys_addr, u64 size,
section->pages[i].flags = 0;
section->pages[i].owner = NULL;
section->pages[i].poison = 0;
+ section->pages[i].epc_cg = NULL;
list_add_tail(§ion->pages[i].list, &sgx_dirty_page_list);
}
@@ -787,6 +811,7 @@ static void __init arch_update_sysfs_visibility(int nid) {}
static bool __init sgx_page_cache_init(void)
{
u32 eax, ebx, ecx, edx, type;
+ u64 capacity = 0;
u64 pa, size;
int nid;
int i;
@@ -837,6 +862,7 @@ static bool __init sgx_page_cache_init(void)
sgx_epc_sections[i].node = &sgx_numa_nodes[nid];
sgx_numa_nodes[nid].size += size;
+ capacity += size;
sgx_nr_epc_sections++;
}
@@ -846,6 +872,8 @@ static bool __init sgx_page_cache_init(void)
return false;
}
+ misc_cg_set_capacity(MISC_CG_RES_SGX_EPC, capacity);
+
return true;
}
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index d2dad21259a8..b1786774b8d2 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -29,12 +29,15 @@
/* Pages on free list */
#define SGX_EPC_PAGE_IS_FREE BIT(1)
+struct sgx_epc_cgroup;
+
struct sgx_epc_page {
unsigned int section;
u16 flags;
u16 poison;
struct sgx_encl_page *owner;
struct list_head list;
+ struct sgx_epc_cgroup *epc_cg;
};
/*
--
2.25.1
From: Sean Christopherson <[email protected]>
To prepare for per-cgroup reclamation, separate the top-level reclaim
function, sgx_reclaim_epc_pages(), into two separate functions:
- sgx_isolate_epc_pages() scans and isolates reclaimable pages from a given LRU list.
- sgx_do_epc_reclamation() performs the real reclamation for the already isolated pages.
Create a new function, sgx_reclaim_epc_pages_global(), calling those two
in succession, to replace the original sgx_reclaim_epc_pages(). The
above two functions will serve as building blocks for the reclamation
flows in later EPC cgroup implementation.
sgx_do_epc_reclamation() returns the number of reclaimed pages. The EPC
cgroup will use the result to track reclaiming progress.
sgx_isolate_epc_pages() returns the additional number of pages to scan
for current epoch of reclamation. The EPC cgroup will use the result to
determine if more scanning to be done in LRUs in its children groups.
Signed-off-by: Sean Christopherson <[email protected]>
Co-developed-by: Kristen Carlson Accardi <[email protected]>
Signed-off-by: Kristen Carlson Accardi <[email protected]>
Co-developed-by: Haitao Huang <[email protected]>
Signed-off-by: Haitao Huang <[email protected]>
Cc: Sean Christopherson <[email protected]>
---
V6:
- Restructure patches to make it easier to review. (Kai)
- Fix unused nr_to_scan (Kai)
---
arch/x86/kernel/cpu/sgx/main.c | 97 ++++++++++++++++++++++------------
arch/x86/kernel/cpu/sgx/sgx.h | 8 +++
2 files changed, 72 insertions(+), 33 deletions(-)
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 33bcba313d40..e8848b493eb7 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -281,33 +281,23 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
mutex_unlock(&encl->lock);
}
-/*
- * Take a fixed number of pages from the head of the active page pool and
- * reclaim them to the enclave's private shmem files. Skip the pages, which have
- * been accessed since the last scan. Move those pages to the tail of active
- * page pool so that the pages get scanned in LRU like fashion.
+/**
+ * sgx_isolate_epc_pages() - Isolate pages from an LRU for reclaim
+ * @lru: LRU from which to reclaim
+ * @nr_to_scan: Number of pages to scan for reclaim
+ * @dst: Destination list to hold the isolated pages
*
- * Batch process a chunk of pages (at the moment 16) in order to degrade amount
- * of IPI's and ETRACK's potentially required. sgx_encl_ewb() does degrade a bit
- * among the HW threads with three stage EWB pipeline (EWB, ETRACK + EWB and IPI
- * + EWB) but not sufficiently. Reclaiming one page at a time would also be
- * problematic as it would increase the lock contention too much, which would
- * halt forward progress.
+ * Return: remaining pages to scan, i.e, @nr_to_scan minus the number of pages scanned.
*/
-static void sgx_reclaim_pages(void)
+unsigned int sgx_isolate_epc_pages(struct sgx_epc_lru_list *lru, unsigned int nr_to_scan,
+ struct list_head *dst)
{
- struct sgx_backing backing[SGX_NR_TO_SCAN];
- struct sgx_epc_page *epc_page, *tmp;
struct sgx_encl_page *encl_page;
- pgoff_t page_index;
- LIST_HEAD(iso);
- int ret;
- int i;
+ struct sgx_epc_page *epc_page;
- spin_lock(&sgx_global_lru.lock);
- for (i = 0; i < SGX_NR_TO_SCAN; i++) {
- epc_page = list_first_entry_or_null(&sgx_global_lru.reclaimable,
- struct sgx_epc_page, list);
+ spin_lock(&lru->lock);
+ for (; nr_to_scan > 0; --nr_to_scan) {
+ epc_page = list_first_entry_or_null(&lru->reclaimable, struct sgx_epc_page, list);
if (!epc_page)
break;
@@ -316,23 +306,53 @@ static void sgx_reclaim_pages(void)
if (kref_get_unless_zero(&encl_page->encl->refcount) != 0) {
sgx_epc_page_set_state(epc_page, SGX_EPC_PAGE_RECLAIM_IN_PROGRESS);
- list_move_tail(&epc_page->list, &iso);
+ list_move_tail(&epc_page->list, dst);
} else
/* The owner is freeing the page. No need to add the
* page back to the list of reclaimable pages.
*/
sgx_epc_page_reset_state(epc_page);
}
- spin_unlock(&sgx_global_lru.lock);
+ spin_unlock(&lru->lock);
+
+ return nr_to_scan;
+}
+
+/**
+ * sgx_do_epc_reclamation() - Perform reclamation for isolated EPC pages.
+ * @iso: List of isolated pages for reclamation
+ *
+ * Take a list of EPC pages and reclaim them to the enclave's private shmem files. Do not
+ * reclaim the pages that have been accessed since the last scan, and move each of those pages
+ * to the tail of its tracking LRU list.
+ *
+ * Limit the number of pages to be processed up to SGX_NR_TO_SCAN_MAX per call in order to
+ * degrade amount of IPI's and ETRACK's potentially required. sgx_encl_ewb() does degrade a bit
+ * among the HW threads with three stage EWB pipeline (EWB, ETRACK + EWB and IPI + EWB) but not
+ * sufficiently. Reclaiming one page at a time would also be problematic as it would increase
+ * the lock contention too much, which would halt forward progress.
+ *
+ * Extra pages in the list beyond the SGX_NR_TO_SCAN_MAX limit are skipped and returned back to
+ * their tracking LRU lists.
+ *
+ * Return: number of pages successfully reclaimed.
+ */
+unsigned int sgx_do_epc_reclamation(struct list_head *iso)
+{
+ struct sgx_backing backing[SGX_NR_TO_SCAN_MAX];
+ struct sgx_epc_page *epc_page, *tmp;
+ struct sgx_encl_page *encl_page;
+ pgoff_t page_index;
+ size_t ret, i;
- if (list_empty(&iso))
- return;
+ if (list_empty(iso))
+ return 0;
i = 0;
- list_for_each_entry_safe(epc_page, tmp, &iso, list) {
+ list_for_each_entry_safe(epc_page, tmp, iso, list) {
encl_page = epc_page->owner;
- if (!sgx_reclaimer_age(epc_page))
+ if (i == SGX_NR_TO_SCAN_MAX || !sgx_reclaimer_age(epc_page))
goto skip;
page_index = PFN_DOWN(encl_page->desc - encl_page->encl->base);
@@ -358,11 +378,11 @@ static void sgx_reclaim_pages(void)
kref_put(&encl_page->encl->refcount, sgx_encl_release);
}
- list_for_each_entry(epc_page, &iso, list)
+ list_for_each_entry(epc_page, iso, list)
sgx_reclaimer_block(epc_page);
i = 0;
- list_for_each_entry_safe(epc_page, tmp, &iso, list) {
+ list_for_each_entry_safe(epc_page, tmp, iso, list) {
encl_page = epc_page->owner;
sgx_reclaimer_write(epc_page, &backing[i++]);
@@ -371,6 +391,17 @@ static void sgx_reclaim_pages(void)
sgx_free_epc_page(epc_page);
}
+
+ return i;
+}
+
+static void sgx_reclaim_epc_pages_global(void)
+{
+ LIST_HEAD(iso);
+
+ sgx_isolate_epc_pages(&sgx_global_lru, SGX_NR_TO_SCAN, &iso);
+
+ sgx_do_epc_reclamation(&iso);
}
static bool sgx_should_reclaim(unsigned long watermark)
@@ -387,7 +418,7 @@ static bool sgx_should_reclaim(unsigned long watermark)
void sgx_reclaim_direct(void)
{
if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
- sgx_reclaim_pages();
+ sgx_reclaim_epc_pages_global();
}
static int ksgxd(void *p)
@@ -410,7 +441,7 @@ static int ksgxd(void *p)
sgx_should_reclaim(SGX_NR_HIGH_PAGES));
if (sgx_should_reclaim(SGX_NR_HIGH_PAGES))
- sgx_reclaim_pages();
+ sgx_reclaim_epc_pages_global();
cond_resched();
}
@@ -587,7 +618,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
* Need to do a global reclamation if cgroup was not full but free
* physical pages run out, causing __sgx_alloc_epc_page() to fail.
*/
- sgx_reclaim_pages();
+ sgx_reclaim_epc_pages_global();
cond_resched();
}
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index dd7ab65b5b27..6a40f70ed96f 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -19,6 +19,11 @@
#define SGX_MAX_EPC_SECTIONS 8
#define SGX_EEXTEND_BLOCK_SIZE 256
+
+/*
+ * Maximum number of pages to scan for reclaiming.
+ */
+#define SGX_NR_TO_SCAN_MAX 32U
#define SGX_NR_TO_SCAN 16
#define SGX_NR_LOW_PAGES 32
#define SGX_NR_HIGH_PAGES 64
@@ -162,6 +167,9 @@ void sgx_reclaim_direct(void);
void sgx_mark_page_reclaimable(struct sgx_epc_page *page);
int sgx_unmark_page_reclaimable(struct sgx_epc_page *page);
struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim);
+unsigned int sgx_do_epc_reclamation(struct list_head *iso);
+unsigned int sgx_isolate_epc_pages(struct sgx_epc_lru_list *lru, unsigned int nr_to_scan,
+ struct list_head *dst);
void sgx_ipi_cb(void *info);
--
2.25.1
The scripts rely on cgroup-tools package from libcgroup [1].
To run selftests for epc cgroup:
sudo ./run_epc_cg_selftests.sh
With different cgroups, the script starts one or multiple concurrent SGX
selftests, each to run one unclobbered_vdso_oversubscribed test. Each
of such test tries to load an enclave of EPC size equal to the EPC
capacity available on the platform. The script checks results against
the expectation set for each cgroup and reports success or failure.
The script creates 3 different cgroups at the beginning with following
expectations:
1) SMALL - intentionally small enough to fail the test loading an
enclave of size equal to the capacity.
2) LARGE - large enough to run up to 4 concurrent tests but fail some if
more than 4 concurrent tests are run. The script starts 4 expecting at
least one test to pass, and then starts 5 expecting at least one test
to fail.
3) LARGER - limit is the same as the capacity, large enough to run lots of
concurrent tests. The script starts 10 of them and expects all pass.
Then it reruns the same test with one process randomly killed and
usage checked to be zero after all process exit.
To watch misc cgroup 'current' changes during testing, run this in a
separate terminal:
./watch_misc_for_tests.sh current
[1] https://github.com/libcgroup/libcgroup/blob/main/README
Signed-off-by: Haitao Huang <[email protected]>
---
V5:
- Added script with automatic results checking, remove the interactive
script.
- The script can run independent from the series below.
---
.../selftests/sgx/run_epc_cg_selftests.sh | 196 ++++++++++++++++++
.../selftests/sgx/watch_misc_for_tests.sh | 13 ++
2 files changed, 209 insertions(+)
create mode 100755 tools/testing/selftests/sgx/run_epc_cg_selftests.sh
create mode 100755 tools/testing/selftests/sgx/watch_misc_for_tests.sh
diff --git a/tools/testing/selftests/sgx/run_epc_cg_selftests.sh b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
new file mode 100755
index 000000000000..72b93f694753
--- /dev/null
+++ b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
@@ -0,0 +1,196 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright(c) 2023 Intel Corporation.
+
+TEST_ROOT_CG=selftest
+cgcreate -g misc:$TEST_ROOT_CG
+if [ $? -ne 0 ]; then
+ echo "# Please make sure cgroup-tools is installed, and misc cgroup is mounted."
+ exit 1
+fi
+TEST_CG_SUB1=$TEST_ROOT_CG/test1
+TEST_CG_SUB2=$TEST_ROOT_CG/test2
+TEST_CG_SUB3=$TEST_ROOT_CG/test1/test3
+TEST_CG_SUB4=$TEST_ROOT_CG/test4
+
+cgcreate -g misc:$TEST_CG_SUB1
+cgcreate -g misc:$TEST_CG_SUB2
+cgcreate -g misc:$TEST_CG_SUB3
+cgcreate -g misc:$TEST_CG_SUB4
+
+# Default to V2
+CG_ROOT=/sys/fs/cgroup
+if [ ! -d "/sys/fs/cgroup/misc" ]; then
+ echo "# cgroup V2 is in use."
+else
+ echo "# cgroup V1 is in use."
+ CG_ROOT=/sys/fs/cgroup/misc
+fi
+
+CAPACITY=$(grep "sgx_epc" "$CG_ROOT/misc.capacity" | awk '{print $2}')
+# This is below number of VA pages needed for enclave of capacity size. So
+# should fail oversubscribed cases
+SMALL=$(( CAPACITY / 512 ))
+
+# At least load one enclave of capacity size successfully, maybe up to 4.
+# But some may fail if we run more than 4 concurrent enclaves of capacity size.
+LARGE=$(( SMALL * 4 ))
+
+# Load lots of enclaves
+LARGER=$CAPACITY
+echo "# Setting up limits."
+echo "sgx_epc $SMALL" | tee $CG_ROOT/$TEST_CG_SUB1/misc.max
+echo "sgx_epc $LARGE" | tee $CG_ROOT/$TEST_CG_SUB2/misc.max
+echo "sgx_epc $LARGER" | tee $CG_ROOT/$TEST_CG_SUB4/misc.max
+
+timestamp=$(date +%Y%m%d_%H%M%S)
+
+test_cmd="./test_sgx -t unclobbered_vdso_oversubscribed"
+
+echo "# Start unclobbered_vdso_oversubscribed with SMALL limit, expecting failure..."
+# Always use leaf node of misc cgroups so it works for both v1 and v2
+# these may fail on OOM
+cgexec -g misc:$TEST_CG_SUB3 $test_cmd >cgtest_small_$timestamp.log 2>&1
+if [[ $? -eq 0 ]]; then
+ echo "# Fail on SMALL limit, not expecting any test passes."
+ cgdelete -r -g misc:$TEST_ROOT_CG
+ exit 1
+else
+ echo "# Test failed as expected."
+fi
+
+echo "# PASSED SMALL limit."
+
+echo "# Start 4 concurrent unclobbered_vdso_oversubscribed tests with LARGE limit,
+ expecting at least one success...."
+pids=()
+for i in {1..4}; do
+ (
+ cgexec -g misc:$TEST_CG_SUB2 $test_cmd >cgtest_large_positive_$timestamp.$i.log 2>&1
+ ) &
+ pids+=($!)
+done
+
+any_success=0
+for pid in "${pids[@]}"; do
+ wait "$pid"
+ status=$?
+ if [[ $status -eq 0 ]]; then
+ any_success=1
+ echo "# Process $pid returned successfully."
+ fi
+done
+
+if [[ $any_success -eq 0 ]]; then
+ echo "# Failed on LARGE limit positive testing, no test passes."
+ cgdelete -r -g misc:$TEST_ROOT_CG
+ exit 1
+fi
+
+echo "# PASSED LARGE limit positive testing."
+
+echo "# Start 5 concurrent unclobbered_vdso_oversubscribed tests with LARGE limit,
+ expecting at least one failure...."
+pids=()
+for i in {1..5}; do
+ (
+ cgexec -g misc:$TEST_CG_SUB2 $test_cmd >cgtest_large_negative_$timestamp.$i.log 2>&1
+ ) &
+ pids+=($!)
+done
+
+any_failure=0
+for pid in "${pids[@]}"; do
+ wait "$pid"
+ status=$?
+ if [[ $status -ne 0 ]]; then
+ echo "# Process $pid returned failure."
+ any_failure=1
+ fi
+done
+
+if [[ $any_failure -eq 0 ]]; then
+ echo "# Failed on LARGE limit negative testing, no test fails."
+ cgdelete -r -g misc:$TEST_ROOT_CG
+ exit 1
+fi
+
+echo "# PASSED LARGE limit negative testing."
+
+echo "# Start 10 concurrent unclobbered_vdso_oversubscribed tests with LARGER limit,
+ expecting no failure...."
+pids=()
+for i in {1..10}; do
+ (
+ cgexec -g misc:$TEST_CG_SUB4 $test_cmd >cgtest_larger_$timestamp.$i.log 2>&1
+ ) &
+ pids+=($!)
+done
+
+any_failure=0
+for pid in "${pids[@]}"; do
+ wait "$pid"
+ status=$?
+ if [[ $status -ne 0 ]]; then
+ echo "# Process $pid returned failure."
+ any_failure=1
+ fi
+done
+
+if [[ $any_failure -ne 0 ]]; then
+ echo "# Failed on LARGER limit, at least one test fails."
+ cgdelete -r -g misc:$TEST_ROOT_CG
+ exit 1
+fi
+
+echo "# PASSED LARGER limit tests."
+
+
+echo "# Start 10 concurrent unclobbered_vdso_oversubscribed tests with LARGER limit,
+ randomly kill one, expecting no failure...."
+pids=()
+for i in {1..10}; do
+ (
+ cgexec -g misc:$TEST_CG_SUB4 $test_cmd >cgtest_larger_$timestamp.$i.log 2>&1
+ ) &
+ pids+=($!)
+done
+
+sleep $((RANDOM % 10 + 5))
+
+# Randomly select a PID to kill
+RANDOM_INDEX=$((RANDOM % 10))
+PID_TO_KILL=${pids[RANDOM_INDEX]}
+
+kill $PID_TO_KILL
+echo "# Killed process with PID: $PID_TO_KILL"
+
+any_failure=0
+for pid in "${pids[@]}"; do
+ wait "$pid"
+ status=$?
+ if [ "$pid" != "$PID_TO_KILL" ]; then
+ if [[ $status -ne 0 ]]; then
+ echo "# Process $pid returned failure."
+ any_failure=1
+ fi
+ fi
+done
+
+if [[ $any_failure -ne 0 ]]; then
+ echo "# Failed on random killing, at least one test fails."
+ cgdelete -r -g misc:$TEST_ROOT_CG
+ exit 1
+fi
+
+sleep 1
+
+USAGE=$(grep '^sgx_epc' "$CG_ROOT/$TEST_ROOT_CG/misc.current" | awk '{print $2}')
+if [ "$USAGE" -ne 0 ]; then
+ echo "# Failed: Final usage is $USAGE, not 0."
+else
+ echo "# PASSED leakage check."
+ echo "# PASSED ALL cgroup limit tests, cleanup cgroups..."
+fi
+cgdelete -r -g misc:$TEST_ROOT_CG
+echo "# done."
diff --git a/tools/testing/selftests/sgx/watch_misc_for_tests.sh b/tools/testing/selftests/sgx/watch_misc_for_tests.sh
new file mode 100755
index 000000000000..dbd38f346e7b
--- /dev/null
+++ b/tools/testing/selftests/sgx/watch_misc_for_tests.sh
@@ -0,0 +1,13 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright(c) 2023 Intel Corporation.
+
+if [ -z "$1" ]
+ then
+ echo "No argument supplied, please provide 'max', 'current' or 'events'"
+ exit 1
+fi
+
+watch -n 1 "find /sys/fs/cgroup -wholename */test*/misc.$1 -exec sh -c \
+ 'echo \"\$1:\"; cat \"\$1\"' _ {} \;"
+
--
2.25.1
Use the lower 2 bits in the flags field of sgx_epc_page struct to track
EPC states and define an enum for possible states for EPC pages tracked
for reclamation.
Add the RECLAIM_IN_PROGRESS state to explicitly indicate a page that is
identified as a candidate for reclaiming, but has not yet been
reclaimed, instead of relying on list_empty(&epc_page->list). A later
patch will replace the array on stack with a temporary list to store the
candidate pages, so list_empty() should no longer be used for this
purpose.
Co-developed-by: Sean Christopherson <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Co-developed-by: Kristen Carlson Accardi <[email protected]>
Signed-off-by: Kristen Carlson Accardi <[email protected]>
Signed-off-by: Haitao Huang <[email protected]>
Cc: Sean Christopherson <[email protected]>
---
V6:
- Drop UNRECLAIMABLE and use only 2 bits for states (Kai)
- Combine the patch for RECLAIM_IN_PROGRESS
- Style fixes (Jarkko and Kai)
---
arch/x86/kernel/cpu/sgx/encl.c | 2 +-
arch/x86/kernel/cpu/sgx/main.c | 33 +++++++++---------
arch/x86/kernel/cpu/sgx/sgx.h | 62 +++++++++++++++++++++++++++++++---
3 files changed, 76 insertions(+), 21 deletions(-)
diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index 279148e72459..17dc108d3ff7 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -1315,7 +1315,7 @@ void sgx_encl_free_epc_page(struct sgx_epc_page *page)
{
int ret;
- WARN_ON_ONCE(page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED);
+ WARN_ON_ONCE(page->flags & SGX_EPC_PAGE_STATE_MASK);
ret = __eremove(sgx_get_epc_virt_addr(page));
if (WARN_ONCE(ret, EREMOVE_ERROR_MESSAGE, ret, ret))
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index d347acd717fd..e27ac73d8843 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -315,13 +315,14 @@ static void sgx_reclaim_pages(void)
list_del_init(&epc_page->list);
encl_page = epc_page->owner;
- if (kref_get_unless_zero(&encl_page->encl->refcount) != 0)
+ if (kref_get_unless_zero(&encl_page->encl->refcount) != 0) {
+ sgx_epc_page_set_state(epc_page, SGX_EPC_PAGE_RECLAIM_IN_PROGRESS);
chunk[cnt++] = epc_page;
- else
+ } else
/* The owner is freeing the page. No need to add the
* page back to the list of reclaimable pages.
*/
- epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
+ sgx_epc_page_reset_state(epc_page);
}
spin_unlock(&sgx_global_lru.lock);
@@ -347,6 +348,7 @@ static void sgx_reclaim_pages(void)
skip:
spin_lock(&sgx_global_lru.lock);
+ sgx_epc_page_set_state(epc_page, SGX_EPC_PAGE_RECLAIMABLE);
list_add_tail(&epc_page->list, &sgx_global_lru.reclaimable);
spin_unlock(&sgx_global_lru.lock);
@@ -370,7 +372,7 @@ static void sgx_reclaim_pages(void)
sgx_reclaimer_write(epc_page, &backing[i]);
kref_put(&encl_page->encl->refcount, sgx_encl_release);
- epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
+ sgx_epc_page_reset_state(epc_page);
sgx_free_epc_page(epc_page);
}
@@ -509,7 +511,8 @@ struct sgx_epc_page *__sgx_alloc_epc_page(void)
void sgx_mark_page_reclaimable(struct sgx_epc_page *page)
{
spin_lock(&sgx_global_lru.lock);
- page->flags |= SGX_EPC_PAGE_RECLAIMER_TRACKED;
+ WARN_ON_ONCE(sgx_epc_page_reclaimable(page->flags));
+ page->flags |= SGX_EPC_PAGE_RECLAIMABLE;
list_add_tail(&page->list, &sgx_global_lru.reclaimable);
spin_unlock(&sgx_global_lru.lock);
}
@@ -527,16 +530,13 @@ void sgx_mark_page_reclaimable(struct sgx_epc_page *page)
int sgx_unmark_page_reclaimable(struct sgx_epc_page *page)
{
spin_lock(&sgx_global_lru.lock);
- if (page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED) {
- /* The page is being reclaimed. */
- if (list_empty(&page->list)) {
- spin_unlock(&sgx_global_lru.lock);
- return -EBUSY;
- }
-
- list_del(&page->list);
- page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
+ if (sgx_epc_page_reclaim_in_progress(page->flags)) {
+ spin_unlock(&sgx_global_lru.lock);
+ return -EBUSY;
}
+
+ list_del(&page->list);
+ sgx_epc_page_reset_state(page);
spin_unlock(&sgx_global_lru.lock);
return 0;
@@ -623,6 +623,7 @@ void sgx_free_epc_page(struct sgx_epc_page *page)
struct sgx_epc_section *section = &sgx_epc_sections[page->section];
struct sgx_numa_node *node = section->node;
+ WARN_ON_ONCE(page->flags & (SGX_EPC_PAGE_STATE_MASK));
if (page->epc_cg) {
sgx_epc_cgroup_uncharge(page->epc_cg);
page->epc_cg = NULL;
@@ -635,7 +636,7 @@ void sgx_free_epc_page(struct sgx_epc_page *page)
list_add(&page->list, &node->sgx_poison_page_list);
else
list_add_tail(&page->list, &node->free_page_list);
- page->flags = SGX_EPC_PAGE_IS_FREE;
+ page->flags = SGX_EPC_PAGE_FREE;
spin_unlock(&node->lock);
atomic_long_inc(&sgx_nr_free_pages);
@@ -737,7 +738,7 @@ int arch_memory_failure(unsigned long pfn, int flags)
* If the page is on a free list, move it to the per-node
* poison page list.
*/
- if (page->flags & SGX_EPC_PAGE_IS_FREE) {
+ if (page->flags == SGX_EPC_PAGE_FREE) {
list_move(&page->list, &node->sgx_poison_page_list);
goto out;
}
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index 0fbe6a2a159b..dd7ab65b5b27 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -23,11 +23,44 @@
#define SGX_NR_LOW_PAGES 32
#define SGX_NR_HIGH_PAGES 64
-/* Pages, which are being tracked by the page reclaimer. */
-#define SGX_EPC_PAGE_RECLAIMER_TRACKED BIT(0)
+enum sgx_epc_page_state {
+ /*
+ * Allocated but not tracked by the reclaimer.
+ *
+ * Pages allocated for virtual EPC which are never tracked by the host
+ * reclaimer; pages just allocated from free list but not yet put in
+ * use; pages just reclaimed, but not yet returned to the free list.
+ * Becomes FREE after sgx_free_epc().
+ * Becomes RECLAIMABLE after sgx_mark_page_reclaimable().
+ */
+ SGX_EPC_PAGE_NOT_TRACKED = 0,
+
+ /*
+ * Page is in the free list, ready for allocation.
+ *
+ * Becomes NOT_TRACKED after sgx_alloc_epc_page().
+ */
+ SGX_EPC_PAGE_FREE = 1,
+
+ /*
+ * Page is in use and tracked in a reclaimable LRU list.
+ *
+ * Becomes NOT_TRACKED after sgx_unmark_page_reclaimable().
+ * Becomes RECLAIM_IN_PROGRESS in sgx_reclaim_pages() when identified
+ * for reclaiming.
+ */
+ SGX_EPC_PAGE_RECLAIMABLE = 2,
+
+ /*
+ * Page is in the middle of reclamation.
+ *
+ * Back to RECLAIMABLE if reclamation fails for any reason.
+ * Becomes NOT_TRACKED if reclaimed successfully.
+ */
+ SGX_EPC_PAGE_RECLAIM_IN_PROGRESS = 3,
+};
-/* Pages on free list */
-#define SGX_EPC_PAGE_IS_FREE BIT(1)
+#define SGX_EPC_PAGE_STATE_MASK GENMASK(1, 0)
struct sgx_epc_cgroup;
@@ -40,6 +73,27 @@ struct sgx_epc_page {
struct sgx_epc_cgroup *epc_cg;
};
+static inline void sgx_epc_page_reset_state(struct sgx_epc_page *page)
+{
+ page->flags &= ~SGX_EPC_PAGE_STATE_MASK;
+}
+
+static inline void sgx_epc_page_set_state(struct sgx_epc_page *page, unsigned long flags)
+{
+ page->flags &= ~SGX_EPC_PAGE_STATE_MASK;
+ page->flags |= (flags & SGX_EPC_PAGE_STATE_MASK);
+}
+
+static inline bool sgx_epc_page_reclaim_in_progress(unsigned long flags)
+{
+ return SGX_EPC_PAGE_RECLAIM_IN_PROGRESS == (flags & SGX_EPC_PAGE_STATE_MASK);
+}
+
+static inline bool sgx_epc_page_reclaimable(unsigned long flags)
+{
+ return SGX_EPC_PAGE_RECLAIMABLE == (flags & SGX_EPC_PAGE_STATE_MASK);
+}
+
/*
* Contains the tracking data for NUMA nodes having EPC pages. Most importantly,
* the free page list local to the node is stored here.
--
2.25.1
From: Sean Christopherson <[email protected]>
Introduce a data structure to wrap the existing reclaimable list and its
spinlock. Each cgroup later will have one instance of this structure to
track EPC pages allocated for processes associated with the same cgroup.
Just like the global SGX reclaimer (ksgxd), an EPC cgroup reclaims pages
from the reclaimable list in this structure when its usage reaches near
its limit.
Signed-off-by: Sean Christopherson <[email protected]>
Co-developed-by: Kristen Carlson Accardi <[email protected]>
Signed-off-by: Kristen Carlson Accardi <[email protected]>
Co-developed-by: Haitao Huang <[email protected]>
Signed-off-by: Haitao Huang <[email protected]>
Cc: Sean Christopherson <[email protected]>
---
V6:
- removed introduction to unreclaimables in commit message.
V4:
- Removed unneeded comments for the spinlock and the non-reclaimables.
(Kai, Jarkko)
- Revised the commit to add introduction comments for unreclaimables and
multiple LRU lists.(Kai)
- Reordered the patches: delay all changes for unreclaimables to
later, and this one becomes the first change in the SGX subsystem.
V3:
- Removed the helper functions and revised commit messages.
---
arch/x86/kernel/cpu/sgx/sgx.h | 15 +++++++++++++++
1 file changed, 15 insertions(+)
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index b1786774b8d2..0fbe6a2a159b 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -86,6 +86,21 @@ static inline void *sgx_get_epc_virt_addr(struct sgx_epc_page *page)
return section->virt_addr + index * PAGE_SIZE;
}
+/*
+ * Contains EPC pages tracked by the global reclaimer (ksgxd) or an EPC
+ * cgroup.
+ */
+struct sgx_epc_lru_list {
+ spinlock_t lock;
+ struct list_head reclaimable;
+};
+
+static inline void sgx_lru_init(struct sgx_epc_lru_list *lru)
+{
+ spin_lock_init(&lru->lock);
+ INIT_LIST_HEAD(&lru->reclaimable);
+}
+
struct sgx_epc_page *__sgx_alloc_epc_page(void);
void sgx_free_epc_page(struct sgx_epc_page *page);
--
2.25.1
On Mon, 2023-10-30 at 11:20 -0700, Haitao Huang wrote:
> SGX Enclave Page Cache (EPC) memory allocations are separate from normal RAM allocations, and
> are managed solely by the SGX subsystem. The existing cgroup memory controller cannot be used
> to limit or account for SGX EPC memory, which is a desirable feature in some environments,
> e.g., support for pod level control in a Kubernates cluster on a VM or baremetal host [1,2].
>
> This patchset implements the support for sgx_epc memory within the misc cgroup controller. The
> user can use the misc cgroup controller to set and enforce a max limit on total EPC usage per
> cgroup. The implementation reports current usage and events of reaching the limit per cgroup as
> well as the total system capacity.
>
> With the EPC misc controller enabled, every EPC page allocation is accounted for a cgroup's
> usage, reflected in the 'sgx_epc' entry in the 'misc.current' interface file of the cgroup.
> Much like normal system memory, EPC memory can be overcommitted via virtual memory techniques
> and pages can be swapped out of the EPC to their backing store (normal system memory allocated
> via shmem, accounted by the memory controller). When the EPC usage of a cgroup reaches its hard
> limit ('sgx_epc' entry in the 'misc.max' file), the cgroup starts a reclamation process to swap
> out some EPC pages within the same cgroup and its descendant to their backing store. Although
> the SGX architecture supports swapping for all pages, to avoid extra complexities, this
> implementation does not support swapping for certain page types, e.g. Version Array(VA) pages,
> and treat them as unreclaimable pages. When the limit is reached but nothing left in the
> cgroup for reclamation, i.e., only unreclaimable pages left, any new EPC allocation in the
> cgroup will result in an ENOMEM error.
>
> The EPC pages allocated for guest VMs by the virtual EPC driver are not reclaimable by the host
> kernel [5]. Therefore they are also treated as unreclaimable from cgroup's point of view. And
> the virtual EPC driver translates an ENOMEM error resulted from an EPC allocation request into
> a SIGBUS to the user process.
>
> This work was originally authored by Sean Christopherson a few years ago, and previously
> modified by Kristen C. Accardi to utilize the misc cgroup controller rather than a custom
> controller. I have been updating the patches based on review comments since V2 [3, 4, 10],
> simplified the implementation/design and fixed some stability issues found from testing.
>
> The patches are organized as following:
> - Patches 1-3 are prerequisite misc cgroup changes for adding new APIs, structs, resource
> types.
> - Patch 4 implements basic misc controller for EPC without reclamation.
> - Patches 5-9 prepare for per-cgroup reclamation.
> * Separate out the existing infrastructure of tracking reclaimable pages
> from the global reclaimer(ksgxd) to a newly created LRU list struct.
> * Separate out reusable top-level functions for reclamation.
> - Patch 10 adds support for per-cgroup reclamation.
> - Patch 11 adds documentation for the EPC cgroup.
> - Patch 12 adds test scripts.
>
> I appreciate your review and providing tags if appropriate.
>
> ---
> V6:
> - Dropped OOM killing path, only implement non-preemptive enforcement of max limit (Dave, Michal)
> - Simplified reclamation flow by taking out sgx_epc_reclaim_control, forced reclamation by
> ignoring 'age".
> - Restructured patches: split misc API + resource types patch and the big EPC cgroup patch
> (Kai, Michal)
> - Dropped some Tested-by/Reviewed-by tags due to significant changes
> - Added more selftests
>
> v5:
> - Replace the manual test script with a selftest script.
> - Restore the "From" tag for some patches to Sean (Kai)
> - Style fixes (Jarkko)
>
> v4:
> - Collected "Tested-by" from Mikko. I kept it for now as no functional changes in v4.
> - Rebased on to v6.6_rc1 and reordered patches as described above.
> - Separated out the bug fixes [7,8,9]. This series depend on those patches. (Dave, Jarkko)
> - Added comments in commit message to give more preview what's to come next. (Jarkko)
> - Fixed some documentation error, gap, style (Mikko, Randy)
> - Fixed some comments, typo, style in code (Mikko, Kai)
> - Patch format and background for reclaimable vs unreclaimable (Kai, Jarkko)
> - Fixed typo (Pavel)
> - Exclude the previous fixes/enhancements for self-tests. Patch 18 now depends on series [6]
> - Use the same to list for cover and all patches. (Solo)
>
> v3:
>
> - Added EPC states to replace flags in sgx_epc_page struct. (Jarkko)
> - Unrolled wrappers for cond_resched, list (Dave)
> - Separate patches for adding reclaimable and unreclaimable lists. (Dave)
> - Other improvements on patch flow, commit messages, styles. (Dave, Jarkko)
> - Simplified the cgroup tree walking with plain
> css_for_each_descendant_pre.
> - Fixed race conditions and crashes.
> - OOM killer to wait for the victim enclave pages being reclaimed.
> - Unblock the user by handling misc_max_write callback asynchronously.
> - Rebased onto 6.4 and no longer base this series on the MCA patchset.
> - Fix an overflow in misc_try_charge.
> - Fix a NULL pointer in SGX PF handler.
> - Updated and included the SGX selftest patches previously reviewed. Those
> patches fix issues triggered in high EPC pressure required for cgroup
> testing.
> - Added test scripts to help setup and test SGX EPC cgroups.
>
> [1]https://lore.kernel.org/all/DM6PR21MB11772A6ED915825854B419D6C4989@DM6PR21MB1177.namprd21.prod.outlook.com/
> [2]https://lore.kernel.org/all/ZD7Iutppjj+muH4p@himmelriiki/
> [3]https://lore.kernel.org/all/[email protected]/
> [4]https://lore.kernel.org/linux-sgx/[email protected]/
> [5]Documentation/arch/x86/sgx.rst, Section "Virtual EPC"
> [6]https://lore.kernel.org/linux-sgx/[email protected]/
> [7]https://lore.kernel.org/linux-sgx/[email protected]/
> [8]https://lore.kernel.org/linux-sgx/[email protected]/
> [9]https://lore.kernel.org/linux-sgx/[email protected]/
> [10]https://lore.kernel.org/all/[email protected]/
>
> Haitao Huang (2):
> x86/sgx: Introduce EPC page states
> selftests/sgx: Add scripts for EPC cgroup testing
>
> Kristen Carlson Accardi (5):
> cgroup/misc: Add per resource callbacks for CSS events
> cgroup/misc: Export APIs for SGX driver
> cgroup/misc: Add SGX EPC resource type
> x86/sgx: Implement basic EPC misc cgroup functionality
> x86/sgx: Implement EPC reclamation for cgroup
>
> Sean Christopherson (5):
> x86/sgx: Add sgx_epc_lru_list to encapsulate LRU list
> x86/sgx: Use sgx_epc_lru_list for existing active page list
> x86/sgx: Use a list to track to-be-reclaimed pages
> x86/sgx: Restructure top-level EPC reclaim function
> Docs/x86/sgx: Add description for cgroup support
>
> Documentation/arch/x86/sgx.rst | 74 ++++
> arch/x86/Kconfig | 13 +
> arch/x86/kernel/cpu/sgx/Makefile | 1 +
> arch/x86/kernel/cpu/sgx/encl.c | 2 +-
> arch/x86/kernel/cpu/sgx/epc_cgroup.c | 319 ++++++++++++++++++
> arch/x86/kernel/cpu/sgx/epc_cgroup.h | 49 +++
> arch/x86/kernel/cpu/sgx/main.c | 245 +++++++++-----
> arch/x86/kernel/cpu/sgx/sgx.h | 88 ++++-
> include/linux/misc_cgroup.h | 42 +++
> kernel/cgroup/misc.c | 52 ++-
> .../selftests/sgx/run_epc_cg_selftests.sh | 196 +++++++++++
> .../selftests/sgx/watch_misc_for_tests.sh | 13 +
> 12 files changed, 996 insertions(+), 98 deletions(-)
> create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.c
> create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.h
> create mode 100755 tools/testing/selftests/sgx/run_epc_cg_selftests.sh
> create mode 100755 tools/testing/selftests/sgx/watch_misc_for_tests.sh
>
Is this expected to work on NUC7?
Planning to test this next week (no time this week).
BR, Jarkko
On Mon, 2023-10-30 at 11:20 -0700, Haitao Huang wrote:
> From: Kristen Carlson Accardi <[email protected]>
>
> Implement support for cgroup control of SGX Enclave Page Cache (EPC)
> memory using the misc cgroup controller. EPC memory is independent
> from normal system memory, e.g. must be reserved at boot from RAM and
> cannot be converted between EPC and normal memory while the system is
> running. EPC is managed by the SGX subsystem and is not accounted by
> the memory controller.
>
> Much like normal system memory, EPC memory can be overcommitted via
> virtual memory techniques and pages can be swapped out of the EPC to
> their backing store (normal system memory, e.g. shmem). The SGX EPC
> subsystem is analogous to the memory subsystem and the SGX EPC controller
> is in turn analogous to the memory controller; it implements limit and
> protection models for EPC memory.
Nit:
The above two paragraphs talk about what is EPC and EPC resource control needs
to be done separately, etc, but IMHO it lacks some background about "why" EPC
resource control is needed, e.g, from use case's perspective.
>
> The misc controller provides a mechanism to set a hard limit of EPC
> usage via the "sgx_epc" resource in "misc.max". The total EPC memory
> available on the system is reported via the "sgx_epc" resource in
> "misc.capacity".
Please separate what the current misc cgroup provides, and how this patch is
going to utilize.
Please describe the changes in imperative mood. E.g, "report total EPC memory
via ...", instead of "... is reported via ...".
>
> This patch was modified from the previous version to only add basic EPC
> cgroup structure, accounting allocations for cgroup usage
> (charge/uncharge), setup misc cgroup callbacks, set total EPC capacity.
This isn't changelog material. Please focus on describing the high level design
and why you chose such design.
>
> For now, the EPC cgroup simply blocks additional EPC allocation in
> sgx_alloc_epc_page() when the limit is reached. Reclaimable pages are
> still tracked in the global active list, only reclaimed by the global
> reclaimer when the total free page count is lower than a threshold.
>
> Later patches will reorganize the tracking and reclamation code in the
> globale reclaimer and implement per-cgroup tracking and reclaiming.
>
> Co-developed-by: Sean Christopherson <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>
> Signed-off-by: Kristen Carlson Accardi <[email protected]>
> Co-developed-by: Haitao Huang <[email protected]>
> Signed-off-by: Haitao Huang <[email protected]>
> ---
> V6:
> - Split the original large patch"Limit process EPC usage with misc
> cgroup controller" and restructure it (Kai)
> ---
> arch/x86/Kconfig | 13 ++++
> arch/x86/kernel/cpu/sgx/Makefile | 1 +
> arch/x86/kernel/cpu/sgx/epc_cgroup.c | 103 +++++++++++++++++++++++++++
> arch/x86/kernel/cpu/sgx/epc_cgroup.h | 36 ++++++++++
> arch/x86/kernel/cpu/sgx/main.c | 28 ++++++++
> arch/x86/kernel/cpu/sgx/sgx.h | 3 +
> 6 files changed, 184 insertions(+)
> create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.c
> create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.h
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 66bfabae8814..e17c5dc3aea4 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1921,6 +1921,19 @@ config X86_SGX
>
> If unsure, say N.
>
> +config CGROUP_SGX_EPC
> + bool "Miscellaneous Cgroup Controller for Enclave Page Cache (EPC) for Intel SGX"
> + depends on X86_SGX && CGROUP_MISC
> + help
> + Provides control over the EPC footprint of tasks in a cgroup via
> + the Miscellaneous cgroup controller.
> +
> + EPC is a subset of regular memory that is usable only by SGX
> + enclaves and is very limited in quantity, e.g. less than 1%
> + of total DRAM.
> +
> + Say N if unsure.
> +
> config X86_USER_SHADOW_STACK
> bool "X86 userspace shadow stack"
> depends on AS_WRUSS
> diff --git a/arch/x86/kernel/cpu/sgx/Makefile b/arch/x86/kernel/cpu/sgx/Makefile
> index 9c1656779b2a..12901a488da7 100644
> --- a/arch/x86/kernel/cpu/sgx/Makefile
> +++ b/arch/x86/kernel/cpu/sgx/Makefile
> @@ -4,3 +4,4 @@ obj-y += \
> ioctl.o \
> main.o
> obj-$(CONFIG_X86_SGX_KVM) += virt.o
> +obj-$(CONFIG_CGROUP_SGX_EPC) += epc_cgroup.o
> diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> new file mode 100644
> index 000000000000..500627d0563f
> --- /dev/null
> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> @@ -0,0 +1,103 @@
> +// SPDX-License-Identifier: GPL-2.0
> +// Copyright(c) 2022 Intel Corporation.
> +
> +#include <linux/atomic.h>
> +#include <linux/kernel.h>
> +#include "epc_cgroup.h"
> +
> +static inline struct sgx_epc_cgroup *sgx_epc_cgroup_from_misc_cg(struct misc_cg *cg)
> +{
> + return (struct sgx_epc_cgroup *)(cg->res[MISC_CG_RES_SGX_EPC].priv);
> +}
> +
> +static inline bool sgx_epc_cgroup_disabled(void)
> +{
> + return !cgroup_subsys_enabled(misc_cgrp_subsys);
From below, the root EPC cgroup is dynamically allocated. Shouldn't it also
check whether the root EPC cgroup is valid?
> +}
> +
> +/**
> + * sgx_epc_cgroup_try_charge() - hierarchically try to charge a single EPC page
> + *
> + * Returns EPC cgroup or NULL on success, -errno on failure.
> + */
> +struct sgx_epc_cgroup *sgx_epc_cgroup_try_charge(void)
> +{
> + struct sgx_epc_cgroup *epc_cg;
> + int ret;
> +
> + if (sgx_epc_cgroup_disabled())
> + return NULL;
> +
> + epc_cg = sgx_epc_cgroup_from_misc_cg(get_current_misc_cg());
> + ret = misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, PAGE_SIZE);
> +
> + if (!ret) {
> + /* No epc_cg returned, release ref from get_current_misc_cg() */
> + put_misc_cg(epc_cg->cg);
> + return ERR_PTR(-ENOMEM);
misc_cg_try_charge() returns 0 when successfully charged, no?
> + }
> +
> + /* Ref released in sgx_epc_cgroup_uncharge() */
> + return epc_cg;
> +}
IMHO the above _try_charge() returning a pointer of EPC cgroup is a little bit
odd, because it doesn't match the existing misc_cg_try_charge() which returns
whether the charge is successful or not. sev_misc_cg_try_charge() matches
misc_cg_try_charge() too.
I think it's better to split "getting EPC cgroup" part out as a separate helper,
and make this _try_charge() match existing pattern:
struct sgx_epc_cgroup *sgx_get_current_epc_cg(void)
{
if (sgx_epc_cgroup_disabled())
return NULL;
return sgx_epc_cgroup_from_misc_cg(get_current_misc_cg());
}
int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg)
{
if (!epc_cg)
return -EINVAL;
return misc_cg_try_charge(epc_cg->cg);
}
Having sgx_get_current_epc_cg() also makes the caller easier to read, because we
can immediately know we are going to charge the *current* EPC cgroup, but not
some cgroup hidden within sgx_epc_cgroup_try_charge().
> +
> +/**
> + * sgx_epc_cgroup_uncharge() - hierarchically uncharge EPC pages
> + * @epc_cg: the charged epc cgroup
> + */
> +void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg)
> +{
> + if (sgx_epc_cgroup_disabled())
> + return;
If with above change, check !epc_cg instead.
> +
> + misc_cg_uncharge(MISC_CG_RES_SGX_EPC, epc_cg->cg, PAGE_SIZE);
> +
> + /* Ref got from sgx_epc_cgroup_try_charge() */
> + put_misc_cg(epc_cg->cg);
> +}
>
> +
> +static void sgx_epc_cgroup_free(struct misc_cg *cg)
> +{
> + struct sgx_epc_cgroup *epc_cg;
> +
> + epc_cg = sgx_epc_cgroup_from_misc_cg(cg);
> + if (!epc_cg)
> + return;
> +
> + kfree(epc_cg);
> +}
> +
> +static int sgx_epc_cgroup_alloc(struct misc_cg *cg);
> +
> +const struct misc_operations_struct sgx_epc_cgroup_ops = {
> + .alloc = sgx_epc_cgroup_alloc,
> + .free = sgx_epc_cgroup_free,
> +};
> +
> +static int sgx_epc_cgroup_alloc(struct misc_cg *cg)
> +{
> + struct sgx_epc_cgroup *epc_cg;
> +
> + epc_cg = kzalloc(sizeof(*epc_cg), GFP_KERNEL);
> + if (!epc_cg)
> + return -ENOMEM;
> +
> + cg->res[MISC_CG_RES_SGX_EPC].misc_ops = &sgx_epc_cgroup_ops;
> + cg->res[MISC_CG_RES_SGX_EPC].priv = epc_cg;
> + epc_cg->cg = cg;
> + return 0;
> +}
> +
> +static int __init sgx_epc_cgroup_init(void)
> +{
> + struct misc_cg *cg;
> +
> + if (!boot_cpu_has(X86_FEATURE_SGX))
> + return 0;
> +
> + cg = misc_cg_root();
> + BUG_ON(!cg);
BUG_ON() will catch some eyeball, but it cannot be NULL in practice IIUC.
I am not sure whether you can just make misc @root_cg visible (instead of having
the misc_cg_root() helper) and directly use @root_cg here to avoid using the
BUG(). No opinion here.
> +
> + return sgx_epc_cgroup_alloc(cg);
As mentioned above the memory allocation can fail, in which case EPC cgroup is
effectively disabled IIUC?
One way is to manually check whether root EPC cgroup is valid in
sgx_epc_cgroup_disabled(). Alternatively, you can have a static root EPC cgroup
here:
static struct sgx_epc_cgroup root_epc_cg;
In this way you can have a sgx_epc_cgroup_init(&epc_cg), and call it from
sgx_epc_cgroup_alloc().
> +}
> +subsys_initcall(sgx_epc_cgroup_init);
> diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.h b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
> new file mode 100644
> index 000000000000..c3abfe82be15
> --- /dev/null
> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
> @@ -0,0 +1,36 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/* Copyright(c) 2022 Intel Corporation. */
> +#ifndef _INTEL_SGX_EPC_CGROUP_H_
> +#define _INTEL_SGX_EPC_CGROUP_H_
> +
> +#include <asm/sgx.h>
> +#include <linux/cgroup.h>
> +#include <linux/list.h>
> +#include <linux/misc_cgroup.h>
> +#include <linux/page_counter.h>
> +#include <linux/workqueue.h>
> +
> +#include "sgx.h"
> +
> +#ifndef CONFIG_CGROUP_SGX_EPC
> +#define MISC_CG_RES_SGX_EPC MISC_CG_RES_TYPES
Do you need this macro?
> +struct sgx_epc_cgroup;
> +
> +static inline struct sgx_epc_cgroup *sgx_epc_cgroup_try_charge(void)
> +{
> + return NULL;
> +}
> +
> +static inline void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg) { }
> +#else
> +struct sgx_epc_cgroup {
> + struct misc_cg *cg;
> +};
> +
> +struct sgx_epc_cgroup *sgx_epc_cgroup_try_charge(void);
> +void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg);
> +bool sgx_epc_cgroup_lru_empty(struct misc_cg *root);
Why do you need sgx_epc_cgroup_lru_empty() here?
> +
> +#endif
> +
> +#endif /* _INTEL_SGX_EPC_CGROUP_H_ */
> diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
> index 166692f2d501..07606f391540 100644
> --- a/arch/x86/kernel/cpu/sgx/main.c
> +++ b/arch/x86/kernel/cpu/sgx/main.c
> @@ -6,6 +6,7 @@
> #include <linux/highmem.h>
> #include <linux/kthread.h>
> #include <linux/miscdevice.h>
> +#include <linux/misc_cgroup.h>
> #include <linux/node.h>
> #include <linux/pagemap.h>
> #include <linux/ratelimit.h>
> @@ -17,6 +18,7 @@
> #include "driver.h"
> #include "encl.h"
> #include "encls.h"
> +#include "epc_cgroup.h"
>
> struct sgx_epc_section sgx_epc_sections[SGX_MAX_EPC_SECTIONS];
> static int sgx_nr_epc_sections;
> @@ -559,6 +561,11 @@ int sgx_unmark_page_reclaimable(struct sgx_epc_page *page)
> struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
> {
> struct sgx_epc_page *page;
> + struct sgx_epc_cgroup *epc_cg;
> +
> + epc_cg = sgx_epc_cgroup_try_charge();
> + if (IS_ERR(epc_cg))
> + return ERR_CAST(epc_cg);
>
> for ( ; ; ) {
> page = __sgx_alloc_epc_page();
> @@ -580,10 +587,21 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
> break;
> }
>
> + /*
> + * Need to do a global reclamation if cgroup was not full but free
> + * physical pages run out, causing __sgx_alloc_epc_page() to fail.
> + */
> sgx_reclaim_pages();
What's the final behaviour? IIUC it should be reclaiming from the *current* EPC
cgroup? If so shouldn't we just pass the @epc_cg to it here?
I think we can make this patch as "structure" patch w/o actually having EPC
cgroup enabled, i.e., sgx_get_current_epc_cg() always return NULL.
And we can have one patch to change sgx_reclaim_pages() to take the 'struct
sgx_epc_lru_list *' as argument:
void sgx_reclaim_pages_lru(struct sgx_epc_lru_list * lru)
{
...
}
Then here we can have something like:
void sgx_reclaim_pages(struct sgx_epc_cg *epc_cg)
{
struct sgx_epc_lru_list *lru =
epc_cg ? &epc_cg->lru : &sgx_global_lru;
sgx_reclaim_pages_lru(lru);
}
Makes sense?
> cond_resched();
> }
>
> + if (!IS_ERR(page)) {
> + WARN_ON_ONCE(page->epc_cg);
> + page->epc_cg = epc_cg;
> + } else {
> + sgx_epc_cgroup_uncharge(epc_cg);
> + }
> +
> if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
> wake_up(&ksgxd_waitq);
>
> @@ -604,6 +622,11 @@ void sgx_free_epc_page(struct sgx_epc_page *page)
> struct sgx_epc_section *section = &sgx_epc_sections[page->section];
> struct sgx_numa_node *node = section->node;
>
> + if (page->epc_cg) {
> + sgx_epc_cgroup_uncharge(page->epc_cg);
> + page->epc_cg = NULL;
> + }
> +
> spin_lock(&node->lock);
>
> page->owner = NULL;
> @@ -643,6 +666,7 @@ static bool __init sgx_setup_epc_section(u64 phys_addr, u64 size,
> section->pages[i].flags = 0;
> section->pages[i].owner = NULL;
> section->pages[i].poison = 0;
> + section->pages[i].epc_cg = NULL;
> list_add_tail(§ion->pages[i].list, &sgx_dirty_page_list);
> }
>
> @@ -787,6 +811,7 @@ static void __init arch_update_sysfs_visibility(int nid) {}
> static bool __init sgx_page_cache_init(void)
> {
> u32 eax, ebx, ecx, edx, type;
> + u64 capacity = 0;
> u64 pa, size;
> int nid;
> int i;
> @@ -837,6 +862,7 @@ static bool __init sgx_page_cache_init(void)
>
> sgx_epc_sections[i].node = &sgx_numa_nodes[nid];
> sgx_numa_nodes[nid].size += size;
> + capacity += size;
>
> sgx_nr_epc_sections++;
> }
> @@ -846,6 +872,8 @@ static bool __init sgx_page_cache_init(void)
> return false;
> }
>
> + misc_cg_set_capacity(MISC_CG_RES_SGX_EPC, capacity);
> +
> return true;
> }
I would separate setting up capacity as a separate patch.
>
> diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
> index d2dad21259a8..b1786774b8d2 100644
> --- a/arch/x86/kernel/cpu/sgx/sgx.h
> +++ b/arch/x86/kernel/cpu/sgx/sgx.h
> @@ -29,12 +29,15 @@
> /* Pages on free list */
> #define SGX_EPC_PAGE_IS_FREE BIT(1)
>
> +struct sgx_epc_cgroup;
> +
> struct sgx_epc_page {
> unsigned int section;
> u16 flags;
> u16 poison;
> struct sgx_encl_page *owner;
> struct list_head list;
> + struct sgx_epc_cgroup *epc_cg;
> };
>
Adding @epc_cg unconditionally means even with !CONFIG_CGROUP_SGX_EPC the memory
is still occupied. IMHO that would bring non-trivial memory waste as it's 8-
bytes for each EPC page.
If it's not good to have "#ifdef CONFIG_CGROUP_SGX_EPC" in the .c file, then
perhaps we can have some helper here, e.g.,
static inline sgx_epc_page_set_cg(struct sgx_epc_page *epc_page,
struct sgx_epc_cgroup *epc_cg)
{
#ifdef CONFIG_CGROUP_SGX_EPC
epc_page->epc_cg = epc_cg;
#endif
}
On Sun, 05 Nov 2023 21:26:44 -0600, Jarkko Sakkinen <[email protected]>
wrote:
> On Mon, 2023-10-30 at 11:20 -0700, Haitao Huang wrote:
>> SGX Enclave Page Cache (EPC) memory allocations are separate from
>> normal RAM allocations, and
>> are managed solely by the SGX subsystem. The existing cgroup memory
>> controller cannot be used
>> to limit or account for SGX EPC memory, which is a desirable feature in
>> some environments,
>> e.g., support for pod level control in a Kubernates cluster on a VM or
>> baremetal host [1,2].
>> This patchset implements the support for sgx_epc memory within the
>> misc cgroup controller. The
>> user can use the misc cgroup controller to set and enforce a max limit
>> on total EPC usage per
>> cgroup. The implementation reports current usage and events of reaching
>> the limit per cgroup as
>> well as the total system capacity.
>> With the EPC misc controller enabled, every EPC page allocation is
>> accounted for a cgroup's
>> usage, reflected in the 'sgx_epc' entry in the 'misc.current' interface
>> file of the cgroup.
>> Much like normal system memory, EPC memory can be overcommitted via
>> virtual memory techniques
>> and pages can be swapped out of the EPC to their backing store (normal
>> system memory allocated
>> via shmem, accounted by the memory controller). When the EPC usage of a
>> cgroup reaches its hard
>> limit ('sgx_epc' entry in the 'misc.max' file), the cgroup starts a
>> reclamation process to swap
>> out some EPC pages within the same cgroup and its descendant to their
>> backing store. Although
>> the SGX architecture supports swapping for all pages, to avoid extra
>> complexities, this
>> implementation does not support swapping for certain page types, e.g.
>> Version Array(VA) pages,
>> and treat them as unreclaimable pages. When the limit is reached but
>> nothing left in the
>> cgroup for reclamation, i.e., only unreclaimable pages left, any new
>> EPC allocation in the
>> cgroup will result in an ENOMEM error.
>>
>> The EPC pages allocated for guest VMs by the virtual EPC driver are not
>> reclaimable by the host
>> kernel [5]. Therefore they are also treated as unreclaimable from
>> cgroup's point of view. And
>> the virtual EPC driver translates an ENOMEM error resulted from an EPC
>> allocation request into
>> a SIGBUS to the user process.
>>
>> This work was originally authored by Sean Christopherson a few years
>> ago, and previously
>> modified by Kristen C. Accardi to utilize the misc cgroup controller
>> rather than a custom
>> controller. I have been updating the patches based on review comments
>> since V2 [3, 4, 10],
>> simplified the implementation/design and fixed some stability issues
>> found from testing.
>> The patches are organized as following:
>> - Patches 1-3 are prerequisite misc cgroup changes for adding new APIs,
>> structs, resource
>> types.
>> - Patch 4 implements basic misc controller for EPC without reclamation.
>> - Patches 5-9 prepare for per-cgroup reclamation.
>> * Separate out the existing infrastructure of tracking reclaimable
>> pages
>> from the global reclaimer(ksgxd) to a newly created LRU list
>> struct.
>> * Separate out reusable top-level functions for reclamation.
>> - Patch 10 adds support for per-cgroup reclamation.
>> - Patch 11 adds documentation for the EPC cgroup.
>> - Patch 12 adds test scripts.
>>
>> I appreciate your review and providing tags if appropriate.
>>
>> ---
>> V6:
>> - Dropped OOM killing path, only implement non-preemptive enforcement
>> of max limit (Dave, Michal)
>> - Simplified reclamation flow by taking out sgx_epc_reclaim_control,
>> forced reclamation by
>> ignoring 'age".
>> - Restructured patches: split misc API + resource types patch and the
>> big EPC cgroup patch
>> (Kai, Michal)
>> - Dropped some Tested-by/Reviewed-by tags due to significant changes
>> - Added more selftests
>>
>> v5:
>> - Replace the manual test script with a selftest script.
>> - Restore the "From" tag for some patches to Sean (Kai)
>> - Style fixes (Jarkko)
>>
>> v4:
>> - Collected "Tested-by" from Mikko. I kept it for now as no functional
>> changes in v4.
>> - Rebased on to v6.6_rc1 and reordered patches as described above.
>> - Separated out the bug fixes [7,8,9]. This series depend on those
>> patches. (Dave, Jarkko)
>> - Added comments in commit message to give more preview what's to come
>> next. (Jarkko)
>> - Fixed some documentation error, gap, style (Mikko, Randy)
>> - Fixed some comments, typo, style in code (Mikko, Kai)
>> - Patch format and background for reclaimable vs unreclaimable (Kai,
>> Jarkko)
>> - Fixed typo (Pavel)
>> - Exclude the previous fixes/enhancements for self-tests. Patch 18 now
>> depends on series [6]
>> - Use the same to list for cover and all patches. (Solo)
>> v3:
>> - Added EPC states to replace flags in sgx_epc_page struct. (Jarkko)
>> - Unrolled wrappers for cond_resched, list (Dave)
>> - Separate patches for adding reclaimable and unreclaimable lists.
>> (Dave)
>> - Other improvements on patch flow, commit messages, styles. (Dave,
>> Jarkko)
>> - Simplified the cgroup tree walking with plain
>> css_for_each_descendant_pre.
>> - Fixed race conditions and crashes.
>> - OOM killer to wait for the victim enclave pages being reclaimed.
>> - Unblock the user by handling misc_max_write callback asynchronously.
>> - Rebased onto 6.4 and no longer base this series on the MCA patchset.
>> - Fix an overflow in misc_try_charge.
>> - Fix a NULL pointer in SGX PF handler.
>> - Updated and included the SGX selftest patches previously reviewed.
>> Those
>> patches fix issues triggered in high EPC pressure required for cgroup
>> testing.
>> - Added test scripts to help setup and test SGX EPC cgroups.
>>
>> [1]https://lore.kernel.org/all/DM6PR21MB11772A6ED915825854B419D6C4989@DM6PR21MB1177.namprd21.prod.outlook.com/
>> [2]https://lore.kernel.org/all/ZD7Iutppjj+muH4p@himmelriiki/
>> [3]https://lore.kernel.org/all/[email protected]/
>> [4]https://lore.kernel.org/linux-sgx/[email protected]/
>> [5]Documentation/arch/x86/sgx.rst, Section "Virtual EPC"
>> [6]https://lore.kernel.org/linux-sgx/[email protected]/
>> [7]https://lore.kernel.org/linux-sgx/[email protected]/
>> [8]https://lore.kernel.org/linux-sgx/[email protected]/
>> [9]https://lore.kernel.org/linux-sgx/[email protected]/
>> [10]https://lore.kernel.org/all/[email protected]/
>>
>> Haitao Huang (2):
>> x86/sgx: Introduce EPC page states
>> selftests/sgx: Add scripts for EPC cgroup testing
>>
>> Kristen Carlson Accardi (5):
>> cgroup/misc: Add per resource callbacks for CSS events
>> cgroup/misc: Export APIs for SGX driver
>> cgroup/misc: Add SGX EPC resource type
>> x86/sgx: Implement basic EPC misc cgroup functionality
>> x86/sgx: Implement EPC reclamation for cgroup
>>
>> Sean Christopherson (5):
>> x86/sgx: Add sgx_epc_lru_list to encapsulate LRU list
>> x86/sgx: Use sgx_epc_lru_list for existing active page list
>> x86/sgx: Use a list to track to-be-reclaimed pages
>> x86/sgx: Restructure top-level EPC reclaim function
>> Docs/x86/sgx: Add description for cgroup support
>>
>> Documentation/arch/x86/sgx.rst | 74 ++++
>> arch/x86/Kconfig | 13 +
>> arch/x86/kernel/cpu/sgx/Makefile | 1 +
>> arch/x86/kernel/cpu/sgx/encl.c | 2 +-
>> arch/x86/kernel/cpu/sgx/epc_cgroup.c | 319 ++++++++++++++++++
>> arch/x86/kernel/cpu/sgx/epc_cgroup.h | 49 +++
>> arch/x86/kernel/cpu/sgx/main.c | 245 +++++++++-----
>> arch/x86/kernel/cpu/sgx/sgx.h | 88 ++++-
>> include/linux/misc_cgroup.h | 42 +++
>> kernel/cgroup/misc.c | 52 ++-
>> .../selftests/sgx/run_epc_cg_selftests.sh | 196 +++++++++++
>> .../selftests/sgx/watch_misc_for_tests.sh | 13 +
>> 12 files changed, 996 insertions(+), 98 deletions(-)
>> create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.c
>> create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.h
>> create mode 100755 tools/testing/selftests/sgx/run_epc_cg_selftests.sh
>> create mode 100755 tools/testing/selftests/sgx/watch_misc_for_tests.sh
>>
>
> Is this expected to work on NUC7?
>
> Planning to test this next week (no time this week).
>
> BR, Jarkko
I don't see a reason why it would not be working on a NUC. I'll try to get
access to one and test it too. The only difference I can think of is you
may have lower capacity. The selftests scripts try not hard code the limit
numbers so I expect they should also work. Also changes do not depend on
EDMM so any NUC should be fine.
BTW I'll also send a fixup patch for an issue on memcg not charged for
cgroup workqueue reclamation that Mikko found in his testing. Please apply
that on top of the series when testing.
Thanks
Haitao
Enclave Page Cache(EPC) memory can be swapped out to regular system
memory, and the consumed memory should be charged to a proper
mem_cgroup. Currently the selection of mem_cgroup to charge is done in
sgx_encl_get_mem_cgroup(). But it only considers two contexts in which
the swapping can be done: normal task and the ksgxd kthread.
With the new EPC cgroup implementation, the swapping can also happen in
EPC cgroup work-queue threads. In those cases, it improperly selects the
root mem_cgroup to charge for the RAM usage.
Change sgx_encl_get_mem_cgroup() to handle non-task contexts only and
return the mem_cgroup of an mm_struct associated with the enclave. The
return is used to charge for EPC backing pages in all kthread cases.
Pass a flag into the top level reclamation function,
sgx_do_epc_reclamation(), to explicitly indicate whether it is called
from a background kthread. Internally, if the flag is true, switch the
active mem_cgroup to the one returned from sgx_encl_get_mem_cgroup(),
prior to any backing page allocation, in order to ensure that shmem page
allocations are charged to the enclave's cgroup.
Removed current_is_ksgxd() as it is no longer needed.
Signed-off-by: Haitao Huang <[email protected]>
Reported-by: Mikko Ylinen <[email protected]>
---
arch/x86/kernel/cpu/sgx/encl.c | 43 ++++++++++++++--------------
arch/x86/kernel/cpu/sgx/encl.h | 3 +-
arch/x86/kernel/cpu/sgx/epc_cgroup.c | 8 +++---
arch/x86/kernel/cpu/sgx/main.c | 26 +++++++----------
arch/x86/kernel/cpu/sgx/sgx.h | 2 +-
5 files changed, 38 insertions(+), 44 deletions(-)
diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index 17dc108d3ff7..930f5bd4b752 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -993,9 +993,7 @@ static int __sgx_encl_get_backing(struct sgx_encl *encl, unsigned long page_inde
}
/*
- * When called from ksgxd, returns the mem_cgroup of a struct mm stored
- * in the enclave's mm_list. When not called from ksgxd, just returns
- * the mem_cgroup of the current task.
+ * Returns the mem_cgroup of a struct mm stored in the enclave's mm_list.
*/
static struct mem_cgroup *sgx_encl_get_mem_cgroup(struct sgx_encl *encl)
{
@@ -1003,14 +1001,6 @@ static struct mem_cgroup *sgx_encl_get_mem_cgroup(struct sgx_encl *encl)
struct sgx_encl_mm *encl_mm;
int idx;
- /*
- * If called from normal task context, return the mem_cgroup
- * of the current task's mm. The remainder of the handling is for
- * ksgxd.
- */
- if (!current_is_ksgxd())
- return get_mem_cgroup_from_mm(current->mm);
-
/*
* Search the enclave's mm_list to find an mm associated with
* this enclave to charge the allocation to.
@@ -1047,29 +1037,38 @@ static struct mem_cgroup *sgx_encl_get_mem_cgroup(struct sgx_encl *encl)
* @encl: an enclave pointer
* @page_index: enclave page index
* @backing: data for accessing backing storage for the page
+ * @indirect: in ksgxd or EPC cgroup work queue context
+ *
+ * Create a backing page for loading data back into an EPC page with ELDU. This function takes
+ * a reference on a new backing page which must be dropped with a corresponding call to
+ * sgx_encl_put_backing().
*
- * When called from ksgxd, sets the active memcg from one of the
- * mms in the enclave's mm_list prior to any backing page allocation,
- * in order to ensure that shmem page allocations are charged to the
- * enclave. Create a backing page for loading data back into an EPC page with
- * ELDU. This function takes a reference on a new backing page which
- * must be dropped with a corresponding call to sgx_encl_put_backing().
+ * When @indirect is true, sets the active memcg from one of the mms in the enclave's mm_list
+ * prior to any backing page allocation, in order to ensure that shmem page allocations are
+ * charged to the enclave.
*
* Return:
* 0 on success,
* -errno otherwise.
*/
int sgx_encl_alloc_backing(struct sgx_encl *encl, unsigned long page_index,
- struct sgx_backing *backing)
+ struct sgx_backing *backing, bool indirect)
{
- struct mem_cgroup *encl_memcg = sgx_encl_get_mem_cgroup(encl);
- struct mem_cgroup *memcg = set_active_memcg(encl_memcg);
+ struct mem_cgroup *encl_memcg;
+ struct mem_cgroup *memcg;
int ret;
+ if (indirect) {
+ encl_memcg = sgx_encl_get_mem_cgroup(encl);
+ memcg = set_active_memcg(encl_memcg);
+ }
+
ret = __sgx_encl_get_backing(encl, page_index, backing);
- set_active_memcg(memcg);
- mem_cgroup_put(encl_memcg);
+ if (indirect) {
+ set_active_memcg(memcg);
+ mem_cgroup_put(encl_memcg);
+ }
return ret;
}
diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h
index f94ff14c9486..549cd2e8d98b 100644
--- a/arch/x86/kernel/cpu/sgx/encl.h
+++ b/arch/x86/kernel/cpu/sgx/encl.h
@@ -103,12 +103,11 @@ static inline int sgx_encl_find(struct mm_struct *mm, unsigned long addr,
int sgx_encl_may_map(struct sgx_encl *encl, unsigned long start,
unsigned long end, unsigned long vm_flags);
-bool current_is_ksgxd(void);
void sgx_encl_release(struct kref *ref);
int sgx_encl_mm_add(struct sgx_encl *encl, struct mm_struct *mm);
const cpumask_t *sgx_encl_cpumask(struct sgx_encl *encl);
int sgx_encl_alloc_backing(struct sgx_encl *encl, unsigned long page_index,
- struct sgx_backing *backing);
+ struct sgx_backing *backing, bool indirect);
void sgx_encl_put_backing(struct sgx_backing *backing);
int sgx_encl_test_and_clear_young(struct mm_struct *mm,
struct sgx_encl_page *page);
diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
index 110d44c0ef7c..9c82bfd531e6 100644
--- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
@@ -131,7 +131,7 @@ void sgx_epc_cgroup_isolate_pages(struct misc_cg *root,
}
static unsigned int sgx_epc_cgroup_reclaim_pages(unsigned int nr_pages,
- struct misc_cg *root)
+ struct misc_cg *root, bool indirect)
{
LIST_HEAD(iso);
/*
@@ -143,7 +143,7 @@ static unsigned int sgx_epc_cgroup_reclaim_pages(unsigned int nr_pages,
nr_pages = min(nr_pages, SGX_NR_TO_SCAN_MAX);
sgx_epc_cgroup_isolate_pages(root, nr_pages, &iso);
- return sgx_do_epc_reclamation(&iso);
+ return sgx_do_epc_reclamation(&iso, indirect);
}
/*
@@ -191,7 +191,7 @@ static void sgx_epc_cgroup_reclaim_work_func(struct work_struct *work)
break;
/* Keep reclaiming until above condition is met. */
- sgx_epc_cgroup_reclaim_pages((unsigned int)(cur - max), epc_cg->cg);
+ sgx_epc_cgroup_reclaim_pages((unsigned int)(cur - max), epc_cg->cg, true);
}
}
@@ -214,7 +214,7 @@ static int __sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg,
return -EBUSY;
}
- if (!sgx_epc_cgroup_reclaim_pages(1, epc_cg->cg))
+ if (!sgx_epc_cgroup_reclaim_pages(1, epc_cg->cg, false))
/* All pages were too young to reclaim, try again */
schedule();
}
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index c496b8f15b54..45036d41c797 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -274,7 +274,7 @@ static void sgx_encl_ewb(struct sgx_epc_page *epc_page,
}
static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
- struct sgx_backing *backing)
+ struct sgx_backing *backing, bool indirect)
{
struct sgx_encl_page *encl_page = epc_page->owner;
struct sgx_encl *encl = encl_page->encl;
@@ -290,7 +290,7 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
if (!encl->secs_child_cnt && test_bit(SGX_ENCL_INITIALIZED, &encl->flags)) {
ret = sgx_encl_alloc_backing(encl, PFN_DOWN(encl->size),
- &secs_backing);
+ &secs_backing, indirect);
if (ret)
goto out;
@@ -346,6 +346,7 @@ unsigned int sgx_isolate_epc_pages(struct sgx_epc_lru_list *lru, unsigned int n
/**
* sgx_do_epc_reclamation() - Perform reclamation for isolated EPC pages.
* @iso: List of isolated pages for reclamation
+ * @indirect: In ksgxd or EPC cgroup workqueue threads
*
* Take a list of EPC pages and reclaim them to the enclave's private shmem files. Do not
* reclaim the pages that have been accessed since the last scan, and move each of those pages
@@ -362,7 +363,7 @@ unsigned int sgx_isolate_epc_pages(struct sgx_epc_lru_list *lru, unsigned int n
*
* Return: number of pages successfully reclaimed.
*/
-unsigned int sgx_do_epc_reclamation(struct list_head *iso)
+unsigned int sgx_do_epc_reclamation(struct list_head *iso, bool indirect)
{
struct sgx_backing backing[SGX_NR_TO_SCAN_MAX];
struct sgx_epc_page *epc_page, *tmp;
@@ -384,7 +385,7 @@ unsigned int sgx_do_epc_reclamation(struct list_head *iso)
page_index = PFN_DOWN(encl_page->desc - encl_page->encl->base);
mutex_lock(&encl_page->encl->lock);
- ret = sgx_encl_alloc_backing(encl_page->encl, page_index, &backing[i]);
+ ret = sgx_encl_alloc_backing(encl_page->encl, page_index, &backing[i], indirect);
if (ret) {
mutex_unlock(&encl_page->encl->lock);
goto skip;
@@ -411,7 +412,7 @@ unsigned int sgx_do_epc_reclamation(struct list_head *iso)
i = 0;
list_for_each_entry_safe(epc_page, tmp, iso, list) {
encl_page = epc_page->owner;
- sgx_reclaimer_write(epc_page, &backing[i++]);
+ sgx_reclaimer_write(epc_page, &backing[i++], indirect);
kref_put(&encl_page->encl->refcount, sgx_encl_release);
sgx_epc_page_reset_state(epc_page);
@@ -422,7 +423,7 @@ unsigned int sgx_do_epc_reclamation(struct list_head *iso)
return i;
}
-static void sgx_reclaim_epc_pages_global(void)
+static void sgx_reclaim_epc_pages_global(bool indirect)
{
unsigned int nr_to_scan = SGX_NR_TO_SCAN;
LIST_HEAD(iso);
@@ -432,7 +433,7 @@ static void sgx_reclaim_epc_pages_global(void)
else
sgx_isolate_epc_pages(&sgx_global_lru, nr_to_scan, &iso);
- sgx_do_epc_reclamation(&iso);
+ sgx_do_epc_reclamation(&iso, indirect);
}
static bool sgx_should_reclaim(unsigned long watermark)
@@ -449,7 +450,7 @@ static bool sgx_should_reclaim(unsigned long watermark)
void sgx_reclaim_direct(void)
{
if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
- sgx_reclaim_epc_pages_global();
+ sgx_reclaim_epc_pages_global(false);
}
static int ksgxd(void *p)
@@ -472,7 +473,7 @@ static int ksgxd(void *p)
sgx_should_reclaim(SGX_NR_HIGH_PAGES));
if (sgx_should_reclaim(SGX_NR_HIGH_PAGES))
- sgx_reclaim_epc_pages_global();
+ sgx_reclaim_epc_pages_global(true);
cond_resched();
}
@@ -495,11 +496,6 @@ static bool __init sgx_page_reclaimer_init(void)
return true;
}
-bool current_is_ksgxd(void)
-{
- return current == ksgxd_tsk;
-}
-
static struct sgx_epc_page *__sgx_alloc_epc_page_from_node(int nid)
{
struct sgx_numa_node *node = &sgx_numa_nodes[nid];
@@ -653,7 +649,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
* Need to do a global reclamation if cgroup was not full but free
* physical pages run out, causing __sgx_alloc_epc_page() to fail.
*/
- sgx_reclaim_epc_pages_global();
+ sgx_reclaim_epc_pages_global(false);
cond_resched();
}
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index 6a40f70ed96f..377625e2ba1d 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -167,7 +167,7 @@ void sgx_reclaim_direct(void);
void sgx_mark_page_reclaimable(struct sgx_epc_page *page);
int sgx_unmark_page_reclaimable(struct sgx_epc_page *page);
struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim);
-unsigned int sgx_do_epc_reclamation(struct list_head *iso);
+unsigned int sgx_do_epc_reclamation(struct list_head *iso, bool indirect);
unsigned int sgx_isolate_epc_pages(struct sgx_epc_lru_list *lru, unsigned int nr_to_scan,
struct list_head *dst);
--
2.25.1
There is an issue WRT charging proper mem_cgroups for backing pages once
per-cgroup reclamation is implemented.
Please apply the fix-up patch [1] on top of this patch or the series.
Thanks
Haitao
[1]
https://lore.kernel.org/all/[email protected]/
On Mon, 06 Nov 2023 06:09:45 -0600, Huang, Kai <[email protected]> wrote:
> On Mon, 2023-10-30 at 11:20 -0700, Haitao Huang wrote:
>> From: Kristen Carlson Accardi <[email protected]>
>>
>> Implement support for cgroup control of SGX Enclave Page Cache (EPC)
>> memory using the misc cgroup controller. EPC memory is independent
>> from normal system memory, e.g. must be reserved at boot from RAM and
>> cannot be converted between EPC and normal memory while the system is
>> running. EPC is managed by the SGX subsystem and is not accounted by
>> the memory controller.
>>
>> Much like normal system memory, EPC memory can be overcommitted via
>> virtual memory techniques and pages can be swapped out of the EPC to
>> their backing store (normal system memory, e.g. shmem). The SGX EPC
>> subsystem is analogous to the memory subsystem and the SGX EPC
>> controller
>> is in turn analogous to the memory controller; it implements limit and
>> protection models for EPC memory.
>
> Nit:
>
> The above two paragraphs talk about what is EPC and EPC resource control
> needs
> to be done separately, etc, but IMHO it lacks some background about
> "why" EPC
> resource control is needed, e.g, from use case's perspective.
>
>>
>> The misc controller provides a mechanism to set a hard limit of EPC
>> usage via the "sgx_epc" resource in "misc.max". The total EPC memory
>> available on the system is reported via the "sgx_epc" resource in
>> "misc.capacity".
>
> Please separate what the current misc cgroup provides, and how this
> patch is
> going to utilize.
>
> Please describe the changes in imperative mood. E.g, "report total EPC
> memory
> via ...", instead of "... is reported via ...".
>
Will update
>>
>> This patch was modified from the previous version to only add basic EPC
>> cgroup structure, accounting allocations for cgroup usage
>> (charge/uncharge), setup misc cgroup callbacks, set total EPC capacity.
>
> This isn't changelog material. Please focus on describing the high
> level design
> and why you chose such design.
>
>>
>> For now, the EPC cgroup simply blocks additional EPC allocation in
>> sgx_alloc_epc_page() when the limit is reached. Reclaimable pages are
>> still tracked in the global active list, only reclaimed by the global
>> reclaimer when the total free page count is lower than a threshold.
>>
>> Later patches will reorganize the tracking and reclamation code in the
>> globale reclaimer and implement per-cgroup tracking and reclaiming.
>>
>> Co-developed-by: Sean Christopherson <[email protected]>
>> Signed-off-by: Sean Christopherson <[email protected]>
>> Signed-off-by: Kristen Carlson Accardi <[email protected]>
>> Co-developed-by: Haitao Huang <[email protected]>
>> Signed-off-by: Haitao Huang <[email protected]>
>> ---
>> V6:
>> - Split the original large patch"Limit process EPC usage with misc
>> cgroup controller" and restructure it (Kai)
>> ---
>> arch/x86/Kconfig | 13 ++++
>> arch/x86/kernel/cpu/sgx/Makefile | 1 +
>> arch/x86/kernel/cpu/sgx/epc_cgroup.c | 103 +++++++++++++++++++++++++++
>> arch/x86/kernel/cpu/sgx/epc_cgroup.h | 36 ++++++++++
>> arch/x86/kernel/cpu/sgx/main.c | 28 ++++++++
>> arch/x86/kernel/cpu/sgx/sgx.h | 3 +
>> 6 files changed, 184 insertions(+)
>> create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.c
>> create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.h
>>
>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>> index 66bfabae8814..e17c5dc3aea4 100644
>> --- a/arch/x86/Kconfig
>> +++ b/arch/x86/Kconfig
>> @@ -1921,6 +1921,19 @@ config X86_SGX
>>
>> If unsure, say N.
>>
>> +config CGROUP_SGX_EPC
>> + bool "Miscellaneous Cgroup Controller for Enclave Page Cache (EPC)
>> for Intel SGX"
>> + depends on X86_SGX && CGROUP_MISC
>> + help
>> + Provides control over the EPC footprint of tasks in a cgroup via
>> + the Miscellaneous cgroup controller.
>> +
>> + EPC is a subset of regular memory that is usable only by SGX
>> + enclaves and is very limited in quantity, e.g. less than 1%
>> + of total DRAM.
>> +
>> + Say N if unsure.
>> +
>> config X86_USER_SHADOW_STACK
>> bool "X86 userspace shadow stack"
>> depends on AS_WRUSS
>> diff --git a/arch/x86/kernel/cpu/sgx/Makefile
>> b/arch/x86/kernel/cpu/sgx/Makefile
>> index 9c1656779b2a..12901a488da7 100644
>> --- a/arch/x86/kernel/cpu/sgx/Makefile
>> +++ b/arch/x86/kernel/cpu/sgx/Makefile
>> @@ -4,3 +4,4 @@ obj-y += \
>> ioctl.o \
>> main.o
>> obj-$(CONFIG_X86_SGX_KVM) += virt.o
>> +obj-$(CONFIG_CGROUP_SGX_EPC) += epc_cgroup.o
>> diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c
>> b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
>> new file mode 100644
>> index 000000000000..500627d0563f
>> --- /dev/null
>> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
>> @@ -0,0 +1,103 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +// Copyright(c) 2022 Intel Corporation.
>> +
>> +#include <linux/atomic.h>
>> +#include <linux/kernel.h>
>> +#include "epc_cgroup.h"
>> +
>> +static inline struct sgx_epc_cgroup
>> *sgx_epc_cgroup_from_misc_cg(struct misc_cg *cg)
>> +{
>> + return (struct sgx_epc_cgroup *)(cg->res[MISC_CG_RES_SGX_EPC].priv);
>> +}
>> +
>> +static inline bool sgx_epc_cgroup_disabled(void)
>> +{
>> + return !cgroup_subsys_enabled(misc_cgrp_subsys);
>
> From below, the root EPC cgroup is dynamically allocated. Shouldn't it
> also
> check whether the root EPC cgroup is valid?
>
Good point. I think I'll go with the static instance approach below.
>> +}
>> +
>> +/**
>> + * sgx_epc_cgroup_try_charge() - hierarchically try to charge a single
>> EPC page
>> + *
>> + * Returns EPC cgroup or NULL on success, -errno on failure.
>> + */
>> +struct sgx_epc_cgroup *sgx_epc_cgroup_try_charge(void)
>> +{
>> + struct sgx_epc_cgroup *epc_cg;
>> + int ret;
>> +
>> + if (sgx_epc_cgroup_disabled())
>> + return NULL;
>> +
>> + epc_cg = sgx_epc_cgroup_from_misc_cg(get_current_misc_cg());
>> + ret = misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, PAGE_SIZE);
>> +
>> + if (!ret) {
>> + /* No epc_cg returned, release ref from get_current_misc_cg() */
>> + put_misc_cg(epc_cg->cg);
>> + return ERR_PTR(-ENOMEM);
>
> misc_cg_try_charge() returns 0 when successfully charged, no?
Right. I really made some mess in rebasing :-(
>
>> + }
>> +
>> + /* Ref released in sgx_epc_cgroup_uncharge() */
>> + return epc_cg;
>> +}
>
> IMHO the above _try_charge() returning a pointer of EPC cgroup is a
> little bit
> odd, because it doesn't match the existing misc_cg_try_charge() which
> returns
> whether the charge is successful or not. sev_misc_cg_try_charge()
> matches
> misc_cg_try_charge() too.
>
> I think it's better to split "getting EPC cgroup" part out as a separate
> helper,
> and make this _try_charge() match existing pattern:
>
> struct sgx_epc_cgroup *sgx_get_current_epc_cg(void)
> {
> if (sgx_epc_cgroup_disabled())
> return NULL;
>
> return sgx_epc_cgroup_from_misc_cg(get_current_misc_cg());
> }
>
> int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg)
> {
> if (!epc_cg)
> return -EINVAL;
>
> return misc_cg_try_charge(epc_cg->cg);
> }
>
> Having sgx_get_current_epc_cg() also makes the caller easier to read,
> because we
> can immediately know we are going to charge the *current* EPC cgroup,
> but not
> some cgroup hidden within sgx_epc_cgroup_try_charge().
>
Actually, unlike other misc controllers, we need charge and get the epc_cg
reference at the same time. That's why it was returning the pointer. How
about rename them sgx_{charge_and_get, uncharge_and_put}_epc_cg()? In
final version, there is a __sgx_epc_cgroup_try_charge() that wraps
misc_cg_try_charge().
>> +
>> +/**
>> + * sgx_epc_cgroup_uncharge() - hierarchically uncharge EPC pages
>> + * @epc_cg: the charged epc cgroup
>> + */
>> +void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg)
>> +{
>> + if (sgx_epc_cgroup_disabled())
>> + return;
>
> If with above change, check !epc_cg instead.
>
>> +
>> + misc_cg_uncharge(MISC_CG_RES_SGX_EPC, epc_cg->cg, PAGE_SIZE);
>> +
>> + /* Ref got from sgx_epc_cgroup_try_charge() */
>> + put_misc_cg(epc_cg->cg);
>> +}
>>
>> +
>> +static void sgx_epc_cgroup_free(struct misc_cg *cg)
>> +{
>> + struct sgx_epc_cgroup *epc_cg;
>> +
>> + epc_cg = sgx_epc_cgroup_from_misc_cg(cg);
>> + if (!epc_cg)
>> + return;
>> +
>> + kfree(epc_cg);
>> +}
>> +
>> +static int sgx_epc_cgroup_alloc(struct misc_cg *cg);
>> +
>> +const struct misc_operations_struct sgx_epc_cgroup_ops = {
>> + .alloc = sgx_epc_cgroup_alloc,
>> + .free = sgx_epc_cgroup_free,
>> +};
>> +
>> +static int sgx_epc_cgroup_alloc(struct misc_cg *cg)
>> +{
>> + struct sgx_epc_cgroup *epc_cg;
>> +
>> + epc_cg = kzalloc(sizeof(*epc_cg), GFP_KERNEL);
>> + if (!epc_cg)
>> + return -ENOMEM;
>> +
>> + cg->res[MISC_CG_RES_SGX_EPC].misc_ops = &sgx_epc_cgroup_ops;
>> + cg->res[MISC_CG_RES_SGX_EPC].priv = epc_cg;
>> + epc_cg->cg = cg;
>> + return 0;
>> +}
>> +
>> +static int __init sgx_epc_cgroup_init(void)
>> +{
>> + struct misc_cg *cg;
>> +
>> + if (!boot_cpu_has(X86_FEATURE_SGX))
>> + return 0;
>> +
>> + cg = misc_cg_root();
>> + BUG_ON(!cg);
>
> BUG_ON() will catch some eyeball, but it cannot be NULL in practice IIUC.
>
> I am not sure whether you can just make misc @root_cg visible (instead
> of having
> the misc_cg_root() helper) and directly use @root_cg here to avoid using
> the
> BUG(). No opinion here.
>
I can remove BUG_ON(). It should never happen anyways.
>> +
>> + return sgx_epc_cgroup_alloc(cg);
>
> As mentioned above the memory allocation can fail, in which case EPC
> cgroup is
> effectively disabled IIUC?
>
> One way is to manually check whether root EPC cgroup is valid in
> sgx_epc_cgroup_disabled(). Alternatively, you can have a static root
> EPC cgroup
> here:
>
> static struct sgx_epc_cgroup root_epc_cg;
>
> In this way you can have a sgx_epc_cgroup_init(&epc_cg), and call it from
> sgx_epc_cgroup_alloc().
>
Yeah, I think that is reasonable.
>> +}
>> +subsys_initcall(sgx_epc_cgroup_init);
>> diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.h
>> b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
>> new file mode 100644
>> index 000000000000..c3abfe82be15
>> --- /dev/null
>> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
>> @@ -0,0 +1,36 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +/* Copyright(c) 2022 Intel Corporation. */
>> +#ifndef _INTEL_SGX_EPC_CGROUP_H_
>> +#define _INTEL_SGX_EPC_CGROUP_H_
>> +
>> +#include <asm/sgx.h>
>> +#include <linux/cgroup.h>
>> +#include <linux/list.h>
>> +#include <linux/misc_cgroup.h>
>> +#include <linux/page_counter.h>
>> +#include <linux/workqueue.h>
>> +
>> +#include "sgx.h"
>> +
>> +#ifndef CONFIG_CGROUP_SGX_EPC
>> +#define MISC_CG_RES_SGX_EPC MISC_CG_RES_TYPES
>
> Do you need this macro?
I remember I got some compiling error without it but I don't see why it
should be needed. I'll double check next round. thanks.
>
>> +struct sgx_epc_cgroup;
>> +
>> +static inline struct sgx_epc_cgroup *sgx_epc_cgroup_try_charge(void)
>> +{
>> + return NULL;
>> +}
>> +
>> +static inline void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup
>> *epc_cg) { }
>> +#else
>> +struct sgx_epc_cgroup {
>> + struct misc_cg *cg;
>> +};
>> +
>> +struct sgx_epc_cgroup *sgx_epc_cgroup_try_charge(void);
>> +void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg);
>> +bool sgx_epc_cgroup_lru_empty(struct misc_cg *root);
>
> Why do you need sgx_epc_cgroup_lru_empty() here?
>
leftover from rebasing. Will remove.
>> +
>> +#endif
>> +
>> +#endif /* _INTEL_SGX_EPC_CGROUP_H_ */
>> diff --git a/arch/x86/kernel/cpu/sgx/main.c
>> b/arch/x86/kernel/cpu/sgx/main.c
>> index 166692f2d501..07606f391540 100644
>> --- a/arch/x86/kernel/cpu/sgx/main.c
>> +++ b/arch/x86/kernel/cpu/sgx/main.c
>> @@ -6,6 +6,7 @@
>> #include <linux/highmem.h>
>> #include <linux/kthread.h>
>> #include <linux/miscdevice.h>
>> +#include <linux/misc_cgroup.h>
>> #include <linux/node.h>
>> #include <linux/pagemap.h>
>> #include <linux/ratelimit.h>
>> @@ -17,6 +18,7 @@
>> #include "driver.h"
>> #include "encl.h"
>> #include "encls.h"
>> +#include "epc_cgroup.h"
>>
>> struct sgx_epc_section sgx_epc_sections[SGX_MAX_EPC_SECTIONS];
>> static int sgx_nr_epc_sections;
>> @@ -559,6 +561,11 @@ int sgx_unmark_page_reclaimable(struct
>> sgx_epc_page *page)
>> struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
>> {
>> struct sgx_epc_page *page;
>> + struct sgx_epc_cgroup *epc_cg;
>> +
>> + epc_cg = sgx_epc_cgroup_try_charge();
>> + if (IS_ERR(epc_cg))
>> + return ERR_CAST(epc_cg);
>>
>> for ( ; ; ) {
>> page = __sgx_alloc_epc_page();
>> @@ -580,10 +587,21 @@ struct sgx_epc_page *sgx_alloc_epc_page(void
>> *owner, bool reclaim)
>> break;
>> }
>>
>> + /*
>> + * Need to do a global reclamation if cgroup was not full but free
>> + * physical pages run out, causing __sgx_alloc_epc_page() to fail.
>> + */
>> sgx_reclaim_pages();
>
> What's the final behaviour? IIUC it should be reclaiming from the
> *current* EPC
> cgroup? If so shouldn't we just pass the @epc_cg to it here?
>
> I think we can make this patch as "structure" patch w/o actually having
> EPC
> cgroup enabled, i.e., sgx_get_current_epc_cg() always return NULL.
>
> And we can have one patch to change sgx_reclaim_pages() to take the
> 'struct
> sgx_epc_lru_list *' as argument:
>
> void sgx_reclaim_pages_lru(struct sgx_epc_lru_list * lru)
> {
> ...
> }
>
> Then here we can have something like:
>
> void sgx_reclaim_pages(struct sgx_epc_cg *epc_cg)
> {
> struct sgx_epc_lru_list *lru = epc_cg ? &epc_cg->lru :
> &sgx_global_lru;
>
> sgx_reclaim_pages_lru(lru);
> }
>
> Makes sense?
>
This is purely global reclamation. No cgroup involved. You can see it
later in changes in patch 10/12. For now I just make a comment there but
no real changes. Cgroup reclamation will be done as part of _try_charge
call.
>> cond_resched();
>> }
>>
>> + if (!IS_ERR(page)) {
>> + WARN_ON_ONCE(page->epc_cg);
>> + page->epc_cg = epc_cg;
>> + } else {
>> + sgx_epc_cgroup_uncharge(epc_cg);
>> + }
>> +
>> if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
>> wake_up(&ksgxd_waitq);
>>
>> @@ -604,6 +622,11 @@ void sgx_free_epc_page(struct sgx_epc_page *page)
>> struct sgx_epc_section *section = &sgx_epc_sections[page->section];
>> struct sgx_numa_node *node = section->node;
>>
>> + if (page->epc_cg) {
>> + sgx_epc_cgroup_uncharge(page->epc_cg);
>> + page->epc_cg = NULL;
>> + }
>> +
>> spin_lock(&node->lock);
>>
>> page->owner = NULL;
>> @@ -643,6 +666,7 @@ static bool __init sgx_setup_epc_section(u64
>> phys_addr, u64 size,
>> section->pages[i].flags = 0;
>> section->pages[i].owner = NULL;
>> section->pages[i].poison = 0;
>> + section->pages[i].epc_cg = NULL;
>> list_add_tail(§ion->pages[i].list, &sgx_dirty_page_list);
>> }
>>
>> @@ -787,6 +811,7 @@ static void __init arch_update_sysfs_visibility(int
>> nid) {}
>> static bool __init sgx_page_cache_init(void)
>> {
>> u32 eax, ebx, ecx, edx, type;
>> + u64 capacity = 0;
>> u64 pa, size;
>> int nid;
>> int i;
>> @@ -837,6 +862,7 @@ static bool __init sgx_page_cache_init(void)
>>
>> sgx_epc_sections[i].node = &sgx_numa_nodes[nid];
>> sgx_numa_nodes[nid].size += size;
>> + capacity += size;
>>
>> sgx_nr_epc_sections++;
>> }
>> @@ -846,6 +872,8 @@ static bool __init sgx_page_cache_init(void)
>> return false;
>> }
>>
>> + misc_cg_set_capacity(MISC_CG_RES_SGX_EPC, capacity);
>> +
>> return true;
>> }
>
> I would separate setting up capacity as a separate patch.
I thought about that, but again it was only 3-4 lines all in this function
and it's also necessary part of basic setup for misc controller...
>
>>
>> diff --git a/arch/x86/kernel/cpu/sgx/sgx.h
>> b/arch/x86/kernel/cpu/sgx/sgx.h
>> index d2dad21259a8..b1786774b8d2 100644
>> --- a/arch/x86/kernel/cpu/sgx/sgx.h
>> +++ b/arch/x86/kernel/cpu/sgx/sgx.h
>> @@ -29,12 +29,15 @@
>> /* Pages on free list */
>> #define SGX_EPC_PAGE_IS_FREE BIT(1)
>>
>> +struct sgx_epc_cgroup;
>> +
>> struct sgx_epc_page {
>> unsigned int section;
>> u16 flags;
>> u16 poison;
>> struct sgx_encl_page *owner;
>> struct list_head list;
>> + struct sgx_epc_cgroup *epc_cg;
>> };
>>
>
> Adding @epc_cg unconditionally means even with !CONFIG_CGROUP_SGX_EPC
> the memory
> is still occupied. IMHO that would bring non-trivial memory waste as
> it's 8-
> bytes for each EPC page.
>
Ok, I'll add ifdef
>
> > > +/**
> > > + * sgx_epc_cgroup_try_charge() - hierarchically try to charge a single
> > > EPC page
> > > + *
> > > + * Returns EPC cgroup or NULL on success, -errno on failure.
> > > + */
> > > +struct sgx_epc_cgroup *sgx_epc_cgroup_try_charge(void)
> > > +{
> > > + struct sgx_epc_cgroup *epc_cg;
> > > + int ret;
> > > +
> > > + if (sgx_epc_cgroup_disabled())
> > > + return NULL;
> > > +
> > > + epc_cg = sgx_epc_cgroup_from_misc_cg(get_current_misc_cg());
> > > + ret = misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, PAGE_SIZE);
> > > +
> > > + if (!ret) {
> > > + /* No epc_cg returned, release ref from get_current_misc_cg() */
> > > + put_misc_cg(epc_cg->cg);
> > > + return ERR_PTR(-ENOMEM);
> >
> > misc_cg_try_charge() returns 0 when successfully charged, no?
>
> Right. I really made some mess in rebasing :-(
>
> >
> > > + }
> > > +
> > > + /* Ref released in sgx_epc_cgroup_uncharge() */
> > > + return epc_cg;
> > > +}
> >
> > IMHO the above _try_charge() returning a pointer of EPC cgroup is a
> > little bit
> > odd, because it doesn't match the existing misc_cg_try_charge() which
> > returns
> > whether the charge is successful or not. sev_misc_cg_try_charge()
> > matches
> > misc_cg_try_charge() too.
> >
> > I think it's better to split "getting EPC cgroup" part out as a separate
> > helper,
> > and make this _try_charge() match existing pattern:
> >
> > struct sgx_epc_cgroup *sgx_get_current_epc_cg(void)
> > {
> > if (sgx_epc_cgroup_disabled())
> > return NULL;
> >
> > return sgx_epc_cgroup_from_misc_cg(get_current_misc_cg());
> > }
> >
> > int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg)
> > {
> > if (!epc_cg)
> > return -EINVAL;
> >
> > return misc_cg_try_charge(epc_cg->cg);
> > }
> >
> > Having sgx_get_current_epc_cg() also makes the caller easier to read,
> > because we
> > can immediately know we are going to charge the *current* EPC cgroup,
> > but not
> > some cgroup hidden within sgx_epc_cgroup_try_charge().
> >
>
> Actually, unlike other misc controllers, we need charge and get the epc_cg
> reference at the same time.
>
Can you elaborate?
And in practice you always call sgx_epc_cgroup_try_charge() right after
sgx_get_current_epc_cg() anyway. The only difference is the whole thing is done
in one function or in separate functions.
[...]
> > > struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
> > > {
> > > struct sgx_epc_page *page;
> > > + struct sgx_epc_cgroup *epc_cg;
> > > +
> > > + epc_cg = sgx_epc_cgroup_try_charge();
> > > + if (IS_ERR(epc_cg))
> > > + return ERR_CAST(epc_cg);
> > >
> > > for ( ; ; ) {
> > > page = __sgx_alloc_epc_page();
> > > @@ -580,10 +587,21 @@ struct sgx_epc_page *sgx_alloc_epc_page(void
> > > *owner, bool reclaim)
> > > break;
> > > }
> > >
> > > + /*
> > > + * Need to do a global reclamation if cgroup was not full but free
> > > + * physical pages run out, causing __sgx_alloc_epc_page() to fail.
> > > + */
> > > sgx_reclaim_pages();
> >
> > What's the final behaviour? IIUC it should be reclaiming from the
> > *current* EPC
> > cgroup? If so shouldn't we just pass the @epc_cg to it here?
> >
> > I think we can make this patch as "structure" patch w/o actually having
> > EPC
> > cgroup enabled, i.e., sgx_get_current_epc_cg() always return NULL.
> >
> > And we can have one patch to change sgx_reclaim_pages() to take the
> > 'struct
> > sgx_epc_lru_list *' as argument:
> >
> > void sgx_reclaim_pages_lru(struct sgx_epc_lru_list * lru)
> > {
> > ...
> > }
> >
> > Then here we can have something like:
> >
> > void sgx_reclaim_pages(struct sgx_epc_cg *epc_cg)
> > {
> > struct sgx_epc_lru_list *lru = epc_cg ? &epc_cg->lru :
> > &sgx_global_lru;
> >
> > sgx_reclaim_pages_lru(lru);
> > }
> >
> > Makes sense?
> >
>
> This is purely global reclamation. No cgroup involved.
>
Again why? Here you are allocating one EPC page for enclave in a particular EPC
cgroup. When that fails, shouldn't you try only to reclaim from the *current*
EPC cgroup? Or at least you should try to reclaim from the *current* EPC cgroup
first?
> You can see it
> later in changes in patch 10/12. For now I just make a comment there but
> no real changes. Cgroup reclamation will be done as part of _try_charge
> call.
>
> > > cond_resched();
> > > }
> > >
> > > + if (!IS_ERR(page)) {
> > > + WARN_ON_ONCE(page->epc_cg);
> > > + page->epc_cg = epc_cg;
> > > + } else {
> > > + sgx_epc_cgroup_uncharge(epc_cg);
> > > + }
> > > +
> > > if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
> > > wake_up(&ksgxd_waitq);
> > >
> > > @@ -604,6 +622,11 @@ void sgx_free_epc_page(struct sgx_epc_page *page)
> > > struct sgx_epc_section *section = &sgx_epc_sections[page->section];
> > > struct sgx_numa_node *node = section->node;
> > >
> > > + if (page->epc_cg) {
> > > + sgx_epc_cgroup_uncharge(page->epc_cg);
> > > + page->epc_cg = NULL;
> > > + }
> > > +
> > > spin_lock(&node->lock);
> > >
> > > page->owner = NULL;
> > > @@ -643,6 +666,7 @@ static bool __init sgx_setup_epc_section(u64
> > > phys_addr, u64 size,
> > > section->pages[i].flags = 0;
> > > section->pages[i].owner = NULL;
> > > section->pages[i].poison = 0;
> > > + section->pages[i].epc_cg = NULL;
> > > list_add_tail(§ion->pages[i].list, &sgx_dirty_page_list);
> > > }
> > >
> > > @@ -787,6 +811,7 @@ static void __init arch_update_sysfs_visibility(int
> > > nid) {}
> > > static bool __init sgx_page_cache_init(void)
> > > {
> > > u32 eax, ebx, ecx, edx, type;
> > > + u64 capacity = 0;
> > > u64 pa, size;
> > > int nid;
> > > int i;
> > > @@ -837,6 +862,7 @@ static bool __init sgx_page_cache_init(void)
> > >
> > > sgx_epc_sections[i].node = &sgx_numa_nodes[nid];
> > > sgx_numa_nodes[nid].size += size;
> > > + capacity += size;
> > >
> > > sgx_nr_epc_sections++;
> > > }
> > > @@ -846,6 +872,8 @@ static bool __init sgx_page_cache_init(void)
> > > return false;
> > > }
> > >
> > > + misc_cg_set_capacity(MISC_CG_RES_SGX_EPC, capacity);
Hmm.. I think this is why MISC_CG_RES_SGX_EPC is needed when
!CONFIG_CGROUP_SGX_EPC.
> > > +
> > > return true;
> > > }
> >
> > I would separate setting up capacity as a separate patch.
>
> I thought about that, but again it was only 3-4 lines all in this function
> and it's also necessary part of basic setup for misc controller...
Fine. Anyway it depends on what things you want to do on this patch. It's fine
to include the capacity if this patch is some "structure" patch that shows the
high level flow of how EPC cgroup works.
On Mon, 2023-10-30 at 11:20 -0700, Haitao Huang wrote:
> +static int __init sgx_epc_cgroup_init(void)
> +{
> + struct misc_cg *cg;
> +
> + if (!boot_cpu_has(X86_FEATURE_SGX))
> + return 0;
> +
> + cg = misc_cg_root();
> + BUG_ON(!cg);
> +
> + return sgx_epc_cgroup_alloc(cg);
> +}
> +subsys_initcall(sgx_epc_cgroup_init);
This should be called from sgx_init(), which is the place to init SGX related
staff. In case you didn't notice, sgx_init() is actually device_initcall(),
which is actually called _after_ the subsys_initcall() you used above.
On Mon, 06 Nov 2023 16:18:30 -0600, Huang, Kai <[email protected]> wrote:
>>
>> > > +/**
>> > > + * sgx_epc_cgroup_try_charge() - hierarchically try to charge a
>> single
>> > > EPC page
>> > > + *
>> > > + * Returns EPC cgroup or NULL on success, -errno on failure.
>> > > + */
>> > > +struct sgx_epc_cgroup *sgx_epc_cgroup_try_charge(void)
>> > > +{
>> > > + struct sgx_epc_cgroup *epc_cg;
>> > > + int ret;
>> > > +
>> > > + if (sgx_epc_cgroup_disabled())
>> > > + return NULL;
>> > > +
>> > > + epc_cg = sgx_epc_cgroup_from_misc_cg(get_current_misc_cg());
>> > > + ret = misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg,
>> PAGE_SIZE);
>> > > +
>> > > + if (!ret) {
>> > > + /* No epc_cg returned, release ref from get_current_misc_cg() */
>> > > + put_misc_cg(epc_cg->cg);
>> > > + return ERR_PTR(-ENOMEM);
>> >
>> > misc_cg_try_charge() returns 0 when successfully charged, no?
>>
>> Right. I really made some mess in rebasing :-(
>>
>> >
>> > > + }
>> > > +
>> > > + /* Ref released in sgx_epc_cgroup_uncharge() */
>> > > + return epc_cg;
>> > > +}
>> >
>> > IMHO the above _try_charge() returning a pointer of EPC cgroup is a
>> > little bit
>> > odd, because it doesn't match the existing misc_cg_try_charge() which
>> > returns
>> > whether the charge is successful or not. sev_misc_cg_try_charge()
>> > matches
>> > misc_cg_try_charge() too.
>> >
>> > I think it's better to split "getting EPC cgroup" part out as a
>> separate
>> > helper,
>> > and make this _try_charge() match existing pattern:
>> >
>> > struct sgx_epc_cgroup *sgx_get_current_epc_cg(void)
>> > {
>> > if (sgx_epc_cgroup_disabled())
>> > return NULL;
>> >
>> > return sgx_epc_cgroup_from_misc_cg(get_current_misc_cg());
>> > }
>> >
>> > int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg)
>> > {
>> > if (!epc_cg)
>> > return -EINVAL;
>> >
>> > return misc_cg_try_charge(epc_cg->cg);
>> > }
>> >
>> > Having sgx_get_current_epc_cg() also makes the caller easier to read,
>> > because we
>> > can immediately know we are going to charge the *current* EPC cgroup,
>> > but not
>> > some cgroup hidden within sgx_epc_cgroup_try_charge().
>> >
>>
>> Actually, unlike other misc controllers, we need charge and get the
>> epc_cg
>> reference at the same time.
>
> Can you elaborate?
>
> And in practice you always call sgx_epc_cgroup_try_charge() right after
> sgx_get_current_epc_cg() anyway. The only difference is the whole thing
> is done
> in one function or in separate functions.
>
> [...]
>
That's true. I was thinking no need to have them done in separate calls.
The caller has to check the return value for epc_cg instance first, then
check result of try_charge. But there is really only one caller,
sgx_alloc_epc_page() below, so I don't have strong opinions now.
With them separate, the checks will look like this:
if (epc_cg = sgx_get_current_epc_cg()) // NULL means cgroup disabled,
should continue for allocation
{
if (ret = sgx_epc_cgroup_try_charge())
return ret
}
// continue...
I can go either way.
>
>> > > struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
>> > > {
>> > > struct sgx_epc_page *page;
>> > > + struct sgx_epc_cgroup *epc_cg;
>> > > +
>> > > + epc_cg = sgx_epc_cgroup_try_charge();
>> > > + if (IS_ERR(epc_cg))
>> > > + return ERR_CAST(epc_cg);
>> > >
>> > > for ( ; ; ) {
>> > > page = __sgx_alloc_epc_page();
>> > > @@ -580,10 +587,21 @@ struct sgx_epc_page *sgx_alloc_epc_page(void
>> > > *owner, bool reclaim)
>> > > break;
>> > > }
>> > >
>> > > + /*
>> > > + * Need to do a global reclamation if cgroup was not full but
>> free
>> > > + * physical pages run out, causing __sgx_alloc_epc_page() to
>> fail.
>> > > + */
>> > > sgx_reclaim_pages();
>> >
>> > What's the final behaviour? IIUC it should be reclaiming from the
>> > *current* EPC
>> > cgroup? If so shouldn't we just pass the @epc_cg to it here?
>> >
>> > I think we can make this patch as "structure" patch w/o actually
>> having
>> > EPC
>> > cgroup enabled, i.e., sgx_get_current_epc_cg() always return NULL.
>> >
>> > And we can have one patch to change sgx_reclaim_pages() to take the
>> > 'struct
>> > sgx_epc_lru_list *' as argument:
>> >
>> > void sgx_reclaim_pages_lru(struct sgx_epc_lru_list * lru)
>> > {
>> > ...
>> > }
>> >
>> > Then here we can have something like:
>> >
>> > void sgx_reclaim_pages(struct sgx_epc_cg *epc_cg)
>> > {
>> > struct sgx_epc_lru_list *lru = epc_cg ? &epc_cg->lru :
>> > &sgx_global_lru;
>> >
>> > sgx_reclaim_pages_lru(lru);
>> > }
>> >
>> > Makes sense?
>> >
>>
>> This is purely global reclamation. No cgroup involved.
>
> Again why? Here you are allocating one EPC page for enclave in a
> particular EPC
> cgroup. When that fails, shouldn't you try only to reclaim from the
> *current*
> EPC cgroup? Or at least you should try to reclaim from the *current*
> EPC cgroup
> first?
>
Later sgx_epc_cg_try_charge will take a 'reclaim' flag, if true, cgroup
reclaims synchronously, otherwise in background and returns -EBUSY in that
case. This function also returns if no valid epc_cg pointer returned.
All reclamation for *current* cgroup is done in sgx_epc_cg_try_charge().
So, by reaching to this point, a valid epc_cg pointer was returned, that
means allocation is allowed for the cgroup (it has reclaimed if necessary,
and its usage is not above limit after charging).
But the system level free count may be low (e.g., limits of all cgroups
may add up to be more than capacity). so we need to do a global
reclamation here, which may involve reclaiming a few pages (from current
or other groups) so the system can be at a performant state with minimal
free count. (current behavior of ksgxd).
Note this patch does not do per-cgroup reclamation. If we had stopped with
this patch without next patches, cgroups will only block allocation by
returning invalid epc_cg pointer (-ENOMEM) from sgx_epc_cg_try_charge().
Reclamation only happens when cgroup is not full but system level free
count is below threshold.
>> You can see it
>> later in changes in patch 10/12. For now I just make a comment there but
>> no real changes. Cgroup reclamation will be done as part of _try_charge
>> call.
>>
>> > > cond_resched();
>> > > }
>> > >
>> > > + if (!IS_ERR(page)) {
>> > > + WARN_ON_ONCE(page->epc_cg);
>> > > + page->epc_cg = epc_cg;
>> > > + } else {
>> > > + sgx_epc_cgroup_uncharge(epc_cg);
>> > > + }
>> > > +
>> > > if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
>> > > wake_up(&ksgxd_waitq);
>> > >
>> > > @@ -604,6 +622,11 @@ void sgx_free_epc_page(struct sgx_epc_page
>> *page)
>> > > struct sgx_epc_section *section =
>> &sgx_epc_sections[page->section];
>> > > struct sgx_numa_node *node = section->node;
>> > >
>> > > + if (page->epc_cg) {
>> > > + sgx_epc_cgroup_uncharge(page->epc_cg);
>> > > + page->epc_cg = NULL;
>> > > + }
>> > > +
>> > > spin_lock(&node->lock);
>> > >
>> > > page->owner = NULL;
>> > > @@ -643,6 +666,7 @@ static bool __init sgx_setup_epc_section(u64
>> > > phys_addr, u64 size,
>> > > section->pages[i].flags = 0;
>> > > section->pages[i].owner = NULL;
>> > > section->pages[i].poison = 0;
>> > > + section->pages[i].epc_cg = NULL;
>> > > list_add_tail(§ion->pages[i].list, &sgx_dirty_page_list);
>> > > }
>> > >
>> > > @@ -787,6 +811,7 @@ static void __init
>> arch_update_sysfs_visibility(int
>> > > nid) {}
>> > > static bool __init sgx_page_cache_init(void)
>> > > {
>> > > u32 eax, ebx, ecx, edx, type;
>> > > + u64 capacity = 0;
>> > > u64 pa, size;
>> > > int nid;
>> > > int i;
>> > > @@ -837,6 +862,7 @@ static bool __init sgx_page_cache_init(void)
>> > >
>> > > sgx_epc_sections[i].node = &sgx_numa_nodes[nid];
>> > > sgx_numa_nodes[nid].size += size;
>> > > + capacity += size;
>> > >
>> > > sgx_nr_epc_sections++;
>> > > }
>> > > @@ -846,6 +872,8 @@ static bool __init sgx_page_cache_init(void)
>> > > return false;
>> > > }
>> > >
>> > > + misc_cg_set_capacity(MISC_CG_RES_SGX_EPC, capacity);
>
> Hmm.. I think this is why MISC_CG_RES_SGX_EPC is needed when
> !CONFIG_CGROUP_SGX_EPC.
right, that was it :-)
>
>> > > +
>> > > return true;
>> > > }
>> >
>> > I would separate setting up capacity as a separate patch.
>>
>> I thought about that, but again it was only 3-4 lines all in this
>> function
>> and it's also necessary part of basic setup for misc controller...
>
> Fine. Anyway it depends on what things you want to do on this patch.
> It's fine
> to include the capacity if this patch is some "structure" patch that
> shows the
> high level flow of how EPC cgroup works.
Ok, I'll keep this way for now unless any objections.
Thanks
Haitao
On Mon, 06 Nov 2023 19:16:30 -0600, Haitao Huang
<[email protected]> wrote:
> On Mon, 06 Nov 2023 16:18:30 -0600, Huang, Kai <[email protected]>
> wrote:
>
>>>
>>> > > +/**
>>> > > + * sgx_epc_cgroup_try_charge() - hierarchically try to charge a
>>> single
>>> > > EPC page
>>> > > + *
>>> > > + * Returns EPC cgroup or NULL on success, -errno on failure.
>>> > > + */
>>> > > +struct sgx_epc_cgroup *sgx_epc_cgroup_try_charge(void)
>>> > > +{
>>> > > + struct sgx_epc_cgroup *epc_cg;
>>> > > + int ret;
>>> > > +
>>> > > + if (sgx_epc_cgroup_disabled())
>>> > > + return NULL;
>>> > > +
>>> > > + epc_cg = sgx_epc_cgroup_from_misc_cg(get_current_misc_cg());
>>> > > + ret = misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg,
>>> PAGE_SIZE);
>>> > > +
>>> > > + if (!ret) {
>>> > > + /* No epc_cg returned, release ref from get_current_misc_cg() */
>>> > > + put_misc_cg(epc_cg->cg);
>>> > > + return ERR_PTR(-ENOMEM);
>>> >
>>> > misc_cg_try_charge() returns 0 when successfully charged, no?
>>>
>>> Right. I really made some mess in rebasing :-(
>>>
>>> >
>>> > > + }
>>> > > +
>>> > > + /* Ref released in sgx_epc_cgroup_uncharge() */
>>> > > + return epc_cg;
>>> > > +}
>>> >
>>> > IMHO the above _try_charge() returning a pointer of EPC cgroup is a
>>> > little bit
>>> > odd, because it doesn't match the existing misc_cg_try_charge() which
>>> > returns
>>> > whether the charge is successful or not. sev_misc_cg_try_charge()
>>> > matches
>>> > misc_cg_try_charge() too.
>>> >
>>> > I think it's better to split "getting EPC cgroup" part out as a
>>> separate
>>> > helper,
>>> > and make this _try_charge() match existing pattern:
>>> >
>>> > struct sgx_epc_cgroup *sgx_get_current_epc_cg(void)
>>> > {
>>> > if (sgx_epc_cgroup_disabled())
>>> > return NULL;
>>> >
>>> > return sgx_epc_cgroup_from_misc_cg(get_current_misc_cg());
>>> > }
>>> >
>>> > int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg)
>>> > {
>>> > if (!epc_cg)
>>> > return -EINVAL;
>>> >
>>> > return misc_cg_try_charge(epc_cg->cg);
>>> > }
>>> >
>>> > Having sgx_get_current_epc_cg() also makes the caller easier to read,
>>> > because we
>>> > can immediately know we are going to charge the *current* EPC cgroup,
>>> > but not
>>> > some cgroup hidden within sgx_epc_cgroup_try_charge().
>>> >
>>>
>>> Actually, unlike other misc controllers, we need charge and get the
>>> epc_cg
>>> reference at the same time.
>>
>> Can you elaborate?
>>
>> And in practice you always call sgx_epc_cgroup_try_charge() right after
>> sgx_get_current_epc_cg() anyway. The only difference is the whole
>> thing is done
>> in one function or in separate functions.
>>
>> [...]
>>
>
> That's true. I was thinking no need to have them done in separate calls.
> The caller has to check the return value for epc_cg instance first, then
> check result of try_charge. But there is really only one caller,
> sgx_alloc_epc_page() below, so I don't have strong opinions now.
>
> With them separate, the checks will look like this:
> if (epc_cg = sgx_get_current_epc_cg()) // NULL means cgroup disabled,
> should continue for allocation
> {
> if (ret = sgx_epc_cgroup_try_charge())
> return ret
> }
> // continue...
>
> I can go either way.
>
>>
>>> > > struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
>>> > > {
>>> > > struct sgx_epc_page *page;
>>> > > + struct sgx_epc_cgroup *epc_cg;
>>> > > +
>>> > > + epc_cg = sgx_epc_cgroup_try_charge();
>>> > > + if (IS_ERR(epc_cg))
>>> > > + return ERR_CAST(epc_cg);
>>> > >
>>> > > for ( ; ; ) {
>>> > > page = __sgx_alloc_epc_page();
>>> > > @@ -580,10 +587,21 @@ struct sgx_epc_page *sgx_alloc_epc_page(void
>>> > > *owner, bool reclaim)
>>> > > break;
>>> > > }
>>> > >
>>> > > + /*
>>> > > + * Need to do a global reclamation if cgroup was not full but
>>> free
>>> > > + * physical pages run out, causing __sgx_alloc_epc_page() to
>>> fail.
>>> > > + */
>>> > > sgx_reclaim_pages();
>>> >
>>> > What's the final behaviour? IIUC it should be reclaiming from the
>>> > *current* EPC
>>> > cgroup? If so shouldn't we just pass the @epc_cg to it here?
>>> >
>>> > I think we can make this patch as "structure" patch w/o actually
>>> having
>>> > EPC
>>> > cgroup enabled, i.e., sgx_get_current_epc_cg() always return NULL.
>>> >
>>> > And we can have one patch to change sgx_reclaim_pages() to take the
>>> > 'struct
>>> > sgx_epc_lru_list *' as argument:
>>> >
>>> > void sgx_reclaim_pages_lru(struct sgx_epc_lru_list * lru)
>>> > {
>>> > ...
>>> > }
>>> >
>>> > Then here we can have something like:
>>> >
>>> > void sgx_reclaim_pages(struct sgx_epc_cg *epc_cg)
>>> > {
>>> > struct sgx_epc_lru_list *lru = epc_cg ? &epc_cg->lru :
>>> > &sgx_global_lru;
>>> >
>>> > sgx_reclaim_pages_lru(lru);
>>> > }
>>> >
>>> > Makes sense?
>>> >
>>>
>>> This is purely global reclamation. No cgroup involved.
>>
>> Again why? Here you are allocating one EPC page for enclave in a
>> particular EPC
>> cgroup. When that fails, shouldn't you try only to reclaim from the
>> *current*
>> EPC cgroup? Or at least you should try to reclaim from the *current*
>> EPC cgroup
>> first?
>>
>
> Later sgx_epc_cg_try_charge will take a 'reclaim' flag, if true, cgroup
> reclaims synchronously, otherwise in background and returns -EBUSY in
> that case. This function also returns if no valid epc_cg pointer
> returned.
>
> All reclamation for *current* cgroup is done in sgx_epc_cg_try_charge().
>
> So, by reaching to this point, a valid epc_cg pointer was returned,
> that means allocation is allowed for the cgroup (it has reclaimed if
> necessary, and its usage is not above limit after charging).
>
> But the system level free count may be low (e.g., limits of all cgroups
> may add up to be more than capacity). so we need to do a global
> reclamation here, which may involve reclaiming a few pages (from current
> or other groups) so the system can be at a performant state with minimal
> free count. (current behavior of ksgxd).
>
I should have sticked to the orignial comment added in code. Actually
__sgx_alloc_epc_page() can fail if system runs out of EPC. That's the
really reason for global reclaim. The free count enforcement is near the
end of this method after should_reclaim() check.
Haitao
On Mon Oct 30, 2023 at 8:20 PM EET, Haitao Huang wrote:
> Use the lower 2 bits in the flags field of sgx_epc_page struct to track
> EPC states and define an enum for possible states for EPC pages tracked
> for reclamation.
>
> Add the RECLAIM_IN_PROGRESS state to explicitly indicate a page that is
> identified as a candidate for reclaiming, but has not yet been
> reclaimed, instead of relying on list_empty(&epc_page->list). A later
> patch will replace the array on stack with a temporary list to store the
> candidate pages, so list_empty() should no longer be used for this
> purpose.
>
> Co-developed-by: Sean Christopherson <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>
> Co-developed-by: Kristen Carlson Accardi <[email protected]>
> Signed-off-by: Kristen Carlson Accardi <[email protected]>
> Signed-off-by: Haitao Huang <[email protected]>
> Cc: Sean Christopherson <[email protected]>
> ---
> V6:
> - Drop UNRECLAIMABLE and use only 2 bits for states (Kai)
> - Combine the patch for RECLAIM_IN_PROGRESS
> - Style fixes (Jarkko and Kai)
> ---
> arch/x86/kernel/cpu/sgx/encl.c | 2 +-
> arch/x86/kernel/cpu/sgx/main.c | 33 +++++++++---------
> arch/x86/kernel/cpu/sgx/sgx.h | 62 +++++++++++++++++++++++++++++++---
> 3 files changed, 76 insertions(+), 21 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
> index 279148e72459..17dc108d3ff7 100644
> --- a/arch/x86/kernel/cpu/sgx/encl.c
> +++ b/arch/x86/kernel/cpu/sgx/encl.c
> @@ -1315,7 +1315,7 @@ void sgx_encl_free_epc_page(struct sgx_epc_page *page)
> {
> int ret;
>
> - WARN_ON_ONCE(page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED);
> + WARN_ON_ONCE(page->flags & SGX_EPC_PAGE_STATE_MASK);
>
> ret = __eremove(sgx_get_epc_virt_addr(page));
> if (WARN_ONCE(ret, EREMOVE_ERROR_MESSAGE, ret, ret))
> diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
> index d347acd717fd..e27ac73d8843 100644
> --- a/arch/x86/kernel/cpu/sgx/main.c
> +++ b/arch/x86/kernel/cpu/sgx/main.c
> @@ -315,13 +315,14 @@ static void sgx_reclaim_pages(void)
> list_del_init(&epc_page->list);
> encl_page = epc_page->owner;
>
> - if (kref_get_unless_zero(&encl_page->encl->refcount) != 0)
> + if (kref_get_unless_zero(&encl_page->encl->refcount) != 0) {
> + sgx_epc_page_set_state(epc_page, SGX_EPC_PAGE_RECLAIM_IN_PROGRESS);
> chunk[cnt++] = epc_page;
> - else
> + } else
> /* The owner is freeing the page. No need to add the
> * page back to the list of reclaimable pages.
> */
> - epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
> + sgx_epc_page_reset_state(epc_page);
> }
> spin_unlock(&sgx_global_lru.lock);
>
> @@ -347,6 +348,7 @@ static void sgx_reclaim_pages(void)
>
> skip:
> spin_lock(&sgx_global_lru.lock);
> + sgx_epc_page_set_state(epc_page, SGX_EPC_PAGE_RECLAIMABLE);
> list_add_tail(&epc_page->list, &sgx_global_lru.reclaimable);
> spin_unlock(&sgx_global_lru.lock);
>
> @@ -370,7 +372,7 @@ static void sgx_reclaim_pages(void)
> sgx_reclaimer_write(epc_page, &backing[i]);
>
> kref_put(&encl_page->encl->refcount, sgx_encl_release);
> - epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
> + sgx_epc_page_reset_state(epc_page);
>
> sgx_free_epc_page(epc_page);
> }
> @@ -509,7 +511,8 @@ struct sgx_epc_page *__sgx_alloc_epc_page(void)
> void sgx_mark_page_reclaimable(struct sgx_epc_page *page)
> {
> spin_lock(&sgx_global_lru.lock);
> - page->flags |= SGX_EPC_PAGE_RECLAIMER_TRACKED;
> + WARN_ON_ONCE(sgx_epc_page_reclaimable(page->flags));
> + page->flags |= SGX_EPC_PAGE_RECLAIMABLE;
> list_add_tail(&page->list, &sgx_global_lru.reclaimable);
> spin_unlock(&sgx_global_lru.lock);
> }
> @@ -527,16 +530,13 @@ void sgx_mark_page_reclaimable(struct sgx_epc_page *page)
> int sgx_unmark_page_reclaimable(struct sgx_epc_page *page)
> {
> spin_lock(&sgx_global_lru.lock);
> - if (page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED) {
> - /* The page is being reclaimed. */
> - if (list_empty(&page->list)) {
> - spin_unlock(&sgx_global_lru.lock);
> - return -EBUSY;
> - }
> -
> - list_del(&page->list);
> - page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
> + if (sgx_epc_page_reclaim_in_progress(page->flags)) {
> + spin_unlock(&sgx_global_lru.lock);
> + return -EBUSY;
> }
> +
> + list_del(&page->list);
> + sgx_epc_page_reset_state(page);
> spin_unlock(&sgx_global_lru.lock);
>
> return 0;
> @@ -623,6 +623,7 @@ void sgx_free_epc_page(struct sgx_epc_page *page)
> struct sgx_epc_section *section = &sgx_epc_sections[page->section];
> struct sgx_numa_node *node = section->node;
>
> + WARN_ON_ONCE(page->flags & (SGX_EPC_PAGE_STATE_MASK));
> if (page->epc_cg) {
> sgx_epc_cgroup_uncharge(page->epc_cg);
> page->epc_cg = NULL;
> @@ -635,7 +636,7 @@ void sgx_free_epc_page(struct sgx_epc_page *page)
> list_add(&page->list, &node->sgx_poison_page_list);
> else
> list_add_tail(&page->list, &node->free_page_list);
> - page->flags = SGX_EPC_PAGE_IS_FREE;
> + page->flags = SGX_EPC_PAGE_FREE;
>
> spin_unlock(&node->lock);
> atomic_long_inc(&sgx_nr_free_pages);
> @@ -737,7 +738,7 @@ int arch_memory_failure(unsigned long pfn, int flags)
> * If the page is on a free list, move it to the per-node
> * poison page list.
> */
> - if (page->flags & SGX_EPC_PAGE_IS_FREE) {
> + if (page->flags == SGX_EPC_PAGE_FREE) {
> list_move(&page->list, &node->sgx_poison_page_list);
> goto out;
> }
> diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
> index 0fbe6a2a159b..dd7ab65b5b27 100644
> --- a/arch/x86/kernel/cpu/sgx/sgx.h
> +++ b/arch/x86/kernel/cpu/sgx/sgx.h
> @@ -23,11 +23,44 @@
> #define SGX_NR_LOW_PAGES 32
> #define SGX_NR_HIGH_PAGES 64
>
> -/* Pages, which are being tracked by the page reclaimer. */
> -#define SGX_EPC_PAGE_RECLAIMER_TRACKED BIT(0)
> +enum sgx_epc_page_state {
> + /*
> + * Allocated but not tracked by the reclaimer.
> + *
> + * Pages allocated for virtual EPC which are never tracked by the host
> + * reclaimer; pages just allocated from free list but not yet put in
> + * use; pages just reclaimed, but not yet returned to the free list.
> + * Becomes FREE after sgx_free_epc().
> + * Becomes RECLAIMABLE after sgx_mark_page_reclaimable().
> + */
> + SGX_EPC_PAGE_NOT_TRACKED = 0,
> +
> + /*
> + * Page is in the free list, ready for allocation.
> + *
> + * Becomes NOT_TRACKED after sgx_alloc_epc_page().
> + */
> + SGX_EPC_PAGE_FREE = 1,
> +
> + /*
> + * Page is in use and tracked in a reclaimable LRU list.
> + *
> + * Becomes NOT_TRACKED after sgx_unmark_page_reclaimable().
> + * Becomes RECLAIM_IN_PROGRESS in sgx_reclaim_pages() when identified
> + * for reclaiming.
> + */
> + SGX_EPC_PAGE_RECLAIMABLE = 2,
> +
> + /*
> + * Page is in the middle of reclamation.
> + *
> + * Back to RECLAIMABLE if reclamation fails for any reason.
> + * Becomes NOT_TRACKED if reclaimed successfully.
> + */
> + SGX_EPC_PAGE_RECLAIM_IN_PROGRESS = 3,
> +};
>
> -/* Pages on free list */
> -#define SGX_EPC_PAGE_IS_FREE BIT(1)
> +#define SGX_EPC_PAGE_STATE_MASK GENMASK(1, 0)
>
> struct sgx_epc_cgroup;
>
> @@ -40,6 +73,27 @@ struct sgx_epc_page {
> struct sgx_epc_cgroup *epc_cg;
> };
>
> +static inline void sgx_epc_page_reset_state(struct sgx_epc_page *page)
> +{
> + page->flags &= ~SGX_EPC_PAGE_STATE_MASK;
> +}
> +
> +static inline void sgx_epc_page_set_state(struct sgx_epc_page *page, unsigned long flags)
> +{
> + page->flags &= ~SGX_EPC_PAGE_STATE_MASK;
> + page->flags |= (flags & SGX_EPC_PAGE_STATE_MASK);
> +}
> +
> +static inline bool sgx_epc_page_reclaim_in_progress(unsigned long flags)
> +{
> + return SGX_EPC_PAGE_RECLAIM_IN_PROGRESS == (flags & SGX_EPC_PAGE_STATE_MASK);
> +}
> +
> +static inline bool sgx_epc_page_reclaimable(unsigned long flags)
> +{
> + return SGX_EPC_PAGE_RECLAIMABLE == (flags & SGX_EPC_PAGE_STATE_MASK);
> +}
> +
> /*
> * Contains the tracking data for NUMA nodes having EPC pages. Most importantly,
> * the free page list local to the node is stored here.
Looks pretty good to me. I'll hold ack's a bit until everything looks as
a whole good, but I agree with the general idea in this patch.
BR, Jarkko
On Mon Oct 30, 2023 at 8:20 PM EET, Haitao Huang wrote:
> From: Kristen Carlson Accardi <[email protected]>
>
> Implement support for cgroup control of SGX Enclave Page Cache (EPC)
> memory using the misc cgroup controller. EPC memory is independent
> from normal system memory, e.g. must be reserved at boot from RAM and
> cannot be converted between EPC and normal memory while the system is
> running. EPC is managed by the SGX subsystem and is not accounted by
> the memory controller.
>
> Much like normal system memory, EPC memory can be overcommitted via
> virtual memory techniques and pages can be swapped out of the EPC to
> their backing store (normal system memory, e.g. shmem). The SGX EPC
> subsystem is analogous to the memory subsystem and the SGX EPC controller
> is in turn analogous to the memory controller; it implements limit and
> protection models for EPC memory.
>
> The misc controller provides a mechanism to set a hard limit of EPC
> usage via the "sgx_epc" resource in "misc.max". The total EPC memory
> available on the system is reported via the "sgx_epc" resource in
> "misc.capacity".
>
> This patch was modified from the previous version to only add basic EPC
> cgroup structure, accounting allocations for cgroup usage
> (charge/uncharge), setup misc cgroup callbacks, set total EPC capacity.
>
> For now, the EPC cgroup simply blocks additional EPC allocation in
> sgx_alloc_epc_page() when the limit is reached. Reclaimable pages are
> still tracked in the global active list, only reclaimed by the global
> reclaimer when the total free page count is lower than a threshold.
>
> Later patches will reorganize the tracking and reclamation code in the
> globale reclaimer and implement per-cgroup tracking and reclaiming.
>
> Co-developed-by: Sean Christopherson <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>
> Signed-off-by: Kristen Carlson Accardi <[email protected]>
> Co-developed-by: Haitao Huang <[email protected]>
> Signed-off-by: Haitao Huang <[email protected]>
> ---
> V6:
> - Split the original large patch"Limit process EPC usage with misc
> cgroup controller" and restructure it (Kai)
> ---
> arch/x86/Kconfig | 13 ++++
> arch/x86/kernel/cpu/sgx/Makefile | 1 +
> arch/x86/kernel/cpu/sgx/epc_cgroup.c | 103 +++++++++++++++++++++++++++
> arch/x86/kernel/cpu/sgx/epc_cgroup.h | 36 ++++++++++
> arch/x86/kernel/cpu/sgx/main.c | 28 ++++++++
> arch/x86/kernel/cpu/sgx/sgx.h | 3 +
> 6 files changed, 184 insertions(+)
> create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.c
> create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.h
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 66bfabae8814..e17c5dc3aea4 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1921,6 +1921,19 @@ config X86_SGX
>
> If unsure, say N.
>
> +config CGROUP_SGX_EPC
> + bool "Miscellaneous Cgroup Controller for Enclave Page Cache (EPC) for Intel SGX"
> + depends on X86_SGX && CGROUP_MISC
> + help
> + Provides control over the EPC footprint of tasks in a cgroup via
> + the Miscellaneous cgroup controller.
> +
> + EPC is a subset of regular memory that is usable only by SGX
> + enclaves and is very limited in quantity, e.g. less than 1%
> + of total DRAM.
> +
> + Say N if unsure.
> +
> config X86_USER_SHADOW_STACK
> bool "X86 userspace shadow stack"
> depends on AS_WRUSS
> diff --git a/arch/x86/kernel/cpu/sgx/Makefile b/arch/x86/kernel/cpu/sgx/Makefile
> index 9c1656779b2a..12901a488da7 100644
> --- a/arch/x86/kernel/cpu/sgx/Makefile
> +++ b/arch/x86/kernel/cpu/sgx/Makefile
> @@ -4,3 +4,4 @@ obj-y += \
> ioctl.o \
> main.o
> obj-$(CONFIG_X86_SGX_KVM) += virt.o
> +obj-$(CONFIG_CGROUP_SGX_EPC) += epc_cgroup.o
> diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> new file mode 100644
> index 000000000000..500627d0563f
> --- /dev/null
> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> @@ -0,0 +1,103 @@
> +// SPDX-License-Identifier: GPL-2.0
> +// Copyright(c) 2022 Intel Corporation.
> +
> +#include <linux/atomic.h>
> +#include <linux/kernel.h>
> +#include "epc_cgroup.h"
> +
> +static inline struct sgx_epc_cgroup *sgx_epc_cgroup_from_misc_cg(struct misc_cg *cg)
> +{
> + return (struct sgx_epc_cgroup *)(cg->res[MISC_CG_RES_SGX_EPC].priv);
> +}
> +
> +static inline bool sgx_epc_cgroup_disabled(void)
> +{
> + return !cgroup_subsys_enabled(misc_cgrp_subsys);
> +}
> +
> +/**
> + * sgx_epc_cgroup_try_charge() - hierarchically try to charge a single EPC page
> + *
> + * Returns EPC cgroup or NULL on success, -errno on failure.
Should have a description explaining what "charging hierarchically" is
all about. This is too cryptic like this.
E.g. consider wahat non-hierarchically charging means. There must be
opposite end in order to have a meaning (for anything expressed with
a language).
> + */
> +struct sgx_epc_cgroup *sgx_epc_cgroup_try_charge(void)
> +{
> + struct sgx_epc_cgroup *epc_cg;
> + int ret;
> +
> + if (sgx_epc_cgroup_disabled())
> + return NULL;
> +
> + epc_cg = sgx_epc_cgroup_from_misc_cg(get_current_misc_cg());
> + ret = misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, PAGE_SIZE);
> +
> + if (!ret) {
> + /* No epc_cg returned, release ref from get_current_misc_cg() */
> + put_misc_cg(epc_cg->cg);
> + return ERR_PTR(-ENOMEM);
> + }
> +
> + /* Ref released in sgx_epc_cgroup_uncharge() */
> + return epc_cg;
> +}
> +
> +/**
> + * sgx_epc_cgroup_uncharge() - hierarchically uncharge EPC pages
> + * @epc_cg: the charged epc cgroup
> + */
> +void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg)
> +{
> + if (sgx_epc_cgroup_disabled())
> + return;
> +
> + misc_cg_uncharge(MISC_CG_RES_SGX_EPC, epc_cg->cg, PAGE_SIZE);
> +
> + /* Ref got from sgx_epc_cgroup_try_charge() */
> + put_misc_cg(epc_cg->cg);
> +}
> +
> +static void sgx_epc_cgroup_free(struct misc_cg *cg)
> +{
> + struct sgx_epc_cgroup *epc_cg;
> +
> + epc_cg = sgx_epc_cgroup_from_misc_cg(cg);
> + if (!epc_cg)
> + return;
> +
> + kfree(epc_cg);
> +}
> +
> +static int sgx_epc_cgroup_alloc(struct misc_cg *cg);
> +
> +const struct misc_operations_struct sgx_epc_cgroup_ops = {
> + .alloc = sgx_epc_cgroup_alloc,
> + .free = sgx_epc_cgroup_free,
> +};
> +
> +static int sgx_epc_cgroup_alloc(struct misc_cg *cg)
> +{
> + struct sgx_epc_cgroup *epc_cg;
> +
> + epc_cg = kzalloc(sizeof(*epc_cg), GFP_KERNEL);
> + if (!epc_cg)
> + return -ENOMEM;
> +
> + cg->res[MISC_CG_RES_SGX_EPC].misc_ops = &sgx_epc_cgroup_ops;
> + cg->res[MISC_CG_RES_SGX_EPC].priv = epc_cg;
> + epc_cg->cg = cg;
> + return 0;
> +}
> +
> +static int __init sgx_epc_cgroup_init(void)
> +{
> + struct misc_cg *cg;
> +
> + if (!boot_cpu_has(X86_FEATURE_SGX))
> + return 0;
> +
> + cg = misc_cg_root();
> + BUG_ON(!cg);
> +
> + return sgx_epc_cgroup_alloc(cg);
> +}
> +subsys_initcall(sgx_epc_cgroup_init);
> diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.h b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
> new file mode 100644
> index 000000000000..c3abfe82be15
> --- /dev/null
> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
> @@ -0,0 +1,36 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/* Copyright(c) 2022 Intel Corporation. */
> +#ifndef _INTEL_SGX_EPC_CGROUP_H_
> +#define _INTEL_SGX_EPC_CGROUP_H_
> +
> +#include <asm/sgx.h>
> +#include <linux/cgroup.h>
> +#include <linux/list.h>
> +#include <linux/misc_cgroup.h>
> +#include <linux/page_counter.h>
> +#include <linux/workqueue.h>
> +
> +#include "sgx.h"
> +
> +#ifndef CONFIG_CGROUP_SGX_EPC
> +#define MISC_CG_RES_SGX_EPC MISC_CG_RES_TYPES
> +struct sgx_epc_cgroup;
> +
> +static inline struct sgx_epc_cgroup *sgx_epc_cgroup_try_charge(void)
> +{
> + return NULL;
> +}
> +
> +static inline void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg) { }
> +#else
> +struct sgx_epc_cgroup {
> + struct misc_cg *cg;
> +};
> +
> +struct sgx_epc_cgroup *sgx_epc_cgroup_try_charge(void);
> +void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg);
> +bool sgx_epc_cgroup_lru_empty(struct misc_cg *root);
> +
> +#endif
> +
> +#endif /* _INTEL_SGX_EPC_CGROUP_H_ */
> diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
> index 166692f2d501..07606f391540 100644
> --- a/arch/x86/kernel/cpu/sgx/main.c
> +++ b/arch/x86/kernel/cpu/sgx/main.c
> @@ -6,6 +6,7 @@
> #include <linux/highmem.h>
> #include <linux/kthread.h>
> #include <linux/miscdevice.h>
> +#include <linux/misc_cgroup.h>
> #include <linux/node.h>
> #include <linux/pagemap.h>
> #include <linux/ratelimit.h>
> @@ -17,6 +18,7 @@
> #include "driver.h"
> #include "encl.h"
> #include "encls.h"
> +#include "epc_cgroup.h"
>
> struct sgx_epc_section sgx_epc_sections[SGX_MAX_EPC_SECTIONS];
> static int sgx_nr_epc_sections;
> @@ -559,6 +561,11 @@ int sgx_unmark_page_reclaimable(struct sgx_epc_page *page)
> struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
> {
> struct sgx_epc_page *page;
> + struct sgx_epc_cgroup *epc_cg;
> +
> + epc_cg = sgx_epc_cgroup_try_charge();
> + if (IS_ERR(epc_cg))
> + return ERR_CAST(epc_cg);
>
> for ( ; ; ) {
> page = __sgx_alloc_epc_page();
> @@ -580,10 +587,21 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
> break;
> }
>
> + /*
> + * Need to do a global reclamation if cgroup was not full but free
> + * physical pages run out, causing __sgx_alloc_epc_page() to fail.
> + */
> sgx_reclaim_pages();
> cond_resched();
> }
>
> + if (!IS_ERR(page)) {
> + WARN_ON_ONCE(page->epc_cg);
> + page->epc_cg = epc_cg;
> + } else {
> + sgx_epc_cgroup_uncharge(epc_cg);
> + }
> +
> if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
> wake_up(&ksgxd_waitq);
>
> @@ -604,6 +622,11 @@ void sgx_free_epc_page(struct sgx_epc_page *page)
> struct sgx_epc_section *section = &sgx_epc_sections[page->section];
> struct sgx_numa_node *node = section->node;
>
> + if (page->epc_cg) {
> + sgx_epc_cgroup_uncharge(page->epc_cg);
> + page->epc_cg = NULL;
> + }
> +
> spin_lock(&node->lock);
>
> page->owner = NULL;
> @@ -643,6 +666,7 @@ static bool __init sgx_setup_epc_section(u64 phys_addr, u64 size,
> section->pages[i].flags = 0;
> section->pages[i].owner = NULL;
> section->pages[i].poison = 0;
> + section->pages[i].epc_cg = NULL;
> list_add_tail(§ion->pages[i].list, &sgx_dirty_page_list);
> }
>
> @@ -787,6 +811,7 @@ static void __init arch_update_sysfs_visibility(int nid) {}
> static bool __init sgx_page_cache_init(void)
> {
> u32 eax, ebx, ecx, edx, type;
> + u64 capacity = 0;
> u64 pa, size;
> int nid;
> int i;
> @@ -837,6 +862,7 @@ static bool __init sgx_page_cache_init(void)
>
> sgx_epc_sections[i].node = &sgx_numa_nodes[nid];
> sgx_numa_nodes[nid].size += size;
> + capacity += size;
>
> sgx_nr_epc_sections++;
> }
> @@ -846,6 +872,8 @@ static bool __init sgx_page_cache_init(void)
> return false;
> }
>
> + misc_cg_set_capacity(MISC_CG_RES_SGX_EPC, capacity);
> +
> return true;
> }
>
> diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
> index d2dad21259a8..b1786774b8d2 100644
> --- a/arch/x86/kernel/cpu/sgx/sgx.h
> +++ b/arch/x86/kernel/cpu/sgx/sgx.h
> @@ -29,12 +29,15 @@
> /* Pages on free list */
> #define SGX_EPC_PAGE_IS_FREE BIT(1)
>
> +struct sgx_epc_cgroup;
> +
> struct sgx_epc_page {
> unsigned int section;
> u16 flags;
> u16 poison;
> struct sgx_encl_page *owner;
> struct list_head list;
> + struct sgx_epc_cgroup *epc_cg;
> };
>
> /*
BR, Jarkko
On Mon Oct 30, 2023 at 8:20 PM EET, Haitao Huang wrote:
> From: Sean Christopherson <[email protected]>
>
> Change sgx_reclaim_pages() to use a list rather than an array for
> storing the epc_pages which will be reclaimed. This change is needed
> to transition to the LRU implementation for EPC cgroup support.
>
> When the EPC cgroup is implemented, the reclaiming process will do a
> pre-order tree walk for the subtree starting from the limit-violating
> cgroup. When each node is visited, candidate pages are selected from
> its "reclaimable" LRU list and moved into this temporary list. Passing a
> list from node to node for temporary storage in this walk is more
> straightforward than using an array.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> Co-developed-by: Kristen Carlson Accardi <[email protected]>
> Signed-off-by: Kristen Carlson Accardi <[email protected]>
> Co-developed-by: Haitao Huang<[email protected]>
> Signed-off-by: Haitao Huang<[email protected]>
> Cc: Sean Christopherson <[email protected]>
> ---
> V6:
> - Remove extra list_del_init and style fix (Kai)
>
> V4:
> - Changes needed for patch reordering
> - Revised commit message
>
> V3:
> - Removed list wrappers
> ---
> arch/x86/kernel/cpu/sgx/main.c | 35 +++++++++++++++-------------------
> 1 file changed, 15 insertions(+), 20 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
> index e27ac73d8843..33bcba313d40 100644
> --- a/arch/x86/kernel/cpu/sgx/main.c
> +++ b/arch/x86/kernel/cpu/sgx/main.c
> @@ -296,12 +296,11 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
> */
> static void sgx_reclaim_pages(void)
> {
> - struct sgx_epc_page *chunk[SGX_NR_TO_SCAN];
> struct sgx_backing backing[SGX_NR_TO_SCAN];
> + struct sgx_epc_page *epc_page, *tmp;
> struct sgx_encl_page *encl_page;
> - struct sgx_epc_page *epc_page;
> pgoff_t page_index;
> - int cnt = 0;
> + LIST_HEAD(iso);
> int ret;
> int i;
>
> @@ -317,7 +316,7 @@ static void sgx_reclaim_pages(void)
>
> if (kref_get_unless_zero(&encl_page->encl->refcount) != 0) {
> sgx_epc_page_set_state(epc_page, SGX_EPC_PAGE_RECLAIM_IN_PROGRESS);
> - chunk[cnt++] = epc_page;
> + list_move_tail(&epc_page->list, &iso);
> } else
> /* The owner is freeing the page. No need to add the
> * page back to the list of reclaimable pages.
> @@ -326,8 +325,11 @@ static void sgx_reclaim_pages(void)
> }
> spin_unlock(&sgx_global_lru.lock);
>
> - for (i = 0; i < cnt; i++) {
> - epc_page = chunk[i];
> + if (list_empty(&iso))
> + return;
> +
> + i = 0;
> + list_for_each_entry_safe(epc_page, tmp, &iso, list) {
> encl_page = epc_page->owner;
>
> if (!sgx_reclaimer_age(epc_page))
> @@ -342,6 +344,7 @@ static void sgx_reclaim_pages(void)
> goto skip;
> }
>
> + i++;
> encl_page->desc |= SGX_ENCL_PAGE_BEING_RECLAIMED;
> mutex_unlock(&encl_page->encl->lock);
> continue;
> @@ -349,27 +352,19 @@ static void sgx_reclaim_pages(void)
> skip:
> spin_lock(&sgx_global_lru.lock);
> sgx_epc_page_set_state(epc_page, SGX_EPC_PAGE_RECLAIMABLE);
> - list_add_tail(&epc_page->list, &sgx_global_lru.reclaimable);
> + list_move_tail(&epc_page->list, &sgx_global_lru.reclaimable);
> spin_unlock(&sgx_global_lru.lock);
>
> kref_put(&encl_page->encl->refcount, sgx_encl_release);
> -
> - chunk[i] = NULL;
> - }
> -
> - for (i = 0; i < cnt; i++) {
> - epc_page = chunk[i];
> - if (epc_page)
> - sgx_reclaimer_block(epc_page);
> }
>
> - for (i = 0; i < cnt; i++) {
> - epc_page = chunk[i];
> - if (!epc_page)
> - continue;
> + list_for_each_entry(epc_page, &iso, list)
> + sgx_reclaimer_block(epc_page);
>
> + i = 0;
> + list_for_each_entry_safe(epc_page, tmp, &iso, list) {
> encl_page = epc_page->owner;
> - sgx_reclaimer_write(epc_page, &backing[i]);
> + sgx_reclaimer_write(epc_page, &backing[i++]);
Couldn't you alternatively "&backing[--i]" and not reset i to zero
before the loop?
>
> kref_put(&encl_page->encl->refcount, sgx_encl_release);
> sgx_epc_page_reset_state(epc_page);
BR, Jarkko
On Mon Oct 30, 2023 at 8:20 PM EET, Haitao Huang wrote:
> The scripts rely on cgroup-tools package from libcgroup [1].
>
> To run selftests for epc cgroup:
>
> sudo ./run_epc_cg_selftests.sh
>
> With different cgroups, the script starts one or multiple concurrent SGX
> selftests, each to run one unclobbered_vdso_oversubscribed test. Each
> of such test tries to load an enclave of EPC size equal to the EPC
> capacity available on the platform. The script checks results against
> the expectation set for each cgroup and reports success or failure.
>
> The script creates 3 different cgroups at the beginning with following
> expectations:
>
> 1) SMALL - intentionally small enough to fail the test loading an
> enclave of size equal to the capacity.
> 2) LARGE - large enough to run up to 4 concurrent tests but fail some if
> more than 4 concurrent tests are run. The script starts 4 expecting at
> least one test to pass, and then starts 5 expecting at least one test
> to fail.
> 3) LARGER - limit is the same as the capacity, large enough to run lots of
> concurrent tests. The script starts 10 of them and expects all pass.
> Then it reruns the same test with one process randomly killed and
> usage checked to be zero after all process exit.
>
> To watch misc cgroup 'current' changes during testing, run this in a
> separate terminal:
>
> ./watch_misc_for_tests.sh current
>
> [1] https://github.com/libcgroup/libcgroup/blob/main/README
>
> Signed-off-by: Haitao Huang <[email protected]>
> ---
> V5:
>
> - Added script with automatic results checking, remove the interactive
> script.
> - The script can run independent from the series below.
> ---
> .../selftests/sgx/run_epc_cg_selftests.sh | 196 ++++++++++++++++++
> .../selftests/sgx/watch_misc_for_tests.sh | 13 ++
> 2 files changed, 209 insertions(+)
> create mode 100755 tools/testing/selftests/sgx/run_epc_cg_selftests.sh
> create mode 100755 tools/testing/selftests/sgx/watch_misc_for_tests.sh
>
> diff --git a/tools/testing/selftests/sgx/run_epc_cg_selftests.sh b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
> new file mode 100755
> index 000000000000..72b93f694753
> --- /dev/null
> +++ b/tools/testing/selftests/sgx/run_epc_cg_selftests.sh
> @@ -0,0 +1,196 @@
> +#!/bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright(c) 2023 Intel Corporation.
> +
> +TEST_ROOT_CG=selftest
> +cgcreate -g misc:$TEST_ROOT_CG
> +if [ $? -ne 0 ]; then
> + echo "# Please make sure cgroup-tools is installed, and misc cgroup is mounted."
> + exit 1
> +fi
> +TEST_CG_SUB1=$TEST_ROOT_CG/test1
> +TEST_CG_SUB2=$TEST_ROOT_CG/test2
> +TEST_CG_SUB3=$TEST_ROOT_CG/test1/test3
> +TEST_CG_SUB4=$TEST_ROOT_CG/test4
> +
> +cgcreate -g misc:$TEST_CG_SUB1
> +cgcreate -g misc:$TEST_CG_SUB2
> +cgcreate -g misc:$TEST_CG_SUB3
> +cgcreate -g misc:$TEST_CG_SUB4
> +
> +# Default to V2
> +CG_ROOT=/sys/fs/cgroup
> +if [ ! -d "/sys/fs/cgroup/misc" ]; then
> + echo "# cgroup V2 is in use."
> +else
> + echo "# cgroup V1 is in use."
> + CG_ROOT=/sys/fs/cgroup/misc
> +fi
Does the test need to support v1 cgroups?
> +
> +CAPACITY=$(grep "sgx_epc" "$CG_ROOT/misc.capacity" | awk '{print $2}')
> +# This is below number of VA pages needed for enclave of capacity size. So
> +# should fail oversubscribed cases
> +SMALL=$(( CAPACITY / 512 ))
> +
> +# At least load one enclave of capacity size successfully, maybe up to 4.
> +# But some may fail if we run more than 4 concurrent enclaves of capacity size.
> +LARGE=$(( SMALL * 4 ))
> +
> +# Load lots of enclaves
> +LARGER=$CAPACITY
> +echo "# Setting up limits."
> +echo "sgx_epc $SMALL" | tee $CG_ROOT/$TEST_CG_SUB1/misc.max
> +echo "sgx_epc $LARGE" | tee $CG_ROOT/$TEST_CG_SUB2/misc.max
> +echo "sgx_epc $LARGER" | tee $CG_ROOT/$TEST_CG_SUB4/misc.max
> +
> +timestamp=$(date +%Y%m%d_%H%M%S)
> +
> +test_cmd="./test_sgx -t unclobbered_vdso_oversubscribed"
> +
> +echo "# Start unclobbered_vdso_oversubscribed with SMALL limit, expecting failure..."
> +# Always use leaf node of misc cgroups so it works for both v1 and v2
> +# these may fail on OOM
> +cgexec -g misc:$TEST_CG_SUB3 $test_cmd >cgtest_small_$timestamp.log 2>&1
> +if [[ $? -eq 0 ]]; then
> + echo "# Fail on SMALL limit, not expecting any test passes."
> + cgdelete -r -g misc:$TEST_ROOT_CG
> + exit 1
> +else
> + echo "# Test failed as expected."
> +fi
> +
> +echo "# PASSED SMALL limit."
> +
> +echo "# Start 4 concurrent unclobbered_vdso_oversubscribed tests with LARGE limit,
> + expecting at least one success...."
> +pids=()
> +for i in {1..4}; do
> + (
> + cgexec -g misc:$TEST_CG_SUB2 $test_cmd >cgtest_large_positive_$timestamp.$i.log 2>&1
> + ) &
> + pids+=($!)
> +done
> +
> +any_success=0
> +for pid in "${pids[@]}"; do
> + wait "$pid"
> + status=$?
> + if [[ $status -eq 0 ]]; then
> + any_success=1
> + echo "# Process $pid returned successfully."
> + fi
> +done
> +
> +if [[ $any_success -eq 0 ]]; then
> + echo "# Failed on LARGE limit positive testing, no test passes."
> + cgdelete -r -g misc:$TEST_ROOT_CG
> + exit 1
> +fi
> +
> +echo "# PASSED LARGE limit positive testing."
> +
> +echo "# Start 5 concurrent unclobbered_vdso_oversubscribed tests with LARGE limit,
> + expecting at least one failure...."
> +pids=()
> +for i in {1..5}; do
> + (
> + cgexec -g misc:$TEST_CG_SUB2 $test_cmd >cgtest_large_negative_$timestamp.$i.log 2>&1
> + ) &
> + pids+=($!)
> +done
> +
> +any_failure=0
> +for pid in "${pids[@]}"; do
> + wait "$pid"
> + status=$?
> + if [[ $status -ne 0 ]]; then
> + echo "# Process $pid returned failure."
> + any_failure=1
> + fi
> +done
> +
> +if [[ $any_failure -eq 0 ]]; then
> + echo "# Failed on LARGE limit negative testing, no test fails."
> + cgdelete -r -g misc:$TEST_ROOT_CG
> + exit 1
> +fi
> +
> +echo "# PASSED LARGE limit negative testing."
> +
> +echo "# Start 10 concurrent unclobbered_vdso_oversubscribed tests with LARGER limit,
> + expecting no failure...."
> +pids=()
> +for i in {1..10}; do
> + (
> + cgexec -g misc:$TEST_CG_SUB4 $test_cmd >cgtest_larger_$timestamp.$i.log 2>&1
> + ) &
> + pids+=($!)
> +done
> +
> +any_failure=0
> +for pid in "${pids[@]}"; do
> + wait "$pid"
> + status=$?
> + if [[ $status -ne 0 ]]; then
> + echo "# Process $pid returned failure."
> + any_failure=1
> + fi
> +done
> +
> +if [[ $any_failure -ne 0 ]]; then
> + echo "# Failed on LARGER limit, at least one test fails."
> + cgdelete -r -g misc:$TEST_ROOT_CG
> + exit 1
> +fi
> +
> +echo "# PASSED LARGER limit tests."
> +
> +
> +echo "# Start 10 concurrent unclobbered_vdso_oversubscribed tests with LARGER limit,
> + randomly kill one, expecting no failure...."
> +pids=()
> +for i in {1..10}; do
> + (
> + cgexec -g misc:$TEST_CG_SUB4 $test_cmd >cgtest_larger_$timestamp.$i.log 2>&1
> + ) &
> + pids+=($!)
> +done
> +
> +sleep $((RANDOM % 10 + 5))
> +
> +# Randomly select a PID to kill
> +RANDOM_INDEX=$((RANDOM % 10))
> +PID_TO_KILL=${pids[RANDOM_INDEX]}
> +
> +kill $PID_TO_KILL
> +echo "# Killed process with PID: $PID_TO_KILL"
> +
> +any_failure=0
> +for pid in "${pids[@]}"; do
> + wait "$pid"
> + status=$?
> + if [ "$pid" != "$PID_TO_KILL" ]; then
> + if [[ $status -ne 0 ]]; then
> + echo "# Process $pid returned failure."
> + any_failure=1
> + fi
> + fi
> +done
> +
> +if [[ $any_failure -ne 0 ]]; then
> + echo "# Failed on random killing, at least one test fails."
> + cgdelete -r -g misc:$TEST_ROOT_CG
> + exit 1
> +fi
> +
> +sleep 1
> +
> +USAGE=$(grep '^sgx_epc' "$CG_ROOT/$TEST_ROOT_CG/misc.current" | awk '{print $2}')
> +if [ "$USAGE" -ne 0 ]; then
> + echo "# Failed: Final usage is $USAGE, not 0."
> +else
> + echo "# PASSED leakage check."
> + echo "# PASSED ALL cgroup limit tests, cleanup cgroups..."
> +fi
> +cgdelete -r -g misc:$TEST_ROOT_CG
> +echo "# done."
> diff --git a/tools/testing/selftests/sgx/watch_misc_for_tests.sh b/tools/testing/selftests/sgx/watch_misc_for_tests.sh
> new file mode 100755
> index 000000000000..dbd38f346e7b
> --- /dev/null
> +++ b/tools/testing/selftests/sgx/watch_misc_for_tests.sh
> @@ -0,0 +1,13 @@
> +#!/bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright(c) 2023 Intel Corporation.
> +
> +if [ -z "$1" ]
> + then
> + echo "No argument supplied, please provide 'max', 'current' or 'events'"
> + exit 1
> +fi
> +
> +watch -n 1 "find /sys/fs/cgroup -wholename */test*/misc.$1 -exec sh -c \
> + 'echo \"\$1:\"; cat \"\$1\"' _ {} \;"
> +
BR, Jarkko
>> +CG_ROOT=/sys/fs/cgroup
>> +if [ ! -d "/sys/fs/cgroup/misc" ]; then
>> + echo "# cgroup V2 is in use."
>> +else
>> + echo "# cgroup V1 is in use."
>> + CG_ROOT=/sys/fs/cgroup/misc
>> +fi
>
> Does the test need to support v1 cgroups?
>
I thought some distro may still only support V1. I do my most work on
Ubuntu22.04 which by default is v1 so it's convenient for me to test. But
not strong opinions.
Thanks
Haitao
> > >
> >
> > That's true. I was thinking no need to have them done in separate calls.
> > The caller has to check the return value for epc_cg instance first, then
> > check result of try_charge. But there is really only one caller,
> > sgx_alloc_epc_page() below, so I don't have strong opinions now.
> >
> > With them separate, the checks will look like this:
> > if (epc_cg = sgx_get_current_epc_cg()) // NULL means cgroup disabled,
> > should continue for allocation
> > {
> > if (ret = sgx_epc_cgroup_try_charge())
> > return ret
> > }
> > // continue...
> >
> > I can go either way.
Let's keep this aligned with other _try_charge() variants: return 'int' to
indicate whether the charge is done or not.
> >
> > >
> > > > > > struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
> > > > > > {
> > > > > > struct sgx_epc_page *page;
> > > > > > + struct sgx_epc_cgroup *epc_cg;
> > > > > > +
> > > > > > + epc_cg = sgx_epc_cgroup_try_charge();
> > > > > > + if (IS_ERR(epc_cg))
> > > > > > + return ERR_CAST(epc_cg);
> > > > > >
> > > > > > for ( ; ; ) {
> > > > > > page = __sgx_alloc_epc_page();
> > > > > > @@ -580,10 +587,21 @@ struct sgx_epc_page *sgx_alloc_epc_page(void
> > > > > > *owner, bool reclaim)
> > > > > > break;
> > > > > > }
> > > > > >
> > > > > > + /*
> > > > > > + * Need to do a global reclamation if cgroup was not full but
> > > > free
> > > > > > + * physical pages run out, causing __sgx_alloc_epc_page() to
> > > > fail.
> > > > > > + */
> > > > > > sgx_reclaim_pages();
> > > > >
> > > > > What's the final behaviour? IIUC it should be reclaiming from the
> > > > > *current* EPC
> > > > > cgroup? If so shouldn't we just pass the @epc_cg to it here?
> > > > >
> > > > > I think we can make this patch as "structure" patch w/o actually
> > > > having
> > > > > EPC
> > > > > cgroup enabled, i.e., sgx_get_current_epc_cg() always return NULL.
> > > > >
> > > > > And we can have one patch to change sgx_reclaim_pages() to take the
> > > > > 'struct
> > > > > sgx_epc_lru_list *' as argument:
> > > > >
> > > > > void sgx_reclaim_pages_lru(struct sgx_epc_lru_list * lru)
> > > > > {
> > > > > ...
> > > > > }
> > > > >
> > > > > Then here we can have something like:
> > > > >
> > > > > void sgx_reclaim_pages(struct sgx_epc_cg *epc_cg)
> > > > > {
> > > > > struct sgx_epc_lru_list *lru = epc_cg ? &epc_cg->lru :
> > > > > &sgx_global_lru;
> > > > >
> > > > > sgx_reclaim_pages_lru(lru);
> > > > > }
> > > > >
> > > > > Makes sense?
> > > > >
> > > >
> > > > This is purely global reclamation. No cgroup involved.
> > >
> > > Again why? Here you are allocating one EPC page for enclave in a
> > > particular EPC
> > > cgroup. When that fails, shouldn't you try only to reclaim from the
> > > *current*
> > > EPC cgroup? Or at least you should try to reclaim from the *current*
> > > EPC cgroup
> > > first?
> > >
> >
> > Later sgx_epc_cg_try_charge will take a 'reclaim' flag, if true, cgroup
> > reclaims synchronously, otherwise in background and returns -EBUSY in
> > that case. This function also returns if no valid epc_cg pointer
> > returned.
> >
> > All reclamation for *current* cgroup is done in sgx_epc_cg_try_charge().
This is fine, but I believe my question above is about where to reclaim when
"allocation" fails, but not "try charge" fails.
And for "reclaim for current cgroup when charge fails", I don't think its even
necessary in this initial implementation of EPC cgroup. You can just fail the
allocation when charge fails (reaching the limit). Trying to reclaim when limit
is hit can be done later.
Please see Dave and Michal's replies here:
https://lore.kernel.org/lkml/[email protected]/#t
https://lore.kernel.org/lkml/yz44wukoic3syy6s4fcrngagurkjhe2hzka6kvxbajdtro3fwu@zd2ilht7wcw3/
> >
> > So, by reaching to this point, a valid epc_cg pointer was returned,
> > that means allocation is allowed for the cgroup (it has reclaimed if
> > necessary, and its usage is not above limit after charging).
I found memory cgroup uses different logic -- allocation first and then charge:
For instance:
static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
{
......
folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
if (!folio)
goto oom;
if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
goto oom_free_page;
......
}
Why EPC needs to "charge first" and "then allocate"?
> >
> > But the system level free count may be low (e.g., limits of all cgroups
> > may add up to be more than capacity). so we need to do a global
> > reclamation here, which may involve reclaiming a few pages (from current
> > or other groups) so the system can be at a performant state with minimal
> > free count. (current behavior of ksgxd).
> >
> I should have sticked to the orignial comment added in code. Actually
> __sgx_alloc_epc_page() can fail if system runs out of EPC. That's the
> really reason for global reclaim.
>
I don't see how this can work. With EPC cgroup likely all EPC pages will go to
the individual LRU of each cgroup, and the sgx_global_lru will basically empty.
How can you reclaim from the sgx_global_lru?
I am probably missing something, but anyway, looks some high level design
material is really missing in the changelog.
On Mon, 2023-10-30 at 11:20 -0700, Haitao Huang wrote:
> From: Sean Christopherson <[email protected]>
>
> To prepare for per-cgroup reclamation, separate the top-level reclaim
> function, sgx_reclaim_epc_pages(), into two separate functions:
>
> - sgx_isolate_epc_pages() scans and isolates reclaimable pages from a given LRU list.
> - sgx_do_epc_reclamation() performs the real reclamation for the already isolated pages.
>
> Create a new function, sgx_reclaim_epc_pages_global(), calling those two
> in succession, to replace the original sgx_reclaim_epc_pages(). The
> above two functions will serve as building blocks for the reclamation
> flows in later EPC cgroup implementation.
>
> sgx_do_epc_reclamation() returns the number of reclaimed pages. The EPC
> cgroup will use the result to track reclaiming progress.
>
> sgx_isolate_epc_pages() returns the additional number of pages to scan
> for current epoch of reclamation. The EPC cgroup will use the result to
> determine if more scanning to be done in LRUs in its children groups.
This changelog says nothing about "why", but only mentions the "implementation".
For instance, assuming we need to reclaim @npages_to_reclaim from the
@epc_cgrp_to_reclaim and its descendants, why cannot we do:
for_each_cgroup_and_descendants(&epc_cgrp_to_reclaim, &epc_cgrp) {
if (npages_to_reclaim <= 0)
return;
npages_to_reclaim -= sgx_reclaim_pages_lru(&epc_cgrp->lru,
npages_to_reclaim);
}
Is there any difference to have "isolate" + "reclaim"?
>
> Signed-off-by: Sean Christopherson <[email protected]>
> Co-developed-by: Kristen Carlson Accardi <[email protected]>
> Signed-off-by: Kristen Carlson Accardi <[email protected]>
> Co-developed-by: Haitao Huang <[email protected]>
> Signed-off-by: Haitao Huang <[email protected]>
> Cc: Sean Christopherson <[email protected]>
> ---
>
[...]
> +/**
> + * sgx_do_epc_reclamation() - Perform reclamation for isolated EPC pages.
> + * @iso: List of isolated pages for reclamation
> + *
> + * Take a list of EPC pages and reclaim them to the enclave's private shmem files. Do not
> + * reclaim the pages that have been accessed since the last scan, and move each of those pages
> + * to the tail of its tracking LRU list.
> + *
> + * Limit the number of pages to be processed up to SGX_NR_TO_SCAN_MAX per call in order to
> + * degrade amount of IPI's and ETRACK's potentially required. sgx_encl_ewb() does degrade a bit
> + * among the HW threads with three stage EWB pipeline (EWB, ETRACK + EWB and IPI + EWB) but not
> + * sufficiently. Reclaiming one page at a time would also be problematic as it would increase
> + * the lock contention too much, which would halt forward progress.
This is kinda optimization, correct? Is there any real performance data to
justify this? If this optimization is useful, shouldn't we bring this
optimization to the current sgx_reclaim_pages() instead, e.g., just increase
SGX_NR_TO_SCAN (16) to SGX_NR_TO_SCAN_MAX (32)?
On Mon, 20 Nov 2023 11:16:42 +0800, Huang, Kai <[email protected]> wrote:
>> > >
>> >
>> > That's true. I was thinking no need to have them done in separate
>> calls.
>> > The caller has to check the return value for epc_cg instance first,
>> then
>> > check result of try_charge. But there is really only one caller,
>> > sgx_alloc_epc_page() below, so I don't have strong opinions now.
>> >
>> > With them separate, the checks will look like this:
>> > if (epc_cg = sgx_get_current_epc_cg()) // NULL means cgroup disabled,
>> > should continue for allocation
>> > {
>> > if (ret = sgx_epc_cgroup_try_charge())
>> > return ret
>> > }
>> > // continue...
>> >
>> > I can go either way.
>
> Let's keep this aligned with other _try_charge() variants: return 'int'
> to
> indicate whether the charge is done or not.
>
Fine with me if no objections from maintainers.
>> >
>> > >
>> > > > > > struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool
>> reclaim)
>> > > > > > {
>> > > > > > struct sgx_epc_page *page;
>> > > > > > + struct sgx_epc_cgroup *epc_cg;
>> > > > > > +
>> > > > > > + epc_cg = sgx_epc_cgroup_try_charge();
>> > > > > > + if (IS_ERR(epc_cg))
>> > > > > > + return ERR_CAST(epc_cg);
>> > > > > >
>> > > > > > for ( ; ; ) {
>> > > > > > page = __sgx_alloc_epc_page();
>> > > > > > @@ -580,10 +587,21 @@ struct sgx_epc_page
>> *sgx_alloc_epc_page(void
>> > > > > > *owner, bool reclaim)
>> > > > > > break;
>> > > > > > }
>> > > > > >
>> > > > > > + /*
>> > > > > > + * Need to do a global reclamation if cgroup was not full
>> but
>> > > > free
>> > > > > > + * physical pages run out, causing __sgx_alloc_epc_page()
>> to
>> > > > fail.
>> > > > > > + */
>> > > > > > sgx_reclaim_pages();
>> > > > >
>> > > > > What's the final behaviour? IIUC it should be reclaiming from
>> the
>> > > > > *current* EPC
>> > > > > cgroup? If so shouldn't we just pass the @epc_cg to it here?
>> > > > >
>> > > > > I think we can make this patch as "structure" patch w/o actually
>> > > > having
>> > > > > EPC
>> > > > > cgroup enabled, i.e., sgx_get_current_epc_cg() always return
>> NULL.
>> > > > >
>> > > > > And we can have one patch to change sgx_reclaim_pages() to take
>> the
>> > > > > 'struct
>> > > > > sgx_epc_lru_list *' as argument:
>> > > > >
>> > > > > void sgx_reclaim_pages_lru(struct sgx_epc_lru_list * lru)
>> > > > > {
>> > > > > ...
>> > > > > }
>> > > > >
>> > > > > Then here we can have something like:
>> > > > >
>> > > > > void sgx_reclaim_pages(struct sgx_epc_cg *epc_cg)
>> > > > > {
>> > > > > struct sgx_epc_lru_list *lru = epc_cg ? &epc_cg->lru :
>> > > > > &sgx_global_lru;
>> > > > >
>> > > > > sgx_reclaim_pages_lru(lru);
>> > > > > }
>> > > > >
>> > > > > Makes sense?
>> > > > >
The reason we 'isolate' first then do real 'reclaim' is that the actual
reclaim is expensive and especially for eblock, etrack, etc.
>> > > >
>> > > > This is purely global reclamation. No cgroup involved.
>> > >
>> > > Again why? Here you are allocating one EPC page for enclave in a
>> > > particular EPC
>> > > cgroup. When that fails, shouldn't you try only to reclaim from the
>> > > *current*
>> > > EPC cgroup? Or at least you should try to reclaim from the
>> *current*
>> > > EPC cgroup
>> > > first?
>> > >
>> >
>> > Later sgx_epc_cg_try_charge will take a 'reclaim' flag, if true,
>> cgroup
>> > reclaims synchronously, otherwise in background and returns -EBUSY in
>> > that case. This function also returns if no valid epc_cg pointer
>> > returned.
>> >
>> > All reclamation for *current* cgroup is done in
>> sgx_epc_cg_try_charge().
>
> This is fine, but I believe my question above is about where to reclaim
> when
> "allocation" fails, but not "try charge" fails.
>
I mean "will be done" :-) Currently no reclaim in try_charge.
> And for "reclaim for current cgroup when charge fails", I don't think
> its even
> necessary in this initial implementation of EPC cgroup. You can just
> fail the
> allocation when charge fails (reaching the limit). Trying to reclaim
> when limit
> is hit can be done later.
>
Yes. It is done later.
> Please see Dave and Michal's replies here:
>
> https://lore.kernel.org/lkml/[email protected]/#t
> https://lore.kernel.org/lkml/yz44wukoic3syy6s4fcrngagurkjhe2hzka6kvxbajdtro3fwu@zd2ilht7wcw3/
>
>> >
>> > So, by reaching to this point, a valid epc_cg pointer was returned,
>> > that means allocation is allowed for the cgroup (it has reclaimed if
>> > necessary, and its usage is not above limit after charging).
>
> I found memory cgroup uses different logic -- allocation first and then
> charge:
>
> For instance:
>
> static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> {
> ......
>
> folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
> if (!folio)
> goto oom;
> if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
> goto oom_free_page;
>
> ...... }
>
> Why EPC needs to "charge first" and "then allocate"?
>
EPC allocation can involve reclaiming which is more expensive than regular
RAM reclamation. Also misc only has max hard limit.
Thanks
Haitao
>> >
>> > But the system level free count may be low (e.g., limits of all
>> cgroups
>> > may add up to be more than capacity). so we need to do a global
>> > reclamation here, which may involve reclaiming a few pages (from
>> current
>> > or other groups) so the system can be at a performant state with
>> minimal
>> > free count. (current behavior of ksgxd).
>> >
>> I should have sticked to the orignial comment added in code. Actually
>> __sgx_alloc_epc_page() can fail if system runs out of EPC. That's the
>> really reason for global reclaim.
>
> I don't see how this can work. With EPC cgroup likely all EPC pages
> will go to
> the individual LRU of each cgroup, and the sgx_global_lru will basically
> empty.
> How can you reclaim from the sgx_global_lru?
Currently, nothing in cgroup LRU, all EPCs pages are in global list.
On Mon, 20 Nov 2023 11:45:46 +0800, Huang, Kai <[email protected]> wrote:
> On Mon, 2023-10-30 at 11:20 -0700, Haitao Huang wrote:
>> From: Sean Christopherson <[email protected]>
>>
>> To prepare for per-cgroup reclamation, separate the top-level reclaim
>> function, sgx_reclaim_epc_pages(), into two separate functions:
>>
>> - sgx_isolate_epc_pages() scans and isolates reclaimable pages from a
>> given LRU list.
>> - sgx_do_epc_reclamation() performs the real reclamation for the
>> already isolated pages.
>>
>> Create a new function, sgx_reclaim_epc_pages_global(), calling those two
>> in succession, to replace the original sgx_reclaim_epc_pages(). The
>> above two functions will serve as building blocks for the reclamation
>> flows in later EPC cgroup implementation.
>>
>> sgx_do_epc_reclamation() returns the number of reclaimed pages. The EPC
>> cgroup will use the result to track reclaiming progress.
>>
>> sgx_isolate_epc_pages() returns the additional number of pages to scan
>> for current epoch of reclamation. The EPC cgroup will use the result to
>> determine if more scanning to be done in LRUs in its children groups.
>
> This changelog says nothing about "why", but only mentions the
> "implementation".
>
> For instance, assuming we need to reclaim @npages_to_reclaim from the
> @epc_cgrp_to_reclaim and its descendants, why cannot we do:
>
> for_each_cgroup_and_descendants(&epc_cgrp_to_reclaim, &epc_cgrp) {
> if (npages_to_reclaim <= 0)
> return;
>
> npages_to_reclaim -= sgx_reclaim_pages_lru(&epc_cgrp->lru,
> npages_to_reclaim);
> }
>
> Is there any difference to have "isolate" + "reclaim"?
>
This is to optimize "reclaim". See how etrack was done in sgx_encl_ewb.
Here we just follow the same design as ksgxd for each reclamation cycle.
>>
>> Signed-off-by: Sean Christopherson <[email protected]>
>> Co-developed-by: Kristen Carlson Accardi <[email protected]>
>> Signed-off-by: Kristen Carlson Accardi <[email protected]>
>> Co-developed-by: Haitao Huang <[email protected]>
>> Signed-off-by: Haitao Huang <[email protected]>
>> Cc: Sean Christopherson <[email protected]>
>> ---
>>
>
> [...]
>
>> +/**
>> + * sgx_do_epc_reclamation() - Perform reclamation for isolated EPC
>> pages.
>> + * @iso: List of isolated pages for reclamation
>> + *
>> + * Take a list of EPC pages and reclaim them to the enclave's private
>> shmem files. Do not
>> + * reclaim the pages that have been accessed since the last scan, and
>> move each of those pages
>> + * to the tail of its tracking LRU list.
>> + *
>> + * Limit the number of pages to be processed up to SGX_NR_TO_SCAN_MAX
>> per call in order to
>> + * degrade amount of IPI's and ETRACK's potentially required.
>> sgx_encl_ewb() does degrade a bit
>> + * among the HW threads with three stage EWB pipeline (EWB, ETRACK +
>> EWB and IPI + EWB) but not
>> + * sufficiently. Reclaiming one page at a time would also be
>> problematic as it would increase
>> + * the lock contention too much, which would halt forward progress.
>
> This is kinda optimization, correct? Is there any real performance data
> to
> justify this?
The above sentences were there originally. This optimization was justified.
> If this optimization is useful, shouldn't we bring this
> optimization to the current sgx_reclaim_pages() instead, e.g., just
> increase
> SGX_NR_TO_SCAN (16) to SGX_NR_TO_SCAN_MAX (32)?
>
SGX_NR_TO_SCAN_MAX might be designed earlier for other reasons I don't
know. Currently it is really the buffer size allocated in
sgx_reclaim_pages(). Both cgroup and ksgxd scan 16 pages a time.Maybe we
should just use SGX_NR_TO_SCAN. No _MAX needed. The point was to batch
reclamation to certain number to minimize impact of EWB pipeline. 16 was
the original design.
Thanks
Haitao
On Mon, 27 Nov 2023 00:01:56 +0800, Haitao Huang
<[email protected]> wrote:
>>> > > > > Then here we can have something like:
>>> > > > >
>>> > > > > void sgx_reclaim_pages(struct sgx_epc_cg *epc_cg)
>>> > > > > {
>>> > > > > struct sgx_epc_lru_list *lru = epc_cg ? &epc_cg->lru :
>>> > > > > &sgx_global_lru;
>>> > > > >
>>> > > > > sgx_reclaim_pages_lru(lru);
>>> > > > > }
>>> > > > >
>>> > > > > Makes sense?
>>> > > > >
>
> The reason we 'isolate' first then do real 'reclaim' is that the actual
> reclaim is expensive and especially for eblock, etrack, etc.
Sorry this is out of context. It was meant to be in the other response for
patch 9/12.
Also FYI I'm traveling for a vacation and email access may be sporadic.
BR
Haitao
On Mon, 2023-11-27 at 00:27 +0800, Haitao Huang wrote:
> On Mon, 20 Nov 2023 11:45:46 +0800, Huang, Kai <[email protected]> wrote:
>
> > On Mon, 2023-10-30 at 11:20 -0700, Haitao Huang wrote:
> > > From: Sean Christopherson <[email protected]>
> > >
> > > To prepare for per-cgroup reclamation, separate the top-level reclaim
> > > function, sgx_reclaim_epc_pages(), into two separate functions:
> > >
> > > - sgx_isolate_epc_pages() scans and isolates reclaimable pages from a
> > > given LRU list.
> > > - sgx_do_epc_reclamation() performs the real reclamation for the
> > > already isolated pages.
> > >
> > > Create a new function, sgx_reclaim_epc_pages_global(), calling those two
> > > in succession, to replace the original sgx_reclaim_epc_pages(). The
> > > above two functions will serve as building blocks for the reclamation
> > > flows in later EPC cgroup implementation.
> > >
> > > sgx_do_epc_reclamation() returns the number of reclaimed pages. The EPC
> > > cgroup will use the result to track reclaiming progress.
> > >
> > > sgx_isolate_epc_pages() returns the additional number of pages to scan
> > > for current epoch of reclamation. The EPC cgroup will use the result to
> > > determine if more scanning to be done in LRUs in its children groups.
> >
> > This changelog says nothing about "why", but only mentions the
> > "implementation".
> >
> > For instance, assuming we need to reclaim @npages_to_reclaim from the
> > @epc_cgrp_to_reclaim and its descendants, why cannot we do:
> >
> > for_each_cgroup_and_descendants(&epc_cgrp_to_reclaim, &epc_cgrp) {
> > if (npages_to_reclaim <= 0)
> > return;
> >
> > npages_to_reclaim -= sgx_reclaim_pages_lru(&epc_cgrp->lru,
> > npages_to_reclaim);
> > }
> >
> > Is there any difference to have "isolate" + "reclaim"?
> >
>
> This is to optimize "reclaim". See how etrack was done in sgx_encl_ewb.
> Here we just follow the same design as ksgxd for each reclamation cycle.
I don't see how did you "follow" ksgxd. If I am guessing correctly, you are
afraid of there might be less than 16 pages in a given EPC cgroup, thus w/o
splitting into "isolate" + "reclaim" you might feed the "reclaim" less than 16
pages, which might cause some performance degrade?
But is this a common case? Should we even worry about this?
I suppose for such new feature we should bring functionality first and then
optimization if you have real performance data to show.
>
> > >
> > > Signed-off-by: Sean Christopherson <[email protected]>
> > > Co-developed-by: Kristen Carlson Accardi <[email protected]>
> > > Signed-off-by: Kristen Carlson Accardi <[email protected]>
> > > Co-developed-by: Haitao Huang <[email protected]>
> > > Signed-off-by: Haitao Huang <[email protected]>
> > > Cc: Sean Christopherson <[email protected]>
> > > ---
> > >
> >
> > [...]
> >
> > > +/**
> > > + * sgx_do_epc_reclamation() - Perform reclamation for isolated EPC
> > > pages.
> > > + * @iso: List of isolated pages for reclamation
> > > + *
> > > + * Take a list of EPC pages and reclaim them to the enclave's private
> > > shmem files. Do not
> > > + * reclaim the pages that have been accessed since the last scan, and
> > > move each of those pages
> > > + * to the tail of its tracking LRU list.
> > > + *
> > > + * Limit the number of pages to be processed up to SGX_NR_TO_SCAN_MAX
> > > per call in order to
> > > + * degrade amount of IPI's and ETRACK's potentially required.
> > > sgx_encl_ewb() does degrade a bit
> > > + * among the HW threads with three stage EWB pipeline (EWB, ETRACK +
> > > EWB and IPI + EWB) but not
> > > + * sufficiently. Reclaiming one page at a time would also be
> > > problematic as it would increase
> > > + * the lock contention too much, which would halt forward progress.
> >
> > This is kinda optimization, correct? Is there any real performance data
> > to
> > justify this?
>
> The above sentences were there originally. This optimization was justified.
I am talking about 16 -> 32.
>
> > If this optimization is useful, shouldn't we bring this
> > optimization to the current sgx_reclaim_pages() instead, e.g., just
> > increase
> > SGX_NR_TO_SCAN (16) to SGX_NR_TO_SCAN_MAX (32)?
> >
>
> SGX_NR_TO_SCAN_MAX might be designed earlier for other reasons I don't
> know. Currently it is really the buffer size allocated in
> sgx_reclaim_pages(). Both cgroup and ksgxd scan 16 pages a time.Maybe we
> should just use SGX_NR_TO_SCAN. No _MAX needed. The point was to batch
> reclamation to certain number to minimize impact of EWB pipeline. 16 was
> the original design.
>
Please don't leave why you are trying to do this to the reviewers. If you don't
know, then just drop this.
Hi Kai
On Mon, 27 Nov 2023 03:57:03 -0600, Huang, Kai <[email protected]> wrote:
> On Mon, 2023-11-27 at 00:27 +0800, Haitao Huang wrote:
>> On Mon, 20 Nov 2023 11:45:46 +0800, Huang, Kai <[email protected]>
>> wrote:
>>
>> > On Mon, 2023-10-30 at 11:20 -0700, Haitao Huang wrote:
>> > > From: Sean Christopherson <[email protected]>
>> > >
>> > > To prepare for per-cgroup reclamation, separate the top-level
>> reclaim
>> > > function, sgx_reclaim_epc_pages(), into two separate functions:
>> > >
>> > > - sgx_isolate_epc_pages() scans and isolates reclaimable pages from
>> a
>> > > given LRU list.
>> > > - sgx_do_epc_reclamation() performs the real reclamation for the
>> > > already isolated pages.
>> > >
>> > > Create a new function, sgx_reclaim_epc_pages_global(), calling
>> those two
>> > > in succession, to replace the original sgx_reclaim_epc_pages(). The
>> > > above two functions will serve as building blocks for the
>> reclamation
>> > > flows in later EPC cgroup implementation.
>> > >
>> > > sgx_do_epc_reclamation() returns the number of reclaimed pages. The
>> EPC
>> > > cgroup will use the result to track reclaiming progress.
>> > >
>> > > sgx_isolate_epc_pages() returns the additional number of pages to
>> scan
>> > > for current epoch of reclamation. The EPC cgroup will use the
>> result to
>> > > determine if more scanning to be done in LRUs in its children
>> groups.
>> >
>> > This changelog says nothing about "why", but only mentions the
>> > "implementation".
>> >
>> > For instance, assuming we need to reclaim @npages_to_reclaim from the
>> > @epc_cgrp_to_reclaim and its descendants, why cannot we do:
>> >
>> > for_each_cgroup_and_descendants(&epc_cgrp_to_reclaim, &epc_cgrp) {
>> > if (npages_to_reclaim <= 0)
>> > return;
>> >
>> > npages_to_reclaim -= sgx_reclaim_pages_lru(&epc_cgrp->lru,
>> > npages_to_reclaim);
>> > }
>> >
>> > Is there any difference to have "isolate" + "reclaim"?
>> >
>>
>> This is to optimize "reclaim". See how etrack was done in sgx_encl_ewb.
>> Here we just follow the same design as ksgxd for each reclamation cycle.
>
> I don't see how did you "follow" ksgxd. If I am guessing correctly, you
> are
> afraid of there might be less than 16 pages in a given EPC cgroup, thus
> w/o
> splitting into "isolate" + "reclaim" you might feed the "reclaim" less
> than 16
> pages, which might cause some performance degrade?
>
> But is this a common case? Should we even worry about this?
>
> I suppose for such new feature we should bring functionality first and
> then
> optimization if you have real performance data to show.
>
The concern is not about "reclaim less than 16".
I mean this is just refactoring with exactly the same design of ksgxd
preserved, in that we first isolate as many candidate EPC pages (up to
16, ignore the unneeded SGX_NR_TO_SCAN_MAX for now), then does the ewb in
one shot without anything else done in between. As described in original
comments for the function sgx_reclaim_pages and sgx_encl_ewb, this is to
finish all ewb quickly while minimizing impact of IPI.
The way you proposed will work but alters the current design and behavior
if cgroups is enabled and EPCs of an enclave are tracked across multiple
LRUs within the descendant cgroups, in that you will have isolation loop,
backing store allocation loop, eblock loop interleaved with the ewb loop.
>>
>> > >
>> > > Signed-off-by: Sean Christopherson <[email protected]>
>> > > Co-developed-by: Kristen Carlson Accardi <[email protected]>
>> > > Signed-off-by: Kristen Carlson Accardi <[email protected]>
>> > > Co-developed-by: Haitao Huang <[email protected]>
>> > > Signed-off-by: Haitao Huang <[email protected]>
>> > > Cc: Sean Christopherson <[email protected]>
>> > > ---
>> > >
>> >
>> > [...]
>> >
>> > > +/**
>> > > + * sgx_do_epc_reclamation() - Perform reclamation for isolated EPC
>> > > pages.
>> > > + * @iso: List of isolated pages for reclamation
>> > > + *
>> > > + * Take a list of EPC pages and reclaim them to the enclave's
>> private
>> > > shmem files. Do not
>> > > + * reclaim the pages that have been accessed since the last scan,
>> and
>> > > move each of those pages
>> > > + * to the tail of its tracking LRU list.
>> > > + *
>> > > + * Limit the number of pages to be processed up to
>> SGX_NR_TO_SCAN_MAX
>> > > per call in order to
>> > > + * degrade amount of IPI's and ETRACK's potentially required.
>> > > sgx_encl_ewb() does degrade a bit
>> > > + * among the HW threads with three stage EWB pipeline (EWB, ETRACK
>> +
>> > > EWB and IPI + EWB) but not
>> > > + * sufficiently. Reclaiming one page at a time would also be
>> > > problematic as it would increase
>> > > + * the lock contention too much, which would halt forward progress.
>> >
>> > This is kinda optimization, correct? Is there any real performance
>> data
>> > to
>> > justify this?
>>
>> The above sentences were there originally. This optimization was
>> justified.
>
> I am talking about 16 -> 32.
>
>>
>> > If this optimization is useful, shouldn't we bring this
>> > optimization to the current sgx_reclaim_pages() instead, e.g., just
>> > increase
>> > SGX_NR_TO_SCAN (16) to SGX_NR_TO_SCAN_MAX (32)?
>> >
>>
>> SGX_NR_TO_SCAN_MAX might be designed earlier for other reasons I don't
>> know. Currently it is really the buffer size allocated in
>> sgx_reclaim_pages(). Both cgroup and ksgxd scan 16 pages a time.Maybe we
>> should just use SGX_NR_TO_SCAN. No _MAX needed. The point was to batch
>> reclamation to certain number to minimize impact of EWB pipeline. 16 was
>> the original design.
>>
>
> Please don't leave why you are trying to do this to the reviewers. If
> you don't
> know, then just drop this.
>
Fair enough. This was my oversight when doing all the changes and rebase.
Will drop it.
Thanks
Haitao
On Mon, 2023-12-11 at 22:04 -0600, Haitao Huang wrote:
> Hi Kai
>
> On Mon, 27 Nov 2023 03:57:03 -0600, Huang, Kai <[email protected]> wrote:
>
> > On Mon, 2023-11-27 at 00:27 +0800, Haitao Huang wrote:
> > > On Mon, 20 Nov 2023 11:45:46 +0800, Huang, Kai <[email protected]>
> > > wrote:
> > >
> > > > On Mon, 2023-10-30 at 11:20 -0700, Haitao Huang wrote:
> > > > > From: Sean Christopherson <[email protected]>
> > > > >
> > > > > To prepare for per-cgroup reclamation, separate the top-level
> > > reclaim
> > > > > function, sgx_reclaim_epc_pages(), into two separate functions:
> > > > >
> > > > > - sgx_isolate_epc_pages() scans and isolates reclaimable pages from
> > > a
> > > > > given LRU list.
> > > > > - sgx_do_epc_reclamation() performs the real reclamation for the
> > > > > already isolated pages.
> > > > >
> > > > > Create a new function, sgx_reclaim_epc_pages_global(), calling
> > > those two
> > > > > in succession, to replace the original sgx_reclaim_epc_pages(). The
> > > > > above two functions will serve as building blocks for the
> > > reclamation
> > > > > flows in later EPC cgroup implementation.
> > > > >
> > > > > sgx_do_epc_reclamation() returns the number of reclaimed pages. The
> > > EPC
> > > > > cgroup will use the result to track reclaiming progress.
> > > > >
> > > > > sgx_isolate_epc_pages() returns the additional number of pages to
> > > scan
> > > > > for current epoch of reclamation. The EPC cgroup will use the
> > > result to
> > > > > determine if more scanning to be done in LRUs in its children
> > > groups.
> > > >
> > > > This changelog says nothing about "why", but only mentions the
> > > > "implementation".
> > > >
> > > > For instance, assuming we need to reclaim @npages_to_reclaim from the
> > > > @epc_cgrp_to_reclaim and its descendants, why cannot we do:
> > > >
> > > > for_each_cgroup_and_descendants(&epc_cgrp_to_reclaim, &epc_cgrp) {
> > > > if (npages_to_reclaim <= 0)
> > > > return;
> > > >
> > > > npages_to_reclaim -= sgx_reclaim_pages_lru(&epc_cgrp->lru,
> > > > npages_to_reclaim);
> > > > }
> > > >
> > > > Is there any difference to have "isolate" + "reclaim"?
> > > >
> > >
> > > This is to optimize "reclaim". See how etrack was done in sgx_encl_ewb.
> > > Here we just follow the same design as ksgxd for each reclamation cycle.
> >
> > I don't see how did you "follow" ksgxd. If I am guessing correctly, you
> > are
> > afraid of there might be less than 16 pages in a given EPC cgroup, thus
> > w/o
> > splitting into "isolate" + "reclaim" you might feed the "reclaim" less
> > than 16
> > pages, which might cause some performance degrade?
> >
> > But is this a common case? Should we even worry about this?
> >
> > I suppose for such new feature we should bring functionality first and
> > then
> > optimization if you have real performance data to show.
> >
> The concern is not about "reclaim less than 16".
> I mean this is just refactoring with exactly the same design of ksgxd
> preserved,
>
I literally have no idea what you are talking about here. ksgxd() just calls
sgx_reclaim_pages(), which tries to reclaim 16 pages at once.
> in that we first isolate as many candidate EPC pages (up to
> 16, ignore the unneeded SGX_NR_TO_SCAN_MAX for now), then does the ewb in
> one shot without anything else done in between.
>
Assuming you are referring the implementation of sgx_reclaim_pages(), and
assuming the "isolate" you mean removing EPC pages from the list (which is
exactly what the sgx_isolate_epc_pages() in this patch does), what happens to
the loops of "backing store allocation" and "EBLOCK", before the loop of EWB?
Eaten by you?
> As described in original
> comments for the function sgx_reclaim_pages and sgx_encl_ewb, this is to
> finish all ewb quickly while minimizing impact of IPI.
>
> The way you proposed will work but alters the current design and behavior
> if cgroups is enabled and EPCs of an enclave are tracked across multiple
> LRUs within the descendant cgroups, in that you will have isolation loop,
> backing store allocation loop, eblock loop interleaved with the ewb loop.
>
I have no idea what you are talking about.
The point is, with or w/o this patch, you can only reclaim 16 EPC pages in one
function call (as you have said you are going to remove SGX_NR_TO_SCAN_MAX,
which is a cipher to both of us). The only difference I can see is, with this
patch, you can have multiple calls of "isolate" and then call the "do_reclaim"
once.
But what's the good of having the "isolate" if the "do_reclaim" can only reclaim
16 pages anyway?
Back to my last reply, are you afraid of any LRU has less than 16 pages to
"isolate", therefore you need to loop LRUs of descendants to get 16? Cause I
really cannot think of any other reason why you are doing this.
> >
On Wed, 13 Dec 2023 05:17:11 -0600, Huang, Kai <[email protected]> wrote:
> On Mon, 2023-12-11 at 22:04 -0600, Haitao Huang wrote:
>> Hi Kai
>>
>> On Mon, 27 Nov 2023 03:57:03 -0600, Huang, Kai <[email protected]>
>> wrote:
>>
>> > On Mon, 2023-11-27 at 00:27 +0800, Haitao Huang wrote:
>> > > On Mon, 20 Nov 2023 11:45:46 +0800, Huang, Kai <[email protected]>
>> > > wrote:
>> > >
>> > > > On Mon, 2023-10-30 at 11:20 -0700, Haitao Huang wrote:
>> > > > > From: Sean Christopherson <[email protected]>
>> > > > >
>> > > > > To prepare for per-cgroup reclamation, separate the top-level
>> > > reclaim
>> > > > > function, sgx_reclaim_epc_pages(), into two separate functions:
>> > > > >
>> > > > > - sgx_isolate_epc_pages() scans and isolates reclaimable pages
>> from
>> > > a
>> > > > > given LRU list.
>> > > > > - sgx_do_epc_reclamation() performs the real reclamation for the
>> > > > > already isolated pages.
>> > > > >
>> > > > > Create a new function, sgx_reclaim_epc_pages_global(), calling
>> > > those two
>> > > > > in succession, to replace the original sgx_reclaim_epc_pages().
>> The
>> > > > > above two functions will serve as building blocks for the
>> > > reclamation
>> > > > > flows in later EPC cgroup implementation.
>> > > > >
>> > > > > sgx_do_epc_reclamation() returns the number of reclaimed pages.
>> The
>> > > EPC
>> > > > > cgroup will use the result to track reclaiming progress.
>> > > > >
>> > > > > sgx_isolate_epc_pages() returns the additional number of pages
>> to
>> > > scan
>> > > > > for current epoch of reclamation. The EPC cgroup will use the
>> > > result to
>> > > > > determine if more scanning to be done in LRUs in its children
>> > > groups.
>> > > >
>> > > > This changelog says nothing about "why", but only mentions the
>> > > > "implementation".
>> > > >
>> > > > For instance, assuming we need to reclaim @npages_to_reclaim from
>> the
>> > > > @epc_cgrp_to_reclaim and its descendants, why cannot we do:
>> > > >
>> > > > for_each_cgroup_and_descendants(&epc_cgrp_to_reclaim, &epc_cgrp)
>> {
>> > > > if (npages_to_reclaim <= 0)
>> > > > return;
>> > > >
>> > > > npages_to_reclaim -= sgx_reclaim_pages_lru(&epc_cgrp->lru,
>> > > > npages_to_reclaim);
>> > > > }
>> > > >
>> > > > Is there any difference to have "isolate" + "reclaim"?
>> > > >
>> > >
>> > > This is to optimize "reclaim". See how etrack was done in
>> sgx_encl_ewb.
>> > > Here we just follow the same design as ksgxd for each reclamation
>> cycle.
>> >
>> > I don't see how did you "follow" ksgxd. If I am guessing correctly,
>> you
>> > are
>> > afraid of there might be less than 16 pages in a given EPC cgroup,
>> thus
>> > w/o
>> > splitting into "isolate" + "reclaim" you might feed the "reclaim" less
>> > than 16
>> > pages, which might cause some performance degrade?
>> >
>> > But is this a common case? Should we even worry about this?
>> >
>> > I suppose for such new feature we should bring functionality first and
>> > then
>> > optimization if you have real performance data to show.
>> >
>> The concern is not about "reclaim less than 16".
>> I mean this is just refactoring with exactly the same design of ksgxd
>> preserved,
>
> I literally have no idea what you are talking about here. ksgxd() just
> calls
> sgx_reclaim_pages(), which tries to reclaim 16 pages at once.
>
>> in that we first isolate as many candidate EPC pages (up to
>> 16, ignore the unneeded SGX_NR_TO_SCAN_MAX for now), then does the ewb
>> in
>> one shot without anything else done in between.
>
> Assuming you are referring the implementation of sgx_reclaim_pages(), and
> assuming the "isolate" you mean removing EPC pages from the list (which
> is
> exactly what the sgx_isolate_epc_pages() in this patch does), what
> happens to
> the loops of "backing store allocation" and "EBLOCK", before the loop of
> EWB?Eaten by you?
>
I skipped those as what really matters is to keep ewb loop separate and
run in one shot for each reclaiming cycle, not dependent on number of
LRUs. All those loops in original sgx_reclaim_pages() except the
"isolate" loop are not dealing with multiple LRUs of cgroups later. That's
the reason to refactor out only the "isolate" part and loop it through
cgroup LRUs in later patches.
>
>> As described in original
>> comments for the function sgx_reclaim_pages and sgx_encl_ewb, this is to
>> finish all ewb quickly while minimizing impact of IPI.
>>
>> The way you proposed will work but alters the current design and
>> behavior
>> if cgroups is enabled and EPCs of an enclave are tracked across multiple
>> LRUs within the descendant cgroups, in that you will have isolation
>> loop,
>> backing store allocation loop, eblock loop interleaved with the ewb
>> loop.
>>
>
> I have no idea what you are talking about.
>
> The point is, with or w/o this patch, you can only reclaim 16 EPC pages
> in one
> function call (as you have said you are going to remove
> SGX_NR_TO_SCAN_MAX,
> which is a cipher to both of us). The only difference I can see is,
> with this
> patch, you can have multiple calls of "isolate" and then call the
> "do_reclaim"
> once.
>
> But what's the good of having the "isolate" if the "do_reclaim" can only
> reclaim
> 16 pages anyway?
>
> Back to my last reply, are you afraid of any LRU has less than 16 pages
> to
> "isolate", therefore you need to loop LRUs of descendants to get 16?
> Cause I
> really cannot think of any other reason why you are doing this.
>
>
I think I see your point. By capping pages reclaimed per cycle to 16,
there is not much difference even if those 16 pages are spread in separate
LRUs . The difference is only significant when we ever raise that cap. To
preserve the current behavior of ewb loops independent on number of LRUs
to loop through for each reclaiming cycle, regardless of the exact value
of the page cap, I would still think current approach in the patch is
reasonable choice. What do you think?
Thanks
Haitao
> >
> > The point is, with or w/o this patch, you can only reclaim 16 EPC pages
> > in one
> > function call (as you have said you are going to remove
> > SGX_NR_TO_SCAN_MAX,
> > which is a cipher to both of us). The only difference I can see is,
> > with this
> > patch, you can have multiple calls of "isolate" and then call the
> > "do_reclaim"
> > once.
> >
> > But what's the good of having the "isolate" if the "do_reclaim" can only
> > reclaim
> > 16 pages anyway?
> >
> > Back to my last reply, are you afraid of any LRU has less than 16 pages
> > to
> > "isolate", therefore you need to loop LRUs of descendants to get 16?
> > Cause I
> > really cannot think of any other reason why you are doing this.
> >
> >
>
> I think I see your point. By capping pages reclaimed per cycle to 16,
> there is not much difference even if those 16 pages are spread in separate
> LRUs . The difference is only significant when we ever raise that cap. To
> preserve the current behavior of ewb loops independent on number of LRUs
> to loop through for each reclaiming cycle, regardless of the exact value
> of the page cap, I would still think current approach in the patch is
> reasonable choice. What do you think?
To me I won't bother to do that. Having less than 16 pages in one LRU is
*extremely rare* that should never happen in practice. It's pointless to make
such code adjustment at this stage.
Let's focus on enabling functionality first. When you have some real
performance issue that is related to this, we can come back then.
Btw, I think you need to step back even further. IIUC the whole multiple LRU
thing isn't mandatory in this initial support.
Please (again) take a look at the comments from Dave and Michal:
https://lore.kernel.org/lkml/[email protected]/#t
https://lore.kernel.org/lkml/yz44wukoic3syy6s4fcrngagurkjhe2hzka6kvxbajdtro3fwu@zd2ilht7wcw3/
On Mon, Dec 18, 2023 at 01:44:56AM +0000, Huang, Kai wrote:
>
> Let's focus on enabling functionality first. When you have some real
> performance issue that is related to this, we can come back then.
>
> Btw, I think you need to step back even further. IIUC the whole multiple LRU
> thing isn't mandatory in this initial support.
>
> Please (again) take a look at the comments from Dave and Michal:
>
> https://lore.kernel.org/lkml/[email protected]/#t
> https://lore.kernel.org/lkml/yz44wukoic3syy6s4fcrngagurkjhe2hzka6kvxbajdtro3fwu@zd2ilht7wcw3/
I don't think setting a hard limit without any reclaiming is preferred.
I'd rather see this similar to what the "sgx_epc.high" was in the RFC
patchset: misc.max for sgx_epc becomes the max value for EPC usage but
enclaves larger than the limit would still run OK. Per-cgroup reclaiming
allows additional controls via memory.high/max in the same cgroup.
If this reclaim flexibily was not there, the sgx_epc limit would always
have to be set based on some "peak" EPC consumption which may not even
be known at the time the limit is set.
From a container runtime perspective (which is what I'm working for Kubernetes)
the current proposal seems best to me: a container is guaranteed at most
the amount of EPC set as the limit and no other container gets to use it.
Also, each container gets charged for reclaiming independently if a low
max value is used (which might be desirable to get more containers to run on the
same node/system). In this model, the sum of containers' max values would be
the capacity.
-- Mikko
On Sun, 17 Dec 2023 19:44:56 -0600, Huang, Kai <[email protected]> wrote:
>
>> >
>> > The point is, with or w/o this patch, you can only reclaim 16 EPC
>> pages
>> > in one
>> > function call (as you have said you are going to remove
>> > SGX_NR_TO_SCAN_MAX,
>> > which is a cipher to both of us). The only difference I can see is,
>> > with this
>> > patch, you can have multiple calls of "isolate" and then call the
>> > "do_reclaim"
>> > once.
>> >
>> > But what's the good of having the "isolate" if the "do_reclaim" can
>> only
>> > reclaim
>> > 16 pages anyway?
>> >
>> > Back to my last reply, are you afraid of any LRU has less than 16
>> pages
>> > to
>> > "isolate", therefore you need to loop LRUs of descendants to get 16?
>> > Cause I
>> > really cannot think of any other reason why you are doing this.
>> >
>> >
>>
>> I think I see your point. By capping pages reclaimed per cycle to 16,
>> there is not much difference even if those 16 pages are spread in
>> separate
>> LRUs . The difference is only significant when we ever raise that cap.
>> To
>> preserve the current behavior of ewb loops independent on number of LRUs
>> to loop through for each reclaiming cycle, regardless of the exact value
>> of the page cap, I would still think current approach in the patch is
>> reasonable choice. What do you think?
>
> To me I won't bother to do that. Having less than 16 pages in one LRU is
> *extremely rare* that should never happen in practice. It's pointless
> to make
> such code adjustment at this stage.
>
> Let's focus on enabling functionality first. When you have some real
> performance issue that is related to this, we can come back then.
>
> Btw, I think you need to step back even further. IIUC the whole
> multiple LRU
> thing isn't mandatory in this initial support.
>
> Please (again) take a look at the comments from Dave and Michal:
>
> https://lore.kernel.org/lkml/[email protected]/#t
> https://lore.kernel.org/lkml/yz44wukoic3syy6s4fcrngagurkjhe2hzka6kvxbajdtro3fwu@zd2ilht7wcw3/
Thanks for raising this. Actually my understanding the above discussion
was mainly about not doing reclaiming by killing enclaves, i.e., I assumed
"reclaiming" within that context only meant for that particular kind.
As Mikko pointed out, without reclaiming per-cgroup, the max limit of each
cgroup needs to accommodate the peak usage of enclaves within that cgroup.
That may be inconvenient for practical usage and limits could be forced to
be set larger than necessary to run enclaves performantly. For example, we
can observe following undesired consequences comparing a system with
current kernel loaded with enclaves whose total peak usage is greater than
the EPC capacity.
1) If a user wants to load the same exact enclaves but in separate
cgroups, then the sum of cgroup limits must be higher than the capacity
and the system will end up doing the same old global reclaiming as it is
currently doing. Cgroup is not useful at all for isolating EPC
consumptions.
2) To isolate impact of usage of each cgroup on other cgroups and yet
still being able to load each enclave, the user basically has to carefully
plan to ensure the sum of cgroup max limits, i.e., the sum of peak usage
of enclaves, is not reaching over the capacity. That means no
over-commiting allowed and the same system may not be able to load as many
enclaves as with current kernel.
@Dave and @Michal, Your thoughts? Or could you confirm we should not do
reclaim per cgroup at
all?
If confirmed as desired, then this series can stop at patch 4.
Thanks
Haitao
On 12/18/23 13:24, Haitao Huang wrote:> @Dave and @Michal, Your
thoughts? Or could you confirm we should not
> do reclaim per cgroup at all?
What's the benefit of doing reclaim per cgroup? Is that worth the extra
complexity?
The key question here is whether we want the SGX VM to be complex and
more like the real VM or simple when a cgroup hits its limit. Right?
If stopping at patch 5 and having less code is even remotely an option,
why not do _that_?
Hello.
On Mon, Dec 18, 2023 at 03:24:40PM -0600, Haitao Huang <[email protected]> wrote:
> Thanks for raising this. Actually my understanding the above discussion was
> mainly about not doing reclaiming by killing enclaves, i.e., I assumed
> "reclaiming" within that context only meant for that particular kind.
>
> As Mikko pointed out, without reclaiming per-cgroup, the max limit of each
> cgroup needs to accommodate the peak usage of enclaves within that cgroup.
> That may be inconvenient for practical usage and limits could be forced to
> be set larger than necessary to run enclaves performantly. For example, we
> can observe following undesired consequences comparing a system with current
> kernel loaded with enclaves whose total peak usage is greater than the EPC
> capacity.
>
> 1) If a user wants to load the same exact enclaves but in separate cgroups,
> then the sum of cgroup limits must be higher than the capacity and the
> system will end up doing the same old global reclaiming as it is currently
> doing. Cgroup is not useful at all for isolating EPC consumptions.
That is the use of limits to prevent a runaway cgroup smothering the
system. Overcommited values in such a config are fine because the more
simultaneous runaways, the less likely.
The peak consumption is on the fair expense of others (some efficiency)
and the limit contains the runaway (hence the isolation).
> 2) To isolate impact of usage of each cgroup on other cgroups and yet still
> being able to load each enclave, the user basically has to carefully plan to
> ensure the sum of cgroup max limits, i.e., the sum of peak usage of
> enclaves, is not reaching over the capacity. That means no over-commiting
> allowed and the same system may not be able to load as many enclaves as with
> current kernel.
Another "config layout" of limits is to achieve partitioning (sum ==
capacity). That is perfect isolation but it naturally goes against
efficient utilization. The way other controllers approach this trade-off
is with weights (cpu, io) or protections (memory). I'm afraid misc
controller is not ready for this.
My opinion is to start with the simple limits (first patches) and think
of prioritization/guarantee mechanism based on real cases.
HTH,
Michal
Hi Dave,
On Wed, 03 Jan 2024 10:37:35 -0600, Dave Hansen <[email protected]>
wrote:
> On 12/18/23 13:24, Haitao Huang wrote:> @Dave and @Michal, Your
> thoughts? Or could you confirm we should not
>> do reclaim per cgroup at all?
> What's the benefit of doing reclaim per cgroup? Is that worth the extra
> complexity?
>
Without reclaiming per cgroup, then we have to always set the limit to
enclave's peak usage. This may not be efficient utilization as in many
cases each enclave can perform fine with EPC limit set less than peak.
Basically each group can not give up some pages for greater good without
dying :-)
Also with enclaves enabled with EDMM, the peak usage is not static so hard
to determine upfront. Hence it might be an operation/deployment
inconvenience.
In case of over-committing (sum of limits > total capacity), one cgroup at
peak usage may require swapping pages out in a different cgroup if system
is overloaded at that time.
> The key question here is whether we want the SGX VM to be complex and
> more like the real VM or simple when a cgroup hits its limit. Right?
>
Although it's fair to say the majority of complexity of this series is in
support for reclaiming per cgroup, I think it's manageable and much less
than real VM after we removed the enclave killing parts: the only extra
effort is to track pages in separate list and reclaim them in separately
as opposed to track in on global list and reclaim together. The main
reclaiming loop code is still pretty much the same as before.
> If stopping at patch 5 and having less code is even remotely an option,
> why not do _that_?
>
I hope I described limitations clear enough above.
If those are OK with users and also make it acceptable for merge quickly,
I'm happy to do that :-)
Thanks
Haitao
On Thu Jan 4, 2024 at 9:11 PM EET, Haitao Huang wrote:
> > The key question here is whether we want the SGX VM to be complex and
> > more like the real VM or simple when a cgroup hits its limit. Right?
> >
>
> Although it's fair to say the majority of complexity of this series is in
> support for reclaiming per cgroup, I think it's manageable and much less
> than real VM after we removed the enclave killing parts: the only extra
> effort is to track pages in separate list and reclaim them in separately
> as opposed to track in on global list and reclaim together. The main
> reclaiming loop code is still pretty much the same as before.
I'm not seeing any unmanageable complexity on SGX side, and also
cgroups specific changes are somewhat clean to me at least...
BR, Jarkko
Hi Michal,
On Thu, 04 Jan 2024 06:38:41 -0600, Michal Koutn? <[email protected]> wrote:
> Hello.
>
> On Mon, Dec 18, 2023 at 03:24:40PM -0600, Haitao Huang
> <[email protected]> wrote:
>> Thanks for raising this. Actually my understanding the above discussion
>> was
>> mainly about not doing reclaiming by killing enclaves, i.e., I assumed
>> "reclaiming" within that context only meant for that particular kind.
>>
>> As Mikko pointed out, without reclaiming per-cgroup, the max limit of
>> each
>> cgroup needs to accommodate the peak usage of enclaves within that
>> cgroup.
>> That may be inconvenient for practical usage and limits could be forced
>> to
>> be set larger than necessary to run enclaves performantly. For example,
>> we
>> can observe following undesired consequences comparing a system with
>> current
>> kernel loaded with enclaves whose total peak usage is greater than the
>> EPC
>> capacity.
>>
>> 1) If a user wants to load the same exact enclaves but in separate
>> cgroups,
>> then the sum of cgroup limits must be higher than the capacity and the
>> system will end up doing the same old global reclaiming as it is
>> currently
>> doing. Cgroup is not useful at all for isolating EPC consumptions.
>
> That is the use of limits to prevent a runaway cgroup smothering the
> system. Overcommited values in such a config are fine because the more
> simultaneous runaways, the less likely.
> The peak consumption is on the fair expense of others (some efficiency)
> and the limit contains the runaway (hence the isolation).
>
This makes sense to me in theory. Mikko, Chris Y/Bo Z, your thoughts on
whether this is good enough for your intended usages?
>> 2) To isolate impact of usage of each cgroup on other cgroups and yet
>> still
>> being able to load each enclave, the user basically has to carefully
>> plan to
>> ensure the sum of cgroup max limits, i.e., the sum of peak usage of
>> enclaves, is not reaching over the capacity. That means no
>> over-commiting
>> allowed and the same system may not be able to load as many enclaves as
>> with
>> current kernel.
>
> Another "config layout" of limits is to achieve partitioning (sum ==
> capacity). That is perfect isolation but it naturally goes against
> efficient utilization. The way other controllers approach this trade-off
> is with weights (cpu, io) or protections (memory). I'm afraid misc
> controller is not ready for this.
>
> My opinion is to start with the simple limits (first patches) and think
> of prioritization/guarantee mechanism based on real cases.
>
We moved away from using mem like custom controller with (low, high, max)
to misc controller. But if we need add those down the road, then the
interface needs be changed. So my concern on this route would be whether
misc would allow any of those extensions. On the other hand, it might turn
out less complex just doing the reclamation per cgroup.
Thanks a lot for your comments and they are really helpful!
Haitao
On 1/4/24 11:11, Haitao Huang wrote:
> If those are OK with users and also make it acceptable for merge
> quickly, I'm happy to do that ????
How about we put some actual numbers behind this? How much complexity
are we talking about here? What's the diffstat for the utterly
bare-bones implementation and what does it cost on top of that to do the
per-cgroup reclaim?
On Thu, 04 Jan 2024 13:27:07 -0600, Dave Hansen <[email protected]>
wrote:
> On 1/4/24 11:11, Haitao Huang wrote:
>> If those are OK with users and also make it acceptable for merge
>> quickly, I'm happy to do that ????
>
> How about we put some actual numbers behind this? How much complexity
> are we talking about here? What's the diffstat for the utterly
> bare-bones implementation and what does it cost on top of that to do the
> per-cgroup reclaim?
>
For bare-bone:
arch/x86/Kconfig | 13 ++++++++++++
arch/x86/kernel/cpu/sgx/Makefile | 1 +
arch/x86/kernel/cpu/sgx/epc_cgroup.c | 104
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
arch/x86/kernel/cpu/sgx/epc_cgroup.h | 39
+++++++++++++++++++++++++++++++++++
arch/x86/kernel/cpu/sgx/main.c | 41
++++++++++++++++++++++++++++++++++++
arch/x86/kernel/cpu/sgx/sgx.h | 5 +++++
include/linux/misc_cgroup.h | 42
+++++++++++++++++++++++++++++++++++++
kernel/cgroup/misc.c | 52
+++++++++++++++++++++++++++++++---------------
8 files changed, 281 insertions(+), 16 deletions(-)
Additional for per-cgroup reclaim:
arch/x86/kernel/cpu/sgx/encl.c | 41 +++++++++--------
arch/x86/kernel/cpu/sgx/encl.h | 3 +-
arch/x86/kernel/cpu/sgx/epc_cgroup.c | 224
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
arch/x86/kernel/cpu/sgx/epc_cgroup.h | 18 ++++++--
arch/x86/kernel/cpu/sgx/main.c | 226
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++----------------------------------
arch/x86/kernel/cpu/sgx/sgx.h | 85
+++++++++++++++++++++++++++++++++--
6 files changed, 480 insertions(+), 117 deletions(-)
Thanks
Haitao
On Thu, Jan 04, 2024 at 01:11:15PM -0600, Haitao Huang wrote:
> Hi Dave,
>
> On Wed, 03 Jan 2024 10:37:35 -0600, Dave Hansen <[email protected]>
> wrote:
>
> > On 12/18/23 13:24, Haitao Huang wrote:> @Dave and @Michal, Your
> > thoughts? Or could you confirm we should not
> > > do reclaim per cgroup at all?
> > What's the benefit of doing reclaim per cgroup? Is that worth the extra
> > complexity?
> >
>
> Without reclaiming per cgroup, then we have to always set the limit to
> enclave's peak usage. This may not be efficient utilization as in many cases
> each enclave can perform fine with EPC limit set less than peak. Basically
> each group can not give up some pages for greater good without dying :-)
+1. this is exactly my thinking too. The per cgroup reclaiming is
important for the containers use case we are working on. I also think
it makes the limit more meaningful: the per-container pool of EPC pages
to use (which is independent of the enclave size).
>
> Also with enclaves enabled with EDMM, the peak usage is not static so hard
> to determine upfront. Hence it might be an operation/deployment
> inconvenience.
>
> In case of over-committing (sum of limits > total capacity), one cgroup at
> peak usage may require swapping pages out in a different cgroup if system is
> overloaded at that time.
>
> > The key question here is whether we want the SGX VM to be complex and
> > more like the real VM or simple when a cgroup hits its limit. Right?
> >
>
> Although it's fair to say the majority of complexity of this series is in
> support for reclaiming per cgroup, I think it's manageable and much less
> than real VM after we removed the enclave killing parts: the only extra
> effort is to track pages in separate list and reclaim them in separately as
> opposed to track in on global list and reclaim together. The main reclaiming
> loop code is still pretty much the same as before.
>
>
> > If stopping at patch 5 and having less code is even remotely an option,
> > why not do _that_?
> >
> I hope I described limitations clear enough above.
> If those are OK with users and also make it acceptable for merge quickly,
You explained the gaps very well already. I don't think the simple
version without per-cgroup reclaiming is enough for the container case.
Mikko
On 10/30/23 11:20, Haitao Huang wrote:
> @@ -527,16 +530,13 @@ void sgx_mark_page_reclaimable(struct sgx_epc_page *page)
> int sgx_unmark_page_reclaimable(struct sgx_epc_page *page)
> {
> spin_lock(&sgx_global_lru.lock);
> - if (page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED) {
> - /* The page is being reclaimed. */
> - if (list_empty(&page->list)) {
> - spin_unlock(&sgx_global_lru.lock);
> - return -EBUSY;
> - }
> -
> - list_del(&page->list);
> - page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
> + if (sgx_epc_page_reclaim_in_progress(page->flags)) {
> + spin_unlock(&sgx_global_lru.lock);
> + return -EBUSY;
> }
> +
> + list_del(&page->list);
> + sgx_epc_page_reset_state(page);
I want to know how much if this series is basically line-for-line
abstraction shifting like:
- page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
+ sgx_epc_page_reset_state(page);
versus actually adding complexity. That way, I might be able to offer
some advice on where this can be pared down. That's really hard to do
with the current series.
Please don't just "introduce new page states". This should have first
abstracted out the sgx_epc_page_reclaim_in_progress() operation, using
the list_empty() check as the implementation.
Then, in a separate patch, introduce the concept of the "reclaim in
progress" flag and finally flip the implementation over.
Ditto for the sgx_epc_page_reset_state() abstraction. It should have
been introduced separately as a concept and then have the implementation
changed.
On in to patch 10 (which is much too big) which introduces the
sgx_lru_list() abstraction.
There's very little about how the LRU design came to be in this cover
letter. Let's add some details.
How's this?
Writing this up, I'm a lot more convinced that this series is, in
general, taking the right approach. I honestly don't see any other
alternatives. As much as I'd love to do something stupidly simple like
just killing enclaves at the moment they hit the limit, that would be a
horrid experience for users _and_ a departure from what the existing
reclaim support does.
That said, there's still a lot of work do to do refactor this series.
It's in need of some love to make it more clear what is going on and to
making the eventual switch over to per-cgroup LRUs more gradual. Each
patch in the series is still doing way too much, _especially_ in patch 10.
==
The existing EPC memory management aims to be a miniature version of the
core VM where EPC memory can be overcommitted and reclaimed. EPC
allocations can wait for reclaim. The alternative to waiting would have
been to send a signal and let the enclave die.
This series attempts to implement that same logic for cgroups, for the
same reasons: it's preferable to wait for memory to become available and
let reclaim happen than to do things that are fatal to enclaves.
There is currently a global reclaimable page SGX LRU list. That list
(and the existing scanning algorithm) is essentially useless for doing
reclaim when a cgroup hits its limit because the cgroup's pages are
scattered around that LRU. It is unspeakably inefficient to scan a
linked list with millions of entries for what could be dozens of pages
from a cgroup that needs reclaim.
Even if unspeakably slow reclaim was accepted, the existing scanning
algorithm only picks a few pages off the head of the global LRU. It
would either need to hold the list locks for unreasonable amounts of
time, or be taught to scan the list in pieces, which has its own challenges.
tl;dr: An cgroup hitting its limit should be as similar as possible to
the system running out of EPC memory. The only two choices to implement
that are nasty changes the existing LRU scanning algorithm, or to add
new LRUs. The result: Add a new LRU for each cgroup and scans those
instead. Replace the existing global cgroup with the root cgroup's LRU
(only when this new support is compiled in, obviously).
On Fri, 05 Jan 2024 12:29:05 -0600, Dave Hansen <[email protected]>
wrote:
> There's very little about how the LRU design came to be in this cover
> letter. Let's add some details.
>
> How's this?
>
> Writing this up, I'm a lot more convinced that this series is, in
> general, taking the right approach. I honestly don't see any other
> alternatives. As much as I'd love to do something stupidly simple like
> just killing enclaves at the moment they hit the limit, that would be a
> horrid experience for users _and_ a departure from what the existing
> reclaim support does.
>
> That said, there's still a lot of work do to do refactor this series.
> It's in need of some love to make it more clear what is going on and to
> making the eventual switch over to per-cgroup LRUs more gradual. Each
> patch in the series is still doing way too much, _especially_ in patch
> 10.
>
> ==
>
> The existing EPC memory management aims to be a miniature version of the
> core VM where EPC memory can be overcommitted and reclaimed. EPC
> allocations can wait for reclaim. The alternative to waiting would have
> been to send a signal and let the enclave die.
>
> This series attempts to implement that same logic for cgroups, for the
> same reasons: it's preferable to wait for memory to become available and
> let reclaim happen than to do things that are fatal to enclaves.
>
> There is currently a global reclaimable page SGX LRU list. That list
> (and the existing scanning algorithm) is essentially useless for doing
> reclaim when a cgroup hits its limit because the cgroup's pages are
> scattered around that LRU. It is unspeakably inefficient to scan a
> linked list with millions of entries for what could be dozens of pages
> from a cgroup that needs reclaim.
>
> Even if unspeakably slow reclaim was accepted, the existing scanning
> algorithm only picks a few pages off the head of the global LRU. It
> would either need to hold the list locks for unreasonable amounts of
> time, or be taught to scan the list in pieces, which has its own
> challenges.
>
> tl;dr: An cgroup hitting its limit should be as similar as possible to
> the system running out of EPC memory. The only two choices to implement
> that are nasty changes the existing LRU scanning algorithm, or to add
> new LRUs. The result: Add a new LRU for each cgroup and scans those
> instead. Replace the existing global cgroup with the root cgroup's LRU
> (only when this new support is compiled in, obviously).
>
I'll add this to the cover letter as a section justifying the LRU design
for per-cgroup reclaiming.
Thank you very much.
Haitao
On Fri, 05 Jan 2024 11:57:03 -0600, Dave Hansen <[email protected]>
wrote:
> On 10/30/23 11:20, Haitao Huang wrote:
>> @@ -527,16 +530,13 @@ void sgx_mark_page_reclaimable(struct
>> sgx_epc_page *page)
>> int sgx_unmark_page_reclaimable(struct sgx_epc_page *page)
>> {
>> spin_lock(&sgx_global_lru.lock);
>> - if (page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED) {
>> - /* The page is being reclaimed. */
>> - if (list_empty(&page->list)) {
>> - spin_unlock(&sgx_global_lru.lock);
>> - return -EBUSY;
>> - }
>> -
>> - list_del(&page->list);
>> - page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
>> + if (sgx_epc_page_reclaim_in_progress(page->flags)) {
>> + spin_unlock(&sgx_global_lru.lock);
>> + return -EBUSY;
>> }
>> +
>> + list_del(&page->list);
>> + sgx_epc_page_reset_state(page);
>
> I want to know how much if this series is basically line-for-line
> abstraction shifting like:
>
> - page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
> + sgx_epc_page_reset_state(page);
>
> versus actually adding complexity. That way, I might be able to offer
> some advice on where this can be pared down. That's really hard to do
> with the current series.
>
> Please don't just "introduce new page states". This should have first
> abstracted out the sgx_epc_page_reclaim_in_progress() operation, using
> the list_empty() check as the implementation.
>
> Then, in a separate patch, introduce the concept of the "reclaim in
> progress" flag and finally flip the implementation over.
>
> Ditto for the sgx_epc_page_reset_state() abstraction. It should have
> been introduced separately as a concept and then have the implementation
> changed.
>
> On in to patch 10 (which is much too big) which introduces the
> sgx_lru_list() abstraction.
>
Sure. I'll try to refactor according to this plan.
Thanks
Haitao
On Sun, 17 Dec 2023 19:44:56 -0600, Huang, Kai <[email protected]> wrote:
>
>> >
>> > The point is, with or w/o this patch, you can only reclaim 16 EPC
>> pages
>> > in one
>> > function call (as you have said you are going to remove
>> > SGX_NR_TO_SCAN_MAX,
>> > which is a cipher to both of us). The only difference I can see is,
>> > with this
>> > patch, you can have multiple calls of "isolate" and then call the
>> > "do_reclaim"
>> > once.
>> >
>> > But what's the good of having the "isolate" if the "do_reclaim" can
>> only
>> > reclaim
>> > 16 pages anyway?
>> >
>> > Back to my last reply, are you afraid of any LRU has less than 16
>> pages
>> > to
>> > "isolate", therefore you need to loop LRUs of descendants to get 16?
>> > Cause I
>> > really cannot think of any other reason why you are doing this.
>> >
>> >
>>
>> I think I see your point. By capping pages reclaimed per cycle to 16,
>> there is not much difference even if those 16 pages are spread in
>> separate
>> LRUs . The difference is only significant when we ever raise that cap.
>> To
>> preserve the current behavior of ewb loops independent on number of LRUs
>> to loop through for each reclaiming cycle, regardless of the exact value
>> of the page cap, I would still think current approach in the patch is
>> reasonable choice. What do you think?
>
> To me I won't bother to do that. Having less than 16 pages in one LRU is
> *extremely rare* that should never happen in practice. It's pointless
> to make
> such code adjustment at this stage.
>
> Let's focus on enabling functionality first. When you have some real
> performance issue that is related to this, we can come back then.
>
I have done some rethinking about this and realize this does save quite
some significant work: without breaking out isolation part from
sgx_reclaim_pages(), I can remove the changes to use a list for isolated
pages, and no need to introduce "state" such as RECLAIM_IN_PROGRESS. About
1/3 of changes for per-cgroup reclamation will be gone.
So I think I'll go this route now. The only downside may be performance if
a enclave spreads its pages in different cgroups and even that is minimum
impact as we limit reclamation to 16 pages a time. Let me know if someone
feel strongly we need dealt with that and see some other potential issues
I may have missed.
Thanks
Haitao
On Fri Jan 12, 2024 at 7:07 PM EET, Haitao Huang wrote:
> On Sun, 17 Dec 2023 19:44:56 -0600, Huang, Kai <[email protected]> wrote:
>
> >
> >> >
> >> > The point is, with or w/o this patch, you can only reclaim 16 EPC
> >> pages
> >> > in one
> >> > function call (as you have said you are going to remove
> >> > SGX_NR_TO_SCAN_MAX,
> >> > which is a cipher to both of us). The only difference I can see is,
> >> > with this
> >> > patch, you can have multiple calls of "isolate" and then call the
> >> > "do_reclaim"
> >> > once.
> >> >
> >> > But what's the good of having the "isolate" if the "do_reclaim" can
> >> only
> >> > reclaim
> >> > 16 pages anyway?
> >> >
> >> > Back to my last reply, are you afraid of any LRU has less than 16
> >> pages
> >> > to
> >> > "isolate", therefore you need to loop LRUs of descendants to get 16?
> >> > Cause I
> >> > really cannot think of any other reason why you are doing this.
> >> >
> >> >
> >>
> >> I think I see your point. By capping pages reclaimed per cycle to 16,
> >> there is not much difference even if those 16 pages are spread in
> >> separate
> >> LRUs . The difference is only significant when we ever raise that cap.
> >> To
> >> preserve the current behavior of ewb loops independent on number of LRUs
> >> to loop through for each reclaiming cycle, regardless of the exact value
> >> of the page cap, I would still think current approach in the patch is
> >> reasonable choice. What do you think?
> >
> > To me I won't bother to do that. Having less than 16 pages in one LRU is
> > *extremely rare* that should never happen in practice. It's pointless
> > to make
> > such code adjustment at this stage.
> >
> > Let's focus on enabling functionality first. When you have some real
> > performance issue that is related to this, we can come back then.
> >
>
> I have done some rethinking about this and realize this does save quite
> some significant work: without breaking out isolation part from
> sgx_reclaim_pages(), I can remove the changes to use a list for isolated
> pages, and no need to introduce "state" such as RECLAIM_IN_PROGRESS. About
> 1/3 of changes for per-cgroup reclamation will be gone.
>
> So I think I'll go this route now. The only downside may be performance if
> a enclave spreads its pages in different cgroups and even that is minimum
> impact as we limit reclamation to 16 pages a time. Let me know if someone
> feel strongly we need dealt with that and see some other potential issues
> I may have missed.
We could deal with possible performance regression later on (if there
is need). I mean there should we a workload first that would so that
sort stimulus...
> Thanks
>
> Haitao
BR, Jarkko
On Sat Jan 13, 2024 at 11:04 PM EET, Jarkko Sakkinen wrote:
> On Fri Jan 12, 2024 at 7:07 PM EET, Haitao Huang wrote:
> > On Sun, 17 Dec 2023 19:44:56 -0600, Huang, Kai <[email protected]> wrote:
> >
> > >
> > >> >
> > >> > The point is, with or w/o this patch, you can only reclaim 16 EPC
> > >> pages
> > >> > in one
> > >> > function call (as you have said you are going to remove
> > >> > SGX_NR_TO_SCAN_MAX,
> > >> > which is a cipher to both of us). The only difference I can see is,
> > >> > with this
> > >> > patch, you can have multiple calls of "isolate" and then call the
> > >> > "do_reclaim"
> > >> > once.
> > >> >
> > >> > But what's the good of having the "isolate" if the "do_reclaim" can
> > >> only
> > >> > reclaim
> > >> > 16 pages anyway?
> > >> >
> > >> > Back to my last reply, are you afraid of any LRU has less than 16
> > >> pages
> > >> > to
> > >> > "isolate", therefore you need to loop LRUs of descendants to get 16?
> > >> > Cause I
> > >> > really cannot think of any other reason why you are doing this.
> > >> >
> > >> >
> > >>
> > >> I think I see your point. By capping pages reclaimed per cycle to 16,
> > >> there is not much difference even if those 16 pages are spread in
> > >> separate
> > >> LRUs . The difference is only significant when we ever raise that cap.
> > >> To
> > >> preserve the current behavior of ewb loops independent on number of LRUs
> > >> to loop through for each reclaiming cycle, regardless of the exact value
> > >> of the page cap, I would still think current approach in the patch is
> > >> reasonable choice. What do you think?
> > >
> > > To me I won't bother to do that. Having less than 16 pages in one LRU is
> > > *extremely rare* that should never happen in practice. It's pointless
> > > to make
> > > such code adjustment at this stage.
> > >
> > > Let's focus on enabling functionality first. When you have some real
> > > performance issue that is related to this, we can come back then.
> > >
> >
> > I have done some rethinking about this and realize this does save quite
> > some significant work: without breaking out isolation part from
> > sgx_reclaim_pages(), I can remove the changes to use a list for isolated
> > pages, and no need to introduce "state" such as RECLAIM_IN_PROGRESS. About
> > 1/3 of changes for per-cgroup reclamation will be gone.
> >
> > So I think I'll go this route now. The only downside may be performance if
> > a enclave spreads its pages in different cgroups and even that is minimum
> > impact as we limit reclamation to 16 pages a time. Let me know if someone
> > feel strongly we need dealt with that and see some other potential issues
> > I may have missed.
>
> We could deal with possible performance regression later on (if there
> is need). I mean there should we a workload first that would so that
> sort stimulus...
I.e. no reason to deal with imaginary workload :-) Go ahead and we'll
go through it.
BR, Jarkko