2024-04-10 22:29:31

by Haitao Huang

[permalink] [raw]
Subject: [PATCH v11 00/14] Add Cgroup support for SGX EPC memory

SGX Enclave Page Cache (EPC) memory allocations are separate from normal
RAM allocations, and are managed solely by the SGX subsystem. The existing
cgroup memory controller cannot be used to limit or account for SGX EPC
memory, which is a desirable feature in some environments, e.g., support
for pod level control in a Kubernates cluster on a VM or bare-metal host
[1,2].

This patchset implements the support for sgx_epc memory within the misc
cgroup controller. A user can use the misc cgroup controller to set and
enforce a max limit on total EPC usage per cgroup. The implementation
reports current usage and events of reaching the limit per cgroup as well
as the total system capacity.

Much like normal system memory, EPC memory can be overcommitted via virtual
memory techniques and pages can be swapped out of the EPC to their backing
store, which are normal system memory allocated via shmem and accounted by
the memory controller. Similar to per-cgroup reclamation done by the memory
controller, the EPC misc controller needs to implement a per-cgroup EPC
reclaiming process: when the EPC usage of a cgroup reaches its hard limit
('sgx_epc' entry in the 'misc.max' file), the cgroup starts swapping out
some EPC pages within the same cgroup to make room for new allocations.

For that, this implementation tracks reclaimable EPC pages in a separate
LRU list in each cgroup, and below are more details and justification of
this design.

Track EPC pages in per-cgroup LRUs (from Dave)
----------------------------------------------

tl;dr: A cgroup hitting its limit should be as similar as possible to the
system running out of EPC memory. The only two choices to implement that
are nasty changes the existing LRU scanning algorithm, or to add new LRUs.
The result: Add a new LRU for each cgroup and scans those instead. Replace
the existing global cgroup with the root cgroup's LRU (only when this new
support is compiled in, obviously).

The existing EPC memory management aims to be a miniature version of the
core VM where EPC memory can be overcommitted and reclaimed. EPC
allocations can wait for reclaim. The alternative to waiting would have
been to send a signal and let the enclave die.

This series attempts to implement that same logic for cgroups, for the same
reasons: it's preferable to wait for memory to become available and let
reclaim happen than to do things that are fatal to enclaves.

There is currently a global reclaimable page SGX LRU list. That list (and
the existing scanning algorithm) is essentially useless for doing reclaim
when a cgroup hits its limit because the cgroup's pages are scattered
around that LRU. It is unspeakably inefficient to scan a linked list with
millions of entries for what could be dozens of pages from a cgroup that
needs reclaim.

Even if unspeakably slow reclaim was accepted, the existing scanning
algorithm only picks a few pages off the head of the global LRU. It would
either need to hold the list locks for unreasonable amounts of time, or be
taught to scan the list in pieces, which has its own challenges.

Unreclaimable Enclave Pages
---------------------------

There are a variety of page types for enclaves, each serving different
purposes [5]. Although the SGX architecture supports swapping for all
types, some special pages, e.g., Version Array(VA) and Secure Enclave
Control Structure (SECS)[5], holds meta data of reclaimed pages and
enclaves. That makes reclamation of such pages more intricate to manage.
The SGX driver global reclaimer currently does not swap out VA pages. It
only swaps the SECS page of an enclave when all other associated pages have
been swapped out. The cgroup reclaimer follows the same approach and does
not track those in per-cgroup LRUs and considers them as unreclaimable
pages. The allocation of these pages is counted towards the usage of a
specific cgroup and is subject to the cgroup's set EPC limits.

Earlier versions of this series implemented forced enclave-killing to
reclaim VA and SECS pages. That was designed to enforce the 'max' limit,
particularly in scenarios where a user or administrator reduces this limit
post-launch of enclaves. However, subsequent discussions [3, 4] indicated
that such preemptive enforcement is not necessary for the misc-controllers.
Therefore, reclaiming SECS/VA pages by force-killing enclaves were removed,
and the limit is only enforced at the time of new EPC allocation request.
When a cgroup hits its limit but nothing left in the LRUs of the subtree,
i.e., nothing to reclaim in the cgroup, any new attempt to allocate EPC
within that cgroup will result in an 'ENOMEM'.

Unreclaimable Guest VM EPC Pages
--------------------------------

The EPC pages allocated for guest VMs by the virtual EPC driver are not
reclaimable by the host kernel [6]. Therefore an EPC cgroup also treats
those as unreclaimable and returns ENOMEM when its limit is hit and nothing
reclaimable left within the cgroup. The virtual EPC driver translates the
ENOMEM error resulted from an EPC allocation request into a SIGBUS to the
user process exactly the same way handling host running out of physical
EPC.

This work was originally authored by Sean Christopherson a few years ago,
and previously modified by Kristen C. Accardi to utilize the misc cgroup
controller rather than a custom controller. I have been updating the
patches based on review comments since V2 [7-15], simplified the
implementation/design, added selftest scripts, fixed some stability issues
found from testing.

Thanks to all for the review/test/tags/feedback provided on the previous
versions.

I appreciate your further reviewing/testing and providing tags if
appropriate.

---
V11:
- Update copyright years and use c style (Kai)
- Improve and simplify test scripts: remove cgroup-tools and bash dependency, drop cgroup v1.
(Jarkko, Michal)
- Add more stub/wrapper functions to minimize #ifdefs in c file. (Kai)
- Revise commit message for patch #8 to clarify design rational (Kai)
- Print error instead of WARN for init failure. (Kai)
- Add check for need to queue an async reclamation before returning from
sgx_cgroup_try_charge(), do so if needed.

V10:
- Use enum instead of boolean for the 'reclaim' parameters in
sgx_alloc_epc_page(). (Dave, Jarkko)
- Pass mm struct instead of a boolean 'indirect'. (Dave, Jarkko)
- Add comments/macros to clarify the cgroup async reclaimer design. (Kai)
- Simplify sgx_reclaim_pages() signature, removing a pointer passed in.
(Kai)
- Clarify design of sgx_cgroup_reclaim_pages(). (Kai)
- Does not return a value for callers to check.
- Its usage pattern is similar to that of sgx_reclaim_pages() now
- Add cond_resched() in the loop in the cgroup reclaimer to improve
liveliness.
- Add logic for cgroup level reclamation in sgx_reclaim_direct()
- Restructure V9 patches 7-10 to make them flow better. (Kai)
- Disable cgroup if workqueue allocation failed during init. (Kai)
- Shorten names for EPC cgroup functions, structures and variables.
(Jarkko)
- Separate out a helper for for addressing single iteration of the loop in
sgx_cgroup_try_charge(). (Jarkko)
- More cleanup/clarifying/comments/style fixes. (Kai, Jarkko)

V9:
- Add comments for static variables outside functions. (Jarkko)
- Remove unnecessary ifs. (Tim)
- Add more Reviewed-By: tags from Jarkko and TJ.

V8:
- Style fixes. (Jarkko)
- Abstract _misc_res_free/alloc() (Jarkko)
- Remove unneeded NULL checks. (Jarkko)

V7:
- Split the large patch for the final EPC implementation, #10 in V6, into
smaller ones. (Dave, Kai)
- Scan and reclaim one cgroup at a time, don't split sgx_reclaim_pages()
into two functions (Kai)
- Removed patches to introduce the EPC page states, list for storing
candidate pages for reclamation. (not needed due to above changes)
- Make ops one per resource type and store them in array (Michal)
- Rename the ops struct to misc_res_ops, and enforce the constraints of
required callback functions (Jarkko)
- Initialize epc cgroup in sgx driver init function. (Kai)
- Moved addition of priv field to patch 4 where it was used first. (Jarkko)
- Split sgx_get_current_epc_cg() out of sgx_epc_cg_try_charge() (Kai)
- Use a static for root cgroup (Kai)

[1]https://lore.kernel.org/all/DM6PR21MB11772A6ED915825854B419D6C4989@DM6PR21MB1177.namprd21.prod.outlook.com/
[2]https://lore.kernel.org/all/ZD7Iutppjj+muH4p@himmelriiki/
[3]https://lore.kernel.org/lkml/[email protected]/
[4]https://lore.kernel.org/lkml/yz44wukoic3syy6s4fcrngagurkjhe2hzka6kvxbajdtro3fwu@zd2ilht7wcw3/
[5]Documentation/arch/x86/sgx.rst, Section"Enclave Page Types"
[6]Documentation/arch/x86/sgx.rst, Section "Virtual EPC"
[7]v2: https://lore.kernel.org/all/[email protected]/
[8]v3: https://lore.kernel.org/linux-sgx/[email protected]/
[9]v4: https://lore.kernel.org/all/[email protected]/
[10]v5: https://lore.kernel.org/all/[email protected]/
[11]v6: https://lore.kernel.org/linux-sgx/[email protected]/
[12]v7: https://lore.kernel.org/linux-sgx/[email protected]/T/#t
[13]v8: https://lore.kernel.org/linux-sgx/[email protected]/T/#t
[14]v9: https://lore.kernel.org/lkml/[email protected]/T/
[15]v10: https://lore.kernel.org/linux-sgx/[email protected]/T/#t

Haitao Huang (3):
x86/sgx: Replace boolean parameters with enums
x86/sgx: Charge mem_cgroup for per-cgroup reclamation
selftests/sgx: Add scripts for EPC cgroup testing

Kristen Carlson Accardi (9):
cgroup/misc: Add per resource callbacks for CSS events
cgroup/misc: Export APIs for SGX driver
cgroup/misc: Add SGX EPC resource type
x86/sgx: Implement basic EPC misc cgroup functionality
x86/sgx: Abstract tracking reclaimable pages in LRU
x86/sgx: Add basic EPC reclamation flow for cgroup
x86/sgx: Implement async reclamation for cgroup
x86/sgx: Abstract check for global reclaimable pages
x86/sgx: Turn on per-cgroup EPC reclamation

Sean Christopherson (2):
x86/sgx: Add sgx_epc_lru_list to encapsulate LRU list
Docs/x86/sgx: Add description for cgroup support

Documentation/arch/x86/sgx.rst | 83 +++++
arch/x86/Kconfig | 13 +
arch/x86/kernel/cpu/sgx/Makefile | 1 +
arch/x86/kernel/cpu/sgx/encl.c | 41 +--
arch/x86/kernel/cpu/sgx/encl.h | 7 +-
arch/x86/kernel/cpu/sgx/epc_cgroup.c | 312 ++++++++++++++++++
arch/x86/kernel/cpu/sgx/epc_cgroup.h | 101 ++++++
arch/x86/kernel/cpu/sgx/ioctl.c | 10 +-
arch/x86/kernel/cpu/sgx/main.c | 205 +++++++++---
arch/x86/kernel/cpu/sgx/sgx.h | 50 ++-
arch/x86/kernel/cpu/sgx/virt.c | 2 +-
include/linux/misc_cgroup.h | 41 +++
kernel/cgroup/misc.c | 109 ++++--
tools/testing/selftests/sgx/ash_cgexec.sh | 16 +
.../selftests/sgx/run_epc_cg_selftests.sh | 275 +++++++++++++++
.../selftests/sgx/watch_misc_for_tests.sh | 11 +
16 files changed, 1172 insertions(+), 105 deletions(-)
create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.c
create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.h
create mode 100755 tools/testing/selftests/sgx/ash_cgexec.sh
create mode 100755 tools/testing/selftests/sgx/run_epc_cg_selftests.sh
create mode 100755 tools/testing/selftests/sgx/watch_misc_for_tests.sh


base-commit: 4cece764965020c22cff7665b18a012006359095
--
2.25.1



2024-04-10 22:29:39

by Haitao Huang

[permalink] [raw]
Subject: [PATCH v11 02/14] cgroup/misc: Add per resource callbacks for CSS events

From: Kristen Carlson Accardi <[email protected]>

The misc cgroup controller (subsystem) currently does not perform
resource type specific action for Cgroups Subsystem State (CSS) events:
the 'css_alloc' event when a cgroup is created and the 'css_free' event
when a cgroup is destroyed.

Define callbacks for those events and allow resource providers to
register the callbacks per resource type as needed. This will be
utilized later by the EPC misc cgroup support implemented in the SGX
driver.

Signed-off-by: Kristen Carlson Accardi <[email protected]>
Co-developed-by: Haitao Huang <[email protected]>
Signed-off-by: Haitao Huang <[email protected]>
Reviewed-by: Jarkko Sakkinen <[email protected]>
Reviewed-by: Tejun Heo <[email protected]>
---
V8:
- Abstract out _misc_cg_res_free() and _misc_cg_res_alloc() (Jarkko)

V7:
- Make ops one per resource type and store them in array (Michal)
- Rename the ops struct to misc_res_ops, and enforce the constraints of required callback
functions (Jarkko)
- Moved addition of priv field to patch 4 where it was used first. (Jarkko)

V6:
- Create ops struct for per resource callbacks (Jarkko)
- Drop max_write callback (Dave, Michal)
- Style fixes (Kai)
---
include/linux/misc_cgroup.h | 11 +++++
kernel/cgroup/misc.c | 84 +++++++++++++++++++++++++++++++++----
2 files changed, 87 insertions(+), 8 deletions(-)

diff --git a/include/linux/misc_cgroup.h b/include/linux/misc_cgroup.h
index e799b1f8d05b..0806d4436208 100644
--- a/include/linux/misc_cgroup.h
+++ b/include/linux/misc_cgroup.h
@@ -27,6 +27,16 @@ struct misc_cg;

#include <linux/cgroup.h>

+/**
+ * struct misc_res_ops: per resource type callback ops.
+ * @alloc: invoked for resource specific initialization when cgroup is allocated.
+ * @free: invoked for resource specific cleanup when cgroup is deallocated.
+ */
+struct misc_res_ops {
+ int (*alloc)(struct misc_cg *cg);
+ void (*free)(struct misc_cg *cg);
+};
+
/**
* struct misc_res: Per cgroup per misc type resource
* @max: Maximum limit on the resource.
@@ -56,6 +66,7 @@ struct misc_cg {

u64 misc_cg_res_total_usage(enum misc_res_type type);
int misc_cg_set_capacity(enum misc_res_type type, u64 capacity);
+int misc_cg_set_ops(enum misc_res_type type, const struct misc_res_ops *ops);
int misc_cg_try_charge(enum misc_res_type type, struct misc_cg *cg, u64 amount);
void misc_cg_uncharge(enum misc_res_type type, struct misc_cg *cg, u64 amount);

diff --git a/kernel/cgroup/misc.c b/kernel/cgroup/misc.c
index 79a3717a5803..14ab13ef3bc7 100644
--- a/kernel/cgroup/misc.c
+++ b/kernel/cgroup/misc.c
@@ -39,6 +39,9 @@ static struct misc_cg root_cg;
*/
static u64 misc_res_capacity[MISC_CG_RES_TYPES];

+/* Resource type specific operations */
+static const struct misc_res_ops *misc_res_ops[MISC_CG_RES_TYPES];
+
/**
* parent_misc() - Get the parent of the passed misc cgroup.
* @cgroup: cgroup whose parent needs to be fetched.
@@ -105,6 +108,36 @@ int misc_cg_set_capacity(enum misc_res_type type, u64 capacity)
}
EXPORT_SYMBOL_GPL(misc_cg_set_capacity);

+/**
+ * misc_cg_set_ops() - set resource specific operations.
+ * @type: Type of the misc res.
+ * @ops: Operations for the given type.
+ *
+ * Context: Any context.
+ * Return:
+ * * %0 - Successfully registered the operations.
+ * * %-EINVAL - If @type is invalid, or the operations missing any required callbacks.
+ */
+int misc_cg_set_ops(enum misc_res_type type, const struct misc_res_ops *ops)
+{
+ if (!valid_type(type))
+ return -EINVAL;
+
+ if (!ops->alloc) {
+ pr_err("%s: alloc missing\n", __func__);
+ return -EINVAL;
+ }
+
+ if (!ops->free) {
+ pr_err("%s: free missing\n", __func__);
+ return -EINVAL;
+ }
+
+ misc_res_ops[type] = ops;
+ return 0;
+}
+EXPORT_SYMBOL_GPL(misc_cg_set_ops);
+
/**
* misc_cg_cancel_charge() - Cancel the charge from the misc cgroup.
* @type: Misc res type in misc cg to cancel the charge from.
@@ -371,6 +404,33 @@ static struct cftype misc_cg_files[] = {
{}
};

+static inline int _misc_cg_res_alloc(struct misc_cg *cg)
+{
+ enum misc_res_type i;
+ int ret;
+
+ for (i = 0; i < MISC_CG_RES_TYPES; i++) {
+ WRITE_ONCE(cg->res[i].max, MAX_NUM);
+ atomic64_set(&cg->res[i].usage, 0);
+ if (misc_res_ops[i]) {
+ ret = misc_res_ops[i]->alloc(cg);
+ if (ret)
+ return ret;
+ }
+ }
+
+ return 0;
+}
+
+static inline void _misc_cg_res_free(struct misc_cg *cg)
+{
+ enum misc_res_type i;
+
+ for (i = 0; i < MISC_CG_RES_TYPES; i++)
+ if (misc_res_ops[i])
+ misc_res_ops[i]->free(cg);
+}
+
/**
* misc_cg_alloc() - Allocate misc cgroup.
* @parent_css: Parent cgroup.
@@ -383,20 +443,25 @@ static struct cftype misc_cg_files[] = {
static struct cgroup_subsys_state *
misc_cg_alloc(struct cgroup_subsys_state *parent_css)
{
- enum misc_res_type i;
- struct misc_cg *cg;
+ struct misc_cg *parent_cg, *cg;
+ int ret;

- if (!parent_css) {
- cg = &root_cg;
+ if (unlikely(!parent_css)) {
+ parent_cg = cg = &root_cg;
} else {
cg = kzalloc(sizeof(*cg), GFP_KERNEL);
if (!cg)
return ERR_PTR(-ENOMEM);
+ parent_cg = css_misc(parent_css);
}

- for (i = 0; i < MISC_CG_RES_TYPES; i++) {
- WRITE_ONCE(cg->res[i].max, MAX_NUM);
- atomic64_set(&cg->res[i].usage, 0);
+ ret = _misc_cg_res_alloc(cg);
+ if (ret) {
+ _misc_cg_res_free(cg);
+ if (likely(parent_css))
+ kfree(cg);
+
+ return ERR_PTR(ret);
}

return &cg->css;
@@ -410,7 +475,10 @@ misc_cg_alloc(struct cgroup_subsys_state *parent_css)
*/
static void misc_cg_free(struct cgroup_subsys_state *css)
{
- kfree(css_misc(css));
+ struct misc_cg *cg = css_misc(css);
+
+ _misc_cg_res_free(cg);
+ kfree(cg);
}

/* Cgroup controller callbacks */
--
2.25.1


2024-04-10 22:38:06

by Haitao Huang

[permalink] [raw]
Subject: [PATCH v11 13/14] Docs/x86/sgx: Add description for cgroup support

From: Sean Christopherson <[email protected]>

Add initial documentation of how to regulate the distribution of
SGX Enclave Page Cache (EPC) memory via the Miscellaneous cgroup
controller.

Signed-off-by: Sean Christopherson <[email protected]>
Co-developed-by: Kristen Carlson Accardi <[email protected]>
Signed-off-by: Kristen Carlson Accardi <[email protected]>
Co-developed-by: Haitao Huang<[email protected]>
Signed-off-by: Haitao Huang<[email protected]>
Cc: Sean Christopherson <[email protected]>
---
V8:
- Limit text width to 80 characters to be consistent.

V6:
- Remove mentioning of VMM specific behavior on handling SIGBUS
- Remove statement of forced reclamation, add statement to specify
ENOMEM returned when no reclamation possible.
- Added statements on the non-preemptive nature for the max limit
- Dropped Reviewed-by tag because of changes

V4:
- Fix indentation (Randy)
- Change misc.events file to be read-only
- Fix a typo for 'subsystem'
- Add behavior when VMM overcommit EPC with a cgroup (Mikko)
---
Documentation/arch/x86/sgx.rst | 83 ++++++++++++++++++++++++++++++++++
1 file changed, 83 insertions(+)

diff --git a/Documentation/arch/x86/sgx.rst b/Documentation/arch/x86/sgx.rst
index d90796adc2ec..c537e6a9aa65 100644
--- a/Documentation/arch/x86/sgx.rst
+++ b/Documentation/arch/x86/sgx.rst
@@ -300,3 +300,86 @@ to expected failures and handle them as follows:
first call. It indicates a bug in the kernel or the userspace client
if any of the second round of ``SGX_IOC_VEPC_REMOVE_ALL`` calls has
a return code other than 0.
+
+
+Cgroup Support
+==============
+
+The "sgx_epc" resource within the Miscellaneous cgroup controller regulates
+distribution of SGX EPC memory, which is a subset of system RAM that is used to
+provide SGX-enabled applications with protected memory, and is otherwise
+inaccessible, i.e. shows up as reserved in /proc/iomem and cannot be
+read/written outside of an SGX enclave.
+
+Although current systems implement EPC by stealing memory from RAM, for all
+intents and purposes the EPC is independent from normal system memory, e.g. must
+be reserved at boot from RAM and cannot be converted between EPC and normal
+memory while the system is running. The EPC is managed by the SGX subsystem and
+is not accounted by the memory controller. Note that this is true only for EPC
+memory itself, i.e. normal memory allocations related to SGX and EPC memory,
+e.g. the backing memory for evicted EPC pages, are accounted, limited and
+protected by the memory controller.
+
+Much like normal system memory, EPC memory can be overcommitted via virtual
+memory techniques and pages can be swapped out of the EPC to their backing store
+(normal system memory allocated via shmem). The SGX EPC subsystem is analogous
+to the memory subsystem, and it implements limit and protection models for EPC
+memory.
+
+SGX EPC Interface Files
+-----------------------
+
+For a generic description of the Miscellaneous controller interface files,
+please see Documentation/admin-guide/cgroup-v2.rst
+
+All SGX EPC memory amounts are in bytes unless explicitly stated otherwise. If
+a value which is not PAGE_SIZE aligned is written, the actual value used by the
+controller will be rounded down to the closest PAGE_SIZE multiple.
+
+ misc.capacity
+ A read-only flat-keyed file shown only in the root cgroup. The sgx_epc
+ resource will show the total amount of EPC memory available on the
+ platform.
+
+ misc.current
+ A read-only flat-keyed file shown in the non-root cgroups. The sgx_epc
+ resource will show the current active EPC memory usage of the cgroup and
+ its descendants. EPC pages that are swapped out to backing RAM are not
+ included in the current count.
+
+ misc.max
+ A read-write single value file which exists on non-root cgroups. The
+ sgx_epc resource will show the EPC usage hard limit. The default is
+ "max".
+
+ If a cgroup's EPC usage reaches this limit, EPC allocations, e.g., for
+ page fault handling, will be blocked until EPC can be reclaimed from the
+ cgroup. If there are no pages left that are reclaimable within the same
+ group, the kernel returns ENOMEM.
+
+ The EPC pages allocated for a guest VM by the virtual EPC driver are not
+ reclaimable by the host kernel. In case the guest cgroup's limit is
+ reached and no reclaimable pages left in the same cgroup, the virtual
+ EPC driver returns SIGBUS to the user space process to indicate failure
+ on new EPC allocation requests.
+
+ The misc.max limit is non-preemptive. If a user writes a limit lower
+ than the current usage to this file, the cgroup will not preemptively
+ deallocate pages currently in use, and will only start blocking the next
+ allocation and reclaiming EPC at that time.
+
+ misc.events
+ A read-only flat-keyed file which exists on non-root cgroups.
+ A value change in this file generates a file modified event.
+
+ max
+ The number of times the cgroup has triggered a reclaim due to
+ its EPC usage approaching (or exceeding) its max EPC boundary.
+
+Migration
+---------
+
+Once an EPC page is charged to a cgroup (during allocation), it remains charged
+to the original cgroup until the page is released or reclaimed. Migrating a
+process to a different cgroup doesn't move the EPC charges that it incurred
+while in the previous cgroup to its new cgroup.
--
2.25.1


2024-04-13 06:48:52

by Mikko Ylinen

[permalink] [raw]
Subject: Re: [PATCH v11 00/14] Add Cgroup support for SGX EPC memory

On Wed, Apr 10, 2024 at 11:25:44AM -0700, Haitao Huang wrote:
> SGX Enclave Page Cache (EPC) memory allocations are separate from normal
> RAM allocations, and are managed solely by the SGX subsystem. The existing
> cgroup memory controller cannot be used to limit or account for SGX EPC
> memory, which is a desirable feature in some environments, e.g., support
> for pod level control in a Kubernates cluster on a VM or bare-metal host
> [1,2].
>
> This patchset implements the support for sgx_epc memory within the misc
> cgroup controller. A user can use the misc cgroup controller to set and
> enforce a max limit on total EPC usage per cgroup. The implementation
> reports current usage and events of reaching the limit per cgroup as well
> as the total system capacity.
>
> Much like normal system memory, EPC memory can be overcommitted via virtual
> memory techniques and pages can be swapped out of the EPC to their backing
> store, which are normal system memory allocated via shmem and accounted by
> the memory controller. Similar to per-cgroup reclamation done by the memory
> controller, the EPC misc controller needs to implement a per-cgroup EPC
> reclaiming process: when the EPC usage of a cgroup reaches its hard limit
> ('sgx_epc' entry in the 'misc.max' file), the cgroup starts swapping out
> some EPC pages within the same cgroup to make room for new allocations.
>
> For that, this implementation tracks reclaimable EPC pages in a separate
> LRU list in each cgroup, and below are more details and justification of
> this design.
>
> Track EPC pages in per-cgroup LRUs (from Dave)
> ----------------------------------------------
>
> tl;dr: A cgroup hitting its limit should be as similar as possible to the
> system running out of EPC memory. The only two choices to implement that
> are nasty changes the existing LRU scanning algorithm, or to add new LRUs.
> The result: Add a new LRU for each cgroup and scans those instead. Replace
> the existing global cgroup with the root cgroup's LRU (only when this new
> support is compiled in, obviously).
>
> The existing EPC memory management aims to be a miniature version of the
> core VM where EPC memory can be overcommitted and reclaimed. EPC
> allocations can wait for reclaim. The alternative to waiting would have
> been to send a signal and let the enclave die.
>
> This series attempts to implement that same logic for cgroups, for the same
> reasons: it's preferable to wait for memory to become available and let
> reclaim happen than to do things that are fatal to enclaves.
>
> There is currently a global reclaimable page SGX LRU list. That list (and
> the existing scanning algorithm) is essentially useless for doing reclaim
> when a cgroup hits its limit because the cgroup's pages are scattered
> around that LRU. It is unspeakably inefficient to scan a linked list with
> millions of entries for what could be dozens of pages from a cgroup that
> needs reclaim.
>
> Even if unspeakably slow reclaim was accepted, the existing scanning
> algorithm only picks a few pages off the head of the global LRU. It would
> either need to hold the list locks for unreasonable amounts of time, or be
> taught to scan the list in pieces, which has its own challenges.
>
> Unreclaimable Enclave Pages
> ---------------------------
>
> There are a variety of page types for enclaves, each serving different
> purposes [5]. Although the SGX architecture supports swapping for all
> types, some special pages, e.g., Version Array(VA) and Secure Enclave
> Control Structure (SECS)[5], holds meta data of reclaimed pages and
> enclaves. That makes reclamation of such pages more intricate to manage.
> The SGX driver global reclaimer currently does not swap out VA pages. It
> only swaps the SECS page of an enclave when all other associated pages have
> been swapped out. The cgroup reclaimer follows the same approach and does
> not track those in per-cgroup LRUs and considers them as unreclaimable
> pages. The allocation of these pages is counted towards the usage of a
> specific cgroup and is subject to the cgroup's set EPC limits.
>
> Earlier versions of this series implemented forced enclave-killing to
> reclaim VA and SECS pages. That was designed to enforce the 'max' limit,
> particularly in scenarios where a user or administrator reduces this limit
> post-launch of enclaves. However, subsequent discussions [3, 4] indicated
> that such preemptive enforcement is not necessary for the misc-controllers.
> Therefore, reclaiming SECS/VA pages by force-killing enclaves were removed,
> and the limit is only enforced at the time of new EPC allocation request.
> When a cgroup hits its limit but nothing left in the LRUs of the subtree,
> i.e., nothing to reclaim in the cgroup, any new attempt to allocate EPC
> within that cgroup will result in an 'ENOMEM'.
>
> Unreclaimable Guest VM EPC Pages
> --------------------------------
>
> The EPC pages allocated for guest VMs by the virtual EPC driver are not
> reclaimable by the host kernel [6]. Therefore an EPC cgroup also treats
> those as unreclaimable and returns ENOMEM when its limit is hit and nothing
> reclaimable left within the cgroup. The virtual EPC driver translates the
> ENOMEM error resulted from an EPC allocation request into a SIGBUS to the
> user process exactly the same way handling host running out of physical
> EPC.
>
> This work was originally authored by Sean Christopherson a few years ago,
> and previously modified by Kristen C. Accardi to utilize the misc cgroup
> controller rather than a custom controller. I have been updating the
> patches based on review comments since V2 [7-15], simplified the
> implementation/design, added selftest scripts, fixed some stability issues
> found from testing.
>
> Thanks to all for the review/test/tags/feedback provided on the previous
> versions.
>
> I appreciate your further reviewing/testing and providing tags if
> appropriate.

I've continued to use/test this series in one of its main target
environments: running "misc.max spc_epc" limited containers in a
Kubernetes cluster. Everything is working as expected when using
a test suite built for that env (similar tests what Haitao has here)
running Gramine-SGX "stress" containers.

Tested-by: Mikko Ylinen <[email protected]>

>
> ---
> V11:
> - Update copyright years and use c style (Kai)
> - Improve and simplify test scripts: remove cgroup-tools and bash dependency, drop cgroup v1.
> (Jarkko, Michal)
> - Add more stub/wrapper functions to minimize #ifdefs in c file. (Kai)
> - Revise commit message for patch #8 to clarify design rational (Kai)
> - Print error instead of WARN for init failure. (Kai)
> - Add check for need to queue an async reclamation before returning from
> sgx_cgroup_try_charge(), do so if needed.
>
> V10:
> - Use enum instead of boolean for the 'reclaim' parameters in
> sgx_alloc_epc_page(). (Dave, Jarkko)
> - Pass mm struct instead of a boolean 'indirect'. (Dave, Jarkko)
> - Add comments/macros to clarify the cgroup async reclaimer design. (Kai)
> - Simplify sgx_reclaim_pages() signature, removing a pointer passed in.
> (Kai)
> - Clarify design of sgx_cgroup_reclaim_pages(). (Kai)
> - Does not return a value for callers to check.
> - Its usage pattern is similar to that of sgx_reclaim_pages() now
> - Add cond_resched() in the loop in the cgroup reclaimer to improve
> liveliness.
> - Add logic for cgroup level reclamation in sgx_reclaim_direct()
> - Restructure V9 patches 7-10 to make them flow better. (Kai)
> - Disable cgroup if workqueue allocation failed during init. (Kai)
> - Shorten names for EPC cgroup functions, structures and variables.
> (Jarkko)
> - Separate out a helper for for addressing single iteration of the loop in
> sgx_cgroup_try_charge(). (Jarkko)
> - More cleanup/clarifying/comments/style fixes. (Kai, Jarkko)
>
> V9:
> - Add comments for static variables outside functions. (Jarkko)
> - Remove unnecessary ifs. (Tim)
> - Add more Reviewed-By: tags from Jarkko and TJ.
>
> V8:
> - Style fixes. (Jarkko)
> - Abstract _misc_res_free/alloc() (Jarkko)
> - Remove unneeded NULL checks. (Jarkko)
>
> V7:
> - Split the large patch for the final EPC implementation, #10 in V6, into
> smaller ones. (Dave, Kai)
> - Scan and reclaim one cgroup at a time, don't split sgx_reclaim_pages()
> into two functions (Kai)
> - Removed patches to introduce the EPC page states, list for storing
> candidate pages for reclamation. (not needed due to above changes)
> - Make ops one per resource type and store them in array (Michal)
> - Rename the ops struct to misc_res_ops, and enforce the constraints of
> required callback functions (Jarkko)
> - Initialize epc cgroup in sgx driver init function. (Kai)
> - Moved addition of priv field to patch 4 where it was used first. (Jarkko)
> - Split sgx_get_current_epc_cg() out of sgx_epc_cg_try_charge() (Kai)
> - Use a static for root cgroup (Kai)
>
> [1]https://lore.kernel.org/all/DM6PR21MB11772A6ED915825854B419D6C4989@DM6PR21MB1177.namprd21.prod.outlook.com/
> [2]https://lore.kernel.org/all/ZD7Iutppjj+muH4p@himmelriiki/
> [3]https://lore.kernel.org/lkml/[email protected]/
> [4]https://lore.kernel.org/lkml/yz44wukoic3syy6s4fcrngagurkjhe2hzka6kvxbajdtro3fwu@zd2ilht7wcw3/
> [5]Documentation/arch/x86/sgx.rst, Section"Enclave Page Types"
> [6]Documentation/arch/x86/sgx.rst, Section "Virtual EPC"
> [7]v2: https://lore.kernel.org/all/[email protected]/
> [8]v3: https://lore.kernel.org/linux-sgx/[email protected]/
> [9]v4: https://lore.kernel.org/all/[email protected]/
> [10]v5: https://lore.kernel.org/all/[email protected]/
> [11]v6: https://lore.kernel.org/linux-sgx/[email protected]/
> [12]v7: https://lore.kernel.org/linux-sgx/[email protected]/T/#t
> [13]v8: https://lore.kernel.org/linux-sgx/[email protected]/T/#t
> [14]v9: https://lore.kernel.org/lkml/[email protected]/T/
> [15]v10: https://lore.kernel.org/linux-sgx/[email protected]/T/#t
>
> Haitao Huang (3):
> x86/sgx: Replace boolean parameters with enums
> x86/sgx: Charge mem_cgroup for per-cgroup reclamation
> selftests/sgx: Add scripts for EPC cgroup testing
>
> Kristen Carlson Accardi (9):
> cgroup/misc: Add per resource callbacks for CSS events
> cgroup/misc: Export APIs for SGX driver
> cgroup/misc: Add SGX EPC resource type
> x86/sgx: Implement basic EPC misc cgroup functionality
> x86/sgx: Abstract tracking reclaimable pages in LRU
> x86/sgx: Add basic EPC reclamation flow for cgroup
> x86/sgx: Implement async reclamation for cgroup
> x86/sgx: Abstract check for global reclaimable pages
> x86/sgx: Turn on per-cgroup EPC reclamation
>
> Sean Christopherson (2):
> x86/sgx: Add sgx_epc_lru_list to encapsulate LRU list
> Docs/x86/sgx: Add description for cgroup support
>
> Documentation/arch/x86/sgx.rst | 83 +++++
> arch/x86/Kconfig | 13 +
> arch/x86/kernel/cpu/sgx/Makefile | 1 +
> arch/x86/kernel/cpu/sgx/encl.c | 41 +--
> arch/x86/kernel/cpu/sgx/encl.h | 7 +-
> arch/x86/kernel/cpu/sgx/epc_cgroup.c | 312 ++++++++++++++++++
> arch/x86/kernel/cpu/sgx/epc_cgroup.h | 101 ++++++
> arch/x86/kernel/cpu/sgx/ioctl.c | 10 +-
> arch/x86/kernel/cpu/sgx/main.c | 205 +++++++++---
> arch/x86/kernel/cpu/sgx/sgx.h | 50 ++-
> arch/x86/kernel/cpu/sgx/virt.c | 2 +-
> include/linux/misc_cgroup.h | 41 +++
> kernel/cgroup/misc.c | 109 ++++--
> tools/testing/selftests/sgx/ash_cgexec.sh | 16 +
> .../selftests/sgx/run_epc_cg_selftests.sh | 275 +++++++++++++++
> .../selftests/sgx/watch_misc_for_tests.sh | 11 +
> 16 files changed, 1172 insertions(+), 105 deletions(-)
> create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.c
> create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.h
> create mode 100755 tools/testing/selftests/sgx/ash_cgexec.sh
> create mode 100755 tools/testing/selftests/sgx/run_epc_cg_selftests.sh
> create mode 100755 tools/testing/selftests/sgx/watch_misc_for_tests.sh
>
>
> base-commit: 4cece764965020c22cff7665b18a012006359095
> --
> 2.25.1
>

2024-04-15 13:46:18

by Huang, Kai

[permalink] [raw]
Subject: Re: [PATCH v11 02/14] cgroup/misc: Add per resource callbacks for CSS events

On Wed, 2024-04-10 at 11:25 -0700, Haitao Huang wrote:
> From: Kristen Carlson Accardi <[email protected]>
>
> The misc cgroup controller (subsystem) currently does not perform
> resource type specific action for Cgroups Subsystem State (CSS) events:
> the 'css_alloc' event when a cgroup is created and the 'css_free' event
> when a cgroup is destroyed.
>
> Define callbacks for those events and allow resource providers to
> register the callbacks per resource type as needed. This will be
> utilized later by the EPC misc cgroup support implemented in the SGX
> driver.
>
> Signed-off-by: Kristen Carlson Accardi <[email protected]>
> Co-developed-by: Haitao Huang <[email protected]>
> Signed-off-by: Haitao Huang <[email protected]>
> Reviewed-by: Jarkko Sakkinen <[email protected]>
> Reviewed-by: Tejun Heo <[email protected]>
>

Reviewed-by: Kai Huang <[email protected]>

Nitpickings below:

>
> +/**
> + * struct misc_res_ops: per resource type callback ops.
> + * @alloc: invoked for resource specific initialization when cgroup is allocated.
> + * @free: invoked for resource specific cleanup when cgroup is deallocated.
> + */
> +struct misc_res_ops {
> + int (*alloc)(struct misc_cg *cg);
> + void (*free)(struct misc_cg *cg);
> +};
> +

Perhaps you can mention why you take 'struct misc_cg *cg' as parameter, but not
'struct misc_res *res' to the changelog.  

It's not very clear in this patch.


[...]

> static struct cgroup_subsys_state *
> misc_cg_alloc(struct cgroup_subsys_state *parent_css)
> {
> - enum misc_res_type i;
> - struct misc_cg *cg;
> + struct misc_cg *parent_cg, *cg;
> + int ret;
>
> - if (!parent_css) {
> - cg = &root_cg;
> + if (unlikely(!parent_css)) {
> + parent_cg = cg = &root_cg;

Seems the 'unlikely' is new.

I think you can remove it because it's not something that is related to this
patch.