SGX Enclave Page Cache (EPC) memory allocations are separate from normal
RAM allocations, and are managed solely by the SGX subsystem. The existing
cgroup memory controller cannot be used to limit or account for SGX EPC
memory, which is a desirable feature in some environments, e.g., support
for pod level control in a Kubernates cluster on a VM or bare-metal host
[1,2].
This patchset implements the support for sgx_epc memory within the misc
cgroup controller. A user can use the misc cgroup controller to set and
enforce a max limit on total EPC usage per cgroup. The implementation
reports current usage and events of reaching the limit per cgroup as well
as the total system capacity.
Much like normal system memory, EPC memory can be overcommitted via virtual
memory techniques and pages can be swapped out of the EPC to their backing
store, which are normal system memory allocated via shmem and accounted by
the memory controller. Similar to per-cgroup reclamation done by the memory
controller, the EPC misc controller needs to implement a per-cgroup EPC
reclaiming process: when the EPC usage of a cgroup reaches its hard limit
('sgx_epc' entry in the 'misc.max' file), the cgroup starts swapping out
some EPC pages within the same cgroup to make room for new allocations.
For that, this implementation tracks reclaimable EPC pages in a separate
LRU list in each cgroup, and below are more details and justification of
this design.
Track EPC pages in per-cgroup LRUs (from Dave)
----------------------------------------------
tl;dr: A cgroup hitting its limit should be as similar as possible to the
system running out of EPC memory. The only two choices to implement that
are nasty changes the existing LRU scanning algorithm, or to add new LRUs.
The result: Add a new LRU for each cgroup and scans those instead. Replace
the existing global cgroup with the root cgroup's LRU (only when this new
support is compiled in, obviously).
The existing EPC memory management aims to be a miniature version of the
core VM where EPC memory can be overcommitted and reclaimed. EPC
allocations can wait for reclaim. The alternative to waiting would have
been to send a signal and let the enclave die.
This series attempts to implement that same logic for cgroups, for the same
reasons: it's preferable to wait for memory to become available and let
reclaim happen than to do things that are fatal to enclaves.
There is currently a global reclaimable page SGX LRU list. That list (and
the existing scanning algorithm) is essentially useless for doing reclaim
when a cgroup hits its limit because the cgroup's pages are scattered
around that LRU. It is unspeakably inefficient to scan a linked list with
millions of entries for what could be dozens of pages from a cgroup that
needs reclaim.
Even if unspeakably slow reclaim was accepted, the existing scanning
algorithm only picks a few pages off the head of the global LRU. It would
either need to hold the list locks for unreasonable amounts of time, or be
taught to scan the list in pieces, which has its own challenges.
Unreclaimable Enclave Pages
---------------------------
There are a variety of page types for enclaves, each serving different
purposes [5]. Although the SGX architecture supports swapping for all
types, some special pages, e.g., Version Array(VA) and Secure Enclave
Control Structure (SECS)[5], holds meta data of reclaimed pages and
enclaves. That makes reclamation of such pages more intricate to manage.
The SGX driver global reclaimer currently does not swap out VA pages. It
only swaps the SECS page of an enclave when all other associated pages have
been swapped out. The cgroup reclaimer follows the same approach and does
not track those in per-cgroup LRUs and considers them as unreclaimable
pages. The allocation of these pages is counted towards the usage of a
specific cgroup and is subject to the cgroup's set EPC limits.
Earlier versions of this series implemented forced enclave-killing to
reclaim VA and SECS pages. That was designed to enforce the 'max' limit,
particularly in scenarios where a user or administrator reduces this limit
post-launch of enclaves. However, subsequent discussions [3, 4] indicated
that such preemptive enforcement is not necessary for the misc-controllers.
Therefore, reclaiming SECS/VA pages by force-killing enclaves were removed,
and the limit is only enforced at the time of new EPC allocation request.
When a cgroup hits its limit but nothing left in the LRUs of the subtree,
i.e., nothing to reclaim in the cgroup, any new attempt to allocate EPC
within that cgroup will result in an 'ENOMEM'.
Unreclaimable Guest VM EPC Pages
--------------------------------
The EPC pages allocated for guest VMs by the virtual EPC driver are not
reclaimable by the host kernel [6]. Therefore an EPC cgroup also treats
those as unreclaimable and returns ENOMEM when its limit is hit and nothing
reclaimable left within the cgroup. The virtual EPC driver translates the
ENOMEM error resulted from an EPC allocation request into a SIGBUS to the
user process exactly the same way handling host running out of physical
EPC.
This work was originally authored by Sean Christopherson a few years ago,
and previously modified by Kristen C. Accardi to utilize the misc cgroup
controller rather than a custom controller. I have been updating the
patches based on review comments since V2 [7-11], simplified the
implementation/design, added selftest scripts, fixed some stability issues
found from testing.
Thanks to all for the review/test/tags/feedback provided on the previous
versions.
I appreciate your further reviewing/testing and providing tags if
appropriate.
---
V7:
- Split the large patch for the final EPC implementation, #10 in V6, into
smaller ones. (Dave, Kai)
- Scan and reclaim one cgroup at a time, don't split sgx_reclaim_pages()
into two functions (Kai)
- Removed patches to introduce the EPC page states, list for storing
candidate pages for reclamation. (not needed due to above changes)
- Make ops one per resource type and store them in array (Michal)
- Rename the ops struct to misc_res_ops, and enforce the constraints of
required callback functions (Jarkko)
- Initialize epc cgroup in sgx driver init function. (Kai)
- Moved addition of priv field to patch 4 where it was used first. (Jarkko)
- Split sgx_get_current_epc_cg() out of sgx_epc_cg_try_charge() (Kai)
- Use a static for root cgroup (Kai)
[1]https://lore.kernel.org/all/DM6PR21MB11772A6ED915825854B419D6C4989@DM6PR21MB1177.namprd21.prod.outlook.com/
[2]https://lore.kernel.org/all/ZD7Iutppjj+muH4p@himmelriiki/
[3]https://lore.kernel.org/lkml/[email protected]/
[4]https://lore.kernel.org/lkml/yz44wukoic3syy6s4fcrngagurkjhe2hzka6kvxbajdtro3fwu@zd2ilht7wcw3/
[5]Documentation/arch/x86/sgx.rst, Section"Enclave Page Types"
[6]Documentation/arch/x86/sgx.rst, Section "Virtual EPC"
[7]v2: https://lore.kernel.org/all/[email protected]/
[8]v3: https://lore.kernel.org/linux-sgx/[email protected]/
[9]v4: https://lore.kernel.org/all/[email protected]/
[10]v5: https://lore.kernel.org/all/[email protected]/
[11]v6:https://lore.kernel.org/linux-sgx/[email protected]/
Haitao Huang (2):
x86/sgx: Charge mem_cgroup for per-cgroup reclamation
selftests/sgx: Add scripts for EPC cgroup testing
Kristen Carlson Accardi (10):
cgroup/misc: Add per resource callbacks for CSS events
cgroup/misc: Export APIs for SGX driver
cgroup/misc: Add SGX EPC resource type
x86/sgx: Implement basic EPC misc cgroup functionality
x86/sgx: Abstract tracking reclaimable pages in LRU
x86/sgx: Implement EPC reclamation flows for cgroup
x86/sgx: Add EPC reclamation in cgroup try_charge()
x86/sgx: Abstract check for global reclaimable pages
x86/sgx: Expose sgx_epc_cgroup_reclaim_pages() for global reclaimer
x86/sgx: Turn on per-cgroup EPC reclamation
Sean Christopherson (3):
x86/sgx: Add sgx_epc_lru_list to encapsulate LRU list
x86/sgx: Expose sgx_reclaim_pages() for cgroup
Docs/x86/sgx: Add description for cgroup support
Documentation/arch/x86/sgx.rst | 74 +++++
arch/x86/Kconfig | 13 +
arch/x86/kernel/cpu/sgx/Makefile | 1 +
arch/x86/kernel/cpu/sgx/encl.c | 43 ++-
arch/x86/kernel/cpu/sgx/encl.h | 3 +-
arch/x86/kernel/cpu/sgx/epc_cgroup.c | 274 ++++++++++++++++++
arch/x86/kernel/cpu/sgx/epc_cgroup.h | 85 ++++++
arch/x86/kernel/cpu/sgx/main.c | 186 ++++++++----
arch/x86/kernel/cpu/sgx/sgx.h | 22 ++
include/linux/misc_cgroup.h | 41 +++
kernel/cgroup/misc.c | 85 +++++-
.../selftests/sgx/run_epc_cg_selftests.sh | 246 ++++++++++++++++
.../selftests/sgx/watch_misc_for_tests.sh | 13 +
13 files changed, 997 insertions(+), 89 deletions(-)
create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.c
create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.h
create mode 100755 tools/testing/selftests/sgx/run_epc_cg_selftests.sh
create mode 100755 tools/testing/selftests/sgx/watch_misc_for_tests.sh
base-commit: 6613476e225e090cc9aad49be7fa504e290dd33d
--
2.25.1
From: Sean Christopherson <[email protected]>
Add initial documentation of how to regulate the distribution of
SGX Enclave Page Cache (EPC) memory via the Miscellaneous cgroup
controller.
Signed-off-by: Sean Christopherson <[email protected]>
Co-developed-by: Kristen Carlson Accardi <[email protected]>
Signed-off-by: Kristen Carlson Accardi <[email protected]>
Co-developed-by: Haitao Huang<[email protected]>
Signed-off-by: Haitao Huang<[email protected]>
Cc: Sean Christopherson <[email protected]>
---
V6:
- Remove mentioning of VMM specific behavior on handling SIGBUS
- Remove statement of forced reclamation, add statement to specify
ENOMEM returned when no reclamation possible.
- Added statements on the non-preemptive nature for the max limit
- Dropped Reviewed-by tag because of changes
V4:
- Fix indentation (Randy)
- Change misc.events file to be read-only
- Fix a typo for 'subsystem'
- Add behavior when VMM overcommit EPC with a cgroup (Mikko)
---
Documentation/arch/x86/sgx.rst | 74 ++++++++++++++++++++++++++++++++++
1 file changed, 74 insertions(+)
diff --git a/Documentation/arch/x86/sgx.rst b/Documentation/arch/x86/sgx.rst
index d90796adc2ec..dfc8fac13ab2 100644
--- a/Documentation/arch/x86/sgx.rst
+++ b/Documentation/arch/x86/sgx.rst
@@ -300,3 +300,77 @@ to expected failures and handle them as follows:
first call. It indicates a bug in the kernel or the userspace client
if any of the second round of ``SGX_IOC_VEPC_REMOVE_ALL`` calls has
a return code other than 0.
+
+
+Cgroup Support
+==============
+
+The "sgx_epc" resource within the Miscellaneous cgroup controller regulates distribution of SGX
+EPC memory, which is a subset of system RAM that is used to provide SGX-enabled applications
+with protected memory, and is otherwise inaccessible, i.e. shows up as reserved in /proc/iomem
+and cannot be read/written outside of an SGX enclave.
+
+Although current systems implement EPC by stealing memory from RAM, for all intents and
+purposes the EPC is independent from normal system memory, e.g. must be reserved at boot from
+RAM and cannot be converted between EPC and normal memory while the system is running. The EPC
+is managed by the SGX subsystem and is not accounted by the memory controller. Note that this
+is true only for EPC memory itself, i.e. normal memory allocations related to SGX and EPC
+memory, e.g. the backing memory for evicted EPC pages, are accounted, limited and protected by
+the memory controller.
+
+Much like normal system memory, EPC memory can be overcommitted via virtual memory techniques
+and pages can be swapped out of the EPC to their backing store (normal system memory allocated
+via shmem). The SGX EPC subsystem is analogous to the memory subsystem, and it implements
+limit and protection models for EPC memory.
+
+SGX EPC Interface Files
+-----------------------
+
+For a generic description of the Miscellaneous controller interface files, please see
+Documentation/admin-guide/cgroup-v2.rst
+
+All SGX EPC memory amounts are in bytes unless explicitly stated otherwise. If a value which
+is not PAGE_SIZE aligned is written, the actual value used by the controller will be rounded
+down to the closest PAGE_SIZE multiple.
+
+ misc.capacity
+ A read-only flat-keyed file shown only in the root cgroup. The sgx_epc resource will
+ show the total amount of EPC memory available on the platform.
+
+ misc.current
+ A read-only flat-keyed file shown in the non-root cgroups. The sgx_epc resource will
+ show the current active EPC memory usage of the cgroup and its descendants. EPC pages
+ that are swapped out to backing RAM are not included in the current count.
+
+ misc.max
+ A read-write single value file which exists on non-root cgroups. The sgx_epc resource
+ will show the EPC usage hard limit. The default is "max".
+
+ If a cgroup's EPC usage reaches this limit, EPC allocations, e.g. for page fault
+ handling, will be blocked until EPC can be reclaimed from the cgroup. If there are no
+ pages left that are reclaimable within the same group, the kernel returns ENOMEM.
+
+ The EPC pages allocated for a guest VM by the virtual EPC driver are not reclaimable by
+ the host kernel. In case the guest cgroup's limit is reached and no reclaimable pages
+ left in the same cgroup, the virtual EPC driver returns SIGBUS to the user space
+ process to indicate failure on new EPC allocation requests.
+
+ The misc.max limit is non-preemptive. If a user writes a limit lower than the current
+ usage to this file, the cgroup will not preemptively deallocate pages currently in use,
+ and will only start blocking the next allocation and reclaiming EPC at that time.
+
+ misc.events
+ A read-only flat-keyed file which exists on non-root cgroups.
+ A value change in this file generates a file modified event.
+
+ max
+ The number of times the cgroup has triggered a reclaim
+ due to its EPC usage approaching (or exceeding) its max
+ EPC boundary.
+
+Migration
+---------
+
+Once an EPC page is charged to a cgroup (during allocation), it remains charged to the original
+cgroup until the page is released or reclaimed. Migrating a process to a different cgroup
+doesn't move the EPC charges that it incurred while in the previous cgroup to its new cgroup.
--
2.25.1
From: Kristen Carlson Accardi <[email protected]>
SGX Enclave Page Cache (EPC) memory allocations are separate from normal
RAM allocations, and are managed solely by the SGX subsystem. The
existing cgroup memory controller cannot be used to limit or account for
SGX EPC memory, which is a desirable feature in some environments. For
example, in a Kubernates environment, a user can request certain EPC
quota for a pod but the orchestrator can not enforce the quota to limit
runtime EPC usage of the pod without an EPC cgroup controller.
Utilize the misc controller [admin-guide/cgroup-v2.rst, 5-9. Misc] to
limit and track EPC allocations per cgroup. Earlier patches have added
the "sgx_epc" resource type in the misc cgroup subsystem. Add basic
support in SGX driver as the "sgx_epc" resource provider:
- Set "capacity" of EPC by calling misc_cg_set_capacity()
- Update EPC usage counter, "current", by calling charge and uncharge
APIs for EPC allocation and deallocation, respectively.
- Setup sgx_epc resource type specific callbacks, which perform
initialization and cleanup during cgroup allocation and deallocation,
respectively.
With these changes, the misc cgroup controller enables user to set a hard
limit for EPC usage in the "misc.max" interface file. It reports current
usage in "misc.current", the total EPC memory available in
"misc.capacity", and the number of times EPC usage reached the max limit
in "misc.events".
For now, the EPC cgroup simply blocks additional EPC allocation in
sgx_alloc_epc_page() when the limit is reached. Reclaimable pages are
still tracked in the global active list, only reclaimed by the global
reclaimer when the total free page count is lower than a threshold.
Later patches will reorganize the tracking and reclamation code in the
global reclaimer and implement per-cgroup tracking and reclaiming.
Co-developed-by: Sean Christopherson <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Kristen Carlson Accardi <[email protected]>
Co-developed-by: Haitao Huang <[email protected]>
Signed-off-by: Haitao Huang <[email protected]>
---
V7:
- Use a static for root cgroup (Kai)
- Wrap epc_cg field in sgx_epc_page struct with #ifdef (Kai)
- Correct check for charge API return (Kai)
- Start initialization in SGX device driver init (Kai)
- Remove unneeded BUG_ON (Kai)
- Split sgx_get_current_epc_cg() out of sgx_epc_cg_try_charge() (Kai)
V6:
- Split the original large patch"Limit process EPC usage with misc
cgroup controller" and restructure it (Kai)
---
arch/x86/Kconfig | 13 +++++
arch/x86/kernel/cpu/sgx/Makefile | 1 +
arch/x86/kernel/cpu/sgx/epc_cgroup.c | 79 ++++++++++++++++++++++++++++
arch/x86/kernel/cpu/sgx/epc_cgroup.h | 73 +++++++++++++++++++++++++
arch/x86/kernel/cpu/sgx/main.c | 52 +++++++++++++++++-
arch/x86/kernel/cpu/sgx/sgx.h | 5 ++
include/linux/misc_cgroup.h | 2 +
7 files changed, 223 insertions(+), 2 deletions(-)
create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.c
create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.h
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 5edec175b9bf..10c3d1d099b2 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1947,6 +1947,19 @@ config X86_SGX
If unsure, say N.
+config CGROUP_SGX_EPC
+ bool "Miscellaneous Cgroup Controller for Enclave Page Cache (EPC) for Intel SGX"
+ depends on X86_SGX && CGROUP_MISC
+ help
+ Provides control over the EPC footprint of tasks in a cgroup via
+ the Miscellaneous cgroup controller.
+
+ EPC is a subset of regular memory that is usable only by SGX
+ enclaves and is very limited in quantity, e.g. less than 1%
+ of total DRAM.
+
+ Say N if unsure.
+
config X86_USER_SHADOW_STACK
bool "X86 userspace shadow stack"
depends on AS_WRUSS
diff --git a/arch/x86/kernel/cpu/sgx/Makefile b/arch/x86/kernel/cpu/sgx/Makefile
index 9c1656779b2a..12901a488da7 100644
--- a/arch/x86/kernel/cpu/sgx/Makefile
+++ b/arch/x86/kernel/cpu/sgx/Makefile
@@ -4,3 +4,4 @@ obj-y += \
ioctl.o \
main.o
obj-$(CONFIG_X86_SGX_KVM) += virt.o
+obj-$(CONFIG_CGROUP_SGX_EPC) += epc_cgroup.o
diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
new file mode 100644
index 000000000000..938695816a9e
--- /dev/null
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
@@ -0,0 +1,79 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright(c) 2022 Intel Corporation.
+
+#include <linux/atomic.h>
+#include <linux/kernel.h>
+#include "epc_cgroup.h"
+
+static struct sgx_epc_cgroup epc_cg_root;
+
+/**
+ * sgx_epc_cgroup_try_charge() - try to charge cgroup for a single EPC page
+ *
+ * @epc_cg: The EPC cgroup to be charged for the page.
+ * Return:
+ * * %0 - If successfully charged.
+ * * -errno - for failures.
+ */
+int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg)
+{
+ if (!epc_cg)
+ return -EINVAL;
+
+ return misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, PAGE_SIZE);
+}
+
+/**
+ * sgx_epc_cgroup_uncharge() - uncharge a cgroup for an EPC page
+ * @epc_cg: The charged epc cgroup
+ */
+void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg)
+{
+ if (!epc_cg)
+ return;
+
+ misc_cg_uncharge(MISC_CG_RES_SGX_EPC, epc_cg->cg, PAGE_SIZE);
+}
+
+static void sgx_epc_cgroup_free(struct misc_cg *cg)
+{
+ struct sgx_epc_cgroup *epc_cg;
+
+ epc_cg = sgx_epc_cgroup_from_misc_cg(cg);
+ if (!epc_cg)
+ return;
+
+ kfree(epc_cg);
+}
+
+static int sgx_epc_cgroup_alloc(struct misc_cg *cg);
+
+const struct misc_res_ops sgx_epc_cgroup_ops = {
+ .alloc = sgx_epc_cgroup_alloc,
+ .free = sgx_epc_cgroup_free,
+};
+
+static void sgx_epc_misc_init(struct misc_cg *cg, struct sgx_epc_cgroup *epc_cg)
+{
+ cg->res[MISC_CG_RES_SGX_EPC].priv = epc_cg;
+ epc_cg->cg = cg;
+}
+
+static int sgx_epc_cgroup_alloc(struct misc_cg *cg)
+{
+ struct sgx_epc_cgroup *epc_cg;
+
+ epc_cg = kzalloc(sizeof(*epc_cg), GFP_KERNEL);
+ if (!epc_cg)
+ return -ENOMEM;
+
+ sgx_epc_misc_init(cg, epc_cg);
+
+ return 0;
+}
+
+void sgx_epc_cgroup_init(void)
+{
+ misc_cg_set_ops(MISC_CG_RES_SGX_EPC, &sgx_epc_cgroup_ops);
+ sgx_epc_misc_init(misc_cg_root(), &epc_cg_root);
+}
diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.h b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
new file mode 100644
index 000000000000..971df34f27d8
--- /dev/null
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
@@ -0,0 +1,73 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright(c) 2022 Intel Corporation. */
+#ifndef _INTEL_SGX_EPC_CGROUP_H_
+#define _INTEL_SGX_EPC_CGROUP_H_
+
+#include <asm/sgx.h>
+#include <linux/cgroup.h>
+#include <linux/list.h>
+#include <linux/misc_cgroup.h>
+#include <linux/page_counter.h>
+#include <linux/workqueue.h>
+
+#include "sgx.h"
+
+#ifndef CONFIG_CGROUP_SGX_EPC
+#define MISC_CG_RES_SGX_EPC MISC_CG_RES_TYPES
+struct sgx_epc_cgroup;
+
+static inline struct sgx_epc_cgroup *sgx_get_current_epc_cg(void)
+{
+ return NULL;
+}
+
+static inline void sgx_put_epc_cg(struct sgx_epc_cgroup *epc_cg) { }
+
+static inline int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg)
+{
+ return 0;
+}
+
+static inline void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg) { }
+
+static inline void sgx_epc_cgroup_init(void) { }
+#else
+struct sgx_epc_cgroup {
+ struct misc_cg *cg;
+};
+
+static inline struct sgx_epc_cgroup *sgx_epc_cgroup_from_misc_cg(struct misc_cg *cg)
+{
+ return (struct sgx_epc_cgroup *)(cg->res[MISC_CG_RES_SGX_EPC].priv);
+}
+
+/**
+ * sgx_get_current_epc_cg() - get the EPC cgroup of current process.
+ *
+ * Returned cgroup has its ref count increased by 1. Caller must call sgx_put_epc_cg()
+ * to return the reference.
+ *
+ * Return: EPC cgroup to which the current task belongs to.
+ */
+static inline struct sgx_epc_cgroup *sgx_get_current_epc_cg(void)
+{
+ return sgx_epc_cgroup_from_misc_cg(get_current_misc_cg());
+}
+
+/**
+ * sgx_put_epc_cg() - Put the EPC cgroup and reduce its ref count.
+ * @epc_cg - EPC cgroup to put.
+ */
+static inline void sgx_put_epc_cg(struct sgx_epc_cgroup *epc_cg)
+{
+ if (epc_cg)
+ put_misc_cg(epc_cg->cg);
+}
+
+int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg);
+void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg);
+void sgx_epc_cgroup_init(void);
+
+#endif
+
+#endif /* _INTEL_SGX_EPC_CGROUP_H_ */
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 166692f2d501..c32f18b70c73 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -6,6 +6,7 @@
#include <linux/highmem.h>
#include <linux/kthread.h>
#include <linux/miscdevice.h>
+#include <linux/misc_cgroup.h>
#include <linux/node.h>
#include <linux/pagemap.h>
#include <linux/ratelimit.h>
@@ -17,6 +18,7 @@
#include "driver.h"
#include "encl.h"
#include "encls.h"
+#include "epc_cgroup.h"
struct sgx_epc_section sgx_epc_sections[SGX_MAX_EPC_SECTIONS];
static int sgx_nr_epc_sections;
@@ -558,7 +560,16 @@ int sgx_unmark_page_reclaimable(struct sgx_epc_page *page)
*/
struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
{
+ struct sgx_epc_cgroup *epc_cg;
struct sgx_epc_page *page;
+ int ret;
+
+ epc_cg = sgx_get_current_epc_cg();
+ ret = sgx_epc_cgroup_try_charge(epc_cg);
+ if (ret) {
+ sgx_put_epc_cg(epc_cg);
+ return ERR_PTR(ret);
+ }
for ( ; ; ) {
page = __sgx_alloc_epc_page();
@@ -567,8 +578,10 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
break;
}
- if (list_empty(&sgx_active_page_list))
- return ERR_PTR(-ENOMEM);
+ if (list_empty(&sgx_active_page_list)) {
+ page = ERR_PTR(-ENOMEM);
+ break;
+ }
if (!reclaim) {
page = ERR_PTR(-EBUSY);
@@ -580,10 +593,25 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
break;
}
+ /*
+ * Need to do a global reclamation if cgroup was not full but free
+ * physical pages run out, causing __sgx_alloc_epc_page() to fail.
+ */
sgx_reclaim_pages();
cond_resched();
}
+#ifdef CONFIG_CGROUP_SGX_EPC
+ if (!IS_ERR(page)) {
+ WARN_ON_ONCE(page->epc_cg);
+ /* sgx_put_epc_cg() in sgx_free_epc_page() */
+ page->epc_cg = epc_cg;
+ } else {
+ sgx_epc_cgroup_uncharge(epc_cg);
+ sgx_put_epc_cg(epc_cg);
+ }
+#endif
+
if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
wake_up(&ksgxd_waitq);
@@ -604,6 +632,14 @@ void sgx_free_epc_page(struct sgx_epc_page *page)
struct sgx_epc_section *section = &sgx_epc_sections[page->section];
struct sgx_numa_node *node = section->node;
+#ifdef CONFIG_CGROUP_SGX_EPC
+ if (page->epc_cg) {
+ sgx_epc_cgroup_uncharge(page->epc_cg);
+ sgx_put_epc_cg(page->epc_cg);
+ page->epc_cg = NULL;
+ }
+#endif
+
spin_lock(&node->lock);
page->owner = NULL;
@@ -643,6 +679,11 @@ static bool __init sgx_setup_epc_section(u64 phys_addr, u64 size,
section->pages[i].flags = 0;
section->pages[i].owner = NULL;
section->pages[i].poison = 0;
+
+#ifdef CONFIG_CGROUP_SGX_EPC
+ section->pages[i].epc_cg = NULL;
+#endif
+
list_add_tail(§ion->pages[i].list, &sgx_dirty_page_list);
}
@@ -787,6 +828,7 @@ static void __init arch_update_sysfs_visibility(int nid) {}
static bool __init sgx_page_cache_init(void)
{
u32 eax, ebx, ecx, edx, type;
+ u64 capacity = 0;
u64 pa, size;
int nid;
int i;
@@ -837,6 +879,7 @@ static bool __init sgx_page_cache_init(void)
sgx_epc_sections[i].node = &sgx_numa_nodes[nid];
sgx_numa_nodes[nid].size += size;
+ capacity += size;
sgx_nr_epc_sections++;
}
@@ -846,6 +889,8 @@ static bool __init sgx_page_cache_init(void)
return false;
}
+ misc_cg_set_capacity(MISC_CG_RES_SGX_EPC, capacity);
+
return true;
}
@@ -942,6 +987,9 @@ static int __init sgx_init(void)
if (sgx_vepc_init() && ret)
goto err_provision;
+ /* Setup cgroup if either the native or vepc driver is active */
+ sgx_epc_cgroup_init();
+
return 0;
err_provision:
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index d2dad21259a8..a898d86dead0 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -29,12 +29,17 @@
/* Pages on free list */
#define SGX_EPC_PAGE_IS_FREE BIT(1)
+struct sgx_epc_cgroup;
+
struct sgx_epc_page {
unsigned int section;
u16 flags;
u16 poison;
struct sgx_encl_page *owner;
struct list_head list;
+#ifdef CONFIG_CGROUP_SGX_EPC
+ struct sgx_epc_cgroup *epc_cg;
+#endif
};
/*
diff --git a/include/linux/misc_cgroup.h b/include/linux/misc_cgroup.h
index 2f6cc3a0ad23..1a16efdfcd3d 100644
--- a/include/linux/misc_cgroup.h
+++ b/include/linux/misc_cgroup.h
@@ -46,11 +46,13 @@ struct misc_res_ops {
* @max: Maximum limit on the resource.
* @usage: Current usage of the resource.
* @events: Number of times, the resource limit exceeded.
+ * @priv: resource specific data.
*/
struct misc_res {
u64 max;
atomic64_t usage;
atomic64_t events;
+ void *priv;
};
/**
--
2.25.1
On Mon Jan 22, 2024 at 7:20 PM EET, Haitao Huang wrote:
> From: Kristen Carlson Accardi <[email protected]>
>
> SGX Enclave Page Cache (EPC) memory allocations are separate from normal
> RAM allocations, and are managed solely by the SGX subsystem. The
> existing cgroup memory controller cannot be used to limit or account for
> SGX EPC memory, which is a desirable feature in some environments. For
> example, in a Kubernates environment, a user can request certain EPC
> quota for a pod but the orchestrator can not enforce the quota to limit
> runtime EPC usage of the pod without an EPC cgroup controller.
>
> Utilize the misc controller [admin-guide/cgroup-v2.rst, 5-9. Misc] to
> limit and track EPC allocations per cgroup. Earlier patches have added
> the "sgx_epc" resource type in the misc cgroup subsystem. Add basic
> support in SGX driver as the "sgx_epc" resource provider:
>
> - Set "capacity" of EPC by calling misc_cg_set_capacity()
> - Update EPC usage counter, "current", by calling charge and uncharge
> APIs for EPC allocation and deallocation, respectively.
> - Setup sgx_epc resource type specific callbacks, which perform
> initialization and cleanup during cgroup allocation and deallocation,
> respectively.
>
> With these changes, the misc cgroup controller enables user to set a hard
> limit for EPC usage in the "misc.max" interface file. It reports current
> usage in "misc.current", the total EPC memory available in
> "misc.capacity", and the number of times EPC usage reached the max limit
> in "misc.events".
>
> For now, the EPC cgroup simply blocks additional EPC allocation in
> sgx_alloc_epc_page() when the limit is reached. Reclaimable pages are
> still tracked in the global active list, only reclaimed by the global
> reclaimer when the total free page count is lower than a threshold.
>
> Later patches will reorganize the tracking and reclamation code in the
> global reclaimer and implement per-cgroup tracking and reclaiming.
>
> Co-developed-by: Sean Christopherson <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>
> Signed-off-by: Kristen Carlson Accardi <[email protected]>
> Co-developed-by: Haitao Huang <[email protected]>
> Signed-off-by: Haitao Huang <[email protected]>
For consistency sake I'd also add co-developed-by for Kristen. This is
at least the format suggested by kernel documentation.
> ---
> V7:
> - Use a static for root cgroup (Kai)
> - Wrap epc_cg field in sgx_epc_page struct with #ifdef (Kai)
> - Correct check for charge API return (Kai)
> - Start initialization in SGX device driver init (Kai)
> - Remove unneeded BUG_ON (Kai)
> - Split sgx_get_current_epc_cg() out of sgx_epc_cg_try_charge() (Kai)
>
> V6:
> - Split the original large patch"Limit process EPC usage with misc
> cgroup controller" and restructure it (Kai)
> ---
> arch/x86/Kconfig | 13 +++++
> arch/x86/kernel/cpu/sgx/Makefile | 1 +
> arch/x86/kernel/cpu/sgx/epc_cgroup.c | 79 ++++++++++++++++++++++++++++
> arch/x86/kernel/cpu/sgx/epc_cgroup.h | 73 +++++++++++++++++++++++++
> arch/x86/kernel/cpu/sgx/main.c | 52 +++++++++++++++++-
> arch/x86/kernel/cpu/sgx/sgx.h | 5 ++
> include/linux/misc_cgroup.h | 2 +
> 7 files changed, 223 insertions(+), 2 deletions(-)
> create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.c
> create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.h
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 5edec175b9bf..10c3d1d099b2 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1947,6 +1947,19 @@ config X86_SGX
>
> If unsure, say N.
>
> +config CGROUP_SGX_EPC
> + bool "Miscellaneous Cgroup Controller for Enclave Page Cache (EPC) for Intel SGX"
> + depends on X86_SGX && CGROUP_MISC
> + help
> + Provides control over the EPC footprint of tasks in a cgroup via
> + the Miscellaneous cgroup controller.
> +
> + EPC is a subset of regular memory that is usable only by SGX
> + enclaves and is very limited in quantity, e.g. less than 1%
> + of total DRAM.
> +
> + Say N if unsure.
> +
> config X86_USER_SHADOW_STACK
> bool "X86 userspace shadow stack"
> depends on AS_WRUSS
> diff --git a/arch/x86/kernel/cpu/sgx/Makefile b/arch/x86/kernel/cpu/sgx/Makefile
> index 9c1656779b2a..12901a488da7 100644
> --- a/arch/x86/kernel/cpu/sgx/Makefile
> +++ b/arch/x86/kernel/cpu/sgx/Makefile
> @@ -4,3 +4,4 @@ obj-y += \
> ioctl.o \
> main.o
> obj-$(CONFIG_X86_SGX_KVM) += virt.o
> +obj-$(CONFIG_CGROUP_SGX_EPC) += epc_cgroup.o
> diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> new file mode 100644
> index 000000000000..938695816a9e
> --- /dev/null
> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
> @@ -0,0 +1,79 @@
> +// SPDX-License-Identifier: GPL-2.0
> +// Copyright(c) 2022 Intel Corporation.
> +
> +#include <linux/atomic.h>
> +#include <linux/kernel.h>
> +#include "epc_cgroup.h"
> +
> +static struct sgx_epc_cgroup epc_cg_root;
> +
> +/**
> + * sgx_epc_cgroup_try_charge() - try to charge cgroup for a single EPC page
> + *
> + * @epc_cg: The EPC cgroup to be charged for the page.
> + * Return:
> + * * %0 - If successfully charged.
> + * * -errno - for failures.
> + */
> +int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg)
> +{
> + if (!epc_cg)
> + return -EINVAL;
Is there legit flow where the function is called with nil?
> +
> + return misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, PAGE_SIZE);
~
extra space
> +}
> +
> +/**
> + * sgx_epc_cgroup_uncharge() - uncharge a cgroup for an EPC page
> + * @epc_cg: The charged epc cgroup
> + */
> +void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg)
> +{
> + if (!epc_cg)
> + return;
If there was, this function also should have a return value (i.e. return
-EINVAL).
This API does not look good tbh.
Perhaps you want to emit error message in both functions? Now there is
asymmetry that other goes silent and other returns error. I'm neither
not sure why exactly -EINVAL was picked (does not mean the same that
I would ultimately oppose picking that).
> +
> + misc_cg_uncharge(MISC_CG_RES_SGX_EPC, epc_cg->cg, PAGE_SIZE);
> +}
> +
> +static void sgx_epc_cgroup_free(struct misc_cg *cg)
> +{
> + struct sgx_epc_cgroup *epc_cg;
> +
> + epc_cg = sgx_epc_cgroup_from_misc_cg(cg);
> + if (!epc_cg)
> + return;
> +
> + kfree(epc_cg);
> +}
> +
> +static int sgx_epc_cgroup_alloc(struct misc_cg *cg);
> +
> +const struct misc_res_ops sgx_epc_cgroup_ops = {
> + .alloc = sgx_epc_cgroup_alloc,
> + .free = sgx_epc_cgroup_free,
> +};
> +
> +static void sgx_epc_misc_init(struct misc_cg *cg, struct sgx_epc_cgroup *epc_cg)
> +{
> + cg->res[MISC_CG_RES_SGX_EPC].priv = epc_cg;
> + epc_cg->cg = cg;
> +}
> +
> +static int sgx_epc_cgroup_alloc(struct misc_cg *cg)
> +{
> + struct sgx_epc_cgroup *epc_cg;
> +
> + epc_cg = kzalloc(sizeof(*epc_cg), GFP_KERNEL);
> + if (!epc_cg)
> + return -ENOMEM;
> +
> + sgx_epc_misc_init(cg, epc_cg);
> +
> + return 0;
> +}
> +
> +void sgx_epc_cgroup_init(void)
> +{
> + misc_cg_set_ops(MISC_CG_RES_SGX_EPC, &sgx_epc_cgroup_ops);
> + sgx_epc_misc_init(misc_cg_root(), &epc_cg_root);
> +}
> diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.h b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
> new file mode 100644
> index 000000000000..971df34f27d8
> --- /dev/null
> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
> @@ -0,0 +1,73 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/* Copyright(c) 2022 Intel Corporation. */
> +#ifndef _INTEL_SGX_EPC_CGROUP_H_
> +#define _INTEL_SGX_EPC_CGROUP_H_
s/_INTEL//
> +
> +#include <asm/sgx.h>
> +#include <linux/cgroup.h>
> +#include <linux/list.h>
> +#include <linux/misc_cgroup.h>
> +#include <linux/page_counter.h>
> +#include <linux/workqueue.h>
> +
> +#include "sgx.h"
> +
> +#ifndef CONFIG_CGROUP_SGX_EPC
> +#define MISC_CG_RES_SGX_EPC MISC_CG_RES_TYPES
> +struct sgx_epc_cgroup;
> +
> +static inline struct sgx_epc_cgroup *sgx_get_current_epc_cg(void)
> +{
> + return NULL;
> +}
> +
> +static inline void sgx_put_epc_cg(struct sgx_epc_cgroup *epc_cg) { }
> +
> +static inline int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg)
> +{
> + return 0;
> +}
> +
> +static inline void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg) { }
> +
> +static inline void sgx_epc_cgroup_init(void) { }
> +#else
> +struct sgx_epc_cgroup {
> + struct misc_cg *cg;
> +};
> +
> +static inline struct sgx_epc_cgroup *sgx_epc_cgroup_from_misc_cg(struct misc_cg *cg)
> +{
> + return (struct sgx_epc_cgroup *)(cg->res[MISC_CG_RES_SGX_EPC].priv);
> +}
> +
> +/**
> + * sgx_get_current_epc_cg() - get the EPC cgroup of current process.
> + *
> + * Returned cgroup has its ref count increased by 1. Caller must call sgx_put_epc_cg()
> + * to return the reference.
> + *
> + * Return: EPC cgroup to which the current task belongs to.
> + */
> +static inline struct sgx_epc_cgroup *sgx_get_current_epc_cg(void)
> +{
> + return sgx_epc_cgroup_from_misc_cg(get_current_misc_cg());
> +}
> +
> +/**
> + * sgx_put_epc_cg() - Put the EPC cgroup and reduce its ref count.
> + * @epc_cg - EPC cgroup to put.
> + */
> +static inline void sgx_put_epc_cg(struct sgx_epc_cgroup *epc_cg)
> +{
> + if (epc_cg)
> + put_misc_cg(epc_cg->cg);
> +}
> +
> +int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg);
> +void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg);
> +void sgx_epc_cgroup_init(void);
> +
> +#endif
> +
> +#endif /* _INTEL_SGX_EPC_CGROUP_H_ */
> diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
> index 166692f2d501..c32f18b70c73 100644
> --- a/arch/x86/kernel/cpu/sgx/main.c
> +++ b/arch/x86/kernel/cpu/sgx/main.c
> @@ -6,6 +6,7 @@
> #include <linux/highmem.h>
> #include <linux/kthread.h>
> #include <linux/miscdevice.h>
> +#include <linux/misc_cgroup.h>
> #include <linux/node.h>
> #include <linux/pagemap.h>
> #include <linux/ratelimit.h>
> @@ -17,6 +18,7 @@
> #include "driver.h"
> #include "encl.h"
> #include "encls.h"
> +#include "epc_cgroup.h"
>
> struct sgx_epc_section sgx_epc_sections[SGX_MAX_EPC_SECTIONS];
> static int sgx_nr_epc_sections;
> @@ -558,7 +560,16 @@ int sgx_unmark_page_reclaimable(struct sgx_epc_page *page)
> */
> struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
> {
> + struct sgx_epc_cgroup *epc_cg;
> struct sgx_epc_page *page;
> + int ret;
> +
> + epc_cg = sgx_get_current_epc_cg();
> + ret = sgx_epc_cgroup_try_charge(epc_cg);
> + if (ret) {
> + sgx_put_epc_cg(epc_cg);
> + return ERR_PTR(ret);
> + }
>
> for ( ; ; ) {
> page = __sgx_alloc_epc_page();
> @@ -567,8 +578,10 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
> break;
> }
>
> - if (list_empty(&sgx_active_page_list))
> - return ERR_PTR(-ENOMEM);
> + if (list_empty(&sgx_active_page_list)) {
> + page = ERR_PTR(-ENOMEM);
> + break;
> + }
>
> if (!reclaim) {
> page = ERR_PTR(-EBUSY);
> @@ -580,10 +593,25 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
> break;
> }
>
> + /*
> + * Need to do a global reclamation if cgroup was not full but free
> + * physical pages run out, causing __sgx_alloc_epc_page() to fail.
> + */
> sgx_reclaim_pages();
> cond_resched();
> }
>
> +#ifdef CONFIG_CGROUP_SGX_EPC
> + if (!IS_ERR(page)) {
> + WARN_ON_ONCE(page->epc_cg);
> + /* sgx_put_epc_cg() in sgx_free_epc_page() */
> + page->epc_cg = epc_cg;
> + } else {
> + sgx_epc_cgroup_uncharge(epc_cg);
> + sgx_put_epc_cg(epc_cg);
> + }
> +#endif
> +
> if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
> wake_up(&ksgxd_waitq);
>
> @@ -604,6 +632,14 @@ void sgx_free_epc_page(struct sgx_epc_page *page)
> struct sgx_epc_section *section = &sgx_epc_sections[page->section];
> struct sgx_numa_node *node = section->node;
>
> +#ifdef CONFIG_CGROUP_SGX_EPC
> + if (page->epc_cg) {
> + sgx_epc_cgroup_uncharge(page->epc_cg);
> + sgx_put_epc_cg(page->epc_cg);
> + page->epc_cg = NULL;
> + }
> +#endif
> +
> spin_lock(&node->lock);
>
> page->owner = NULL;
> @@ -643,6 +679,11 @@ static bool __init sgx_setup_epc_section(u64 phys_addr, u64 size,
> section->pages[i].flags = 0;
> section->pages[i].owner = NULL;
> section->pages[i].poison = 0;
> +
> +#ifdef CONFIG_CGROUP_SGX_EPC
> + section->pages[i].epc_cg = NULL;
> +#endif
> +
> list_add_tail(§ion->pages[i].list, &sgx_dirty_page_list);
> }
>
> @@ -787,6 +828,7 @@ static void __init arch_update_sysfs_visibility(int nid) {}
> static bool __init sgx_page_cache_init(void)
> {
> u32 eax, ebx, ecx, edx, type;
> + u64 capacity = 0;
> u64 pa, size;
> int nid;
> int i;
> @@ -837,6 +879,7 @@ static bool __init sgx_page_cache_init(void)
>
> sgx_epc_sections[i].node = &sgx_numa_nodes[nid];
> sgx_numa_nodes[nid].size += size;
> + capacity += size;
>
> sgx_nr_epc_sections++;
> }
> @@ -846,6 +889,8 @@ static bool __init sgx_page_cache_init(void)
> return false;
> }
>
> + misc_cg_set_capacity(MISC_CG_RES_SGX_EPC, capacity);
> +
> return true;
> }
>
> @@ -942,6 +987,9 @@ static int __init sgx_init(void)
> if (sgx_vepc_init() && ret)
> goto err_provision;
>
> + /* Setup cgroup if either the native or vepc driver is active */
> + sgx_epc_cgroup_init();
> +
> return 0;
>
> err_provision:
> diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
> index d2dad21259a8..a898d86dead0 100644
> --- a/arch/x86/kernel/cpu/sgx/sgx.h
> +++ b/arch/x86/kernel/cpu/sgx/sgx.h
> @@ -29,12 +29,17 @@
> /* Pages on free list */
> #define SGX_EPC_PAGE_IS_FREE BIT(1)
>
> +struct sgx_epc_cgroup;
> +
> struct sgx_epc_page {
> unsigned int section;
> u16 flags;
> u16 poison;
> struct sgx_encl_page *owner;
> struct list_head list;
> +#ifdef CONFIG_CGROUP_SGX_EPC
> + struct sgx_epc_cgroup *epc_cg;
> +#endif
> };
>
> /*
> diff --git a/include/linux/misc_cgroup.h b/include/linux/misc_cgroup.h
> index 2f6cc3a0ad23..1a16efdfcd3d 100644
> --- a/include/linux/misc_cgroup.h
> +++ b/include/linux/misc_cgroup.h
> @@ -46,11 +46,13 @@ struct misc_res_ops {
> * @max: Maximum limit on the resource.
> * @usage: Current usage of the resource.
> * @events: Number of times, the resource limit exceeded.
> + * @priv: resource specific data.
> */
> struct misc_res {
> u64 max;
> atomic64_t usage;
> atomic64_t events;
> + void *priv;
> };
>
> /**
BR, Jarkko
From: Kristen Carlson Accardi <[email protected]>
The misc cgroup controller (subsystem) currently does not perform
resource type specific action for Cgroups Subsystem State (CSS) events:
the 'css_alloc' event when a cgroup is created and the 'css_free' event
when a cgroup is destroyed.
Define callbacks for those events and allow resource providers to
register the callbacks per resource type as needed. This will be
utilized later by the EPC misc cgroup support implemented in the SGX
driver.
Signed-off-by: Kristen Carlson Accardi <[email protected]>
Co-developed-by: Haitao Huang <[email protected]>
Signed-off-by: Haitao Huang <[email protected]>
---
V7:
- Make ops one per resource type and store them in array (Michal)
- Rename the ops struct to misc_res_ops, and enforce the constraints of required callback
functions (Jarkko)
- Moved addition of priv field to patch 4 where it was used first. (Jarkko)
V6:
- Create ops struct for per resource callbacks (Jarkko)
- Drop max_write callback (Dave, Michal)
- Style fixes (Kai)
---
include/linux/misc_cgroup.h | 11 +++++++
kernel/cgroup/misc.c | 60 +++++++++++++++++++++++++++++++++++--
2 files changed, 68 insertions(+), 3 deletions(-)
diff --git a/include/linux/misc_cgroup.h b/include/linux/misc_cgroup.h
index e799b1f8d05b..0806d4436208 100644
--- a/include/linux/misc_cgroup.h
+++ b/include/linux/misc_cgroup.h
@@ -27,6 +27,16 @@ struct misc_cg;
#include <linux/cgroup.h>
+/**
+ * struct misc_res_ops: per resource type callback ops.
+ * @alloc: invoked for resource specific initialization when cgroup is allocated.
+ * @free: invoked for resource specific cleanup when cgroup is deallocated.
+ */
+struct misc_res_ops {
+ int (*alloc)(struct misc_cg *cg);
+ void (*free)(struct misc_cg *cg);
+};
+
/**
* struct misc_res: Per cgroup per misc type resource
* @max: Maximum limit on the resource.
@@ -56,6 +66,7 @@ struct misc_cg {
u64 misc_cg_res_total_usage(enum misc_res_type type);
int misc_cg_set_capacity(enum misc_res_type type, u64 capacity);
+int misc_cg_set_ops(enum misc_res_type type, const struct misc_res_ops *ops);
int misc_cg_try_charge(enum misc_res_type type, struct misc_cg *cg, u64 amount);
void misc_cg_uncharge(enum misc_res_type type, struct misc_cg *cg, u64 amount);
diff --git a/kernel/cgroup/misc.c b/kernel/cgroup/misc.c
index 79a3717a5803..b8c32791334c 100644
--- a/kernel/cgroup/misc.c
+++ b/kernel/cgroup/misc.c
@@ -39,6 +39,9 @@ static struct misc_cg root_cg;
*/
static u64 misc_res_capacity[MISC_CG_RES_TYPES];
+/* Resource type specific operations */
+static const struct misc_res_ops *misc_res_ops[MISC_CG_RES_TYPES];
+
/**
* parent_misc() - Get the parent of the passed misc cgroup.
* @cgroup: cgroup whose parent needs to be fetched.
@@ -105,6 +108,36 @@ int misc_cg_set_capacity(enum misc_res_type type, u64 capacity)
}
EXPORT_SYMBOL_GPL(misc_cg_set_capacity);
+/**
+ * misc_cg_set_ops() - set resource specific operations.
+ * @type: Type of the misc res.
+ * @ops: Operations for the given type.
+ *
+ * Context: Any context.
+ * Return:
+ * * %0 - Successfully registered the operations.
+ * * %-EINVAL - If @type is invalid, or the operations missing any required callbacks.
+ */
+int misc_cg_set_ops(enum misc_res_type type, const struct misc_res_ops *ops)
+{
+ if (!valid_type(type))
+ return -EINVAL;
+
+ if (!ops->alloc) {
+ pr_err("%s: alloc missing\n", __func__);
+ return -EINVAL;
+ }
+
+ if (!ops->free) {
+ pr_err("%s: free missing\n", __func__);
+ return -EINVAL;
+ }
+
+ misc_res_ops[type] = ops;
+ return 0;
+}
+EXPORT_SYMBOL_GPL(misc_cg_set_ops);
+
/**
* misc_cg_cancel_charge() - Cancel the charge from the misc cgroup.
* @type: Misc res type in misc cg to cancel the charge from.
@@ -383,23 +416,37 @@ static struct cftype misc_cg_files[] = {
static struct cgroup_subsys_state *
misc_cg_alloc(struct cgroup_subsys_state *parent_css)
{
+ struct misc_cg *parent_cg, *cg;
enum misc_res_type i;
- struct misc_cg *cg;
+ int ret;
if (!parent_css) {
- cg = &root_cg;
+ parent_cg = cg = &root_cg;
} else {
cg = kzalloc(sizeof(*cg), GFP_KERNEL);
if (!cg)
return ERR_PTR(-ENOMEM);
+ parent_cg = css_misc(parent_css);
}
for (i = 0; i < MISC_CG_RES_TYPES; i++) {
WRITE_ONCE(cg->res[i].max, MAX_NUM);
atomic64_set(&cg->res[i].usage, 0);
+ if (misc_res_ops[i]) {
+ ret = misc_res_ops[i]->alloc(cg);
+ if (ret)
+ goto alloc_err;
+ }
}
return &cg->css;
+
+alloc_err:
+ for (i = 0; i < MISC_CG_RES_TYPES; i++)
+ if (misc_res_ops[i])
+ misc_res_ops[i]->free(cg);
+ kfree(cg);
+ return ERR_PTR(ret);
}
/**
@@ -410,7 +457,14 @@ misc_cg_alloc(struct cgroup_subsys_state *parent_css)
*/
static void misc_cg_free(struct cgroup_subsys_state *css)
{
- kfree(css_misc(css));
+ struct misc_cg *cg = css_misc(css);
+ enum misc_res_type i;
+
+ for (i = 0; i < MISC_CG_RES_TYPES; i++)
+ if (misc_res_ops[i])
+ misc_res_ops[i]->free(cg);
+
+ kfree(cg);
}
/* Cgroup controller callbacks */
--
2.25.1
From: Kristen Carlson Accardi <[email protected]>
When the EPC usage of a cgroup is near its limit, the cgroup needs to
reclaim pages used in the same cgroup to make room for new allocations.
This is analogous to the behavior that the global reclaimer is triggered
when the global usage is close to total available EPC.
Add a Boolean parameter for sgx_epc_cgroup_try_charge() to indicate
whether synchronous reclaim is allowed or not. And trigger the
synchronous/asynchronous reclamation flow accordingly.
Note at this point, all reclaimable EPC pages are still tracked in the
global LRU and per-cgroup LRUs are empty. So no per-cgroup reclamation
is activated yet.
Co-developed-by: Sean Christopherson <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Kristen Carlson Accardi <[email protected]>
Co-developed-by: Haitao Huang <[email protected]>
Signed-off-by: Haitao Huang <[email protected]>
---
V7:
- Split this out from the big patch, #10 in V6. (Dave, Kai)
---
arch/x86/kernel/cpu/sgx/epc_cgroup.c | 26 ++++++++++++++++++++++++--
arch/x86/kernel/cpu/sgx/epc_cgroup.h | 4 ++--
arch/x86/kernel/cpu/sgx/main.c | 2 +-
3 files changed, 27 insertions(+), 5 deletions(-)
diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
index 44265f62b2a4..c28ed12ff864 100644
--- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
@@ -176,16 +176,38 @@ static void sgx_epc_cgroup_reclaim_work_func(struct work_struct *work)
/**
* sgx_epc_cgroup_try_charge() - try to charge cgroup for a single EPC page
* @epc_cg: The EPC cgroup to be charged for the page.
+ * @reclaim: Whether or not synchronous reclaim is allowed
* Return:
* * %0 - If successfully charged.
* * -errno - for failures.
*/
-int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg)
+int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg, bool reclaim)
{
if (!epc_cg)
return -EINVAL;
- return misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, PAGE_SIZE);
+ for (;;) {
+ if (!misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg,
+ PAGE_SIZE))
+ break;
+
+ if (sgx_epc_cgroup_lru_empty(epc_cg->cg))
+ return -ENOMEM;
+
+ if (signal_pending(current))
+ return -ERESTARTSYS;
+
+ if (!reclaim) {
+ queue_work(sgx_epc_cg_wq, &epc_cg->reclaim_work);
+ return -EBUSY;
+ }
+
+ if (!sgx_epc_cgroup_reclaim_pages(epc_cg->cg, false))
+ /* All pages were too young to reclaim, try again a little later */
+ schedule();
+ }
+
+ return 0;
}
/**
diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.h b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
index 9b77b51a2839..6e156de5f7ff 100644
--- a/arch/x86/kernel/cpu/sgx/epc_cgroup.h
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
@@ -23,7 +23,7 @@ static inline struct sgx_epc_cgroup *sgx_get_current_epc_cg(void)
static inline void sgx_put_epc_cg(struct sgx_epc_cgroup *epc_cg) { }
-static inline int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg)
+static inline int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg, bool reclaim)
{
return 0;
}
@@ -66,7 +66,7 @@ static inline void sgx_put_epc_cg(struct sgx_epc_cgroup *epc_cg)
put_misc_cg(epc_cg->cg);
}
-int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg);
+int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg, bool reclaim);
void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg);
bool sgx_epc_cgroup_lru_empty(struct misc_cg *root);
void sgx_epc_cgroup_init(void);
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 14314f25880d..b43d51eff5ef 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -586,7 +586,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
int ret;
epc_cg = sgx_get_current_epc_cg();
- ret = sgx_epc_cgroup_try_charge(epc_cg);
+ ret = sgx_epc_cgroup_try_charge(epc_cg, reclaim);
if (ret) {
sgx_put_epc_cg(epc_cg);
return ERR_PTR(ret);
--
2.25.1
From: Kristen Carlson Accardi <[email protected]>
To determine if any page available for reclamation at the global level,
only checking for emptiness of the global LRU is not adequate when pages
are tracked in multiple LRUs, one per cgroup. For this purpose, create a
new helper, sgx_can_reclaim(), currently only checks the global LRU,
later will check emptiness of LRUs of all cgroups when per-cgroup
tracking is turned on. Replace all the checks of the global LRU,
list_empty(&sgx_global_lru.reclaimable), with calls to
sgx_can_reclaim().
Co-developed-by: Sean Christopherson <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Kristen Carlson Accardi <[email protected]>
Co-developed-by: Haitao Huang <[email protected]>
Signed-off-by: Haitao Huang <[email protected]>
---
v7:
- Split this out from the big patch, #10 in V6. (Dave, Kai)
---
arch/x86/kernel/cpu/sgx/main.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index b43d51eff5ef..7b13bcf3e75d 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -37,6 +37,11 @@ static inline struct sgx_epc_lru_list *sgx_lru_list(struct sgx_epc_page *epc_pag
return &sgx_global_lru;
}
+static inline bool sgx_can_reclaim(void)
+{
+ return !list_empty(&sgx_global_lru.reclaimable);
+}
+
static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0);
/* Nodes with one or more EPC sections. */
@@ -395,7 +400,7 @@ unsigned int sgx_reclaim_pages(struct sgx_epc_lru_list *lru, unsigned int *nr_to
static bool sgx_should_reclaim(unsigned long watermark)
{
return atomic_long_read(&sgx_nr_free_pages) < watermark &&
- !list_empty(&sgx_global_lru.reclaimable);
+ sgx_can_reclaim();
}
static void sgx_reclaim_pages_global(bool indirect)
@@ -599,7 +604,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
break;
}
- if (list_empty(&sgx_global_lru.reclaimable)) {
+ if (!sgx_can_reclaim()) {
page = ERR_PTR(-ENOMEM);
break;
}
--
2.25.1
From: Kristen Carlson Accardi <[email protected]>
Previous patches have implemented all infrastructure needed for
per-cgroup EPC page tracking and reclaiming. But all reclaimable EPC
pages are still tracked in the global LRU as sgx_lru_list() returns hard
coded reference to the global LRU.
Change sgx_lru_list() to return the LRU of the cgroup in which the given
EPC page is allocated.
This makes all EPC pages tracked in per-cgroup LRUs and the global
reclaimer (ksgxd) will not be able to reclaim any pages from the global
LRU. However, in cases of over-committing, i.e., sum of cgroup limits
greater than the total capacity, cgroups may never reclaim but the total
usage can still be near the capacity. Therefore global reclamation is
still needed in those cases and it should reclaim from the root cgroup.
Modify sgx_reclaim_pages_global(), to reclaim from the root EPC cgroup
when cgroup is enabled, otherwise from the global LRU.
Similarly, modify sgx_can_reclaim(), to check emptiness of LRUs of all
cgroups when EPC cgroup is enabled, otherwise only check the global LRU.
With these changes, the global reclamation and per-cgroup reclamation
both work properly with all pages tracked in per-cgroup LRUs.
Co-developed-by: Sean Christopherson <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Kristen Carlson Accardi <[email protected]>
Co-developed-by: Haitao Huang <[email protected]>
Signed-off-by: Haitao Huang <[email protected]>
---
V7:
- Split this out from the big patch, #10 in V6. (Dave, Kai)
---
arch/x86/kernel/cpu/sgx/main.c | 16 +++++++++++++++-
1 file changed, 15 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 7b13bcf3e75d..15bf2ca5e67a 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -34,12 +34,23 @@ static struct sgx_epc_lru_list sgx_global_lru;
static inline struct sgx_epc_lru_list *sgx_lru_list(struct sgx_epc_page *epc_page)
{
+#ifdef CONFIG_CGROUP_SGX_EPC
+ if (epc_page->epc_cg)
+ return &epc_page->epc_cg->lru;
+
+ /* This should not happen if kernel is configured correctly */
+ WARN_ON_ONCE(1);
+#endif
return &sgx_global_lru;
}
static inline bool sgx_can_reclaim(void)
{
+#ifdef CONFIG_CGROUP_SGX_EPC
+ return !sgx_epc_cgroup_lru_empty(misc_cg_root());
+#else
return !list_empty(&sgx_global_lru.reclaimable);
+#endif
}
static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0);
@@ -407,7 +418,10 @@ static void sgx_reclaim_pages_global(bool indirect)
{
unsigned int nr_to_scan = SGX_NR_TO_SCAN;
- sgx_reclaim_pages(&sgx_global_lru, &nr_to_scan, indirect);
+ if (IS_ENABLED(CONFIG_CGROUP_SGX_EPC))
+ sgx_epc_cgroup_reclaim_pages(misc_cg_root(), indirect);
+ else
+ sgx_reclaim_pages(&sgx_global_lru, &nr_to_scan, indirect);
}
/*
--
2.25.1
From: Kristen Carlson Accardi <[email protected]>
Implement the reclamation flow for cgroup, encapsulated in the top-level
function sgx_epc_cgroup_reclaim_pages(). It does a pre-order walk on its
subtree, and make calls to sgx_reclaim_pages() at each node passing in
the LRU of that node. It keeps track of total reclaimed pages, and pages
left to attempt. It stops the walk if desired number of pages are
attempted.
In some contexts, e.g. page fault handling, only asynchronous
reclamation is allowed. Create a work-queue, corresponding work item and
function definitions to support the asynchronous reclamation. Both
synchronous and asynchronous flows invoke the same top level reclaim
function, and will be triggered later by sgx_epc_cgroup_try_charge()
when usage of the cgroup is at or near its limit.
Co-developed-by: Sean Christopherson <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Kristen Carlson Accardi <[email protected]>
Co-developed-by: Haitao Huang <[email protected]>
Signed-off-by: Haitao Huang <[email protected]>
---
V7:
- Split this out from the big patch, #10 in V6. (Dave, Kai)
---
arch/x86/kernel/cpu/sgx/epc_cgroup.c | 174 ++++++++++++++++++++++++++-
arch/x86/kernel/cpu/sgx/epc_cgroup.h | 5 +-
2 files changed, 177 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
index 938695816a9e..71570c346d95 100644
--- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
@@ -7,9 +7,173 @@
static struct sgx_epc_cgroup epc_cg_root;
+static struct workqueue_struct *sgx_epc_cg_wq;
+
+static inline u64 sgx_epc_cgroup_page_counter_read(struct sgx_epc_cgroup *epc_cg)
+{
+ return atomic64_read(&epc_cg->cg->res[MISC_CG_RES_SGX_EPC].usage) / PAGE_SIZE;
+}
+
+static inline u64 sgx_epc_cgroup_max_pages(struct sgx_epc_cgroup *epc_cg)
+{
+ return READ_ONCE(epc_cg->cg->res[MISC_CG_RES_SGX_EPC].max) / PAGE_SIZE;
+}
+
+/*
+ * Get the lower bound of limits of a cgroup and its ancestors. Used in
+ * sgx_epc_cgroup_reclaim_work_func() to determine if EPC usage of a cgroup is over its limit
+ * or its ancestors' hence reclamation is needed.
+ */
+static inline u64 sgx_epc_cgroup_max_pages_to_root(struct sgx_epc_cgroup *epc_cg)
+{
+ struct misc_cg *i = epc_cg->cg;
+ u64 m = U64_MAX;
+
+ while (i) {
+ m = min(m, READ_ONCE(i->res[MISC_CG_RES_SGX_EPC].max));
+ i = misc_cg_parent(i);
+ }
+
+ return m / PAGE_SIZE;
+}
+
/**
- * sgx_epc_cgroup_try_charge() - try to charge cgroup for a single EPC page
+ * sgx_epc_cgroup_lru_empty() - check if a cgroup tree has no pages on its LRUs
+ * @root: Root of the tree to check
*
+ * Return: %true if all cgroups under the specified root have empty LRU lists.
+ * Used to avoid livelocks due to a cgroup having a non-zero charge count but
+ * no pages on its LRUs, e.g. due to a dead enclave waiting to be released or
+ * because all pages in the cgroup are unreclaimable.
+ */
+bool sgx_epc_cgroup_lru_empty(struct misc_cg *root)
+{
+ struct cgroup_subsys_state *css_root;
+ struct cgroup_subsys_state *pos;
+ struct sgx_epc_cgroup *epc_cg;
+ bool ret = true;
+
+ /*
+ * Caller ensure css_root ref acquired
+ */
+ css_root = &root->css;
+
+ rcu_read_lock();
+ css_for_each_descendant_pre(pos, css_root) {
+ if (!css_tryget(pos))
+ break;
+
+ rcu_read_unlock();
+
+ epc_cg = sgx_epc_cgroup_from_misc_cg(css_misc(pos));
+
+ spin_lock(&epc_cg->lru.lock);
+ ret = list_empty(&epc_cg->lru.reclaimable);
+ spin_unlock(&epc_cg->lru.lock);
+
+ rcu_read_lock();
+ css_put(pos);
+ if (!ret)
+ break;
+ }
+
+ rcu_read_unlock();
+
+ return ret;
+}
+
+/**
+ * sgx_epc_cgroup_reclaim_pages() - walk a cgroup tree and scan LRUs to reclaim pages
+ * @root: Root of the tree to start walking
+ * Return: Number of pages reclaimed.
+ */
+unsigned int sgx_epc_cgroup_reclaim_pages(struct misc_cg *root)
+{
+ /*
+ * Attempting to reclaim only a few pages will often fail and is inefficient, while
+ * reclaiming a huge number of pages can result in soft lockups due to holding various
+ * locks for an extended duration.
+ */
+ unsigned int nr_to_scan = SGX_NR_TO_SCAN;
+ struct cgroup_subsys_state *css_root;
+ struct cgroup_subsys_state *pos;
+ struct sgx_epc_cgroup *epc_cg;
+ unsigned int cnt;
+
+ /* Caller ensure css_root ref acquired */
+ css_root = &root->css;
+
+ cnt = 0;
+ rcu_read_lock();
+ css_for_each_descendant_pre(pos, css_root) {
+ if (!css_tryget(pos))
+ break;
+ rcu_read_unlock();
+
+ epc_cg = sgx_epc_cgroup_from_misc_cg(css_misc(pos));
+ cnt += sgx_reclaim_pages(&epc_cg->lru, &nr_to_scan);
+
+ rcu_read_lock();
+ css_put(pos);
+ if (!nr_to_scan)
+ break;
+ }
+
+ rcu_read_unlock();
+ return cnt;
+}
+
+/*
+ * Scheduled by sgx_epc_cgroup_try_charge() to reclaim pages from the cgroup when the cgroup is
+ * at/near its maximum capacity
+ */
+static void sgx_epc_cgroup_reclaim_work_func(struct work_struct *work)
+{
+ struct sgx_epc_cgroup *epc_cg;
+ u64 cur, max;
+
+ epc_cg = container_of(work, struct sgx_epc_cgroup, reclaim_work);
+
+ for (;;) {
+ max = sgx_epc_cgroup_max_pages_to_root(epc_cg);
+
+ /*
+ * Adjust the limit down by one page, the goal is to free up
+ * pages for fault allocations, not to simply obey the limit.
+ * Conditionally decrementing max also means the cur vs. max
+ * check will correctly handle the case where both are zero.
+ */
+ if (max)
+ max--;
+
+ /*
+ * Unless the limit is extremely low, in which case forcing
+ * reclaim will likely cause thrashing, force the cgroup to
+ * reclaim at least once if it's operating *near* its maximum
+ * limit by adjusting @max down by half the min reclaim size.
+ * This work func is scheduled by sgx_epc_cgroup_try_charge
+ * when it cannot directly reclaim due to being in an atomic
+ * context, e.g. EPC allocation in a fault handler. Waiting
+ * to reclaim until the cgroup is actually at its limit is less
+ * performant as it means the faulting task is effectively
+ * blocked until a worker makes its way through the global work
+ * queue.
+ */
+ if (max > SGX_NR_TO_SCAN * 2)
+ max -= (SGX_NR_TO_SCAN / 2);
+
+ cur = sgx_epc_cgroup_page_counter_read(epc_cg);
+
+ if (cur <= max || sgx_epc_cgroup_lru_empty(epc_cg->cg))
+ break;
+
+ /* Keep reclaiming until above condition is met. */
+ sgx_epc_cgroup_reclaim_pages(epc_cg->cg);
+ }
+}
+
+/**
+ * sgx_epc_cgroup_try_charge() - try to charge cgroup for a single EPC page
* @epc_cg: The EPC cgroup to be charged for the page.
* Return:
* * %0 - If successfully charged.
@@ -43,6 +207,7 @@ static void sgx_epc_cgroup_free(struct misc_cg *cg)
if (!epc_cg)
return;
+ cancel_work_sync(&epc_cg->reclaim_work);
kfree(epc_cg);
}
@@ -55,6 +220,8 @@ const struct misc_res_ops sgx_epc_cgroup_ops = {
static void sgx_epc_misc_init(struct misc_cg *cg, struct sgx_epc_cgroup *epc_cg)
{
+ sgx_lru_init(&epc_cg->lru);
+ INIT_WORK(&epc_cg->reclaim_work, sgx_epc_cgroup_reclaim_work_func);
cg->res[MISC_CG_RES_SGX_EPC].priv = epc_cg;
epc_cg->cg = cg;
}
@@ -74,6 +241,11 @@ static int sgx_epc_cgroup_alloc(struct misc_cg *cg)
void sgx_epc_cgroup_init(void)
{
+ sgx_epc_cg_wq = alloc_workqueue("sgx_epc_cg_wq",
+ WQ_UNBOUND | WQ_FREEZABLE,
+ WQ_UNBOUND_MAX_ACTIVE);
+ BUG_ON(!sgx_epc_cg_wq);
+
misc_cg_set_ops(MISC_CG_RES_SGX_EPC, &sgx_epc_cgroup_ops);
sgx_epc_misc_init(misc_cg_root(), &epc_cg_root);
}
diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.h b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
index 971df34f27d8..9b77b51a2839 100644
--- a/arch/x86/kernel/cpu/sgx/epc_cgroup.h
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
@@ -33,7 +33,9 @@ static inline void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg) { }
static inline void sgx_epc_cgroup_init(void) { }
#else
struct sgx_epc_cgroup {
- struct misc_cg *cg;
+ struct misc_cg *cg;
+ struct sgx_epc_lru_list lru;
+ struct work_struct reclaim_work;
};
static inline struct sgx_epc_cgroup *sgx_epc_cgroup_from_misc_cg(struct misc_cg *cg)
@@ -66,6 +68,7 @@ static inline void sgx_put_epc_cg(struct sgx_epc_cgroup *epc_cg)
int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg);
void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg);
+bool sgx_epc_cgroup_lru_empty(struct misc_cg *root);
void sgx_epc_cgroup_init(void);
#endif
--
2.25.1
From: Sean Christopherson <[email protected]>
Introduce a data structure to wrap the existing reclaimable list and its
spinlock. Each cgroup later will have one instance of this structure to
track EPC pages allocated for processes associated with the same cgroup.
Just like the global SGX reclaimer (ksgxd), an EPC cgroup reclaims pages
from the reclaimable list in this structure when its usage reaches near
its limit.
Use this structure to encapsulate the LRU list and its lock used by the
global reclaimer.
Signed-off-by: Sean Christopherson <[email protected]>
Co-developed-by: Kristen Carlson Accardi <[email protected]>
Signed-off-by: Kristen Carlson Accardi <[email protected]>
Co-developed-by: Haitao Huang <[email protected]>
Signed-off-by: Haitao Huang <[email protected]>
Cc: Sean Christopherson <[email protected]>
---
V6:
- removed introduction to unreclaimables in commit message.
V4:
- Removed unneeded comments for the spinlock and the non-reclaimables.
(Kai, Jarkko)
- Revised the commit to add introduction comments for unreclaimables and
multiple LRU lists.(Kai)
- Reordered the patches: delay all changes for unreclaimables to
later, and this one becomes the first change in the SGX subsystem.
V3:
- Removed the helper functions and revised commit messages.
---
arch/x86/kernel/cpu/sgx/main.c | 39 +++++++++++++++++-----------------
arch/x86/kernel/cpu/sgx/sgx.h | 15 +++++++++++++
2 files changed, 35 insertions(+), 19 deletions(-)
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index c32f18b70c73..912959c7ecc9 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -28,10 +28,9 @@ static DEFINE_XARRAY(sgx_epc_address_space);
/*
* These variables are part of the state of the reclaimer, and must be accessed
- * with sgx_reclaimer_lock acquired.
+ * with sgx_global_lru.lock acquired.
*/
-static LIST_HEAD(sgx_active_page_list);
-static DEFINE_SPINLOCK(sgx_reclaimer_lock);
+static struct sgx_epc_lru_list sgx_global_lru;
static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0);
@@ -306,13 +305,13 @@ static void sgx_reclaim_pages(void)
int ret;
int i;
- spin_lock(&sgx_reclaimer_lock);
+ spin_lock(&sgx_global_lru.lock);
for (i = 0; i < SGX_NR_TO_SCAN; i++) {
- if (list_empty(&sgx_active_page_list))
+ epc_page = list_first_entry_or_null(&sgx_global_lru.reclaimable,
+ struct sgx_epc_page, list);
+ if (!epc_page)
break;
- epc_page = list_first_entry(&sgx_active_page_list,
- struct sgx_epc_page, list);
list_del_init(&epc_page->list);
encl_page = epc_page->owner;
@@ -324,7 +323,7 @@ static void sgx_reclaim_pages(void)
*/
epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
}
- spin_unlock(&sgx_reclaimer_lock);
+ spin_unlock(&sgx_global_lru.lock);
for (i = 0; i < cnt; i++) {
epc_page = chunk[i];
@@ -347,9 +346,9 @@ static void sgx_reclaim_pages(void)
continue;
skip:
- spin_lock(&sgx_reclaimer_lock);
- list_add_tail(&epc_page->list, &sgx_active_page_list);
- spin_unlock(&sgx_reclaimer_lock);
+ spin_lock(&sgx_global_lru.lock);
+ list_add_tail(&epc_page->list, &sgx_global_lru.reclaimable);
+ spin_unlock(&sgx_global_lru.lock);
kref_put(&encl_page->encl->refcount, sgx_encl_release);
@@ -380,7 +379,7 @@ static void sgx_reclaim_pages(void)
static bool sgx_should_reclaim(unsigned long watermark)
{
return atomic_long_read(&sgx_nr_free_pages) < watermark &&
- !list_empty(&sgx_active_page_list);
+ !list_empty(&sgx_global_lru.reclaimable);
}
/*
@@ -432,6 +431,8 @@ static bool __init sgx_page_reclaimer_init(void)
ksgxd_tsk = tsk;
+ sgx_lru_init(&sgx_global_lru);
+
return true;
}
@@ -507,10 +508,10 @@ struct sgx_epc_page *__sgx_alloc_epc_page(void)
*/
void sgx_mark_page_reclaimable(struct sgx_epc_page *page)
{
- spin_lock(&sgx_reclaimer_lock);
+ spin_lock(&sgx_global_lru.lock);
page->flags |= SGX_EPC_PAGE_RECLAIMER_TRACKED;
- list_add_tail(&page->list, &sgx_active_page_list);
- spin_unlock(&sgx_reclaimer_lock);
+ list_add_tail(&page->list, &sgx_global_lru.reclaimable);
+ spin_unlock(&sgx_global_lru.lock);
}
/**
@@ -525,18 +526,18 @@ void sgx_mark_page_reclaimable(struct sgx_epc_page *page)
*/
int sgx_unmark_page_reclaimable(struct sgx_epc_page *page)
{
- spin_lock(&sgx_reclaimer_lock);
+ spin_lock(&sgx_global_lru.lock);
if (page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED) {
/* The page is being reclaimed. */
if (list_empty(&page->list)) {
- spin_unlock(&sgx_reclaimer_lock);
+ spin_unlock(&sgx_global_lru.lock);
return -EBUSY;
}
list_del(&page->list);
page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
}
- spin_unlock(&sgx_reclaimer_lock);
+ spin_unlock(&sgx_global_lru.lock);
return 0;
}
@@ -578,7 +579,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
break;
}
- if (list_empty(&sgx_active_page_list)) {
+ if (list_empty(&sgx_global_lru.reclaimable)) {
page = ERR_PTR(-ENOMEM);
break;
}
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index a898d86dead0..0e99e9ae3a67 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -88,6 +88,21 @@ static inline void *sgx_get_epc_virt_addr(struct sgx_epc_page *page)
return section->virt_addr + index * PAGE_SIZE;
}
+/*
+ * Contains EPC pages tracked by the global reclaimer (ksgxd) or an EPC
+ * cgroup.
+ */
+struct sgx_epc_lru_list {
+ spinlock_t lock;
+ struct list_head reclaimable;
+};
+
+static inline void sgx_lru_init(struct sgx_epc_lru_list *lru)
+{
+ spin_lock_init(&lru->lock);
+ INIT_LIST_HEAD(&lru->reclaimable);
+}
+
struct sgx_epc_page *__sgx_alloc_epc_page(void);
void sgx_free_epc_page(struct sgx_epc_page *page);
--
2.25.1
From: Kristen Carlson Accardi <[email protected]>
When cgroup is enabled, all reclaimable pages will be tracked in cgroup
LRUs. The global reclaimer needs to start reclamation from the root
cgroup. Expose the top level cgroup reclamation function so the global
reclaimer can reuse it.
Co-developed-by: Sean Christopherson <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Kristen Carlson Accardi <[email protected]>
Co-developed-by: Haitao Huang <[email protected]>
Signed-off-by: Haitao Huang <[email protected]>
---
V7:
- Split this out from the big patch, #10 in V6. (Dave, Kai)
---
arch/x86/kernel/cpu/sgx/epc_cgroup.c | 2 +-
arch/x86/kernel/cpu/sgx/epc_cgroup.h | 9 +++++++++
2 files changed, 10 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
index c28ed12ff864..fdf1417d9ade 100644
--- a/arch/x86/kernel/cpu/sgx/epc_cgroup.c
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
@@ -88,7 +88,7 @@ bool sgx_epc_cgroup_lru_empty(struct misc_cg *root)
* @indirect: In ksgxd or EPC cgroup work queue context.
* Return: Number of pages reclaimed.
*/
-static unsigned int sgx_epc_cgroup_reclaim_pages(struct misc_cg *root, bool indirect)
+unsigned int sgx_epc_cgroup_reclaim_pages(struct misc_cg *root, bool indirect)
{
/*
* Attempting to reclaim only a few pages will often fail and is inefficient, while
diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.h b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
index 6e156de5f7ff..05a4de9f7024 100644
--- a/arch/x86/kernel/cpu/sgx/epc_cgroup.h
+++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h
@@ -31,6 +31,12 @@ static inline int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg, bool
static inline void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg) { }
static inline void sgx_epc_cgroup_init(void) { }
+
+static inline unsigned int sgx_epc_cgroup_reclaim_pages(struct misc_cg *root,
+ bool indirect)
+{
+ return 0;
+}
#else
struct sgx_epc_cgroup {
struct misc_cg *cg;
@@ -69,6 +75,9 @@ static inline void sgx_put_epc_cg(struct sgx_epc_cgroup *epc_cg)
int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg, bool reclaim);
void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg);
bool sgx_epc_cgroup_lru_empty(struct misc_cg *root);
+unsigned int sgx_epc_cgroup_reclaim_pages(struct misc_cg *root,
+ bool indirect);
+
void sgx_epc_cgroup_init(void);
#endif
--
2.25.1
From: Sean Christopherson <[email protected]>
Each EPC cgroup will have an LRU structure to track reclaimable EPC pages.
When a cgroup usage reaches its limit, the cgroup needs to reclaim pages
from its LRU or LRUs of its descendants to make room for any new
allocations.
To prepare for reclamation per cgroup, expose the top level reclamation
function, sgx_reclaim_pages(), in header file for reuse. Add a parameter
to the function to pass in an LRU so cgroups can pass in different
tracking LRUs later. Add another parameter for passing in the number of
pages to scan and make the function return the number of pages reclaimed
as a cgroup reclaimer may need to track reclamation progress from its
descendants, change number of pages to scan in subsequent calls.
Create a wrapper for the global reclaimer, sgx_reclaim_pages_global(),
to just call this function with the global LRU passed in. When
per-cgroup LRU is added later, the wrapper will perform global
reclamation from the root cgroup.
Signed-off-by: Sean Christopherson <[email protected]>
Co-developed-by: Kristen Carlson Accardi <[email protected]>
Signed-off-by: Kristen Carlson Accardi <[email protected]>
Co-developed-by: Haitao Huang <[email protected]>
Signed-off-by: Haitao Huang <[email protected]>
---
V7:
- Reworked from patch 9 of V6, "x86/sgx: Restructure top-level EPC reclaim
function". Do not split the top level function (Kai)
- Dropped patches 7 and 8 of V6.
---
arch/x86/kernel/cpu/sgx/main.c | 62 +++++++++++++++++++++-------------
arch/x86/kernel/cpu/sgx/sgx.h | 1 +
2 files changed, 40 insertions(+), 23 deletions(-)
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index cde750688e62..60cb3a7b3001 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -286,20 +286,24 @@ static void sgx_reclaimer_write(struct sgx_epc_page *epc_page,
mutex_unlock(&encl->lock);
}
-/*
- * Take a fixed number of pages from the head of the active page pool and
- * reclaim them to the enclave's private shmem files. Skip the pages, which have
- * been accessed since the last scan. Move those pages to the tail of active
- * page pool so that the pages get scanned in LRU like fashion.
+/**
+ * sgx_reclaim_pages() - Reclaim a fixed number of pages from an LRU
+ *
+ * Take a fixed number of pages from the head of a given LRU and reclaim them to the enclave's
+ * private shmem files. Skip the pages, which have been accessed since the last scan. Move
+ * those pages to the tail of the list so that the pages get scanned in LRU like fashion.
+ *
+ * Batch process a chunk of pages (at the moment 16) in order to degrade amount of IPI's and
+ * ETRACK's potentially required. sgx_encl_ewb() does degrade a bit among the HW threads with
+ * three stage EWB pipeline (EWB, ETRACK + EWB and IPI + EWB) but not sufficiently. Reclaiming
+ * one page at a time would also be problematic as it would increase the lock contention too
+ * much, which would halt forward progress.
*
- * Batch process a chunk of pages (at the moment 16) in order to degrade amount
- * of IPI's and ETRACK's potentially required. sgx_encl_ewb() does degrade a bit
- * among the HW threads with three stage EWB pipeline (EWB, ETRACK + EWB and IPI
- * + EWB) but not sufficiently. Reclaiming one page at a time would also be
- * problematic as it would increase the lock contention too much, which would
- * halt forward progress.
+ * @lru: The LRU from which pages are reclaimed.
+ * @nr_to_scan: Pointer to the target number of pages to scan, must be less than SGX_NR_TO_SCAN.
+ * Return: Number of pages reclaimed.
*/
-static void sgx_reclaim_pages(void)
+unsigned int sgx_reclaim_pages(struct sgx_epc_lru_list *lru, unsigned int *nr_to_scan)
{
struct sgx_epc_page *chunk[SGX_NR_TO_SCAN];
struct sgx_backing backing[SGX_NR_TO_SCAN];
@@ -310,10 +314,10 @@ static void sgx_reclaim_pages(void)
int ret;
int i;
- spin_lock(&sgx_global_lru.lock);
- for (i = 0; i < SGX_NR_TO_SCAN; i++) {
- epc_page = list_first_entry_or_null(&sgx_global_lru.reclaimable,
- struct sgx_epc_page, list);
+ spin_lock(&lru->lock);
+
+ for (; *nr_to_scan > 0; --(*nr_to_scan)) {
+ epc_page = list_first_entry_or_null(&lru->reclaimable, struct sgx_epc_page, list);
if (!epc_page)
break;
@@ -328,7 +332,8 @@ static void sgx_reclaim_pages(void)
*/
epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
}
- spin_unlock(&sgx_global_lru.lock);
+
+ spin_unlock(&lru->lock);
for (i = 0; i < cnt; i++) {
epc_page = chunk[i];
@@ -351,9 +356,9 @@ static void sgx_reclaim_pages(void)
continue;
skip:
- spin_lock(&sgx_global_lru.lock);
- list_add_tail(&epc_page->list, &sgx_global_lru.reclaimable);
- spin_unlock(&sgx_global_lru.lock);
+ spin_lock(&lru->lock);
+ list_add_tail(&epc_page->list, &lru->reclaimable);
+ spin_unlock(&lru->lock);
kref_put(&encl_page->encl->refcount, sgx_encl_release);
@@ -366,6 +371,7 @@ static void sgx_reclaim_pages(void)
sgx_reclaimer_block(epc_page);
}
+ ret = 0;
for (i = 0; i < cnt; i++) {
epc_page = chunk[i];
if (!epc_page)
@@ -378,7 +384,10 @@ static void sgx_reclaim_pages(void)
epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
sgx_free_epc_page(epc_page);
+ ret++;
}
+
+ return (unsigned int)ret;
}
static bool sgx_should_reclaim(unsigned long watermark)
@@ -387,6 +396,13 @@ static bool sgx_should_reclaim(unsigned long watermark)
!list_empty(&sgx_global_lru.reclaimable);
}
+static void sgx_reclaim_pages_global(void)
+{
+ unsigned int nr_to_scan = SGX_NR_TO_SCAN;
+
+ sgx_reclaim_pages(&sgx_global_lru, &nr_to_scan);
+}
+
/*
* sgx_reclaim_direct() should be called (without enclave's mutex held)
* in locations where SGX memory resources might be low and might be
@@ -395,7 +411,7 @@ static bool sgx_should_reclaim(unsigned long watermark)
void sgx_reclaim_direct(void)
{
if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
- sgx_reclaim_pages();
+ sgx_reclaim_pages_global();
}
static int ksgxd(void *p)
@@ -418,7 +434,7 @@ static int ksgxd(void *p)
sgx_should_reclaim(SGX_NR_HIGH_PAGES));
if (sgx_should_reclaim(SGX_NR_HIGH_PAGES))
- sgx_reclaim_pages();
+ sgx_reclaim_pages_global();
cond_resched();
}
@@ -605,7 +621,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
* Need to do a global reclamation if cgroup was not full but free
* physical pages run out, causing __sgx_alloc_epc_page() to fail.
*/
- sgx_reclaim_pages();
+ sgx_reclaim_pages_global();
cond_resched();
}
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index 0e99e9ae3a67..2593c013d091 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/kernel/cpu/sgx/sgx.h
@@ -110,6 +110,7 @@ void sgx_reclaim_direct(void);
void sgx_mark_page_reclaimable(struct sgx_epc_page *page);
int sgx_unmark_page_reclaimable(struct sgx_epc_page *page);
struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim);
+unsigned int sgx_reclaim_pages(struct sgx_epc_lru_list *lru, unsigned int *nr_to_scan);
void sgx_ipi_cb(void *info);
--
2.25.1
On Mon, 22 Jan 2024 14:25:53 -0600, Jarkko Sakkinen <[email protected]>
wrote:
> On Mon Jan 22, 2024 at 7:20 PM EET, Haitao Huang wrote:
>> From: Kristen Carlson Accardi <[email protected]>
>>
>> SGX Enclave Page Cache (EPC) memory allocations are separate from normal
>> RAM allocations, and are managed solely by the SGX subsystem. The
>> existing cgroup memory controller cannot be used to limit or account for
>> SGX EPC memory, which is a desirable feature in some environments. For
>> example, in a Kubernates environment, a user can request certain EPC
>> quota for a pod but the orchestrator can not enforce the quota to limit
>> runtime EPC usage of the pod without an EPC cgroup controller.
>>
>> Utilize the misc controller [admin-guide/cgroup-v2.rst, 5-9. Misc] to
>> limit and track EPC allocations per cgroup. Earlier patches have added
>> the "sgx_epc" resource type in the misc cgroup subsystem. Add basic
>> support in SGX driver as the "sgx_epc" resource provider:
>>
>> - Set "capacity" of EPC by calling misc_cg_set_capacity()
>> - Update EPC usage counter, "current", by calling charge and uncharge
>> APIs for EPC allocation and deallocation, respectively.
>> - Setup sgx_epc resource type specific callbacks, which perform
>> initialization and cleanup during cgroup allocation and deallocation,
>> respectively.
>>
>> With these changes, the misc cgroup controller enables user to set a
>> hard
>> limit for EPC usage in the "misc.max" interface file. It reports current
>> usage in "misc.current", the total EPC memory available in
>> "misc.capacity", and the number of times EPC usage reached the max limit
>> in "misc.events".
>>
>> For now, the EPC cgroup simply blocks additional EPC allocation in
>> sgx_alloc_epc_page() when the limit is reached. Reclaimable pages are
>> still tracked in the global active list, only reclaimed by the global
>> reclaimer when the total free page count is lower than a threshold.
>>
>> Later patches will reorganize the tracking and reclamation code in the
>> global reclaimer and implement per-cgroup tracking and reclaiming.
>>
>> Co-developed-by: Sean Christopherson <[email protected]>
>> Signed-off-by: Sean Christopherson <[email protected]>
>> Signed-off-by: Kristen Carlson Accardi <[email protected]>
>> Co-developed-by: Haitao Huang <[email protected]>
>> Signed-off-by: Haitao Huang <[email protected]>
>
> For consistency sake I'd also add co-developed-by for Kristen. This is
> at least the format suggested by kernel documentation.
>
>> ---
>> V7:
>> - Use a static for root cgroup (Kai)
>> - Wrap epc_cg field in sgx_epc_page struct with #ifdef (Kai)
>> - Correct check for charge API return (Kai)
>> - Start initialization in SGX device driver init (Kai)
>> - Remove unneeded BUG_ON (Kai)
>> - Split sgx_get_current_epc_cg() out of sgx_epc_cg_try_charge() (Kai)
>>
>> V6:
>> - Split the original large patch"Limit process EPC usage with misc
>> cgroup controller" and restructure it (Kai)
>> ---
>> arch/x86/Kconfig | 13 +++++
>> arch/x86/kernel/cpu/sgx/Makefile | 1 +
>> arch/x86/kernel/cpu/sgx/epc_cgroup.c | 79 ++++++++++++++++++++++++++++
>> arch/x86/kernel/cpu/sgx/epc_cgroup.h | 73 +++++++++++++++++++++++++
>> arch/x86/kernel/cpu/sgx/main.c | 52 +++++++++++++++++-
>> arch/x86/kernel/cpu/sgx/sgx.h | 5 ++
>> include/linux/misc_cgroup.h | 2 +
>> 7 files changed, 223 insertions(+), 2 deletions(-)
>> create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.c
>> create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.h
>>
>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>> index 5edec175b9bf..10c3d1d099b2 100644
>> --- a/arch/x86/Kconfig
>> +++ b/arch/x86/Kconfig
>> @@ -1947,6 +1947,19 @@ config X86_SGX
>>
>> If unsure, say N.
>>
>> +config CGROUP_SGX_EPC
>> + bool "Miscellaneous Cgroup Controller for Enclave Page Cache (EPC)
>> for Intel SGX"
>> + depends on X86_SGX && CGROUP_MISC
>> + help
>> + Provides control over the EPC footprint of tasks in a cgroup via
>> + the Miscellaneous cgroup controller.
>> +
>> + EPC is a subset of regular memory that is usable only by SGX
>> + enclaves and is very limited in quantity, e.g. less than 1%
>> + of total DRAM.
>> +
>> + Say N if unsure.
>> +
>> config X86_USER_SHADOW_STACK
>> bool "X86 userspace shadow stack"
>> depends on AS_WRUSS
>> diff --git a/arch/x86/kernel/cpu/sgx/Makefile
>> b/arch/x86/kernel/cpu/sgx/Makefile
>> index 9c1656779b2a..12901a488da7 100644
>> --- a/arch/x86/kernel/cpu/sgx/Makefile
>> +++ b/arch/x86/kernel/cpu/sgx/Makefile
>> @@ -4,3 +4,4 @@ obj-y += \
>> ioctl.o \
>> main.o
>> obj-$(CONFIG_X86_SGX_KVM) += virt.o
>> +obj-$(CONFIG_CGROUP_SGX_EPC) += epc_cgroup.o
>> diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.c
>> b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
>> new file mode 100644
>> index 000000000000..938695816a9e
>> --- /dev/null
>> +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.c
>> @@ -0,0 +1,79 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +// Copyright(c) 2022 Intel Corporation.
>> +
>> +#include <linux/atomic.h>
>> +#include <linux/kernel.h>
>> +#include "epc_cgroup.h"
>> +
>> +static struct sgx_epc_cgroup epc_cg_root;
>> +
>> +/**
>> + * sgx_epc_cgroup_try_charge() - try to charge cgroup for a single EPC
>> page
>> + *
>> + * @epc_cg: The EPC cgroup to be charged for the page.
>> + * Return:
>> + * * %0 - If successfully charged.
>> + * * -errno - for failures.
>> + */
>> +int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg)
>> +{
>> + if (!epc_cg)
>> + return -EINVAL;
>
> Is there legit flow where the function is called with nil?
>
>> +
>> + return misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg,
>> PAGE_SIZE);
> ~
> extra space
>
>> +}
>> +
>> +/**
>> + * sgx_epc_cgroup_uncharge() - uncharge a cgroup for an EPC page
>> + * @epc_cg: The charged epc cgroup
>> + */
>> +void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg)
>> +{
>> + if (!epc_cg)
>> + return;
>
> If there was, this function also should have a return value (i.e. return
> -EINVAL).
>
> This API does not look good tbh.
>
> Perhaps you want to emit error message in both functions? Now there is
> asymmetry that other goes silent and other returns error. I'm neither
> not sure why exactly -EINVAL was picked (does not mean the same that
> I would ultimately oppose picking that).
>
>
Good points.
I'll remove the NULL check for both cases. They should not happen if
kernel is configured with EPC cgroup. There are separate versions when EPC
cgroup disabled.
Thanks for your quick response.
Haitao
On Mon, 22 Jan 2024 14:25:53 -0600, Jarkko Sakkinen <[email protected]>
wrote:
> On Mon Jan 22, 2024 at 7:20 PM EET, Haitao Huang wrote:
>> From: Kristen Carlson Accardi <[email protected]>
>>
>> SGX Enclave Page Cache (EPC) memory allocations are separate from normal
>> RAM allocations, and are managed solely by the SGX subsystem. The
>> existing cgroup memory controller cannot be used to limit or account for
>> SGX EPC memory, which is a desirable feature in some environments. For
>> example, in a Kubernates environment, a user can request certain EPC
>> quota for a pod but the orchestrator can not enforce the quota to limit
>> runtime EPC usage of the pod without an EPC cgroup controller.
>>
>> Utilize the misc controller [admin-guide/cgroup-v2.rst, 5-9. Misc] to
>> limit and track EPC allocations per cgroup. Earlier patches have added
>> the "sgx_epc" resource type in the misc cgroup subsystem. Add basic
>> support in SGX driver as the "sgx_epc" resource provider:
>>
>> - Set "capacity" of EPC by calling misc_cg_set_capacity()
>> - Update EPC usage counter, "current", by calling charge and uncharge
>> APIs for EPC allocation and deallocation, respectively.
>> - Setup sgx_epc resource type specific callbacks, which perform
>> initialization and cleanup during cgroup allocation and deallocation,
>> respectively.
>>
>> With these changes, the misc cgroup controller enables user to set a
>> hard
>> limit for EPC usage in the "misc.max" interface file. It reports current
>> usage in "misc.current", the total EPC memory available in
>> "misc.capacity", and the number of times EPC usage reached the max limit
>> in "misc.events".
>>
>> For now, the EPC cgroup simply blocks additional EPC allocation in
>> sgx_alloc_epc_page() when the limit is reached. Reclaimable pages are
>> still tracked in the global active list, only reclaimed by the global
>> reclaimer when the total free page count is lower than a threshold.
>>
>> Later patches will reorganize the tracking and reclamation code in the
>> global reclaimer and implement per-cgroup tracking and reclaiming.
>>
>> Co-developed-by: Sean Christopherson <[email protected]>
>> Signed-off-by: Sean Christopherson <[email protected]>
>> Signed-off-by: Kristen Carlson Accardi <[email protected]>
>> Co-developed-by: Haitao Huang <[email protected]>
>> Signed-off-by: Haitao Huang <[email protected]>
>
> For consistency sake I'd also add co-developed-by for Kristen. This is
> at least the format suggested by kernel documentation.
>
She is the "From Author", so only Signed-off-by is needed for her
according to the second example in the doc[1]?
Thanks
Haitao
[1]https://docs.kernel.org/process/submitting-patches.html#when-to-use-acked-by-cc-and-co-developed-by
On Wed Jan 24, 2024 at 5:29 AM EET, Haitao Huang wrote:
> On Mon, 22 Jan 2024 14:25:53 -0600, Jarkko Sakkinen <[email protected]>
> wrote:
>
> > On Mon Jan 22, 2024 at 7:20 PM EET, Haitao Huang wrote:
> >> From: Kristen Carlson Accardi <[email protected]>
> >>
> >> SGX Enclave Page Cache (EPC) memory allocations are separate from normal
> >> RAM allocations, and are managed solely by the SGX subsystem. The
> >> existing cgroup memory controller cannot be used to limit or account for
> >> SGX EPC memory, which is a desirable feature in some environments. For
> >> example, in a Kubernates environment, a user can request certain EPC
> >> quota for a pod but the orchestrator can not enforce the quota to limit
> >> runtime EPC usage of the pod without an EPC cgroup controller.
> >>
> >> Utilize the misc controller [admin-guide/cgroup-v2.rst, 5-9. Misc] to
> >> limit and track EPC allocations per cgroup. Earlier patches have added
> >> the "sgx_epc" resource type in the misc cgroup subsystem. Add basic
> >> support in SGX driver as the "sgx_epc" resource provider:
> >>
> >> - Set "capacity" of EPC by calling misc_cg_set_capacity()
> >> - Update EPC usage counter, "current", by calling charge and uncharge
> >> APIs for EPC allocation and deallocation, respectively.
> >> - Setup sgx_epc resource type specific callbacks, which perform
> >> initialization and cleanup during cgroup allocation and deallocation,
> >> respectively.
> >>
> >> With these changes, the misc cgroup controller enables user to set a
> >> hard
> >> limit for EPC usage in the "misc.max" interface file. It reports current
> >> usage in "misc.current", the total EPC memory available in
> >> "misc.capacity", and the number of times EPC usage reached the max limit
> >> in "misc.events".
> >>
> >> For now, the EPC cgroup simply blocks additional EPC allocation in
> >> sgx_alloc_epc_page() when the limit is reached. Reclaimable pages are
> >> still tracked in the global active list, only reclaimed by the global
> >> reclaimer when the total free page count is lower than a threshold.
> >>
> >> Later patches will reorganize the tracking and reclamation code in the
> >> global reclaimer and implement per-cgroup tracking and reclaiming.
> >>
> >> Co-developed-by: Sean Christopherson <[email protected]>
> >> Signed-off-by: Sean Christopherson <[email protected]>
> >> Signed-off-by: Kristen Carlson Accardi <[email protected]>
> >> Co-developed-by: Haitao Huang <[email protected]>
> >> Signed-off-by: Haitao Huang <[email protected]>
> >
> > For consistency sake I'd also add co-developed-by for Kristen. This is
> > at least the format suggested by kernel documentation.
> >
> She is the "From Author", so only Signed-off-by is needed for her
> according to the second example in the doc[1]?
Right but okay then Kristen afaik be the last sob so only re-ordering
is needed I guess.
>
> Thanks
> Haitao
> [1]https://docs.kernel.org/process/submitting-patches.html#when-to-use-acked-by-cc-and-co-developed-by
BR, Jarkko